System dataset drive died, can't reboot, how to proceed

Status
Not open for further replies.

Hazar

Cadet
Joined
Dec 9, 2016
Messages
9
I received an e-mail from my FreeNAS installation, which says
Code:
The volume incihdd (ZFS) state is UNAVAIL: One or more devices are faulted in response to IO failures.

When I connected to console to check it, it seems that drive has disconnected and reconnected itself.

incihdd.png

(console screenshot: https://www.dropbox.com/s/a9y4szvuwl4h7h6/incihdd.png?dl=0)

This drive has only FreeNAS system dataset in it. My boot and data drives are different.

I can't get a shell, asks for username and password then hangs. I can switch consoles with ALT+F2 F3 etc. but same applies. No webgui or shares available also.

  1. Is there a way to reboot the machine gracefully (other than pulling the plug)
  2. I'm assuming that drive is died even it reconnected itself. What is the procedure for replacing drives that contains system dataset?
This is the second time having problems with system dataset. First one caused by using an USB drive, since then I'm using a standard 2.5" HDD.

System:
FreeNAS-9.10.2-U3 (e1497f269)
HP Microserver G8 with 12G RAM

Thanks.
 
Last edited:

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479

Hazar

Cadet
Joined
Dec 9, 2016
Messages
9
Are your hard drives connected to a RAID card in that machine?

My disks are connected via on-board B120i controller, which is configured to use AHCI mode (all RAID related stuff is disabled). This is the only way to connect hard drives to this system.

Including my last zpool status report for providing additional info about my disk arrangement:

freenas-boot: 1x MicroSD card
incihdd: 1x 2.5" HDD
tank: 4x 3.5" HDD

Code:
########## ZPool status report for freenas-boot ##########

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h15m with 0 errors on Sun Oct  8 04:01:00 2017
config:

	NAME										  STATE	 READ WRITE CKSUM
	freenas-boot								  ONLINE	   0	 0	 0
	  gptid/7308fc34-c0a0-11e6-a247-94188237f074  ONLINE	   0	 0	 0

errors: No known data errors



########## ZPool status report for incihdd ##########

  pool: incihdd
state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Sun Oct  1 00:00:13 2017
config:

	NAME										  STATE	 READ WRITE CKSUM
	incihdd									   ONLINE	   0	 0	 0
	  gptid/cc16a746-5cea-11e7-9ccb-94188237f074  ONLINE	   0	 0	 0

errors: No known data errors



########## ZPool status report for tank ##########

  pool: tank
state: ONLINE
  scan: scrub repaired 0 in 7h13m with 0 errors on Sun Oct  1 12:13:36 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	tank											ONLINE	   0	 0	 0
	  raidz1-0									  ONLINE	   0	 0	 0
		gptid/94e45b7b-c0c4-11e6-9124-94188237f074  ONLINE	   0	 0	 0
		gptid/9aac2ee0-c0c4-11e6-9124-94188237f074  ONLINE	   0	 0	 0
		gptid/a01b1ae7-c0c4-11e6-9124-94188237f074  ONLINE	   0	 0	 0
		gptid/a604509c-c0c4-11e6-9124-94188237f074  ONLINE	   0	 0	 0

errors: No known data errors



Last SMART report of failed drive (appearantly it died midst of an SMART test, and very interesting start-stop count):
Code:
########## SMART status report for ada4 drive (Seagate Momentus 5400.6: 5XT03G8L) ##########
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   117   099   006	Pre-fail  Always	   -	   120162798
  3 Spin_Up_Time			0x0003   099   099   085	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   037   037   020	Old_age   Always	   -	   65535
  5 Reallocated_Sector_Ct   0x0033   100   100   036	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   077   060   030	Pre-fail  Always	   -	   52224768
  9 Power_On_Hours		  0x0032   097   097   000	Old_age   Always	   -	   3030
10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
12 Power_Cycle_Count	   0x0032   099   099   020	Old_age   Always	   -	   2035
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   100   100   000	Old_age   Always	   -	   0
189 High_Fly_Writes		 0x003a   100   100   000	Old_age   Always	   -	   0
190 Airflow_Temperature_Cel 0x0022   064   054   045	Old_age   Always	   -	   36 (Min/Max 31/41)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   57
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   74
193 Load_Cycle_Count		0x0032   034   034   000	Old_age   Always	   -	   133704
194 Temperature_Celsius	 0x0022   036   046   000	Old_age   Always	   -	   36 (0 15 0 0 0)
195 Hardware_ECC_Recovered  0x001a   055   055   000	Old_age   Always	   -	   120162798
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   1820 (28 180 0)
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   1731402232
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   3350734571
254 Free_Fall_Sensor		0x0032   100   100   000	Old_age   Always	   -	   0

No Errors Logged

Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline	Self-test routine in progress 20%	  3030		 -
 
Last edited:

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Having a separate drive just for your System Dataset files is non-standard. I'm guessing
that you are aware of this fact. As far as recovery goes, if you have a copy of your configuration
saved to another device, you will probably want to perform a new install. (SEE BELOW)


In regards to the future, my recommendation would be to do away with the separate drive for
the system dataset, purchase a good quality SSD drive (64GB - 128GB) and install to that
SSD drive and keep your system dataset on that same drive. Do away with the microSD card.

I happen to personally follow the above mentioned configuration and swear by it's stability,
durability and cost effectiveness. What I consider to be the proper type of boot drive
is linked below.

https://www.ebay.com/i/382238920102?rt=nc

http://www.ebay.com/itm/Intel-32GB-...293441?hash=item46617d0301:g:A20AAOSw7XBY8My~

Please note that my personal experience is that the 32GB size has been
historically MUCH better than the 64GB model in terms of overall condition from the used market.
 
Last edited:

Hazar

Cadet
Joined
Dec 9, 2016
Messages
9
Having a separate drive just for your System Dataset files is non-standard.
I'm doing this since I'm using an SD card to boot, otherwise the system dataset will kill it very quickly with its constant writes. This way the boot drive do read-only operations pretty much always (and won't die quickly).

In regards to the future .. purchase a good quality SSD drive.
Fully agreed. This is also my plan but I just can't find a relatively small SSD from a reputable source (i.e not from unknown Chinese brands) in my region yet.

Thanks for the reply. I still can't access/reboot my FreeNAS machine and the idea of hard resetting a machine that runs ZFS is bit scary. Any idea is appreciated.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
I still can't access/reboot my FreeNAS machine and the idea of hard resetting a machine that runs ZFS is bit scary. Any idea is appreciated.
If you're are worried about your data pool drives becoming corrupted you can always unplug them (i'm assuming
the server is now currently shut down). The data pool can not be affected if the drives are not powered up. Having
done that you can now be free to do what you need or want to do to get FreeNAS reinstalled and working again.
Once your boot drive/dataset issue is solved to your satisfaction, shut down, hook up your data pool drives and
import the volume into your fresh install of FreeNAS.
 

Hazar

Cadet
Joined
Dec 9, 2016
Messages
9
If you're are worried about your data pool drives becoming corrupted you can always unplug them (i'm assuming
the server is now currently shut down).

My server is still running, but I can't gracefully shut it down because I can't get a shell/webgui/etc. Console session (with IPMI - hp iLO) and SSH attempts will ask for username and password, I enter them, then it hangs. I can switch consoles (on IPMI) with ALT+F2 F3 etc. but end result is same. It still respond to pings, accepts SSH connections but hangs on login. I'm trying to find a way to poweroff/reboot this to ensure my filesystems are synced and reducing risks of ZFS corrupting. If there isn't another way, I'll hard reset it forcedly.
 
Last edited:

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504

Hazar

Cadet
Joined
Dec 9, 2016
Messages
9
Ctrl-Alt-Del?

upload_2017-10-10_21-31-30.png


Now it hangs there. Don't allow any input, ALT+F1-2-3 is stopped working (probably because single user mode?) and SSH attempts are now refused.

Appropriate timing for saying "it's dead Jim" ? Somehow the USB subsystem is working, when I reconnect to remote console, the usual "virtual keyboard connected" messages still appearing.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
If there isn't another way, I'll hard reset it forcedly.
I would just hold in the power button until it shuts down, open the case and unplug the data pool drives.
 
Status
Not open for further replies.
Top