FreeNAS 10.3-STABLE Hard crash

freenas-supero · Jun 28, 2018

Hello,

My Freenas server has just crashed in a way I've never seen before. First of all I got an email from the box saying that a drive was faulted and that the array was degraded.

The volume zpool (ZFS) state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.

Then I tried logging in via web UI and I get this error page:

An error occurred.
Sorry, the page you are looking for is currently unavailable.
Please try again later.

If you are the system administrator of this resource then you should check the error log for details.

Faithfully yours, nginx.

Using SSH and dmesg shows:

Code:

	(da3:mps0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 677 command timeout cm 0xfffffe00009f1890 ccb 0xfffff8010c140800
   (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xfffffe00009f1890
mps0: Sending reset from mpssas_send_abort for target ID 4
   (da3:mps0:0:4:0): WRITE(10). CDB: 2a 00 ba 72 af 48 00 00 10 00 length 8192 SMID 633 terminated ioc 804b scsi 0 state c xfer 0
mps0: (da3:mps0:0:4:0): WRITE(10). CDB: 2a 00 ba 72 af 48 00 00 10 00
Unfreezing devq for target ID 4
(da3:mps0:0:4:0): CAM status: CCB request completed with an error
(da3:mps0:0:4:0): Retrying command
(da3:mps0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da3:mps0:0:4:0): CAM status: Command timeout
(da3:mps0:0:4:0): Retrying command
(da3:mps0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da3:mps0:0:4:0): CAM status: SCSI Status Error
(da3:mps0:0:4:0): SCSI status: Check Condition
(da3:mps0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da3:mps0:0:4:0): Error 6, Retries exhausted
(da3:mps0:0:4:0): Invalidating pack
GEOM_ELI: g_eli_read_done() failed (error=6) da3p1.eli[READ(offset=27906048, length=4096)]
swap_pager: I/O error - pagein failed; blkno 3152547,size 4096, error 6
vm_fault: pager read error, pid 4099 (zfsd)
GEOM_ELI: g_eli_read_done() failed (error=6) da3p1.eli[READ(offset=27906048, length=4096)]
swap_pager: I/O error - pagein failed; blkno 3152547,size 4096, error 6
vm_fault: pager read error, pid 4099 (zfsd)
Failed to write core file for process zfsd (error 14)
pid 4099 (zfsd), uid 0: exited on signal 11
GEOM_ELI: g_eli_read_done() failed (error=6) da3p1.eli[READ(offset=5980160, length=4096)]
swap_pager: I/O error - pagein failed; blkno 3147194,size 4096, error 6
vm_fault: pager read error, pid 1253 (devd)
GEOM_ELI: g_eli_read_done() failed (error=6) da3p1.eli[READ(offset=5943296, length=28672)]
swap_pager: I/O error - pagein failed; blkno 3147185,size 28672, error 6
vm_fault: pager read error, pid 1253 (devd)
Failed to write core file for process devd (error 14)
pid 1253 (devd), uid 0: exited on signal 11
GEOM_ELI: g_eli_read_done() failed (error=6) da3p1.eli[READ(offset=19120128, length=8192)]
swap_pager: I/O error - pagein failed; blkno 3150402,size 8192, error 6
vm_fault: pager read error, pid 3064 (python2.7)
pid 3064 (python2.7), uid 0: exited on signal 11
GEOM_ELI: g_eli_read_done() failed (error=6) da3p1.eli[READ(offset=589824, length=32768)]
swap_pager: I/O error - pagein failed; blkno 3145878,size 32768, error 6
vm_fault: pager read error, pid 3087 (python2.7)
GEOM_ELI: g_eli_read_done() failed (error=6) da3p1.eli[READ(offset=16900096, length=4096)]
swap_pager: I/O error - pagein failed; blkno 3149860,size 4096, error 6
vm_fault: pager read error, pid 3087 (python2.7)
Failed to write core file for process python2.7 (error 14)
pid 3087 (python2.7), uid 0: exited on signal 11

zpool status zpool:

Code:

[root@freenas] ~# zpool status zpool
  pool: zpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
   Sufficient replicas exist for the pool to continue functioning in a
   degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
   repaired.
  scan: scrub in progress since Thu Jun 28 03:00:05 2018
		11.0T scanned out of 11.1T at 175M/s, 0h16m to go
		0 repaired, 98.52% done
config:

   NAME											STATE	 READ WRITE CKSUM
   zpool										   DEGRADED	 0	 0	 0
	 raidz3-0									  DEGRADED	 0	 0	 0
	   gptid/4a751424-5a4a-11e5-82f2-0030487f11ba  ONLINE	   0	 0	 0
	   gptid/7231ce76-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0
	   gptid/74010031-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0
	   gptid/3010e8b6-1d80-11e7-ac2f-0025907ad3a1  ONLINE	   0	 0	 0
	   gptid/7577d07e-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0
	   gptid/7799b692-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0
	   gptid/7979c1c6-0fb8-11e4-9267-0030487f11ba  FAULTED	  6   616	 0  too many errors
	   gptid/7ba4673f-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0

I do not have a spare hard drive at hand and need to order one. Should I shut it down until I have a hard drive handy? Array uses a Z3 RAID, so technically, I would have to lose 3 drives to be in real danger but who knows...

Best practices?

Thanks a bunch!!

EDIT: OK I managed to reboot the WebUI with

Code:

service django stop
service django start

The webUI is back. Strange that it crashed at the same time the HDD faulted. I also narrowed down which one of the HDD failed and its indeed a Seagate drive. So far, all 6 Seagates drives I initially had have failed. Nice.

I ordered 2 WD RED drives but they wont likely be here before next wednesday or thursday (so 6 or 7 days from now). Would it be advisable to shutdown the server until then or let it run on degraded array? I do have backups but I wont need the server until at least Sunday afternoon so....

Chris Moore · Jun 28, 2018

First, you should always include a description of your hardware in the post when you are asking a question so we can get clued in to what you have and it often helps with finding the problem, plus it saves having to ask a bunch of questions like this... What version of FreeNAS are you running?
Here is a forum post that tells about the kind of information you should include:
https://forums.freenas.org/index.php?threads/updated-forum-rules-4-11-17.45124/

Second, if I had to guess what happened (from the dmesg listing alone) I would say that you are running an older version of FreeNAS that uses individual drives for swap space instead of mirrors. When the drive faulted out of the pool, some portion of the operating code or data that the system needed was in the swap that was on the faulted drive. That caused the GUI to fail and in many situations, it would cause the system to reboot entirely. See this: swap_pager: I/O error - pagein failed That you were still able to SSH in and do diagnostics is pretty fortunate. Anyhow, I could be all wrong on that because I don't know what version of FreeNAS you have.

freenas-supero said:
I ordered 2 WD RED drives but they won't likely be here before next wednesday or thursday (so 6 or 7 days from now). Would it be advisable to shutdown the server until then or let it run on degraded array? I do have backups but I won't need the server until at least Sunday afternoon so....

You are running RAIDz3, so you still have 2 drives of parity. I wouldn't worry about running until the spare arrives. Personally I run RAIDz2 and only have 2 parity drives all the time, but I also keep a couple spares on hand so I could change a drive the same day it fails.

Do you have monitoring scripts setup?

Github repository for FreeNAS scripts, including disk burnin
https://forums.freenas.org/index.ph...for-freenas-scripts-including-disk-burnin.28/

Because I am thinking you should have known there was something going wrong long before it fully faulted out of the pool. I run all Seagate drives (32 in the main NAS and 12 more in the backup NAS) and they don't usually fail without plenty of warning. What model are the Seagate drives you are having trouble with?

freenas-supero said:
Best practices?

Thanks a bunch!!

Do a burn-in (see the scripts) on your drives before adding them to the pool.

Chris Moore · Jun 28, 2018

I realize that the title says:

freenas-supero said:
Freenas 10.3-STABLE Hard crash

but there is no such version of FreeNAS.

freenas-supero · Jun 28, 2018

Chris Moore said:
I realize that the title says: but there is no such version of FreeNAS.

Exactly! The title was based on the output provided by "uname -a" and it was FreeBSD not FreeNAS...

Sorry for not including more info, in the "excitement" I forgot to copy paste a text file I had prepared which contained the relevant hardware and OS info :

Code:

Build		 FreeNAS-9.10.2-U6 (561f0d7a1)	
Platform   Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz	
Memory	32709MB	
System Time  Thu Jun 28 23:08:13 EDT 2018	
Uptime		   11:08PM up 263 days, 10:14, 0 users	
Load Average 0.43, 0.25, 0.16

Code:

uname -a
FreeBSD 10.3-STABLE #0 r295946+21897e6695f (HEAD): Tue Jul 25 00:03:12 UTC 2017

Motherboard
Supermicro X9SCL-F
Firmware Revision : 03.38
Firmware Build Time : 2015-01-05

SAS Controller
IBM ServeRAID M1015 (Flashed to IT)
[Not sure how to retrieve the FW version from SSH CLI]

CPU
Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz

System RAM
Kingston 32GB (4 x 8GB) 240-Pin DDR3
SDRAM ECC Unbuffered DDR3 1600 Server
Memory w/TS Model KVR16E11K4/32

Hard drives
1x SEAGATE Model number: ST2000DM001
1x SEAGATE Model number: ST32000542AS
1x HITACHI Model number: HDS5C302
1x HITACHI Model number: HDS5C302
1x SEAGATE Model number: ST2000DL003-9VT1 **** The Faulted drive ****
1x HITACHI Model number: HDS5C302
1x HITACHI Model number: HDN724030ALE640
1x HITACHI Model number: HDN724030ALE640

Raid Z3 (single pool using all 8 drives)

Network cards
Using onboard Intel LAN chipsets

Regarding the monitoring scripts, I do not use them yet, but will most definitely look in to them. I fully agree, I should have seen this coming, especially for a drive completely faulting out like this. The Seagate models that have failed so far are:

SEAGATE ST2000DM001
SEAGATE ST2000DM001
SEAGATE ST3000DM001-1CH1
SEAGATE ST2000DM001

So thats pretty much all 2TB's drives from the initial Freenas built (2012). I kept replacing them one after the other with HGST or Hitachi's because I had more luck with HGST than Seagate, but now I just ordered WD's because I didnt think HGST's product line was interesting (3TB not NAS capable at $199 CAD??). WD's were more reasonably priced.

Some more data from FreeNAS on the faulty drive:

Code:

Geom name: da3
modified: false
state: OK
fwheads: 255
fwsectors: 63
last: 3907029134
first: 34
entries: 128
scheme: GPT
Providers:
1. Name: da3p1
   Mediasize: 2147483648 (2.0G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   rawuuid: 7954af28-0fb8-11e4-9267-0030487f11ba
   rawtype: 516e7cb5-6ecf-11d6-8ff8-00022d09712b
   label: (null)
   length: 2147483648
   offset: 65536
   type: freebsd-swap
   index: 1
   end: 4194431
   start: 128
2. Name: da3p2
   Mediasize: 1998251364352 (1.8T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2
   rawuuid: 7979c1c6-0fb8-11e4-9267-0030487f11ba
   rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
   label: (null)
   length: 1998251364352
   offset: 2147549184
   type: freebsd-zfs
   index: 2
   end: 3907029127
   start: 4194432
Consumers:
1. Name: da3
   Mediasize: 2000398934016 (1.8T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r2w2e5

For now I will keep the server up & running until the spares arrive then I will resilver the array.

Thanks Chris, this is really appreciated!

Chris Moore · Jun 28, 2018

freenas-supero said:
Uptime 11:08PM up 263 days, 10:14, 0 users

Wow, 263 days without a reboot for any reason, even a power outage.

freenas-supero said:
IBM ServeRAID M1015 (Flashed to IT)
[Not sure how to retrieve the FW version from SSH CLI]

The command sas2flash -list will give you:

Code:

LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

		Adapter Selected is a LSI SAS: SAS2308_2(D1)

		Controller Number			  : 0
		Controller					 : SAS2308_2(D1)
		PCI Address					: 00:03:00:00
		SAS Address					: 500605b-0-09ef-7220
		NVDATA Version (Default)	   : 14.01.00.06
		NVDATA Version (Persistent)	: 14.01.00.06
		Firmware Product ID			: 0x2214 (IT)
		Firmware Version			   : 20.00.07.00
		NVDATA Vendor				  : LSI
		NVDATA Product ID			  : SAS9207-8i
		BIOS Version				   : N/A
		UEFI BSD Version			   : N/A
		FCODE Version				  : N/A
		Board Name					 : SAS9207-8i
		Board Assembly				 : H3-25412-00J
		Board Tracer Number			: SV45308383

		Finished Processing Commands Successfully.
		Exiting SAS2Flash.

You should be able to get output on the SMART status of the faulted drive with smartctl -a /dev/da3 and it should look like:

Code:

smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Barracuda 7200.14 (AF)
Device Model:	 ST2000DM001-1ER164
Serial Number:	Z4Z3ARLR
LU WWN Device Id: 5 000c50 086c31b39
Firmware Version: CC25
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	7200 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Thu Jun 28 23:41:55 2018 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
										was completed without error.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(   89) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   1) minutes.
Extended self-test routine
recommended polling time:		( 218) minutes.
Conveyance self-test routine
recommended polling time:		(   2) minutes.
SCT capabilities:			  (0x1085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   119   099   006	Pre-fail  Always	   -	   232600360
  3 Spin_Up_Time			0x0003   096   096   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   36
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   075   060   030	Pre-fail  Always	   -	   41470062
  9 Power_On_Hours		  0x0032   085   085   000	Old_age   Always	   -	   13841
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   36
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   100   100   000	Old_age   Always	   -	   0 0 0
189 High_Fly_Writes		 0x003a   098   098   000	Old_age   Always	   -	   2
190 Airflow_Temperature_Cel 0x0022   062   057   045	Old_age   Always	   -	   38 (Min/Max 37/41)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   33
193 Load_Cycle_Count		0x0032   100   100   000	Old_age   Always	   -	   536
194 Temperature_Celsius	 0x0022   038   043   000	Old_age   Always	   -	   38 (0 23 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   13820h+11m+07.042s
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   9300028576
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   1164690014691

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 13828		 -
# 2  Short offline	   Completed without error	   00%	 13817		 -
# 3  Extended offline	Completed without error	   00%	 13804		 -
# 4  Short offline	   Completed without error	   00%	 13793		 -
# 5  Extended offline	Completed without error	   00%	 13780		 -
# 6  Short offline	   Completed without error	   00%	 13769		 -
# 7  Extended offline	Completed without error	   00%	 13757		 -
# 8  Short offline	   Completed without error	   00%	 13745		 -
# 9  Extended offline	Completed without error	   00%	 13735		 -
#10  Short offline	   Completed without error	   00%	 13721		 -
#11  Extended offline	Completed without error	   00%	 13709		 -
#12  Short offline	   Completed without error	   00%	 13697		 -
#13  Extended offline	Completed without error	   00%	 13684		 -
#14  Short offline	   Completed without error	   00%	 13673		 -
#15  Extended offline	Completed without error	   00%	 13660		 -
#16  Short offline	   Completed without error	   00%	 13649		 -
#17  Extended offline	Completed without error	   00%	 13637		 -
#18  Short offline	   Completed without error	   00%	 13625		 -
#19  Extended offline	Completed without error	   00%	 13612		 -
#20  Short offline	   Completed without error	   00%	 13601		 -
#21  Extended offline	Completed without error	   00%	 13588		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Some of the older model drives were more troublesome. I recently replaced a batch of drives that had gone over 5 years of age and subsequently started failing more frequently. It might shed some light on the, "why," of this failure if you share the SMART status.

freenas-supero said:
1x SEAGATE Model number: ST32000542AS

That is a really old model drive. You might want to consider replacing that drive preemptively, but look at the SMART status first.
It is generally a good idea to plan to replace drives by the time they reach 5 years of age because it is around that age that the rate of failure begins to increase.
If your build was in 2012, any of the original drives would be nearly 6 years old. I don't know if you have any of them still in service, but the age of a drive contributes to the rate of failure. Kind of like tires on your car, they need to be replaced from time to time.

freenas-supero said:
Build FreeNAS-9.10.2-U6 (561f0d7a1)

Is there some reason why you are not updating to the current release?

freenas-supero · Jun 29, 2018

Yeah, the uptime is pretty OK, but nothing close to the maximum this server has provided. Last uptime was 768 days and thats only because my power utility did some "rolling blackout" due to system issues. The UPS could not supply power for more than 5 hours so the server shut itself down.

LSI's sas2flash (ah good memories) gives:

Code:

LSI Corporation SAS2 Flash Utility																								 
Version 16.00.00.00 (2013.03.01)																									
Copyright (c) 2008-2013 LSI Corporation. All rights reserved																		
																																   
	   Adapter Selected is a LSI SAS: SAS2008(B1)																				 
																																   
	   Controller Number			  : 0																						 
	   Controller					 : SAS2008(B1)																				
	   PCI Address					: 00:01:00:00																				
	   SAS Address					: 500605b-0-0227-cbc0																		
	   NVDATA Version (Default)	   : 14.01.00.08																				
	   NVDATA Version (Persistent)	: 14.01.00.08																				
	   Firmware Product ID			: 0x2213 (IT)																				
	   Firmware Version			   : 20.00.07.00																				
	   NVDATA Vendor				  : LSI																						
	   NVDATA Product ID			  : SAS9211-8i																				 
	   BIOS Version				   : N/A																						
	   UEFI BSD Version			   : N/A																						
	   FCODE Version				  : N/A																						
	   Board Name					 : SAS9211-8i																				 
	   Board Assembly				 : N/A																						
	   Board Tracer Number			: N/A																						
																																   
	   Finished Processing Commands Successfully.																				 
	   Exiting SAS2Flash.

I am not an expert on SMART commands but "smartctl -a /dev/da3" gives:

Code:

[root@freenas] ~# smartctl -a /dev/da3
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Barracuda Green (AF)
Device Model:	 ST2000DL003-9VT166
Serial Number:	5YD9RJXW
LU WWN Device Id: 5 000c50 0530de955
Firmware Version: CC45
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5900 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Fri Jun 29 06:49:39 2018 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)   Offline data collection activity
				   was completed without error.
				   Auto Offline Data Collection: Enabled.
Self-test execution status:	  (  41)   The self-test routine was interrupted
				   by the host with a hard or soft reset.
Total time to complete Offline
data collection:		(  623) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
				   Auto Offline data collection on/off support.
				   Suspend Offline collection upon new
				   command.
				   Offline surface scan supported.
				   Self-test supported.
				   Conveyance Self-test supported.
				   Selective Self-test supported.
SMART capabilities:			(0x0003)   Saves SMART data before entering
				   power-saving mode.
				   Supports SMART auto save timer.
Error logging capability:		(0x01)   Error logging supported.
				   General Purpose Logging supported.
Short self-test routine
recommended polling time:	 (   1) minutes.
Extended self-test routine
recommended polling time:	 ( 353) minutes.
Conveyance self-test routine
recommended polling time:	 (   2) minutes.
SCT capabilities:		   (0x30b7)   SCT Status supported.
				   SCT Feature Control supported.
				   SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   119   099   006	Pre-fail  Always	   -	   215982960
  3 Spin_Up_Time			0x0003   085   070   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   104
  5 Reallocated_Sector_Ct   0x0033   100   100   036	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   087   060   030	Pre-fail  Always	   -	   504268892
  9 Power_On_Hours		  0x0032   055   055   000	Old_age   Always	   -	   40069
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   103
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   099   099   000	Old_age   Always	   -	   8590065666
189 High_Fly_Writes		 0x003a   100   100   000	Old_age   Always	   -	   0
190 Airflow_Temperature_Cel 0x0022   066   058   045	Old_age   Always	   -	   34 (Min/Max 26/40)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   98
193 Load_Cycle_Count		0x0032   100   100   000	Old_age   Always	   -	   104
194 Temperature_Celsius	 0x0022   034   042   000	Old_age   Always	   -	   34 (0 17 0 0 0)
195 Hardware_ECC_Recovered  0x001a   034   004   000	Old_age   Always	   -	   215982960
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   40064 (71 71 0)
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   281936614
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   3822250229

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Interrupted (host reset)	  90%	 40058		 -
# 2  Short offline	   Completed without error	   00%	 40017		 -
# 3  Short offline	   Completed without error	   00%	 39969		 -
# 4  Short offline	   Completed without error	   00%	 39897		 -
# 5  Short offline	   Completed without error	   00%	 39825		 -
# 6  Short offline	   Completed without error	   00%	 39753		 -
# 7  Extended offline	Completed without error	   00%	 39712		 -
# 8  Short offline	   Completed without error	   00%	 39681		 -
# 9  Short offline	   Completed without error	   00%	 39609		 -
#10  Short offline	   Completed without error	   00%	 39537		 -
#11  Short offline	   Completed without error	   00%	 39465		 -
#12  Short offline	   Completed without error	   00%	 39393		 -
#13  Extended offline	Completed without error	   00%	 39320		 -
#14  Short offline	   Completed without error	   00%	 39273		 -
#15  Short offline	   Completed without error	   00%	 39225		 -
#16  Short offline	   Completed without error	   00%	 39153		 -
#17  Short offline	   Completed without error	   00%	 39081		 -
#18  Short offline	   Completed without error	   00%	 39009		 -
#19  Extended offline	Completed without error	   00%	 38968		 -
#20  Short offline	   Completed without error	   00%	 38937		 -
#21  Short offline	   Completed without error	   00%	 38865		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

As for the Freenas version, you bet I will be upgrading right after replacement of the dead drive.... Reason for not upgrading more often are risk adverse, I just dont like touching what works well. I've had so many issues with updates in the past on all kind of systems, I generally avoid updating unless there's a stability or performance or integrity benefit. I guess here I should have updated before..

Chris Moore · Jun 29, 2018

freenas-supero said:
Code:
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 40064 (71 71 0)

That is about 4.5 years, so I would say it gave you good service.

freenas-supero said:
Code:
188 Command_Timeout 0x0032 099 099 000 Old_age Always - 8590065666

This is a bad indicator. The system is giving the drives commands and the drive is not responding, or not responding fast enough.

freenas-supero said:
I've had so many issues with updates in the past on all kind of systems, I generally avoid updating unless there's a stability or performance or integrity benefit.

The release that is out right this moment, FreeNAS-11.1-U5, has been working nicely for me and one of the improvements is that the swap space is in mirrors now so a single disk failure won't cause the kind of crash you had.

Johnnie Black · Jun 29, 2018

The RAW command timeout attribute on most Seagate drives is a multi byte value and can't be read directly, there's no cause for concern with the current value, disk looks perfectly healthy.

Chris Moore · Jun 29, 2018

Johnnie Black said:
The RAW command timeout attribute on most Seagate drives is a multi byte value and can't be read directly, there's no cause for concern with the current value, disk looks perfectly healthy.

I have 32 Seagate drives in my Main NAS and another 12 in my Backup NAS. That is not one of the values that looks crazy.
The "Timeout" values on my Seagate drives that are healthy is ZERO.
This drive that was dropped from the pool for, "too many errors", has a non-zero number.
Tell me again how healthy the drive looks?

Johnnie Black · Jun 29, 2018

Chris Moore said:
Tell me again how healthy the drive looks?

Perfectly healthy, like mentioned that value is multi byte for that model and many other Seagates, though not for all, if you want more info about these google Seagate high command timeout value

Chris Moore · Jun 29, 2018

Johnnie Black said:
Perfectly healthy, like mentioned that value is multi byte for that model and many other Seagates, though not for all, if you want more info about these google Seagate high command timeout value

If it is so healthy, why is it dropped from the pool?

Johnnie Black · Jun 29, 2018

Chris Moore said:
If it is so healthy, why is it dropped from the pool?

I'm saying the SMART report is showing a perfectly heathy drive, and the command timeout raw value is not any indication that the drive is failing, for these drives you need to look at the attributes current and worst values, currently the same at 99, so perfectly fine, though a good SMART report does not always equal an healthy drive, but it could also have dropped from the pool due to a bad cable/connection, what I usually do in these situations is to replace/swap cables/backplane with a different drive to rule them out, then if there are more problems with the same drive replace it, despite the healthy SMART.

Code:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
188 Command_Timeout		 0x0032   099   099   000	Old_age   Always	   -	   8590065666

Chris Moore · Jun 29, 2018

Johnnie Black said:
I'm saying the SMART report is showing a perfectly heathy drive

I would agree with you if we were talking about Seek Error Rate, Raw Read Error Rate, or Hardware ECC Recovered; but I think you are mistaken about this value being indicative of a healthy drive, especially given that the drive has already been ejected from the pool by ZFS for having to many errors.

freenas-supero · Jun 29, 2018

OK So today I tried upgrading Freenas to 11. Things got weird fast.

First of all the browser crashed while downloading the packages. Then I assumed I didnt have to relaunch the update as the process was happening on the server ()i.e. my firefox browser is not necessary?). So I waited a bit for the packages to download, then re-logged in and tried to run the update once more. FreeNAS said that the update was already done and to activate the boot environment under system>Boot. I did, and clicked reboot.

It never rebooted. Using IPMI console I saw endless Swap IO error messages .. I forced a reset using power cycle, and let the server reboot. It booted just fine and then I went to the update section once more to see what happened, and tried to update, this time, it said:

Update: FreeNAS-11.1-U5 has already been applied. Please Reboot the system to avail of this update

So I went to reboot the server but was warned that it was resilvering!!! What? I didnt replace the faulty drive so how can it be resilvering?

zpool status shows:

Code:

[root@freenas] ~# zpool status zpool
  pool: zpool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
   continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Jun 29 18:00:41 2018
		21.2G scanned out of 11.1T at 101M/s, 31h44m to go
		2.46G resilvered, 0.19% done
config:

   NAME											STATE	 READ WRITE CKSUM
   zpool										   ONLINE	   0	 0	 0
	 raidz3-0									  ONLINE	   0	 0	 0
	   gptid/4a751424-5a4a-11e5-82f2-0030487f11ba  ONLINE	   0	 0	 0
	   gptid/7231ce76-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0
	   gptid/74010031-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0
	   gptid/3010e8b6-1d80-11e7-ac2f-0025907ad3a1  ONLINE	   0	 0	 0
	   gptid/7577d07e-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0
	   gptid/7799b692-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0
	   gptid/7979c1c6-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0  (resilvering)
	   gptid/7ba4673f-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0

So apparently the "faulty" drive is no longer faulty and is being resilvered?

I intend to change it no matter what. Should I stop the resilver process? Isn't that wearing out the drives in a useless manner?

freenas-supero · Jun 29, 2018

The resilver just finished. less than 5 minutes later.

WT* is going on???

Need Freenas Guru's input... I no longer thrust that server.

Code:

[root@freenas] ~# zpool status zpool
  pool: zpool
 state: ONLINE
  scan: resilvered 7.50G in 0h11m with 0 errors on Fri Jun 29 18:12:27 2018
config:

   NAME											STATE	 READ WRITE CKSUM
   zpool										   ONLINE	   0	 0	 0
	 raidz3-0									  ONLINE	   0	 0	 0
	   gptid/4a751424-5a4a-11e5-82f2-0030487f11ba  ONLINE	   0	 0	 0
	   gptid/7231ce76-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0
	   gptid/74010031-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0
	   gptid/3010e8b6-1d80-11e7-ac2f-0025907ad3a1  ONLINE	   0	 0	 0
	   gptid/7577d07e-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0
	   gptid/7799b692-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0
	   gptid/7979c1c6-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0
	   gptid/7ba4673f-0fb8-11e4-9267-0030487f11ba  ONLINE	   0	 0	 0

EDIT: OK I rebooted the server, and even if 11.1-U5 is active and says "On reboot", it didnt activate at teboot and I am still running 9.10...

Worst, now freenas says that the 2nd USB Stick (it is mirrored) is FAULTY!

Code:

Boot Volume Condition: DEGRADED One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state

Chris Moore · Jun 29, 2018

It only needs to resilver what changes have happened since the drive was dropped off the pool.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

Jailer · Jun 29, 2018

Delete the 11.1-U5 boot environment and run the update again.

freenas-supero · Jul 2, 2018

I think a faulty USB stick was the culprit. I tried replacing the faulty boot stick with a new one, and the resilvering took 6 hours... Not normal.

Then the server never rebooted.

I had to re-flash 11.1-U5 on a new drive, install it then mirror it.

Now everything is working once again.

Thanks for the help guys!!

Important Announcement for the TrueNAS Community.

FreeNAS 10.3-STABLE Hard crash

freenas-supero

Contributor

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

freenas-supero

Contributor

Chris Moore

Hall of Famer

freenas-supero

Contributor

Chris Moore

Hall of Famer

Johnnie Black

Guru

Chris Moore

Hall of Famer

Johnnie Black

Guru

Chris Moore

Hall of Famer

Johnnie Black

Guru

Chris Moore

Hall of Famer

freenas-supero

Contributor

freenas-supero

Contributor

Chris Moore

Hall of Famer

Jailer

Not strong, but bad

freenas-supero

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

FreeNAS 10.3-STABLE Hard crash

Contributor

Hall of Famer

Hall of Famer

Contributor

Hall of Famer

Contributor

Hall of Famer

Guru

Hall of Famer

Guru

Hall of Famer

Guru

Hall of Famer

Contributor

Contributor

Hall of Famer

Not strong, but bad

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS 10.3-STABLE Hard crash"

Similar threads