Intermittent Critical Error

interrupt21h · Oct 17, 2016

My FreeNAS box (ASRock C2750D4I, 16gb ECC memory) has been running flawlessy for the last six months. I have a mirrored pool of 4 x 4TB Seagate NAS drives.
I'm new to ZFS but have some experience with *nix systems.

Three days ago I received the following error by email:

Code:

The volume data (ZFS) state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.
Device: /dev/ada2, unable to open device

So I checked the pool status to confirm and ada2 was offline. At first I thought the drive prematurely died on me but I first swapped the drive in another bay and rebooted the system. No error occured and the pool showed as Healthy.

To verify data integrity I did a scrub during the night and midway through it another critical error occured and the pool once again showed as degraded.

Code:

  pool: data
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 51.4M in 2h46m with 0 errors on Sun Oct 16 01:32:55 2016
config:

	NAME											STATE	 READ WRITE CKSUM
	data											DEGRADED	 0	 0	 0
	  mirror-0									  ONLINE	   0	 0	 0
		gptid/99c05557-f6ca-11e5-be76-d05099c08a48  ONLINE	   0	 0	 0
		gptid/9a69f4f8-f6ca-11e5-be76-d05099c08a48  ONLINE	   0	 0	 0
	  mirror-1									  DEGRADED	 0	 0	 0
		gptid/9b18fd4a-f6ca-11e5-be76-d05099c08a48  ONLINE	   0	 0	 0
		gptid/9bcdc0e4-f6ca-11e5-be76-d05099c08a48  DEGRADED	 0	 0 12.7K  too many errors

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h3m with 0 errors on Mon Oct  3 03:48:43 2016
config:

	NAME		STATE	 READ WRITE CKSUM
	freenas-boot  ONLINE	   0	 0	 0
	  da0p2	 ONLINE	   0	 0	 0

errors: No known data errors

I have short SMART tests enabled every 30 minutes but realized that I never scheduled long SMART tests.

So I ran a long test and then I checked the SMART status but I am not super comfortable interpreting them. There's no reallocated sector showing in the SMART status.

Code:

smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate NAS HDD
Device Model:	 ST4000VN000-1H4168
Serial Number:	Z304WQNB
LU WWN Device Id: 5 000c50 086e996b2
Firmware Version: SC46
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5900 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Mon Oct 17 18:09:40 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection:		 (  107) seconds.
Offline data collection
capabilities:			 (0x73) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time:	 (   1) minutes.
Extended self-test routine
recommended polling time:	 ( 485) minutes.
Conveyance self-test routine
recommended polling time:	 (   2) minutes.
SCT capabilities:			(0x10bd)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   117   100   006	Pre-fail  Always	   -	   157170104
  3 Spin_Up_Time			0x0003   091   091   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   22
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   063   060   030	Pre-fail  Always	   -	   8594435583
  9 Power_On_Hours		  0x0032   096   096   000	Old_age   Always	   -	   3780
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   22
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   100   100   000	Old_age   Always	   -	   0
189 High_Fly_Writes		 0x003a   100   100   000	Old_age   Always	   -	   0
190 Airflow_Temperature_Cel 0x0022   066   057   045	Old_age   Always	   -	   34 (Min/Max 21/40)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   0
193 Load_Cycle_Count		0x0032   014   014   000	Old_age   Always	   -	   173479
194 Temperature_Celsius	 0x0022   034   043   000	Old_age   Always	   -	   34 (0 18 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0

SMART Error Log Version: 1
ATA Error Count: 1
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 3742 hours (155 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 71 03 80 04 11 40

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 00 00 00 40 00  12d+17:31:00.627  FLUSH CACHE EXT
  61 00 08 ff ff ff 4f 00  12d+17:31:00.609  WRITE FPDMA QUEUED
  61 00 28 ff ff ff 4f 00  12d+17:30:57.234  WRITE FPDMA QUEUED
  61 00 18 ff ff ff 4f 00  12d+17:30:57.233  WRITE FPDMA QUEUED
  61 00 10 ff ff ff 4f 00  12d+17:30:57.233  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	  3765		 -
# 2  Short offline	   Completed without error	   00%	  3596		 -
# 3  Short offline	   Completed without error	   00%	  3584		 -
# 4  Short offline	   Completed without error	   00%	  3411		 -
# 5  Short offline	   Completed without error	   00%	  3399		 -
# 6  Short offline	   Completed without error	   00%	  3243		 -
# 7  Short offline	   Completed without error	   00%	  3231		 -
# 8  Short offline	   Completed without error	   00%	  3075		 -
# 9  Short offline	   Completed without error	   00%	  3063		 -
#10  Extended offline	Completed without error	   00%	  3010		 -
#11  Short offline	   Completed without error	   00%	  2907		 -
#12  Short offline	   Completed without error	   00%	  2895		 -
#13  Short offline	   Completed without error	   00%	  2739		 -
#14  Short offline	   Completed without error	   00%	  2727		 -
#15  Short offline	   Completed without error	   00%	  2571		 -
#16  Short offline	   Completed without error	   00%	  2559		 -
#17  Short offline	   Completed without error	   00%	  2309		 -
#18  Short offline	   Completed without error	   00%	  2297		 -
#19  Short offline	   Completed without error	   00%	  2141		 -
#20  Short offline	   Completed without error	   00%	  2129		 -
#21  Short offline	   Completed without error	   00%	  1973		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

After all that I rebooted the server one last time one day ago and no error occured ever since.

I have read https://forums.freenas.org/index.php?threads/is-my-drive-going-bad.42742/
and
https://forums.freenas.org/index.php?threads/critical-alert-how-best-to-proceed.30645/

Is that a sign the the drive will likely fail soon or there's another more likely cause? Internal SATA cabling? Why would it be intermittent?

I Also realized in all this process that I had no spare drive at home so I'll go and buy one. Should I replace the drive that caused thos critical errors?

What are the advantages/disadvantages of running a hot spare vs having the spare drive sit quietly in my closet? I still have 4 bays available in my chassis.

Thanks for your help

SweetAndLow · Oct 17, 2016

Hardware and freenas version? That drive is dead. Drives sitting in your closet are usually best, make sure it's burned in and ready to go.

Sent from my Nexus 5X using Tapatalk

interrupt21h · Oct 17, 2016

SweetAndLow said:
That drive is dead.

Do you infer that by the SMART status I posted? If so what are the indicators so I can spot them by myself in the future?

System is:
FreeNAS-9.10.1-U2 (f045a8b)

ASRock C2750D4I (Intel Avoton C2750 8-core cpu)
2x8gb Crucrial ECC DDR3
4x4TB Seagate ST4000VN000

Robert Trevellyan · Oct 17, 2016

interrupt21h said:
short SMART tests enabled every 30 minutes

That would be absurdly frequent, but it looks like they're actually every 12 hours. In my opinion that's still far too frequent, but not so bad. I'm guessing your SMART checks are every 30 minutes, which is the default.

SweetAndLow said:
That drive is dead

What makes you say this?

interrupt21h said:
What are the advantages/disadvantages of running a hot spare

If you have a pool made of mirrored pairs, a single drive failure leaves you with a vdev with no redundancy, but a hot spare can jump right in and start resilvering. In general, a hot spare makes a lot more sense when you have multiple vdevs. For example, if you have two RAIDZ1 vdevs, a hot spare might make sense, but if you have a single RAIDZ1 vdev with a hot spare, you should have built a RAIDZ2 vdev instead.

interrupt21h · Oct 17, 2016

Robert Trevellyan said:
I'm guessing your SMART checks are every 30 minutes, which is the default.

Yes, sorry for the confusion, that's what I meant. Thanks for clearing that up.

Robert Trevellyan said:
In general, a hot spare makes a lot more sense when you have multiple vdevs

So if I want to keep using mirrored pairs a hot spare might make more sense if for example I had three 2-disk mirror vdevs or more? When I first configured my server it seemed more resilient to used mirrored pairs instead of RaidZ2.

High availability / constant up-time is not a constraint at home so I might as well keep the disk in the closet and replace when needed. It's interesting to think about those things for future setup.

Ericloewe · Oct 18, 2016

interrupt21h said:
When I first configured my server it seemed more resilient to used mirrored pairs instead of RaidZ2.

No, not at all.

SweetAndLow said:
That drive is dead.

I wouldn't jump to that conclusion. The error the drive has might be caused by the interface. Definitely worth thoroughly testing the drive before possibly getting rid of it.

Robert Trevellyan · Oct 18, 2016

interrupt21h said:
a hot spare might make more sense if for example I had three 2-disk mirror vdevs or more?

You already have multiple vdevs: your pool is made of two mirrored pairs. Therefore, there is a case to be made for a hot spare. However, as long as you have a burned in spare on hand, it won't make much difference. The advantages of a hot spare are much clearer when you don't have immediate physical access to the server.

As for resiliency, RAIDZ2 is clearly more resilient than mirrored pairs, because it can lose any two disks without data loss, where mirrored pairs can only survive the loss of one drive per mirror. The primary benefit of mirrored pairs for the average user is flexibility, i.e. you can grow your storage two drives at a time.

interrupt21h · Oct 18, 2016

Ericloewe said:
Definitely worth thoroughly testing the drive

Beside a long SMART test which I already did, what other forms of testing are you suggesting?

Robert Trevellyan said:
As for resiliency, RAIDZ2 is clearly more resilient than mirrored pairs, because it can lose any two disks without data loss, where mirrored pairs can only survive the loss of one drive per mirror.

What do you think about this article: http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-not-raidz/
That's what convinced me to use mirror in the first place.

Robert Trevellyan · Oct 18, 2016

interrupt21h said:
What do you think about this article

I think it's a pale imitation of this article:
http://constantin.glez.de/blog/2010/01/home-server-raid-greed-and-why-mirroring-still-best
I find the argument for mirrors for home/small business servers appealing. The next time I reconfigure my pool, I might go with mirrors.

Ericloewe · Oct 18, 2016

interrupt21h said:
Beside a long SMART test which I already did, what other forms of testing are you suggesting?

Badblocks is a good place to start.

interrupt21h said:
What do you think about this article: http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-not-raidz/
That's what convinced me to use mirror in the first place.

tl;dr - it's best ignored. If anyone wishes to disagree, I invite them to bankroll a real study on the matter, with real servers and real drives.

Important Announcement for the TrueNAS Community.

Intermittent Critical Error

interrupt21h

Cadet

SweetAndLow

Sweet'NASty

interrupt21h

Cadet

Robert Trevellyan

Pony Wrangler

interrupt21h

Cadet

Ericloewe

Server Wrangler

Robert Trevellyan

Pony Wrangler

interrupt21h

Cadet

Robert Trevellyan

Pony Wrangler

Ericloewe

Server Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

Intermittent Critical Error

Cadet

Sweet'NASty

Cadet

Pony Wrangler

Cadet

Server Wrangler

Pony Wrangler

Cadet

Pony Wrangler

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Intermittent Critical Error"

Similar threads