Do these drives have life left in them? (SMART errors, need an eye to check interpretation)

Stilez · Jan 21, 2018

Over the last week or so, 3 of my old 4TBs all started throwing errors in the messages log. The pool is fine - there's plenty of redundancy and I swapped them over in good time - so now the 3 of them are sitting on the side, waiting for me to figure if they're worth a hit with smartctl/badblocks, or a quick trip to the metal recyclers. It does seem to be the drives, at least probably: when I swapped cables around, the error stayed with the drive not the cable. I used different baseboard ports and new, branded cables for all tests below.

I don't mind losing any or all of them. All 3 have outlived their expected lifetime and they weren't server drives so they've managed particularly well. But I have a 4TB vdev configured as a 3 way mirror that's currently using 2 x 4TB + 1 x 6TB, and I'd really like to get the 6TB back if one or more of the pulled 4 TBs has a few months more of working life :D

I don't need much help with actioning any tests or replacement, it's more about accurately interpreting what I see, and an experienced eye to check I'm not missing anything important.

This is what I know of each of the 3 drives:

Drive 1: Seagate desktop ST4000DM000, bought 2013 (serial ****DBVN, sharing ada6)
The drive started throwing persistent "8 unallocated/uncorrectable/pending sectors" a week ago (you know the message!). No panic, spare was ready, watched error count. The count didn't go up, it's remained at 8 for a long time. So I swapped the drive out of the pool and ran a smartctl long test. But by the time I could type the next line and check the test was running ("smartctl - c /dev/ada6"), it had stopped. The status given was "Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed". The self test log reports "Extended offline Completed: read failure 90%".

The LBA seemed pretty consistent in the log ("UNC at LBA = 0x0fffffff = 268435455") so I hit it with "badblocks -b 4096 -wsv -c 64 -p 10 /dev/ada6 268535455 268335455", and it completed with 0/0/0 errors every time. So I ran smartctl -t long again. Before I could type "smartctl -c", it seems it had already ended self-testing. I'm not sure how to explain what I'm apparently seeing - errors at a sector, as reported by smart, and a read error killing self-test, but badblocks finding no errors and no issue for 100k blocks either side of the reported LBA. So I don't know what to make of this drive.

Drive 2: Seagate desktop ST4000DM000, bought 2014 (serial ****813L, sharing ada6)
Has thrown error 184 (end to end) 6 times, which apparently means its internal error checking (cache ram, controller, no idea what) is suspect. Sounds like a fairly fatal error and that the drive is dead on its feet. Am I correct, or is there any hope for it? I don't mind trying, but this sounds like an error that can't be fixed or trusted, and if so - disposal.

*Update* Since typing the above yesterday, I held off posting in order to run a long self test on it first anyway, on a different port/cable. Surprisingly, it completed without errors (smart log report states "SMART Error Log Version: 1 No Errors Logged", 3 hours ago). So now I don't know what to make of it. The NAS is still constantly logging error 184 in messages but I think that's just a reminder and that it doesn't look like new events have occurred for days or maybe even a week or so.

As event 184 has happened 6 times now according to output, I'm guessing that there is definitively an internal fault in the component, is that correct? But equally it looks like it may be infrequent and stable, that the HDD's error detection is catching it for now, and ZFS of course double checks and has found no errors during a recent scrub. How nuts would it be to semi-trust the drive (especially backed with ZFS checksumming and tolerance for double failure on the vdev) for a while more, and keep an eye on whether the count increases, or is that foolish?

Drive 3: HGST Deskstar 7K4000 a.k.a. HDS724040ALE640, bought end 2013 (serial ****K67S, ada8)
Started throwing Interface CRC errors shortly after attaching to the pool, with a variety of not-dissimilar addresses/LBAs in the output. But after removing and a different port/cable, long self-test completed without error. I didn't try a specific badblocks wipe as it wasn't clear whether the smartctl result is definitive of the device being ok, or what addresses to hit with badblocks given the output (see below). Instead I ran the entire drive through a single pass with "badblocks -b 4096 -wsv -c 64 /dev/ada8" which has been running 12 hours and will take about another 12 to finish, in case smartctl isn't telling me how it is ;) but so far the error count is 0/0/0 (I'll update if it changes). So I'm basically not sure what to make of this drive. Perhaps it's fine and it was the port or cable, and perhaps not. There's no errors in messages for ada8 which is also suggestive. How can I be sure?

Full smartctl -a output is below. Thanks in advance taking time to read and help!

Drive 1 smartctl (****DBVN, sharing ada6):

Code:

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Desktop HDD.15
Device Model:	 ST4000DM000-1F2168
Serial Number:	****DBVN
LU WWN Device Id: 5 000c50 ********
Firmware Version: CC52
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5900 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sun Jan 21 04:11:52 2018 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  ( 121) The previous self-test completed having
										the read element of the test failed.
Total time to complete Offline
data collection:				(  612) seconds.
Offline data collection
capabilities:					(0x73) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										No Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   1) minutes.
Extended self-test routine
recommended polling time:		( 529) minutes.
Conveyance self-test routine
recommended polling time:		(   2) minutes.
SCT capabilities:			  (0x1085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   117   099   006	Pre-fail  Always	   -	   147347016
  3 Spin_Up_Time			0x0003   092   091   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   097   097   020	Old_age   Always	   -	   3140
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   052   051   030	Pre-fail  Always	   -	   5278693935299
  9 Power_On_Hours		  0x0032   054   054   000	Old_age   Always	   -	   41170
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   171
183 Runtime_Bad_Block	   0x0032   099   099   000	Old_age   Always	   -	   1
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   092   092   000	Old_age   Always	   -	   8
188 Command_Timeout		 0x0032   100   052   000	Old_age   Always	   -	   0 0 51
189 High_Fly_Writes		 0x003a   001   001   000	Old_age   Always	   -	   128
190 Airflow_Temperature_Cel 0x0022   076   044   045	Old_age   Always   In_the_past 24 (0 8 29 23 0)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   130
193 Load_Cycle_Count		0x0032   001   001   000	Old_age   Always	   -	   265589
194 Temperature_Celsius	 0x0022   024   056   000	Old_age   Always	   -	   24 (0 12 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   8
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   8
199 UDMA_CRC_Error_Count	0x003e   200   189   000	Old_age   Always	   -	   81
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   27359h+43m+05.529s
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   548859535772
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   362273111179

SMART Error Log Version: 1
ATA Error Count: 8 (device log contains only the most recent five errors)
		CR = Command Register [HEX]
		FR = Features Register [HEX]
		SC = Sector Count Register [HEX]
		SN = Sector Number Register [HEX]
		CL = Cylinder Low Register [HEX]
		CH = Cylinder High Register [HEX]
		DH = Device/Head Register [HEX]
		DC = Device Command Register [HEX]
		ER = Error register [HEX]
		ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 8 occurred at disk power-on lifetime: 41082 hours (1711 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 ff ff ff 4f 00	  14:04:16.193  READ FPDMA QUEUED
  ef 02 00 00 00 00 a0 00	  14:04:15.427  SET FEATURES [Enable write cache]
  00 00 00 00 00 00 00 ff	  14:04:15.196  NOP [Abort queued commands]
  60 00 80 ff ff ff 4f 00	  14:03:40.565  READ FPDMA QUEUED
  ef 02 00 00 00 00 a0 00	  14:03:39.797  SET FEATURES [Enable write cache]

Error 7 occurred at disk power-on lifetime: 41082 hours (1711 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 ff ff ff 4f 00	  14:03:40.565  READ FPDMA QUEUED
  ef 02 00 00 00 00 a0 00	  14:03:39.797  SET FEATURES [Enable write cache]
  00 00 00 00 00 00 00 ff	  14:03:39.568  NOP [Abort queued commands]
  60 00 80 ff ff ff 4f 00	  14:03:02.895  READ FPDMA QUEUED
  60 00 80 ff ff ff 4f 00	  14:03:02.856  READ FPDMA QUEUED

Error 6 occurred at disk power-on lifetime: 41082 hours (1711 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 ff ff ff 4f 00	  14:03:02.895  READ FPDMA QUEUED
  60 00 80 ff ff ff 4f 00	  14:03:02.856  READ FPDMA QUEUED
  60 00 80 ff ff ff 4f 00	  14:03:01.166  READ FPDMA QUEUED
  60 00 80 ff ff ff 4f 00	  14:03:01.157  READ FPDMA QUEUED
  60 00 80 ff ff ff 4f 00	  14:02:33.768  READ FPDMA QUEUED

Error 5 occurred at disk power-on lifetime: 40670 hours (1694 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 a8 ff ff ff 4f 00   7d+15:21:16.198  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00   7d+15:21:16.198  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00   7d+15:21:16.197  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00   7d+15:21:16.163  READ LOG EXT
  60 00 a8 ff ff ff 4f 00   7d+15:21:12.735  READ FPDMA QUEUED

Error 4 occurred at disk power-on lifetime: 40670 hours (1694 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 a8 ff ff ff 4f 00   7d+15:21:12.735  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00   7d+15:21:12.735  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00   7d+15:21:12.734  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00   7d+15:21:12.674  READ LOG EXT
  60 00 a8 ff ff ff 4f 00   7d+15:21:09.250  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	 41169		 -
# 2  Extended offline	Completed: read failure	   90%	 41157		 -
# 3  Extended offline	Completed: read failure	   90%	 41157		 -
# 4  Extended offline	Completed: read failure	   90%	 41157		 -
# 5  Short offline	   Completed without error	   00%	 41076		 -
# 6  Short offline	   Completed without error	   00%	 40672		 -
# 7  Extended offline	Interrupted (host reset)	  00%	 37009		 -
# 8  Extended offline	Aborted by host			   90%	 37007		 -
# 9  Short offline	   Completed without error	   00%	 37006		 -
#10  Extended offline	Aborted by host			   90%	 37006		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Drive 2 smartctl (****813L, sharing ada6):

Code:

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Desktop HDD.15
Device Model:	 ST4000DM000-1F2168
Serial Number:	****813L
LU WWN Device Id: 5 000c50 ********
Firmware Version: CC51
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5900 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sun Jan 21 15:45:02 2018 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(  602) seconds.
Offline data collection
capabilities:					(0x73) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										No Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   1) minutes.
Extended self-test routine
recommended polling time:		( 541) minutes.
Conveyance self-test routine
recommended polling time:		(   2) minutes.
SCT capabilities:			  (0x1085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   108   099   006	Pre-fail  Always	   -	   14991216
  3 Spin_Up_Time			0x0003   092   092   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   098   098   020	Old_age   Always	   -	   2642
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   085   060   030	Pre-fail  Always	   -	   342676084
  9 Power_On_Hours		  0x0032   054   054   000	Old_age   Always	   -	   40858
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   133
183 Runtime_Bad_Block	   0x0032   095   095   000	Old_age   Always	   -	   5
184 End-to-End_Error		0x0032   094   094   099	Old_age   Always   FAILING_NOW 6
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   100   099   000	Old_age   Always	   -	   0 0 3
189 High_Fly_Writes		 0x003a   001   001   000	Old_age   Always	   -	   369
190 Airflow_Temperature_Cel 0x0022   077   046   045	Old_age   Always	   -	   23 (Min/Max 22/27)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   100
193 Load_Cycle_Count		0x0032   001   001   000	Old_age   Always	   -	   231382
194 Temperature_Celsius	 0x0022   023   054   000	Old_age   Always	   -	   23 (0 18 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   5
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   34438h+01m+35.215s
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   462626101174
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   473185202346

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 40855		 -
# 2  Extended offline	Completed without error	   00%	 37665		 -
# 3  Short offline	   Completed without error	   00%	 37650		 -
# 4  Extended offline	Aborted by host			   90%	 37650		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Drive 3 smartctl (****K67S, ada8):

Code:

=== START OF INFORMATION SECTION ===
Model Family:	 Hitachi/HGST Deskstar 7K4000
Device Model:	 HGST HDS724040ALE640
Serial Number:	****K67S
LU WWN Device Id: 5 000cca ********
Firmware Version: MJAOA580
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	7200 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sun Jan 21 04:11:57 2018 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(   24) seconds.
Offline data collection
capabilities:					(0x5b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										No Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   1) minutes.
Extended self-test routine
recommended polling time:		( 569) minutes.
SCT capabilities:			  (0x003d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000b   100   100   016	Pre-fail  Always	   -	   0
  2 Throughput_Performance  0x0005   137   137   054	Pre-fail  Offline	  -	   79
  3 Spin_Up_Time			0x0007   127   127   024	Pre-fail  Always	   -	   613 (Average 602)
  4 Start_Stop_Count		0x0012   100   100   000	Old_age   Always	   -	   3627
  5 Reallocated_Sector_Ct   0x0033   100   100   005	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000b   100   100   067	Pre-fail  Always	   -	   0
  8 Seek_Time_Performance   0x0005   119   119   020	Pre-fail  Offline	  -	   35
  9 Power_On_Hours		  0x0012   096   096   000	Old_age   Always	   -	   33498
 10 Spin_Retry_Count		0x0013   100   100   060	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   259
192 Power-Off_Retract_Count 0x0032   082   082   000	Old_age   Always	   -	   22432
193 Load_Cycle_Count		0x0012   082   082   000	Old_age   Always	   -	   22432
194 Temperature_Celsius	 0x0002   200   200   000	Old_age   Always	   -	   30 (Min/Max 20/58)
196 Reallocated_Event_Count 0x0032   100   100   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0022   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0008   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x000a   200   200   000	Old_age   Always	   -	   271

SMART Error Log Version: 1
ATA Error Count: 271 (device log contains only the most recent five errors)
		CR = Command Register [HEX]
		FR = Features Register [HEX]
		SC = Sector Count Register [HEX]
		SN = Sector Number Register [HEX]
		CL = Cylinder Low Register [HEX]
		CH = Cylinder High Register [HEX]
		DH = Device/Head Register [HEX]
		DC = Device Command Register [HEX]
		ER = Error register [HEX]
		ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 271 occurred at disk power-on lifetime: 33131 hours (1380 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 31 3f 8f 60 00  Error: ICRC, ABRT at LBA = 0x00608f3f = 6328127

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 70 70 93 60 40 00	  00:00:54.003  READ FPDMA QUEUED
  60 00 68 70 92 60 40 00	  00:00:54.003  READ FPDMA QUEUED
  60 00 60 70 91 60 40 00	  00:00:54.003  READ FPDMA QUEUED
  60 00 58 70 90 60 40 00	  00:00:54.003  READ FPDMA QUEUED
  60 00 50 70 8f 60 40 00	  00:00:54.003  READ FPDMA QUEUED

Error 270 occurred at disk power-on lifetime: 33122 hours (1380 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 51 1f 9b 60 00  Error: ICRC, ABRT at LBA = 0x00609b1f = 6331167

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 90 70 9b 60 40 00   5d+03:42:37.133  READ FPDMA QUEUED
  60 00 88 70 9a 60 40 00   5d+03:42:37.133  READ FPDMA QUEUED
  60 00 80 70 99 60 40 00   5d+03:42:37.133  READ FPDMA QUEUED
  60 00 78 70 98 60 40 00   5d+03:42:37.133  READ FPDMA QUEUED
  60 00 70 70 97 60 40 00   5d+03:42:37.133  READ FPDMA QUEUED

Error 269 occurred at disk power-on lifetime: 33083 hours (1378 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 e9 0f 08 64 00  Error: ICRC, ABRT at LBA = 0x0064080f = 6555663

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 20 68 e0 0b 64 40 00   3d+13:13:39.139  READ FPDMA QUEUED
  60 f8 60 e8 0a 64 40 00   3d+13:13:39.139  READ FPDMA QUEUED
  60 f8 58 f0 09 64 40 00   3d+13:13:39.139  READ FPDMA QUEUED
  60 f8 50 f8 08 64 40 00   3d+13:13:39.139  READ FPDMA QUEUED
  60 f8 48 00 08 64 40 00   3d+13:13:39.139  READ FPDMA QUEUED

Error 268 occurred at disk power-on lifetime: 33083 hours (1378 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 d9 1f 08 64 00  Error: ICRC, ABRT at LBA = 0x0064081f = 6555679

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 20 48 e0 0b 64 40 00   3d+13:13:37.997  READ FPDMA QUEUED
  60 f8 40 e8 0a 64 40 00   3d+13:13:37.997  READ FPDMA QUEUED
  60 f8 38 f0 09 64 40 00   3d+13:13:37.997  READ FPDMA QUEUED
  60 f8 30 f8 08 64 40 00   3d+13:13:37.997  READ FPDMA QUEUED
  60 f8 28 00 08 64 40 00   3d+13:13:37.997  READ FPDMA QUEUED

Error 267 occurred at disk power-on lifetime: 32607 hours (1358 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 71 87 c4 19 0f  Error: ICRC, ABRT at LBA = 0x0f19c487 = 253346951

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 c8 f8 c3 19 40 00	  17:10:21.881  READ FPDMA QUEUED
  60 00 c0 f8 c2 19 40 00	  17:10:21.881  READ FPDMA QUEUED
  60 00 b8 f8 c1 19 40 00	  17:10:21.880  READ FPDMA QUEUED
  60 00 b0 f8 c0 19 40 00	  17:10:21.879  READ FPDMA QUEUED
  60 00 a8 f8 bf 19 40 00	  17:10:21.878  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 33494		 -
# 2  Short offline	   Completed without error	   00%	 33404		 -
# 3  Short offline	   Completed without error	   00%	 29911		 -
# 4  Short offline	   Completed without error	   00%	 29492		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Johnnie Black · Jan 21, 2018

Stilez said:
"UNC at LBA = 0x0fffffff = 268435455")

That's a bogus sector, it would be past the end of the disk, reported by Seagates when there's an uncorrected error without a real sector error, and since it's also failing the extended SMART test it should be replaced.

Disk2 should also be replaced, end-to-end error means there's internal problems that could compromise the data integrity.

Disk3 looks in perfect health, UDMA_CRC errors are usually the result of a bad connection, most times a bad SATA cable.

Stilez · Jan 21, 2018

Johnnie Black said:
That's a bogus sector, it would be past the end of the disk, reported by Seagates when there's an uncorrected error without a real sector error, and since it's also failing the extended SMART test it should be replaced.

Disk2 should also be replaced, end-to-end error means there's internal problems that could compromise the data integrity.

Disk3 looks in perfect health, UDMA_CRC errors are usually the result of a bad connection, most times a bad SATA cable.

Perfect information, thank you!

2 disks recycling, 1 resilvering, and the 6TB detached for reuse elsewhere. Thanks again!

(Bogus sectors. Winner of the 1832 "Bad Ideas People Thought Were Good" Platinum Awards. Because accurate information helps. Can you tell I'm unimpressed? :p)

wblock · Jan 22, 2018

The Seagates are, well, Seagates. But what happened to that poor Hitachi to give it 22,000 power-off retracts?

Johnnie Black · Jan 22, 2018

wblock said:
But what happened to that poor Hitachi to give it 22,000 power-off retracts?

It's apparently common for some Hitachi drives, not sure where it comes from, I have one with a similar high number that I can't explain, and it's not from sleeping, as sleep also increases the start_stop count.

Code:

########## SMART status report for ada3 drive (Hitachi Ultrastar A7K2000: BEGW16WW) ##########
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  4 Start_Stop_Count		0x0012   100   100   000	Old_age   Always	   -	   99  
  9 Power_On_Hours		  0x0012   095   095   000	Old_age   Always	   -	   39652
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   92
192 Power-Off_Retract_Count 0x0032   051   051   000	Old_age   Always	   -	   59561
193 Load_Cycle_Count		0x0012   051   051   000	Old_age   Always	   -	   59561

To the OP, one thing I forgot to mention is to keep an eye on the temps, at some point in the past the disk got way too hot:

194 Temperature_Celsius 0x0002 200 200 000 Old_age Always - 30 (Min/Max 20/58)

Stilez · Jan 23, 2018

Johnnie Black said:
To the OP, one thing I forgot to mention is to keep an eye on the temps, at some point in the past the disk got way too hot

You're right, they did - and they never will again. I gave up looking for what I wanted online, and just built a decent external fanned enclosure in December. Now they are all mounted vertically (for heat escape) with 15mm anti-vibration buffers, 20mm drive separation (airflow), forced cooling with a run of high static pressure 120mm fans from below, and enough free space above/below the fans, for the disks not to impede the airflow. Also trivially accessible/pullable. Now they sit around 27-30C and on a busy day sometimes get up to 35C - but the NAS has to be doing a lot to get there

wblock said:
The Seagates are, well, Seagates. But what happened to that poor Hitachi to give it 22,000 power-off retracts?

I have no idea about the Hitachi either. I know that there was some legal issue a few years back where some company hit others with a lawsuit about AMM/APM and as a result drives couldn't be parked properly. There was also an issue with a couple of drives that kept parking repeatedly on very short idle times of 3 - 6 seconds even in use (click---click---click) and needed software to control/prevent it. I don't recollect HGST/Hitachi being affected by either of those, but it was in a server with the drive that was affected, so perhaps the software used to enforce parking on the other drives somehow ended up causing excessive cycling on the Hitachi. Unlikely though, because the same server also had the 2 Seagates for much of their lives, until they were moved to the FreeNAS server, and they have low cycle counts. So I don't really have any idea.

Although aware of the Backblaze and other posts about Seagate, in fairness I've never hit reliability issues with Seagate myself. The few times I've used it, their warranty service has consistently been very supportive and quick which I value. I even had 2 drives of the exact kind in the current lawsuit (ST3000DM's) and they both outlasted their warranties by years not months - the last one died just 3 months ago. The disks in this post were budget 4TBs but both lasted around 4 - 4.5 years (H2 2013/early 2014) compared to a warranty of 2. The two 1TB constellations I bought in 2010 as a RAID1 pair are still working in my desktop nearly 8 years on. Maybe it's classic YMMV at a guess? I did noticed the concerns but they seem to have passed me by. Lucky me? :D

Important Announcement for the TrueNAS Community.

Do these drives have life left in them? (SMART errors, need an eye to check interpretation)

Stilez

Guru

Johnnie Black

Guru

Stilez

Guru

wblock

Documentation Engineer

Johnnie Black

Guru

Stilez

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Do these drives have life left in them? (SMART errors, need an eye to check interpretation)

Stilez

Guru

Johnnie Black

Guru

Stilez

Guru

wblock

Documentation Engineer

Johnnie Black

Guru

Stilez

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Do these drives have life left in them? (SMART errors, need an eye to check interpretation)"

Similar threads