Is one of my drives failing?

AndrewParsons · Oct 12, 2016

Hey team,

Recently I have been getting a red alert on da1 drive. I do have a replacement drive but I am scared to swap the drive out since I am only raidz1 (media server) and I have heard some horror stories. Below is my smartctrl -a output and a snip it from my daily security email.

Code:

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD40EFRX-68WT0N0
Serial Number:	WD-WCC4E4YSVP2Z
LU WWN Device Id: 5 0014ee 20d42354d
Firmware Version: 82.00A82
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Wed Oct 12 11:14:15 2016 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(53160) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 532) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x703d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   199   051	Pre-fail  Always	   -	   1918
  3 Spin_Up_Time			0x0027   199   188   021	Pre-fail  Always	   -	   7033
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   14
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   097   097   000	Old_age   Always	   -	   2613
10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   14
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   12
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   20
194 Temperature_Celsius	 0x0022   108   102   000	Old_age   Always	   -	   44
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   93

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	  2605		 -
# 2  Extended offline	Completed: read failure	   40%	  2515		 269778072
# 3  Short offline	   Completed without error	   00%	  2437		 -
# 4  Short offline	   Completed without error	   00%	  2222		 -
# 5  Extended offline	Completed: read failure	   10%	  2135		 2902720392
# 6  Short offline	   Completed without error	   00%	  2054		 -
# 7  Short offline	   Completed without error	   00%	  1886		 -
# 8  Extended offline	Completed without error	   00%	  1801		 -
# 9  Short offline	   Completed without error	   00%	  1718		 -
#10  Short offline	   Completed without error	   00%	  1636		 -
#11  Short offline	   Completed without error	   00%	  1478		 -
#12  Extended offline	Completed without error	   00%	  1394		 -
#13  Short offline	   Completed without error	   00%	  1311		 -
#14  Short offline	   Completed without error	   00%	  1143		 -
#15  Short offline	   Completed without error	   00%	  1083		 -
#16  Extended offline	Completed: read failure	   20%	  1054		 1540799104
#17  Short offline	   Completed without error	   00%	   975		 -
#18  Short offline	   Completed without error	   00%	   735		 -
#19  Extended offline	Completed without error	   00%	   661		 -
#20  Short offline	   Completed without error	   00%	   578		 -
#21  Short offline	   Completed without error	   00%	   411		 -
1 of 3 failed self-tests are outdated by newer successful extended offline self-test # 8

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Below is a snip it from my securty out put as well.

Code:

freenas.local kernel log messages:

>   (da1:mps0:0:1:0): READ(10). CDB: 28 00 71 33 1c d0 00 00 08 00 length 4096 SMID 523 terminated ioc 804b scsi 0 state 0 xfer 0

>   (da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 01 02 82 55 60 00 00

> 01 00 00 00 length 131072 SMID 917 terminated ioc 804b

> s(da1:mps0:0:1:0): READ(10). CDB: 28 00 71 33 1c d0 00 00 08 00 csi 0

> state 0 xfer 0

> (da1:mps0:0:1:0): CAM status: CCB request completed with an error

>   (da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 01 02 82 56 60 00 00

> 01 00 00 00 length 131072 SMID 911 terminated ioc 804b s(da1:csi 0

> state 0 xfer 0

> mps0:0:1:0): Retrying command

> (da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 01 02 82 55 60 00 00

> 01 00 00 00

> (da1:mps0:0:1:0): CAM status: CCB request completed with an error

> (da1:mps0:0:1:0): Retrying command

> (da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 01 02 82 56 60 00 00

> 01 00 00 00

> (da1:mps0:0:1:0): CAM status: CCB request completed with an error

> (da1:mps0:0:1:0): Retrying command

> (da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 01 02 82 54 78 00 00

> 00 e8 00 00

> (da1:mps0:0:1:0): CAM status: SCSI Status Error

> (da1:mps0:0:1:0): SCSI status: Check Condition

> (da1:mps0:0:1:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read

> error)

> (da1:mps0:0:1:0): Info: 0x102825478

> (da1:mps0:0:1:0): Error 5, Unretryable error


-- End of security output --

Please let me know your thoughts on the drive and my best course of action to replace the drive in needed. Thank you in advance.

G

anodos · Oct 12, 2016

SMART info doesn't have obvious indications of impending doom. Perhaps try replacing the SATA cable and use different power connector. Is this connected to an HBA or directly to the motherboard? What type of PSU? Give full hardware specs.

AndrewParsons · Oct 12, 2016

anodos,

Thank you for your fast reply.

Specs
Supermicro X10SL7-F uATX DDR3 1600 LGA 1150 Motherboard (controller is flashed)
Build FreeNAS-9.10.1-U2 (f045a8b)
Platform Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz
Memory 32697MB
(9) 4TB WD reds for my media volume
(2) kingstons 120 gig SSD for my jailvolume
All drives are connected directly to the motherboard through SAS and sata3.

During a previous scrub (not the most recent, and i do not know how to go trough past scrubs) I did have a small amount bad data the system was able to recover on da1 drive.

Please let me know if you need anymore info.

Thank you in advance for helping me.

m0nkey_ · Oct 12, 2016

AndrewParsons said:
194 Temperature_Celsius 0x0022 108 102 000 Old_age Always - 44

I would say the drive is a little toasty for my liking. Can you also post the output of smartctl -x /dev/da1 as well?

AndrewParsons · Oct 12, 2016

Thank you for your help, below is the out put from smartctl -x /dev/da1

Code:

[root@freenas] ~# smartctl -x /dev/da1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD40EFRX-68WT0N0
Serial Number:	WD-WCC4E4YSVP2Z
LU WWN Device Id: 5 0014ee 20d42354d
Firmware Version: 82.00A82
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Wed Oct 12 13:05:22 2016 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(53160) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 532) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x703d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAGS	VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate	 POSR-K   200   199   051	-	1918
  3 Spin_Up_Time			POS--K   199   188   021	-	7033
  4 Start_Stop_Count		-O--CK   100   100   000	-	14
  5 Reallocated_Sector_Ct   PO--CK   200   200   140	-	0
  7 Seek_Error_Rate		 -OSR-K   200   200   000	-	0
  9 Power_On_Hours		  -O--CK   097   097   000	-	2615
 10 Spin_Retry_Count		-O--CK   100   253   000	-	0
 11 Calibration_Retry_Count -O--CK   100   253   000	-	0
 12 Power_Cycle_Count	   -O--CK   100   100   000	-	14
192 Power-Off_Retract_Count -O--CK   200   200   000	-	12
193 Load_Cycle_Count		-O--CK   200   200   000	-	20
194 Temperature_Celsius	 -O---K   108   102   000	-	44
196 Reallocated_Event_Count -O--CK   200   200   000	-	0
197 Current_Pending_Sector  -O--CK   200   200   000	-	0
198 Offline_Uncorrectable   ----CK   100   253   000	-	0
199 UDMA_CRC_Error_Count	-O--CK   200   200   000	-	0
200 Multi_Zone_Error_Rate   ---R--   200   200   000	-	93
							||||||_ K auto-keep
							|||||__ C event count
							||||___ R error rate
							|||____ S speed/performance
							||_____ O updated online
							|______ P prefailure warning

General Purpose Log Directory Version 1
SMART		   Log Directory Version 1 [multi-sector log support]
Address	Access  R/W   Size  Description
0x00	   GPL,SL  R/O	  1  Log Directory
0x01		   SL  R/O	  1  Summary SMART error log
0x02		   SL  R/O	  5  Comprehensive SMART error log
0x03	   GPL	 R/O	  6  Ext. Comprehensive SMART error log
0x06		   SL  R/O	  1  SMART self-test log
0x07	   GPL	 R/O	  1  Extended self-test log
0x09		   SL  R/W	  1  Selective self-test log
0x10	   GPL	 R/O	  1  SATA NCQ Queued Error log
0x11	   GPL	 R/O	  1  SATA Phy Event Counters log
0x21	   GPL	 R/O	  1  Write stream error log
0x22	   GPL	 R/O	  1  Read stream error log
0x80-0x9f  GPL,SL  R/W	 16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS	  16  Device vendor specific log
0xa8-0xb6  GPL,SL  VS	   1  Device vendor specific log
0xb7	   GPL,SL  VS	  39  Device vendor specific log
0xbd	   GPL,SL  VS	   1  Device vendor specific log
0xc0	   GPL,SL  VS	   1  Device vendor specific log
0xc1	   GPL	 VS	  93  Device vendor specific log
0xe0	   GPL,SL  R/W	  1  SCT Command/Status
0xe1	   GPL,SL  R/W	  1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 32 (device log contains only the most recent 24 errors)
		CR	 = Command Register
		FEATR  = Features Register
		COUNT  = Count (was: Sector Count) Register
		LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
		LH	 = LBA High (was: Cylinder High) Register	]   LBA
		LM	 = LBA Mid (was: Cylinder Low) Register	  ] Register
		LL	 = LBA Low (was: Sector Number) Register	 ]
		DV	 = Device (was: Device/Head) Register
		DC	 = Device Control Register
		ER	 = Error register
		ST	 = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 32 [7] occurred at disk power-on lifetime: 2611 hours (108 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 74 9b 29 e0 40 00  Error: UNC at LBA = 0x749b29e0 = 1956325856

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 20 00 00 00 00 74 9b 29 d8 40 00 20d+20:15:35.482  READ FPDMA QUEUED
  60 00 08 00 00 00 00 74 9b 3f e0 40 00 20d+20:15:35.457  READ FPDMA QUEUED
  60 00 18 00 00 00 00 74 90 44 e8 40 00 20d+20:15:35.441  READ FPDMA QUEUED
  60 00 18 00 00 00 00 72 1d 2d 00 40 00 20d+20:15:35.401  READ FPDMA QUEUED
  60 00 08 00 00 00 00 72 15 16 a0 40 00 20d+20:15:35.381  READ FPDMA QUEUED

Error 31 [6] occurred at disk power-on lifetime: 2605 hours (108 days + 13 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 19 5e 17 9f 40 00  Error: UNC at LBA = 0x1195e179f = 4720564127

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 00 00 01 19 5e 17 e8 40 00 20d+14:44:42.561  READ FPDMA QUEUED
  60 01 00 00 18 00 01 19 5e 16 e8 40 00 20d+14:44:42.527  READ FPDMA QUEUED
  60 01 00 00 08 00 01 19 5e 15 e8 40 00 20d+14:44:42.526  READ FPDMA QUEUED
  60 01 00 00 00 00 01 19 5e 14 e8 40 00 20d+14:44:42.525  READ FPDMA QUEUED
  60 01 00 00 18 00 01 19 5e 13 e8 40 00 20d+14:44:42.191  READ FPDMA QUEUED

Error 30 [5] occurred at disk power-on lifetime: 2605 hours (108 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 19 34 49 60 40 00  Error: WP at LBA = 0x119344960 = 4717824352

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 10 00 10 00 01 19 c7 4f 10 40 00 20d+14:41:24.171  WRITE FPDMA QUEUED
  61 00 20 00 10 00 01 a3 df 73 b0 40 00 20d+14:41:24.171  WRITE FPDMA QUEUED
  61 00 20 00 10 00 01 55 d0 bd 88 40 00 20d+14:41:24.171  WRITE FPDMA QUEUED
  61 00 08 00 10 00 01 19 c7 4f 20 40 00 20d+14:41:24.171  WRITE FPDMA QUEUED
  61 00 10 00 10 00 01 19 c7 4f 10 40 00 20d+14:41:24.171  WRITE FPDMA QUEUED

Error 29 [4] occurred at disk power-on lifetime: 2605 hours (108 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 19 34 46 77 40 00  Error: WP at LBA = 0x119344677 = 4717823607

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 08 00 01 55 d0 bd 88 40 00 20d+14:41:17.167  WRITE FPDMA QUEUED
  60 01 00 00 18 00 01 19 34 48 30 40 00 20d+14:41:17.167  READ FPDMA QUEUED
  60 01 00 00 00 00 01 19 34 47 30 40 00 20d+14:41:17.166  READ FPDMA QUEUED
  60 01 00 00 10 00 01 19 34 46 30 40 00 20d+14:41:15.536  READ FPDMA QUEUED
  61 00 10 00 08 00 01 55 d0 bd 78 40 00 20d+14:41:09.575  WRITE FPDMA QUEUED

Error 28 [3] occurred at disk power-on lifetime: 2605 hours (108 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 19 33 e9 48 40 00  Error: WP at LBA = 0x11933e948 = 4717799752

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 30 00 01 d1 c0 bd 98 40 00 20d+14:40:40.655  WRITE FPDMA QUEUED
  61 00 08 00 28 00 01 d1 c0 bb 98 40 00 20d+14:40:40.655  WRITE FPDMA QUEUED
  61 00 08 00 20 00 00 00 40 03 98 40 00 20d+14:40:40.655  WRITE FPDMA QUEUED
  61 00 08 00 18 00 00 00 40 01 98 40 00 20d+14:40:40.655  WRITE FPDMA QUEUED
  60 01 00 00 10 00 01 19 33 e8 80 40 00 20d+14:40:40.654  READ FPDMA QUEUED

Error 27 [2] occurred at disk power-on lifetime: 2605 hours (108 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 19 b4 01 20 40 00  Error: WP at LBA = 0x119b40120 = 4726194464

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 10 00 18 00 01 19 c7 46 88 40 00 20d+14:37:04.789  WRITE FPDMA QUEUED
  60 00 e0 00 28 00 01 19 b4 02 58 40 00 20d+14:37:04.789  READ FPDMA QUEUED
  60 00 e0 00 20 00 01 19 b3 fe b8 40 00 20d+14:37:04.789  READ FPDMA QUEUED
  61 00 08 00 18 00 01 19 c7 46 80 40 00 20d+14:37:04.789  WRITE FPDMA QUEUED
  60 00 10 00 10 00 01 d1 c0 bc 90 40 00 20d+14:37:04.789  READ FPDMA QUEUED

Error 26 [1] occurred at disk power-on lifetime: 2605 hours (108 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 19 b3 fd 70 40 00  Error: UNC at LBA = 0x119b3fd70 = 4726193520

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 e0 00 18 00 01 19 b3 fe b8 40 00 20d+14:36:58.963  READ FPDMA QUEUED
  61 00 08 00 00 00 01 19 c7 46 80 40 00 20d+14:36:58.963  WRITE FPDMA QUEUED
  61 00 10 00 18 00 00 00 40 02 90 40 00 20d+14:36:56.553  WRITE FPDMA QUEUED
  60 00 10 00 28 00 01 d1 c0 bc 90 40 00 20d+14:36:56.546  READ FPDMA QUEUED
  60 00 10 00 20 00 01 d1 c0 ba 90 40 00 20d+14:36:56.546  READ FPDMA QUEUED

Error 25 [0] occurred at disk power-on lifetime: 2605 hours (108 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 19 b3 fc d8 40 00  Error: UNC at LBA = 0x119b3fcd8 = 4726193368

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 20 00 10 00 01 19 b3 fd 38 40 00 20d+14:36:52.875  READ FPDMA QUEUED
  60 00 20 00 08 00 01 19 b3 fc f8 40 00 20d+14:36:52.875  READ FPDMA QUEUED
  60 00 20 00 00 00 01 19 b3 fc d8 40 00 20d+14:36:52.874  READ FPDMA QUEUED
  60 00 20 00 00 00 01 19 b3 fc b8 40 00 20d+14:36:52.874  READ FPDMA QUEUED
  60 00 20 00 00 00 01 19 b3 fc 98 40 00 20d+14:36:52.874  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	  2605		 -
# 2  Extended offline	Completed: read failure	   40%	  2515		 4564745368
# 3  Short offline	   Completed without error	   00%	  2437		 -
# 4  Short offline	   Completed without error	   00%	  2222		 -
# 5  Extended offline	Completed: read failure	   10%	  2135		 7197687688
# 6  Short offline	   Completed without error	   00%	  2054		 -
# 7  Short offline	   Completed without error	   00%	  1886		 -
# 8  Extended offline	Completed without error	   00%	  1801		 -
# 9  Short offline	   Completed without error	   00%	  1718		 -
#10  Short offline	   Completed without error	   00%	  1636		 -
#11  Short offline	   Completed without error	   00%	  1478		 -
#12  Extended offline	Completed without error	   00%	  1394		 -
#13  Short offline	   Completed without error	   00%	  1311		 -
#14  Short offline	   Completed without error	   00%	  1143		 -
#15  Short offline	   Completed without error	   00%	  1083		 -
#16  Extended offline	Completed: read failure	   20%	  1054		 5835766400
#17  Short offline	   Completed without error	   00%	   975		 -
#18  Short offline	   Completed without error	   00%	   735		 -
1 of 3 failed self-tests are outdated by newer successful extended offline self-test # 8

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:				  3
SCT Version (vendor specific):	   258 (0x0102)
SCT Support Level:				   1
Device State:						Active (0)
Current Temperature:					44 Celsius
Power Cycle Min/Max Temperature:	 43/50 Celsius
Lifetime	Min/Max Temperature:	 23/50 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:	 2
Temperature Sampling Period:		 1 minute
Temperature Logging Interval:		1 minute
Min/Max recommended Temperature:	  0/60 Celsius
Min/Max Temperature Limit:		   -41/85 Celsius
Temperature History Size (Index):	478 (53)

Index	Estimated Time   Temperature Celsius
  54	2016-10-12 05:08	44  *************************
 ...	..( 98 skipped).	..  *************************
 153	2016-10-12 06:47	44  *************************
 154	2016-10-12 06:48	43  ************************
 ...	..(118 skipped).	..  ************************
 273	2016-10-12 08:47	43  ************************
 274	2016-10-12 08:48	44  *************************
 ...	..(256 skipped).	..  *************************
  53	2016-10-12 13:05	44  *************************

SCT Error Recovery Control:
		   Read:	 70 (7.0 seconds)
		  Write:	 70 (7.0 seconds)

Device Statistics (GP/SMART Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11)
ID	  Size	 Value  Description
0x0001  2			0  Command failed due to ICRC error
0x0002  2			0  R_ERR response for data FIS
0x0003  2			0  R_ERR response for device-to-host data FIS
0x0004  2			0  R_ERR response for host-to-device data FIS
0x0005  2			0  R_ERR response for non-data FIS
0x0006  2			0  R_ERR response for device-to-host non-data FIS
0x0007  2			0  R_ERR response for host-to-device non-data FIS
0x0008  2			0  Device-to-host non-data FIS retries
0x0009  2			5  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2			6  Device-to-host register FISes sent due to a COMRESET
0x000b  2			0  CRC errors within host-to-device FIS
0x000f  2			0  R_ERR response for host-to-device data FIS, CRC
0x0012  2			0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4	  1816538  Vendor specific

[root@freenas] ~#

Please let me know if you need anything else. Thanks again.

anodos · Oct 12, 2016

9-wide RAIDZ1? Sounds like it might be a good time to make sure you've backed up all the things, and then re-think how you're doing your pool (might be a good 'come to God' moment).

What type of PSU?

A couple of random thoughts:

- Try swapping things around (power / SATA cable) and see if you can finish a long SMART test. If this doesn't fix the issue, then try to RMA the drive.
- Transition to RAIDZ2

Ericloewe · Oct 12, 2016

anodos said:
SMART info doesn't have obvious indications of impending doom. Perhaps try replacing the SATA cable and use different power connector. Is this connected to an HBA or directly to the motherboard? What type of PSU? Give full hardware specs.

m0nkey_ said:
I would say the drive is a little toasty for my liking. Can you also post the output of smartctl -x /dev/da1 as well?

Uhmm, guys? The drive has a significant Multi_Zone_Error_Rate, it's well on its way to the bit bucket in the sky. Raw read error rate is also abnormally high.

AndrewParsons · Oct 12, 2016

I have a EVGA SuperNOVA 550 G2 80 Plus Gold Rated, Fully Modular ATX 12V/EPS 12V ECO Mode Power Supply 220-G2-0550-Y1

Okay, so I think it is pretty clear that da1 needs to be replaced. Based on my setup is there anything special I have to do besides putting to off line then power done and replace?

Any info you be much appreciated and thank you for all your help so far.

Ericloewe · Oct 12, 2016

If you have a spare port, resilver the new drive with the old one in place, to minimize risk. Just follow the manual's instructions.

anodos · Oct 12, 2016

Ericloewe said:
Uhmm, guys? The drive has a significant Multi_Zone_Error_Rate, it's well on its way to the bit bucket in the sky. Raw read error rate is also abnormally high.

D'oh I missed that. Side effect of using a mobile phone.

AndrewParsons · Oct 12, 2016

I cannot find any information in the manual about resilvering with the failing drive in place. Is it the same as if I pulled the bad drive out? do i still put it to offline or do I expand the pool? Thank you.

joeschmuck · Oct 12, 2016

Yea, I was just going to point this out. The drive is failing. If you see MultiZone failures with some other failure, even if it's not the typical hard failure, it still means the drive is failing. In this case you had Extended tests which wouldn't pass.

The ID1 value doesn't mean anything for most drives. Sometimes its as simple as the drive read extra blocks of data, trying to cache them just in case they are requested, and then the next request comes along for different data and now you have a read error because the data isn't what was requested.

AndrewParsons · Oct 12, 2016

Team,

I have been researching away and I am still really foggy on how to resilver with the failing drive in the array during the process. Any help at all will be much appreciated. please see 2 messages up.

Thank you very much

Robert Trevellyan · Oct 12, 2016

Take a look at the directions for replacing drives to grow a pool.

danb35 · Oct 12, 2016

If you have a spare SATA port, install the new drive
Go to Storage -> select your pool -> Volume Status (the button that looks like a sheet of notebook paper) -> click on da1 -> Replace -> select your new disk -> Replace Disk
Once the resilver finishes, remove the old disk

SweetAndLow · Oct 12, 2016

You should also look into better cooling for your drives. They probably overheat during scrubs.

Sent from my Nexus 5X using Tapatalk

Stux · Oct 15, 2016

How have you distributed power to the 9 drives? Used any splitters?

rs225 · Oct 16, 2016

No, you do not offline the drive before replacing it. It creates the possibility off off-lining the wrong drive or swapping the wrong drive.

NetworkNoobie · Jan 1, 2019

Any new developments on this post? I have a similar issue but I wanted to do as much research as possible before I post.

Chris Moore · Jan 1, 2019

rs225 said:
No, you do not offline the drive before replacing it. It creates the possibility off off-lining the wrong drive or swapping the wrong drive.

You identify the drive to remove my serial number. That way there is no chance of mistake.

Important Announcement for the TrueNAS Community.

Is one of my drives failing?

Dabbler

Sambassador

Dabbler

MVP

Dabbler

Sambassador

Server Wrangler

Dabbler

Server Wrangler

Sambassador

Dabbler

Old Man

Dabbler

Pony Wrangler

Hall of Famer

Sweet'NASty

MVP

Guru

Cadet

Hall of Famer

Similar threads