Possible Failing Disk?

xCatalystx · Nov 15, 2017

So I noticed this morning after my disks finished scrubbing one of my pools, "zpool status" reported 4.05gb has been repaired. I schedule a cron 1 day after each scrub to record the latest zpool status and I cannot see any data being repaired in the last 3months. I run scrubs on the 1st and 15th of each month.

Is there any way I can see what data was repaired or at least what disk it was?

Disks are 3-4years old so just want to make sure if I should be expecting any upcoming issues.

EDIT: I should note under zpool status read/write/cksum are all listed as 0

EDIT2: SMART data list's 1 disk as "Raw_Read_Error_Rate = "458" but no other values as abnormal such. SMART thinks it passed aswell.

EDIT3: Well OK now I have reservations. SMART says its passed but FreeNAS is now warning "Device: /dev/da4 [SAT], 2 Currently unreadable (pending) sectors" AND SMART reports "Current_Pending_Sector = 8"

farmerpling2 · Nov 15, 2017

Remove from RAID, if RAIDZ2 vdev. If RAID1, you should get a backup.

Wipe the disk with zero's and that error should go away.

xCatalystx · Nov 15, 2017

the pool is 6x 3tb wd reds in raidz2.

should I take the pool offline in the meantime while I remove the disc and write zero's to it? I know scrub's take about 25hrs on my system based on 80% usage i am trying to mitigate risk.

I do have an up2date backup of the 1TB of super important stuff, the rest if mostly media or vm backups. ie: less stress to lose, but still a huge PITA!

nightshade00013 · Nov 15, 2017

Probably best to shutdown and offline the one disk. Personally I pull and test on a different system using badblocks running a couple passes and then a long smart test.

I would not recommend doing the test on the same system where your pool lives since at the command line it only takes one little Oops and you can kill something you didn't intend to.

Once the disk is pulled you could run the system but if you can wait the 20 hours or so to test the disk it wouldn't hurt. If you had a RaidZ3 I would say run it and don't worry but with a RaidZ2 with one disk down it could bite you in the butt.

Jailer · Nov 16, 2017

Post the output of smartctl -x /dev/da4 in code tags.

No need to take a RAIDZ2 pool off line with one failing drive.

xCatalystx · Nov 16, 2017

dammit, I was planning to buy 6tb disks during black Friday sales. guess I'll leave it for tonight and might confirm that my important stuff is all backed up.

I am aware I don't "need" to take the pool offline, but I am just thinking to lower my risks until I can replace the drive or write over it might just be a good idea. That pool just means to plex-archive or vm/device backups, which I can live with for a few days.

output as requested:

Code:

smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD30EFRX-68EUZN0
Serial Number:	<removed>
LU WWN Device Id: 5 0014ee 604fe63ee
Firmware Version: 82.00A82
User Capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Thu Nov 16 19:39:16 2017 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:	  ( 121)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline
data collection:		 (39060) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time:	 (   2) minutes.
Extended self-test routine
recommended polling time:	 ( 392) minutes.
Conveyance self-test routine
recommended polling time:	 (   5) minutes.
SCT capabilities:			(0x703d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAGS	VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate	 POSR-K   200   200   051	-	1538
  3 Spin_Up_Time			POS--K   175   175   021	-	6225
  4 Start_Stop_Count		-O--CK   100   100   000	-	42
  5 Reallocated_Sector_Ct   PO--CK   200   200   140	-	0
  7 Seek_Error_Rate		 -OSR-K   200   200   000	-	0
  9 Power_On_Hours		  -O--CK   066   066   000	-	25292
10 Spin_Retry_Count		-O--CK   100   253   000	-	0
11 Calibration_Retry_Count -O--CK   100   253   000	-	0
12 Power_Cycle_Count	   -O--CK   100   100   000	-	42
192 Power-Off_Retract_Count -O--CK   200   200   000	-	40
193 Load_Cycle_Count		-O--CK   200   200   000	-	750
194 Temperature_Celsius	 -O---K   120   103   000	-	30
196 Reallocated_Event_Count -O--CK   200   200   000	-	0
197 Current_Pending_Sector  -O--CK   200   200   000	-	8
198 Offline_Uncorrectable   ----CK   100   253   000	-	0
199 UDMA_CRC_Error_Count	-O--CK   200   200   000	-	0
200 Multi_Zone_Error_Rate   ---R--   200   200   000	-	4
							||||||_ K auto-keep
							|||||__ C event count
							||||___ R error rate
							|||____ S speed/performance
							||_____ O updated online
							|______ P prefailure warning

General Purpose Log Directory Version 1
SMART		   Log Directory Version 1 [multi-sector log support]
Address	Access  R/W   Size  Description
0x00	   GPL,SL  R/O	  1  Log Directory
0x01		   SL  R/O	  1  Summary SMART error log
0x02		   SL  R/O	  5  Comprehensive SMART error log
0x03	   GPL	 R/O	  6  Ext. Comprehensive SMART error log
0x06		   SL  R/O	  1  SMART self-test log
0x07	   GPL	 R/O	  1  Extended self-test log
0x09		   SL  R/W	  1  Selective self-test log
0x10	   GPL	 R/O	  1  SATA NCQ Queued Error log
0x11	   GPL	 R/O	  1  SATA Phy Event Counters log
0x21	   GPL	 R/O	  1  Write stream error log
0x22	   GPL	 R/O	  1  Read stream error log
0x80-0x9f  GPL,SL  R/W	 16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS	  16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS	   1  Device vendor specific log
0xbd	   GPL,SL  VS	   1  Device vendor specific log
0xc0	   GPL,SL  VS	   1  Device vendor specific log
0xc1	   GPL	 VS	  93  Device vendor specific log
0xe0	   GPL,SL  R/W	  1  SCT Command/Status
0xe1	   GPL,SL  R/W	  1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 90 (device log contains only the most recent 24 errors)
	CR	 = Command Register
	FEATR  = Features Register
	COUNT  = Count (was: Sector Count) Register
	LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
	LH	 = LBA High (was: Cylinder High) Register	]   LBA
	LM	 = LBA Mid (was: Cylinder Low) Register	  ] Register
	LL	 = LBA Low (was: Sector Number) Register	 ]
	DV	 = Device (was: Device/Head) Register
	DC	 = Device Control Register
	ER	 = Error register
	ST	 = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 90 [17] occurred at disk power-on lifetime: 25290 hours (1053 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 72 fd 63 ae 40 00  Error: UNC at LBA = 0x72fd63ae = 1929208750

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 01 00 00 00 00 72 fd 63 ae 40 00 47d+02:49:08.493  READ FPDMA QUEUED
  ea 00 00 00 00 00 00 00 00 00 00 40 00 47d+02:49:05.919  FLUSH CACHE EXT
  61 00 08 00 10 00 01 5d 50 a1 f8 40 00 47d+02:49:05.919  WRITE FPDMA QUEUED
  61 00 08 00 00 00 01 5d 50 9f f8 40 00 47d+02:49:05.919  WRITE FPDMA QUEUED
  61 00 08 00 08 00 00 00 40 03 f8 40 00 47d+02:49:05.919  WRITE FPDMA QUEUED

Error 89 [16] occurred at disk power-on lifetime: 25288 hours (1053 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 54 1f b0 a8 40 00  Error: WP at LBA = 0x541fb0a8 = 1411362984

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 18 00 08 00 01 50 df 21 08 40 00 47d+01:17:52.392  WRITE FPDMA QUEUED
  61 00 48 00 08 00 00 c8 25 6e e0 40 00 47d+01:17:52.392  WRITE FPDMA QUEUED
  61 00 40 00 08 00 00 72 0b 39 38 40 00 47d+01:17:52.391  WRITE FPDMA QUEUED
  61 00 18 00 08 00 01 51 ac 6c f8 40 00 47d+01:17:52.391  WRITE FPDMA QUEUED
  61 00 10 00 08 00 01 50 df 21 28 40 00 47d+01:17:52.391  WRITE FPDMA QUEUED

Error 88 [15] occurred at disk power-on lifetime: 25288 hours (1053 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 54 1f 2e a8 40 00  Error: WP at LBA = 0x541f2ea8 = 1411329704

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 00 00 00 72 0b 19 a0 40 00 47d+01:15:23.811  WRITE FPDMA QUEUED
  61 00 18 00 00 00 01 51 ac 66 48 40 00 47d+01:15:23.811  WRITE FPDMA QUEUED
  61 00 28 00 00 00 01 50 df 0d d0 40 00 47d+01:15:23.811  WRITE FPDMA QUEUED
  61 00 40 00 00 00 00 c8 25 4f b0 40 00 47d+01:15:23.811  WRITE FPDMA QUEUED
  61 00 20 00 00 00 00 72 0b 19 c0 40 00 47d+01:15:23.811  WRITE FPDMA QUEUED

Error 87 [14] occurred at disk power-on lifetime: 25288 hours (1053 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 53 d7 13 b0 40 00  Error: UNC at LBA = 0x53d713b0 = 1406604208

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 40 00 00 00 00 53 d7 13 98 40 00 47d+01:14:59.509  READ FPDMA QUEUED
  60 00 40 00 00 00 00 53 d6 bf b0 40 00 47d+01:14:59.486  READ FPDMA QUEUED
  60 00 40 00 00 00 00 53 d6 bd 88 40 00 47d+01:14:59.485  READ FPDMA QUEUED
  60 00 40 00 00 00 00 53 d6 bd 48 40 00 47d+01:14:59.462  READ FPDMA QUEUED
  60 00 40 00 00 00 00 53 f5 17 18 40 00 47d+01:14:59.439  READ FPDMA QUEUED

Error 86 [13] occurred at disk power-on lifetime: 25288 hours (1053 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 54 48 88 20 40 00  Error: UNC at LBA = 0x54488820 = 1414039584

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 40 00 00 00 00 54 48 88 10 40 00 47d+01:14:49.481  READ FPDMA QUEUED
  60 00 40 00 00 00 00 54 3e 22 88 40 00 47d+01:14:49.481  READ FPDMA QUEUED
  60 00 40 00 00 00 00 54 3e 22 48 40 00 47d+01:14:49.468  READ FPDMA QUEUED
  60 00 40 00 08 00 00 54 3e 18 90 40 00 47d+01:14:49.467  READ FPDMA QUEUED
  60 00 40 00 00 00 00 54 3e 18 10 40 00 47d+01:14:49.467  READ FPDMA QUEUED

Error 85 [12] occurred at disk power-on lifetime: 25286 hours (1053 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 53 f9 68 78 40 00  Error: UNC at LBA = 0x53f96878 = 1408854136

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 40 00 28 00 00 53 f9 68 98 40 00 46d+23:00:18.474  READ FPDMA QUEUED
  60 00 40 00 20 00 00 53 f9 68 58 40 00 46d+23:00:18.474  READ FPDMA QUEUED
  61 00 08 00 18 00 01 5d 50 a2 28 40 00 46d+23:00:18.474  WRITE FPDMA QUEUED
  61 00 08 00 10 00 01 5d 50 a0 28 40 00 46d+23:00:18.474  WRITE FPDMA QUEUED
  61 00 08 00 08 00 00 00 40 04 28 40 00 46d+23:00:18.474  WRITE FPDMA QUEUED

Error 84 [11] occurred at disk power-on lifetime: 25286 hours (1053 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 54 6d 23 80 40 00  Error: UNC at LBA = 0x546d2380 = 1416438656

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 c0 00 10 00 00 54 6d a6 b0 40 00 46d+22:57:27.547  READ FPDMA QUEUED
  60 00 40 00 00 00 00 54 6d 32 b0 40 00 46d+22:57:27.547  READ FPDMA QUEUED
  60 00 80 00 08 00 00 54 6d 23 20 40 00 46d+22:57:27.534  READ FPDMA QUEUED
  60 00 40 00 08 00 00 54 4a ae 30 40 00 46d+22:57:27.524  READ FPDMA QUEUED
  60 00 40 00 10 00 00 54 4a 96 f0 40 00 46d+22:57:27.524  READ FPDMA QUEUED

Error 83 [10] occurred at disk power-on lifetime: 25276 hours (1053 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 59 ec 44 b0 40 00  Error: UNC at LBA = 0x159ec44b0 = 5803623600

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 00 00 01 59 ec 44 b0 40 00 46d+12:46:03.542  READ FPDMA QUEUED
  60 00 08 00 00 00 01 18 7e 12 c8 40 00 46d+12:46:03.194  READ FPDMA QUEUED
  60 00 08 00 00 00 01 18 46 33 00 40 00 46d+12:46:03.030  READ FPDMA QUEUED
  60 00 08 00 00 00 00 72 46 8f 78 40 00 46d+12:46:02.919  READ FPDMA QUEUED
  60 00 08 00 00 00 00 72 46 8f 80 40 00 46d+12:46:02.865  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	 25290		 1929208745
# 2  Extended offline	Completed: read failure	   90%	 25290		 1929208746
# 3  Extended offline	Completed: read failure	   90%	 25289		 1929208750
# 4  Short offline	   Completed without error	   00%	 25250		 -
# 5  Short offline	   Completed without error	   00%	 25130		 -
# 6  Short offline	   Completed without error	   00%	 25010		 -
# 7  Short offline	   Completed without error	   00%	 24866		 -
# 8  Extended offline	Completed without error	   00%	 24755		 -
# 9  Short offline	   Completed without error	   00%	 24506		 -
#10  Short offline	   Completed without error	   00%	 24386		 -
#11  Short offline	   Completed without error	   00%	 24267		 -
#12  Short offline	   Completed without error	   00%	 24148		 -
#13  Extended offline	Completed without error	   00%	 24037		 -
#14  Short offline	   Completed without error	   00%	 23789		 -
#15  Short offline	   Completed without error	   00%	 23669		 -
#16  Short offline	   Completed without error	   00%	 23550		 -
#17  Short offline	   Completed without error	   00%	 23406		 -
#18  Extended offline	Completed without error	   00%	 23294		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:				  3
SCT Version (vendor specific):	   258 (0x0102)
SCT Support Level:				   1
Device State:						Active (0)
Current Temperature:					30 Celsius
Power Cycle Min/Max Temperature:	 22/39 Celsius
Lifetime	Min/Max Temperature:	  2/47 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:	 2
Temperature Sampling Period:		 1 minute
Temperature Logging Interval:		1 minute
Min/Max recommended Temperature:	  0/60 Celsius
Min/Max Temperature Limit:		   -41/85 Celsius
Temperature History Size (Index):	478 (198)

Index	Estimated Time   Temperature Celsius
199	2017-11-16 11:42	30  ***********
...	..( 35 skipped).	..  ***********
235	2017-11-16 12:18	30  ***********
236	2017-11-16 12:19	31  ************
...	..( 14 skipped).	..  ************
251	2017-11-16 12:34	31  ************
252	2017-11-16 12:35	30  ***********
...	..(139 skipped).	..  ***********
392	2017-11-16 14:55	30  ***********
393	2017-11-16 14:56	29  **********
...	..(174 skipped).	..  **********
  90	2017-11-16 17:51	29  **********
  91	2017-11-16 17:52	30  ***********
...	..(106 skipped).	..  ***********
198	2017-11-16 19:39	30  ***********

SCT Error Recovery Control:
		   Read:	 70 (7.0 seconds)
		  Write:	 70 (7.0 seconds)

Device Statistics (GP/SMART Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11)
ID	  Size	 Value  Description
0x0001  2			0  Command failed due to ICRC error
0x0002  2			0  R_ERR response for data FIS
0x0003  2			0  R_ERR response for device-to-host data FIS
0x0004  2			0  R_ERR response for host-to-device data FIS
0x0005  2			0  R_ERR response for non-data FIS
0x0006  2			0  R_ERR response for device-to-host non-data FIS
0x0007  2			0  R_ERR response for host-to-device non-data FIS
0x0008  2			0  Device-to-host non-data FIS retries
0x0009  2			1  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2			2  Device-to-host register FISes sent due to a COMRESET
0x000b  2			0  CRC errors within host-to-device FIS
0x000f  2			0  R_ERR response for host-to-device data FIS, CRC
0x0012  2			0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4	  4077887  Vendor specific

Jailer · Nov 16, 2017

Yeah da4 is toast, replace it asap.

farmerpling2 · Nov 16, 2017

xCatalystx said:
the pool is 6x 3tb wd reds in raidz2.

should I take the pool offline in the meantime while I remove the disc and write zero's to it? I know scrub's take about 25hrs on my system based on 80% usage i am trying to mitigate risk.

I do have an up2date backup of the 1TB of super important stuff, the rest if mostly media or vm backups. ie: less stress to lose, but still a huge PITA!

You could remove it from vdev and then "wipe" it from the GUI fairly easy. As long as you can recover the data if it goes south, then I would not shut system down. It would take 2 disks going south to fail the vdev.

Just wipe it and bring it back to vdev.

xCatalystx · Nov 16, 2017

Well, i've zeroed it out and running the extended tests now. I would assume for the short-term as long as "Raw_Read_Error_Rate" doesn't increase during the test and no nastier other traits popup, the disk should be fine.

I had listed in my notes to re-create the pool as ashift=12 (4k blocks) so if this can tie me over a few weeks, I might just do that anyways.

Have submitted an RMA for the drive anyways so depends on if WD wants to honor it (expired 2 days ago DAMMIT!)

farmerpling2 · Nov 16, 2017

Do a wipe of disk. You have about 3 years of run time on it.

You are getting CRC errors on certain areas of disk. I have seen wipes extend usage for another 1-2 years. I have also seen platter rot occur where areas of the disk start failing. Over time it can get worse. I would suggest getting a replacement and if you can get this drive to do extended without failure, then you could make it a spare drive.

Best of luck!

farmerpling2 · Nov 16, 2017

xCatalystx said:
dammit, I was planning to buy 6tb disks during black Friday sales. guess I'll leave it for tonight and might confirm that my important stuff is all backed up.

Go ahead and buy 6tb on black Friday. You will likely be OK.

Take a look at the Disk Price/Performance Analysis Buying Information resource. You can get an idea of good deals. Do not be afraid to buy Enterprise drives if they are just a small amount more. Longer warranty, faster drives, better made...

Jailer · Nov 16, 2017

farmerpling2 said:
You are getting CRC errors on certain areas of disk. I have seen wipes extend usage for another 1-2 years.

Those are read errors not CRC errors.

farmerpling2 said:
and if you can get this drive to do extended without failure

It has already had 3 extended tests fail. Did you even look at the SMART output?

I have a Seagate drive that has numerous errors that is still functional but I would NEVER trust it with my valuable data.

Replace the drive.

xCatalystx · Nov 16, 2017

To be fair, those extended tests were run yesterday by me to determine if I could easily fix it with dd but I wasn't making much progress. So I wiped it and am currently running the extended test. 3/4 in.

the last extended test to run successfully was on the 7th of Nov. Scrub was ran on the 14-15th at which point it triggered the pending sector spam.

My concern is that while the pending sectors are now gone, I noticed after about 65% Raw_Read_Error_Rate has crept up another 152 (during my running of extended smart). I think I might just dispose of the drive if WD doesn't want to let me RMA.

I might just disable all the read/write stuff directed at the pool and hope / pray it lasts a week. Made an up2date backup of the critical stuff last night so i am abit more flexible now.

quick question: if i decide to unmount the pool and move it to another system, does it matter what order i plug the drives in? i assume it wouldn't as the system should mount the partition guid? figured i best double check.

Jailer · Nov 16, 2017

xCatalystx said:
quick question: if i decide to unmount the pool and move it to another system, does it matter what order i plug the drives in?

Nope.

xCatalystx said:
I might just disable all the read/write stuff directed at the pool and hope / pray it lasts a week.

If there is just one failing drive you should be just fine running in a degraded state for a week. I recently did the very same thing when one of my drives suddenly failed.

If you're concerned about it run smartctl -x /dev/dax replacing dax with the corresponding drive number in your system and post the output of each here in code tags.

farmerpling2 · Nov 16, 2017

Jailer said:
Those are read errors not CRC errors.

It has already had 3 extended tests fail. Did you even look at the SMART output?

My friend,

I am not familiar with smart output, nor do I know how to read...

but that not being neither hear nor there...

I will feebly attempt to share what little I know about the CRC errors on his disk. I have no doubt you will correct me in my incorrect assumptions.

EDIT3: Well OK now I have reservations. SMART says its passed but FreeNAS is now warning "Device: /dev/da4 [SAT], 2 Currently unreadable (pending) sectors" AND SMART reports "Current_Pending_Sector = 8"

The above quoted information from xCatalystx is what told me there were CRC errors on his disk... To clear them up, you do a write to the logical block that is failing.. A write will either cause a revector of the failing blocks, which is handled by the disk drive or for the error to magically disappear (transient error - bitrot, etc.).

The above information is saying that a CRC failed according to the on disk storage sector size of 4k physical of which the disk drive computes for every read/write. The drive detected the CRC error and will return an error until a write occurs on it or a read is successful.

Normally a bad black would be detected on read/write and would be revectored to a good block. Nothing wrong or unusual about this - it is expected and architected to work this way. The smart counters would get bumped and everyone would be happy.

So in the end, wiping the disk can cause these to disappear. If you see a lot of them, ten's to hundreds smart will let you know a failure is possibly coming and time to replace the drive.

Since you think I am pulling this from my arse and not from experience, I will share with you a well written article that explains it much better than I can (Yes, I just read it for the first time, just now

).

https://unix.stackexchange.com/ques...my-disk-unmap-pending-unreadable-sectors#1928

I hope this helps explain why I said what I said.

Have a great day! ;)

farmerpling2 · Nov 16, 2017

xCatalystx said:
To be fair, those extended tests were run yesterday by me to determine if I could easily fix it with dd but I wasn't making much progress. So I wiped it and am currently running the extended test. 3/4 in.

Code:

# 1 Extended offline	Completed: read failure	 90%	25290		1929208745
# 2 Extended offline	Completed: read failure	 90%	25290		1929208746
# 3 Extended offline	Completed: read failure	 90%	25289		1929208750
# 4 Short offline	 Completed without error	 00%	25250		-
# 5 Short offline	 Completed without error	 00%	25130		-
# 6 Short offline	 Completed without error	 00%	25010		-
# 7 Short offline	 Completed without error	 00%	24866		-
# 8 Extended offline	Completed without error	 00%	24755		-

We can see that the last time the extended scan completed successfully was 24755 hours and the first failure was 25289. This is 534 hours or 22.25 days apart. Most of us saw the three failing runs as being like one run since the short time between each of them.

These errors just started, in the last month. We do not know the rate the errors have occurred at this time.

These type of errors have been around since the early 80's when smarter disk drives and controllers were created, 30+ years ago, and 450+ MB disk drives on 14" platters. The good old days.

Let's see the data from the wipe then extended scan before we say throw the baby out with the bath water.

I would suggest keeping a close eye on this drive and plan on replacing it sooner than later. I think you will likely make it to Black Friday with it.

Cheers!

xCatalystx · Nov 16, 2017

farmerpling2 said:
We can see that the last time the extended scan completed successfully was 24755 hours and the first failure was 25289. This is 534 hours or 22.25 days apart.

your right, seems i changed to days, my calendar said 7th I must have changed it for some reason oo well.

farmerpling2 said:
Let's see the data from the wipe then extended scan before we say throw the baby out with the bath water.

Well, the extended test failed. So I assume that's gameover for that drive? READ_ERROR_RATE = 2240 but none of the other smart values changed.

I forgot to get a snip of smartctl output because I livebooted that machine >_> Here's a snip I just took from windows.

It's pretty amazing, I have 4x 2tb HGST (either coolspin or deskstar) drives that have been running almost non-stop for 6 years (come dec) and have almost no changes in smartdata since the day I bought them. So weird.

EDIT: Good news is I might be able to borrow one of our burnt-in on-shelve replacements from work(gimme some of that WD Gold =D ); so that might hold me over. Otherwise, i might just offline the pool+discs until black friday.

farmerpling2 · Nov 16, 2017

xCatalystx said:
Well, the extended test failed. So I assume that's gameover for that drive? READ_ERROR_RATE = 2240 but none of the other smart values changed.

If you want to play you can dd to the specific LBA using seek=n BS=4096, where n is LBA address masking off the lower 12 bits of address so you start at the beginning of the sector. You need to tell it noerror and something else that slips my mind.

Should look something like below. I have not tested this, do might be forgetting something.

$LBA should be masked LBA with lower 12 bits cleared (trimming down to 4096 start of physical sector)

Code:

smartctl -x /dev/ada4 >1a.txt
dd if=/dev/zero of=/dev/ada4 seek=$LBA bs=4096 count=1 conv=noerror,notrunc
smartctl -x /dev/ada4 >1b.txt
diff 1a.txt 1b.txt >1c.txt

See what changed between 2 runs.

For second LBA use 2a.txt 2b.txt 2c.txt
etc...

This will give you a paper trail that you can look at if you have any questions. Your goal is to cause failing LBA to get revectored.

Run SMART extended again.

xCatalystx said:
It's pretty amazing, I have 4x 2tb HGST (either coolspin or deskstar) drives that have been running almost non-stop for 6 years (come dec) and have almost no changes in smartdata since the day I bought them. So weird.

I run HGST Ultrastars. They just work. Generally, WD is basically taking HGST Ultrastars and calling them WD Gold Enterprise, from what I can tell. (HGST is owned by WD now).

Red/Red Pro, Ironwolf/Ironwolf Pro, and N300 are not made to the same standards the Enterprise drives are. There is a reason the MTBF is lower, warranty is shorter, error rate is 10x different and workload is lower.

xCatalystx said:
EDIT: Good news is I might be able to borrow one of our burnt-in on-shelve replacements from work(gimme some of that WD Gold =D ); so that might hold me over. Otherwise, i might just offline the pool+discs until black friday.

I would not offline going from Z2 to Z2-1 for such a short time. It would take 2 disk failures to wipe you out. For many years people ran RAID 5 and never batted an eye about it. You have backups and recovery methods and should be fine.

WD Gold will do you fine. If you get a large drive you will notice a performance increase as the heads will not have to move as much for accessing same amount of data! Not a huge difference, but watch the reports for the fun of it.

At least wipe the WD before you replace the old drive. Check before / after smartctl -x to see if anything concerning shows up.

Having a thumb drive with one of the many variants of linux (Knoppix is a good one as it has smartctl and other utilities already installed) to use a standalone system to do this from. Always verify S/N of drive is the one you expect to be working on via smartctl.

Take a look at some of the specs in the disk price/performance guide for some ideas on pricing and specs. You just might find a good enterprise drive at around the same price you were willing to buy a consumer brand for: https://forums.freenas.org/index.ph...e-performance-analysis-buying-information.62/

Cheers!

xCatalystx · Nov 17, 2017

First off thanks for all the replies guys.

farmerpling2 said:
If you want to play you can dd to the specific LBA using seek=n BS=4096

Yeah, I already tried using DD. We actually have a really nice script we use for work that does things similar to what you posted because as mentioned occasionally you'll get one or two bad blocks and after reallocating/remapping the disk will be fine for years, but even after DD, a full write over, and extended tests the disk appears to be getting worse.

farmerpling2 said:
I would not offline going from Z2 to Z2-1 for such a short time. It would take 2 disk failures to wipe you out. For many years people ran RAID 5 and never batted an eye about it. You have backups and recovery methods and should be fine.

yeah, but part of me just doesn't want to deal with it. my live data (things related to everyday or work is live replicated my backup and aws) so im doing fine atm.

farmerpling2 said:
Having a thumb drive with one of the many variants of linux

Yep, I use Knoppix and Customized PartedMagic (work provided ). Just one of those I am in a hurry moments to use the test bench for something else that I forgot to save the output to the usb (boot into ram).

I've been meaning to build a snazzy m.2 SSD live-boot drive with multiboot just haven't got around to it. not like I have plenty of 128gb intel ssds lying around =)

farmerpling2 said:
Take a look at some of the specs in the disk price/performance guide for some ideas on pricing and specs.

Yep, I saw you post the link above. Just playing around with ideas in my head atm.

Thinking either 6x 6tb or 6x 8tb just have to wait for the sales.

farmerpling2 said:
You just might find a good enterprise drive at around the same price you were willing to buy a consumer brand for

I do try to aim for lower rpm because heat is an issue here; so is power usage. So the only reason I would go higher rpm is if i could get away with fewer drives (was playing around with the idea of 5 or 4 10-12tb drives in raidz2 or raidz3

blood AU pricing >_> I do use WD Re's in my backup box but that is mostly idle/sleep and only has 4 in it.

farmerpling2 · Nov 17, 2017

xCatalystx said:
I do try to aim for lower rpm because heat is an issue here; so is power usage. So the only reason I would go higher rpm is if i could get away with fewer drives (was playing around with the idea of 5 or 4 10-12tb drives in raidz2 or raidz3

blood AU pricing >_> I do use WD Re's in my backup box but that is mostly idle/sleep and only has 4 in it.

The power "issue" is a non starter. Take a look at the power usage columns and the difference between 7200 and 5x00 is not that much.

Depending on how many drives, it might be only 10's of dollars a year.

With helium drives, the heat issue is also lower. Less gas friction/turbulence, less energy use by motors.

Example (using WD Gold since you like them):

8TB drives cost of electricity per year at 20% duty cycle

WD Red NAS (5400 RPM): $5.96
WD Red NAS Pro (7200 RPM): $6.38
WD Gold Enterprise (7200) RPM: $6.05

How much hotter is the Red NAS going to be compared to Gold Enterprise when they use almost the same amount of energy a year??? You can ignore heat and power usage as a comparison point, as there is virtually no difference.

Just trying to get you ignoring some of the old wives tales from the last 20-30 years. Technology has changed.

I can quote a lot of numbers to show you why the Gold is better than Red NAS, but it would be better for you to look at the numbers a little bit.

The real difference is going to be price. How much extra are you willing to pay for an Enterprise drive with better specs, warranty, durability, and performance? $20??? $40???

Maybe you can buy some from work at a better deal!

Keep up the good fight!

Important Announcement for the TrueNAS Community.

Possible Failing Disk?

Contributor

Patron

Contributor

Wizard

Not strong, but bad

Contributor

Not strong, but bad

Patron

Contributor

Patron

Patron

Not strong, but bad

Contributor

Not strong, but bad

Patron

Patron

Contributor

Patron

Contributor

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Possible Failing Disk?"

Similar threads