When do you actually replace a failing drive?

Status
Not open for further replies.

aadje93

Explorer
Joined
Sep 25, 2015
Messages
60
After about 3 years running 24/7 my freenas server now is reporting beginnings of failing drive(s)

my setup:

Norco 24bay 4U case.
Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz
Supermicro X10SRL-F
128 GB samsung memory (at the time it was cheap enough to go right to 128, good choice now it seems)

24x WD Red series
4 vdevs of Z2 6disk, in a single pool.

Now in 2 of the vdevs i get Smart long read errors listed

Code:
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   8
  3 Spin_Up_Time			0x0027   178   178   021	Pre-fail  Always	   -	   8066
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   15
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   069   069   000	Old_age   Always	   -	   22952
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   15
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   13
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   472
194 Temperature_Celsius	 0x0022   116   108   000	Old_age   Always	   -	   36
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   2

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	 19500		 -
# 2  Extended offline	Completed: read failure	   90%	 19406		 116357160
# 3  Short offline	   Completed without error	   00%	 19333		 -
# 4  Short offline	   Completed without error	   00%	 19165		 -
# 5  Extended offline	Completed: read failure	   90%	 19070		 116357160
# 6  Short offline	   Completed without error	   00%	 18997		 -
# 7  Short offline	   Completed without error	   00%	 18757		 -
# 8  Extended offline	Completed: read failure	   90%	 18662		 116357160
# 9  Short offline	   Completed without error	   00%	 18589		 -
#10  Short offline	   Completed without error	   00%	 18422		 -
#11  Extended offline	Completed: read failure	   90%	 18327		 116357160
#12  Short offline	   Completed without error	   00%	 18254		 -
#13  Short offline	   Completed without error	   00%	 18039		 -
#14  Extended offline	Completed without error	   00%	 17953		 -
#15  Short offline	   Completed without error	   00%	 17871		 -
#16  Short offline	   Completed without error	   00%	 17703		 -
#17  Extended offline	Completed: read failure	   20%	 17615		 1775422536
#18  Short offline	   Completed without error	   00%	 17535		 -
#19  Short offline	   Completed without error	   00%	 17294		 -
#20  Extended offline	Completed without error	   00%	 17212		 -
#21  Short offline	   Completed without error	   00%	 17127		 -
1 of 5 failed self-tests are outdated by newer successful extended offline self-test #14



and disk 2:
Code:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   41
  3 Spin_Up_Time			0x0027   172   172   021	Pre-fail  Always	   -	   8400
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   15
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   071   070   000	Old_age   Always	   -	   21760
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   15
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   13
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   1977
194 Temperature_Celsius	 0x0022   119   108   000	Old_age   Always	   -	   33
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed: read failure	   90%	 19501		 2589409
# 2  Extended offline	Completed: read failure	   90%	 19406		 2589408
# 3  Short offline	   Completed: read failure	   90%	 19333		 2589408
# 4  Short offline	   Completed: read failure	   50%	 19165		 2589408
# 5  Extended offline	Completed without error	   00%	 19080		 -
# 6  Short offline	   Completed without error	   00%	 18997		 -
# 7  Short offline	   Completed without error	   00%	 18758		 -
# 8  Extended offline	Completed without error	   00%	 18672		 -
# 9  Short offline	   Completed without error	   00%	 18590		 -
#10  Short offline	   Completed without error	   00%	 18422		 -
#11  Extended offline	Completed without error	   00%	 18337		 -
#12  Short offline	   Completed without error	   00%	 18255		 -
#13  Short offline	   Completed without error	   00%	 18039		 -
#14  Extended offline	Completed without error	   00%	 17954		 -
#15  Short offline	   Completed without error	   00%	 17871		 -
#16  Short offline	   Completed without error	   00%	 17703		 -
#17  Extended offline	Completed without error	   00%	 17618		 -
#18  Short offline	   Completed without error	   00%	 17535		 -
#19  Short offline	   Completed without error	   00%	 17295		 -
#20  Extended offline	Completed without error	   00%	 17212		 -
#21  Short offline	   Completed without error	   00%	 17127		 -



Should i replace those discs right now? Or can i keep them until the point they go "offline"

The pool seems working fine, but its a little unclear for me if ZFS actually "dropped" the disks and is now running with 2 Z1 vdevs in the pool instead of Z2, or that they are just notifications of a soon to be dead disk?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
There's no single hard rule. If the warranty is still valid, definitely have the failing disks replaced.

Most people replace disks if:
  • They're accumulating bad sectors (more than a small handful and increasing)
  • They're failing SMART tests
  • They're slowing things down to a crawl
  • SMART is fine but they're otherwise misbehaving
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
In warranty or not, it is time to replace the drives that are shown. If you have not been doing it, it is time to keep a couple spare drives ready. You might also want to consider a plan to replace them all because they will become more failure prone as they approach the 5 year point. Last year, I had to replace my entire pool. They had gone past 5 years old and I was having a drive go every month, two in the same day one time.
Best to plan for it.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

aadje93

Explorer
Joined
Sep 25, 2015
Messages
60
wow thanks for all the fast replies!

So i'm going to order some replacement disks now and got some spare disks for future failures.

How can i get waranty on those disks? Technically they are still working fine, if the shops gets them they connect them to a PC, format it, put files on it and send it back with the notice its working fine....

They are ~2,5 years running now as you see at the smart logs.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
How can i get waranty on those disks? Technically they are still working fine, if the shops gets them they connect them to a PC, format it, put files on it and send it back with the notice its working fine....
You go direct to Western Digital for the warranty and you tell them that the drive is giving the errors that you posted above. That simple.
Both of them failed the long self test with a read error. That should NOT happen. The drive has failed.
Why would that be a question?
 

aadje93

Explorer
Joined
Sep 25, 2015
Messages
60
Allright, made a support question on WD to ask how to do the RMA and stuff (totally new for me on HDD's, on synology they just plain died and it was done :eek: no warnings before, not even re-aloc errors on the mail)
 

melloa

Wizard
Joined
May 22, 2016
Messages
1,749
So i'm going to order some replacement disks now and got some spare disks for future failures.

And burn them to: (1) ensure they are good and (2) be ready for action as burning a HD takes time, lots of it. Also be aware that when replacing a disk, the pool resilvers and that also stress the disks and could cause another one to fail.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
And burn them to: (1) ensure they are good and (2) be ready for action as burning a HD takes time, lots of it. Also be aware that when replacing a disk, the pool resilvers and that also stress the disks and could cause another one to fail.
All good advice, but it is no more stressful on the pool to resilver than it is to scrub.


Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

melloa

Wizard
Joined
May 22, 2016
Messages
1,749
it is no more stressful on the pool to resilver than it is to scrub.

Point was that other disks can fail, not that one is more or less stressful.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Point was that other disks can fail, not that one is more or less stressful.
You are missing the point.
The idea that a resilver would cause a fault should not be a thing to think about because a resilver is not overly stressful on the pool.
 

aadje93

Explorer
Joined
Sep 25, 2015
Messages
60
New disk will get a burn in for 24 hours before replacement, thats what i did on the others to and it showed all were fine luckily.

But the re-allocated sector count is 140, so its not yet >200 to be classified as "failed disk" by WD i suppose? (they have ~90 days WD warranty left)
 

aadje93

Explorer
Joined
Sep 25, 2015
Messages
60
Allright thanks for the confirmation Johnnie,

now i got another problem (didnt notice it before as mail was not working right) I got like 23K UDMA CRC errors on another drive, that means connection of cable? (no other drive has this many)>

They are in a norco 4224 with Z2's spread out on all the backplanes (sort of 4 Z2's stacked next to each other) so it means 1 drive or its backplane connector or cable is bad. As the other 3 have none of these errors, could it be the backplane connector connecting the disk being bad? Or is it the cable (supermicro) that's probarly bad?
 
Joined
May 10, 2017
Messages
838
Cable is the most likely to be causing those errors, so replace that first, if the attribute continue to increase then swap backplane slots with another disk.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
New disk will get a burn in for 24 hours before replacement, thats what i did on the others to and it showed all were fine luckily.

But the re-allocated sector count is 140, so its not yet >200 to be classified as "failed disk" by WD i suppose? (they have ~90 days WD warranty left)
The last time I burned in a set of disks, it took 5 days. If you are only giving 24 hours, it isn't enough.

You can actually get a drive replaced under warranty for a single reallocated sector. I have done it. You don't have to wait for some threshold of failed. If it is broken, even just a little, it is bad and they will replace it.
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Cable is the most likely to be causing those errors, so replace that first, if the attribute continue to increase then swap backplane slots with another disk.
What errors are you blaming on the cable?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Allright thanks for the confirmation Johnnie,

now i got another problem (didnt notice it before as mail was not working right) I got like 23K UDMA CRC errors on another drive, that means connection of cable? (no other drive has this many)>

They are in a norco 4224 with Z2's spread out on all the backplanes (sort of 4 Z2's stacked next to each other) so it means 1 drive or its backplane connector or cable is bad. As the other 3 have none of these errors, could it be the backplane connector connecting the disk being bad? Or is it the cable (supermicro) that's probarly bad?
The backplane in the Norco case is basically just a breakout cable. It can be the problem, but like Johnnie said, the cable is a more likely cause. I have seen a kinked cable cause the CRC errors.


Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

aadje93

Explorer
Joined
Sep 25, 2015
Messages
60
Thanks for all your support, i'm got a price qoute for some new disks sent out (also some cold spares). But how is WD gold in comparison to the WD red? They are more expensive, but they are truly rated for database use instead of the more consumer based 8disk nas devices (24 now, soon to be a second backup machine running also 24 disks right above it).

Also you say tight bent is not good. Should i modify the fan mounted plate in the middle of the case so the cables can go a little bit more straight through it instead of the bottom one in the middle and back up again?
 
Status
Not open for further replies.
Top