When do you actually replace a failing drive?

aadje93 · Jun 19, 2018

After about 3 years running 24/7 my freenas server now is reporting beginnings of failing drive(s)

my setup:

Norco 24bay 4U case.
Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz
Supermicro X10SRL-F
128 GB samsung memory (at the time it was cheap enough to go right to 128, good choice now it seems)

24x WD Red series
4 vdevs of Z2 6disk, in a single pool.

Now in 2 of the vdevs i get Smart long read errors listed

Code:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   8
  3 Spin_Up_Time			0x0027   178   178   021	Pre-fail  Always	   -	   8066
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   15
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   069   069   000	Old_age   Always	   -	   22952
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   15
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   13
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   472
194 Temperature_Celsius	 0x0022   116   108   000	Old_age   Always	   -	   36
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   2

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	 19500		 -
# 2  Extended offline	Completed: read failure	   90%	 19406		 116357160
# 3  Short offline	   Completed without error	   00%	 19333		 -
# 4  Short offline	   Completed without error	   00%	 19165		 -
# 5  Extended offline	Completed: read failure	   90%	 19070		 116357160
# 6  Short offline	   Completed without error	   00%	 18997		 -
# 7  Short offline	   Completed without error	   00%	 18757		 -
# 8  Extended offline	Completed: read failure	   90%	 18662		 116357160
# 9  Short offline	   Completed without error	   00%	 18589		 -
#10  Short offline	   Completed without error	   00%	 18422		 -
#11  Extended offline	Completed: read failure	   90%	 18327		 116357160
#12  Short offline	   Completed without error	   00%	 18254		 -
#13  Short offline	   Completed without error	   00%	 18039		 -
#14  Extended offline	Completed without error	   00%	 17953		 -
#15  Short offline	   Completed without error	   00%	 17871		 -
#16  Short offline	   Completed without error	   00%	 17703		 -
#17  Extended offline	Completed: read failure	   20%	 17615		 1775422536
#18  Short offline	   Completed without error	   00%	 17535		 -
#19  Short offline	   Completed without error	   00%	 17294		 -
#20  Extended offline	Completed without error	   00%	 17212		 -
#21  Short offline	   Completed without error	   00%	 17127		 -
1 of 5 failed self-tests are outdated by newer successful extended offline self-test #14

and disk 2:

Code:

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   41
  3 Spin_Up_Time			0x0027   172   172   021	Pre-fail  Always	   -	   8400
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   15
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   071   070   000	Old_age   Always	   -	   21760
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   15
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   13
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   1977
194 Temperature_Celsius	 0x0022   119   108   000	Old_age   Always	   -	   33
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed: read failure	   90%	 19501		 2589409
# 2  Extended offline	Completed: read failure	   90%	 19406		 2589408
# 3  Short offline	   Completed: read failure	   90%	 19333		 2589408
# 4  Short offline	   Completed: read failure	   50%	 19165		 2589408
# 5  Extended offline	Completed without error	   00%	 19080		 -
# 6  Short offline	   Completed without error	   00%	 18997		 -
# 7  Short offline	   Completed without error	   00%	 18758		 -
# 8  Extended offline	Completed without error	   00%	 18672		 -
# 9  Short offline	   Completed without error	   00%	 18590		 -
#10  Short offline	   Completed without error	   00%	 18422		 -
#11  Extended offline	Completed without error	   00%	 18337		 -
#12  Short offline	   Completed without error	   00%	 18255		 -
#13  Short offline	   Completed without error	   00%	 18039		 -
#14  Extended offline	Completed without error	   00%	 17954		 -
#15  Short offline	   Completed without error	   00%	 17871		 -
#16  Short offline	   Completed without error	   00%	 17703		 -
#17  Extended offline	Completed without error	   00%	 17618		 -
#18  Short offline	   Completed without error	   00%	 17535		 -
#19  Short offline	   Completed without error	   00%	 17295		 -
#20  Extended offline	Completed without error	   00%	 17212		 -
#21  Short offline	   Completed without error	   00%	 17127		 -

Should i replace those discs right now? Or can i keep them until the point they go "offline"

The pool seems working fine, but its a little unclear for me if ZFS actually "dropped" the disks and is now running with 2 Z1 vdevs in the pool instead of Z2, or that they are just notifications of a soon to be dead disk?

Johnnie Black · Jun 19, 2018

aadje93 said:
Should i replace those discs right now?

Yes.

Ericloewe · Jun 19, 2018

There's no single hard rule. If the warranty is still valid, definitely have the failing disks replaced.

Most people replace disks if:

They're accumulating bad sectors (more than a small handful and increasing)
They're failing SMART tests
They're slowing things down to a crawl
SMART is fine but they're otherwise misbehaving

Chris Moore · Jun 19, 2018

In warranty or not, it is time to replace the drives that are shown. If you have not been doing it, it is time to keep a couple spare drives ready. You might also want to consider a plan to replace them all because they will become more failure prone as they approach the 5 year point. Last year, I had to replace my entire pool. They had gone past 5 years old and I was having a drive go every month, two in the same day one time.
Best to plan for it.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

aadje93 · Jun 19, 2018

wow thanks for all the fast replies!

So i'm going to order some replacement disks now and got some spare disks for future failures.

How can i get waranty on those disks? Technically they are still working fine, if the shops gets them they connect them to a PC, format it, put files on it and send it back with the notice its working fine....

They are ~2,5 years running now as you see at the smart logs.

Chris Moore · Jun 19, 2018

aadje93 said:
How can i get waranty on those disks? Technically they are still working fine, if the shops gets them they connect them to a PC, format it, put files on it and send it back with the notice its working fine....

You go direct to Western Digital for the warranty and you tell them that the drive is giving the errors that you posted above. That simple.
Both of them failed the long self test with a read error. That should NOT happen. The drive has failed.
Why would that be a question?

aadje93 · Jun 19, 2018

Allright, made a support question on WD to ask how to do the RMA and stuff (totally new for me on HDD's, on synology they just plain died and it was done

no warnings before, not even re-aloc errors on the mail)

melloa · Jun 19, 2018

aadje93 said:
So i'm going to order some replacement disks now and got some spare disks for future failures.

And burn them to: (1) ensure they are good and (2) be ready for action as burning a HD takes time, lots of it. Also be aware that when replacing a disk, the pool resilvers and that also stress the disks and could cause another one to fail.

Chris Moore · Jun 19, 2018

melloa said:
And burn them to: (1) ensure they are good and (2) be ready for action as burning a HD takes time, lots of it. Also be aware that when replacing a disk, the pool resilvers and that also stress the disks and could cause another one to fail.

All good advice, but it is no more stressful on the pool to resilver than it is to scrub.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

melloa · Jun 19, 2018

Chris Moore said:
it is no more stressful on the pool to resilver than it is to scrub.

Point was that other disks can fail, not that one is more or less stressful.

Chris Moore · Jun 19, 2018

melloa said:
Point was that other disks can fail, not that one is more or less stressful.

You are missing the point.
The idea that a resilver would cause a fault should not be a thing to think about because a resilver is not overly stressful on the pool.

aadje93 · Jun 19, 2018

New disk will get a burn in for 24 hours before replacement, thats what i did on the others to and it showed all were fine luckily.

But the re-allocated sector count is 140, so its not yet >200 to be classified as "failed disk" by WD i suppose? (they have ~90 days WD warranty left)

Johnnie Black · Jun 19, 2018

A failed SMART test is more than enough to quality for an RMA

aadje93 · Jun 19, 2018

Allright thanks for the confirmation Johnnie,

now i got another problem (didnt notice it before as mail was not working right) I got like 23K UDMA CRC errors on another drive, that means connection of cable? (no other drive has this many)>

They are in a norco 4224 with Z2's spread out on all the backplanes (sort of 4 Z2's stacked next to each other) so it means 1 drive or its backplane connector or cable is bad. As the other 3 have none of these errors, could it be the backplane connector connecting the disk being bad? Or is it the cable (supermicro) that's probarly bad?

Johnnie Black · Jun 19, 2018

Cable is the most likely to be causing those errors, so replace that first, if the attribute continue to increase then swap backplane slots with another disk.

Chris Moore · Jun 20, 2018

aadje93 said:
New disk will get a burn in for 24 hours before replacement, thats what i did on the others to and it showed all were fine luckily.

But the re-allocated sector count is 140, so its not yet >200 to be classified as "failed disk" by WD i suppose? (they have ~90 days WD warranty left)

The last time I burned in a set of disks, it took 5 days. If you are only giving 24 hours, it isn't enough.

You can actually get a drive replaced under warranty for a single reallocated sector. I have done it. You don't have to wait for some threshold of failed. If it is broken, even just a little, it is bad and they will replace it.

Chris Moore · Jun 20, 2018

Johnnie Black said:
Cable is the most likely to be causing those errors, so replace that first, if the attribute continue to increase then swap backplane slots with another disk.

What errors are you blaming on the cable?

Johnnie Black · Jun 20, 2018

Chris Moore said:
What errors are you blaming on the cable?

aadje93 said:
I got like 23K UDMA CRC errors on another drive

Chris Moore · Jun 20, 2018

aadje93 said:
Allright thanks for the confirmation Johnnie,

now i got another problem (didnt notice it before as mail was not working right) I got like 23K UDMA CRC errors on another drive, that means connection of cable? (no other drive has this many)>

They are in a norco 4224 with Z2's spread out on all the backplanes (sort of 4 Z2's stacked next to each other) so it means 1 drive or its backplane connector or cable is bad. As the other 3 have none of these errors, could it be the backplane connector connecting the disk being bad? Or is it the cable (supermicro) that's probarly bad?

The backplane in the Norco case is basically just a breakout cable. It can be the problem, but like Johnnie said, the cable is a more likely cause. I have seen a kinked cable cause the CRC errors.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

aadje93 · Jun 20, 2018

Thanks for all your support, i'm got a price qoute for some new disks sent out (also some cold spares). But how is WD gold in comparison to the WD red? They are more expensive, but they are truly rated for database use instead of the more consumer based 8disk nas devices (24 now, soon to be a second backup machine running also 24 disks right above it).

Also you say tight bent is not good. Should i modify the fan mounted plate in the middle of the case so the cables can go a little bit more straight through it instead of the bottom one in the middle and back up again?

Important Announcement for the TrueNAS Community.

When do you actually replace a failing drive?

Explorer

Guru

Server Wrangler

Hall of Famer

Explorer

Hall of Famer

Explorer

Wizard

Hall of Famer

Wizard

Hall of Famer

Explorer

Guru

Explorer

Guru

Hall of Famer

Hall of Famer

Guru

Hall of Famer

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "When do you actually replace a failing drive?"

Similar threads