Read errors - When to ditch drive?

HarambeLives

Contributor
Joined
Jul 19, 2021
Messages
153
For my main array made of 6 x 4TB Disks in 3 x Mirrors, I have a bunch of known good 4TB SATA disks and then some used 4TB SAS disks

Right out the gate when creating the pool, I got 4 rear errors on one of the used 4TB SAS disks. I took it out and put a spare in, since I have so many. The array has been fine since

Is this enough to condemn the drive? Or should I do further testing and possibly use it? SMART shows all good, and I'm currently running a full write + read test in HD Sentinel

If it fails the test, I'll toss it. If it passes, then what?
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Hi,

You mentioned that they were used, but did not said if they were still under warranty. Here, the moment my HDD shows the first sign of problems (bad sectors, failed smart test, ...), I RMA them. I have 2 in the mail as of now (one from Hades and one from Atlas).

Should they not be under warranty, I may have done differently.

I now have 2 cold spares on each site (Hades and Atlas). Hades being RaidZ2, I would keep waiting for more than a few bad sectors before replacing a drive. Should I push the drive too far, the dual redundancy plus the cold spare on site will let me recover in short time.

Atlas being only regular mirrors like yours, its risk is different. In this case, I would keep Thanatos (also on this site) more up-to-date but I guess it would not be long before I replace the drive. The pain of rebuilding everything and going through my DR process does not worth a HDD.

So how are your backups ? Do you already have spares on site ? What is your risk appetite ? How are your warranty ?
 

HarambeLives

Contributor
Joined
Jul 19, 2021
Messages
153
No drives under warranty, all of these are older used drives. If its under warranty, I return it at the first sign of trouble

I currently have 2 systems, one with 12 x 8TB in RAIDZ2 and then the main system with mirrors. I have 3 x 4TB spares and 15 x 8TB spares, so I'm well stocked

The main system is replicated to the secondary system every 15m, and I have multiple backups in different sites. So if I did lose the array, it wouldn't really be the end of the world but obviously I'd like to avoid that

Assuming it passes a surface test and extended SMART, would you throw the disk with 4 read errors back into the pile?
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
Since it's not critical, and not a pressing need, I'd probably toss it in another system, and torture it with badblocks(8) for several days, just to see if I could tip it into full failure. If it stays at 4 errors, and you're satisfied with the risk, run it as a primary device, not a hot spare.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Since it's not critical, and not a pressing need, I'd probably toss it in another system, and torture it with badblocks(8) for several days, just to see if I could tip it into full failure. If it stays at 4 errors, and you're satisfied with the risk, run it as a primary device, not a hot spare.
Agree, and will now humbly direct @HarambeLives to my disk burn-in repo:

 

Mlovelace

Guru
Joined
Aug 19, 2014
Messages
1,111
You can fix a drive's unreadable sector error by forcing a write to the bad sector. The write will force the drive to re-allocate the bad sector as all drives are made with spare sectors for exactly this purpose. Then run a S.M.A.R.T long test to insure the error is gone, scrub the pool to rebuild the re-allocated sector(s) from parity, and you're good.

The command would be:
Code:
sysctl kern.geom.debugflags=16
dd if=/dev/zero of=/dev/drive bs= count=1 seek= conv=noerror,sync

bs= stripesize
seek= bad sector from smart error
 
Top