Read errors - When to ditch drive?

HarambeLives · Aug 10, 2021

For my main array made of 6 x 4TB Disks in 3 x Mirrors, I have a bunch of known good 4TB SATA disks and then some used 4TB SAS disks

Right out the gate when creating the pool, I got 4 rear errors on one of the used 4TB SAS disks. I took it out and put a spare in, since I have so many. The array has been fine since

Is this enough to condemn the drive? Or should I do further testing and possibly use it? SMART shows all good, and I'm currently running a full write + read test in HD Sentinel

If it fails the test, I'll toss it. If it passes, then what?

Heracles · Aug 10, 2021

Hi,

You mentioned that they were used, but did not said if they were still under warranty. Here, the moment my HDD shows the first sign of problems (bad sectors, failed smart test, ...), I RMA them. I have 2 in the mail as of now (one from Hades and one from Atlas).

Should they not be under warranty, I may have done differently.

I now have 2 cold spares on each site (Hades and Atlas). Hades being RaidZ2, I would keep waiting for more than a few bad sectors before replacing a drive. Should I push the drive too far, the dual redundancy plus the cold spare on site will let me recover in short time.

Atlas being only regular mirrors like yours, its risk is different. In this case, I would keep Thanatos (also on this site) more up-to-date but I guess it would not be long before I replace the drive. The pain of rebuilding everything and going through my DR process does not worth a HDD.

So how are your backups ? Do you already have spares on site ? What is your risk appetite ? How are your warranty ?

HarambeLives · Aug 11, 2021

No drives under warranty, all of these are older used drives. If its under warranty, I return it at the first sign of trouble

I currently have 2 systems, one with 12 x 8TB in RAIDZ2 and then the main system with mirrors. I have 3 x 4TB spares and 15 x 8TB spares, so I'm well stocked

The main system is replicated to the secondary system every 15m, and I have multiple backups in different sites. So if I did lose the array, it wouldn't really be the end of the world but obviously I'd like to avoid that

Assuming it passes a surface test and extended SMART, would you throw the disk with 4 read errors back into the pile?

rvassar · Aug 11, 2021

Since it's not critical, and not a pressing need, I'd probably toss it in another system, and torture it with badblocks(8) for several days, just to see if I could tip it into full failure. If it stays at 4 errors, and you're satisfied with the risk, run it as a primary device, not a hot spare.

Spearfoot · Aug 11, 2021

rvassar said:
Since it's not critical, and not a pressing need, I'd probably toss it in another system, and torture it with badblocks(8) for several days, just to see if I could tip it into full failure. If it stays at 4 errors, and you're satisfied with the risk, run it as a primary device, not a hot spare.

Agree, and will now humbly direct @HarambeLives to my disk burn-in repo:

GitHub - Spearfoot/disk-burnin-and-testing: Shell script for burn-in and testing of new or re-purposed drives

Shell script for burn-in and testing of new or re-purposed drives - Spearfoot/disk-burnin-and-testing

github.com

Mlovelace · Aug 11, 2021

You can fix a drive's unreadable sector error by forcing a write to the bad sector. The write will force the drive to re-allocate the bad sector as all drives are made with spare sectors for exactly this purpose. Then run a S.M.A.R.T long test to insure the error is gone, scrub the pool to rebuild the re-allocated sector(s) from parity, and you're good.

The command would be:

Code:

sysctl kern.geom.debugflags=16
dd if=/dev/zero of=/dev/drive bs= count=1 seek= conv=noerror,sync

bs= stripesize
seek= bad sector from smart error

Important Announcement for the TrueNAS Community.

Read errors - When to ditch drive?

HarambeLives

Contributor

Heracles

Wizard

HarambeLives

Contributor

rvassar

Guru

Spearfoot

He of the long foot

GitHub - Spearfoot/disk-burnin-and-testing: Shell script for burn-in and testing of new or re-purposed drives

Mlovelace

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Read errors - When to ditch drive?

HarambeLives

Contributor

Heracles

Wizard

HarambeLives

Contributor

rvassar

Guru

Spearfoot

He of the long foot

GitHub - Spearfoot/disk-burnin-and-testing: Shell script for burn-in and testing of new or re-purposed drives

Mlovelace

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Read errors - When to ditch drive?"

Similar threads