Failing SSDs: best way not to lose data

goosesensor · Aug 24, 2022

I setup a FreeNAS machine about four years ago with 8x1TB SSDs in RAIDZ2.

21 days ago I finally updated to TrueNAS Core.

Today I logged in and noticed an alert: 2 drives are degraded (66 and 99 read errors respectively). I was hoping to get more life out of these drives, and surprised (and somewhat horrified) to see two start failing in such close temporal proximity.

I want to give myself the best odds of saving my data in the event a third drive fails. What is the very next thing I should do?

1. Replace both drives and let it rebuild (reading and writing)
2. Replace only one drive and let it rebuild (reading and writing) I have a spare taped inside the case.
3. Plug in a USB SATA enclosure with a large HDD and use rsync or something similar to copy the pools contents (mostly reading)
4. Copy to another machine via SSH/rsync (mostly reading) Probably the same as #3.

Any thoughts appreciated.

Etorix · Aug 24, 2022

First and foremost: Always have backups. So, item 4. or 3. Permanently.

If you have enough spart ports, it's possible to replace drives without unplugging the old ones, and thus without reducing redundancy.
Put the spare in use, without removing any drive. If possible, add a new drive and do both replacements at the same time without removing the old drives.

4 years is not that bad. What's the drive model? Are they all the same?

Arwen · Aug 24, 2022

I would also add that a test of the "spare taped inside the case" should be done, before using it. Basically a bad blocks test on another system.

Also second the replace in place, meaning installing the replacement drive in the NAS. Then using ZFS' option to replace a bad drive. Having the existing drive still available means any redundancy it, (and the other bad drive), have, can still be used.

NugentS · Aug 24, 2022

@Arwen - SSD - surely you aren't advocating for a badblocks run on an SSD?

goosesensor · Aug 24, 2022

Etorix said:
First and foremost: Always have backups. So, item 4. or 3. Permanently.

If you have enough spart ports, it's possible to replace drives without unplugging the old ones, and thus without reducing redundancy.
Put the spare in use, without removing any drive. If possible, add a new drive and do both replacements at the same time without removing the old drives.

4 years is not that bad. What's the drive model? Are they all the same?

I do swap physical backups at my folks place but it's been several months since I was there.

I didn't know you could replace them in that manner. That is great.

It's an ITX mobo and the single PCI-E slot has an SFP+ card in it. I will find a cheap supported SATA card to swap and use the mobo's built in ethernet for access while replacing the drives.

They are all 1TB "Pioneer" brand, model APS-SL3N-1TB.

Arwen said:
I would also add that a test of the "spare taped inside the case" should be done, before using it. Basically a bad blocks test on another system.

Also second the replace in place, meaning installing the replacement drive in the NAS. Then using ZFS' option to replace a bad drive. Having the existing drive still available means any redundancy it, (and the other bad drive), have, can still be used.

Seems like a wise idea.

NugentS said:
@Arwen - SSD - surely you aren't advocating for a badblocks run on an SSD?

Can you explain what you are saying here? Is there something wrong with running a check on an SSD or something? What software would you recommend? Linux preferred. Thanks.

rvassar · Aug 24, 2022

goosesensor said:
Can you explain what you are saying here? Is there something wrong with running a check on an SSD or something? What software would you recommend? Linux preferred. Thanks.

Running badblocks on an ssd burns useful life for little to no return. The act of writing to an SSD wears it out, and badblocks performs that as an exercise. An ssd already remaps bad LBA's as part of the wear leveling. Which isn't to say it's entirely useless, but a pool scrub catches 90+%, so why burn the reserve flash over-subscription just to prove the firmware works?

Etorix · Aug 25, 2022

goosesensor said:
Can you explain what you are saying here? Is there something wrong with running a check on an SSD or something? What software would you recommend? Linux preferred. Thanks.

badblocks isn't even designed to check drives, although it is commonly used for this purpose. It may do with HDDs, but is not suitable for SSDs.
You may use solnet-test-array (BSD), since it is read only. Or just run SMART tests and trust the manufacturer…

Arwen · Aug 25, 2022

NugentS said:
@Arwen - SSD - surely you aren't advocating for a badblocks run on an SSD?

I was suggesting a read test. Probably should have been clearer.

A read test of a SSD should / will refresh blocks that are close to loosing their charge. Basically any read of a flash cell uses up some of the electrical charge. So does just sitting for a long time. And so does warm or hot environments.

Causing a SSD to re-read all the blocks will find any that are low on power, (aka voltage). The SSD should / will automatically re-write them. And, if you find any blocks that have already lost their ability to recover, then you know about it before putting it into service. A simple write to the "bad" block will cause a SATA device to swap out the bad block with a good spare block.

goosesensor · Aug 25, 2022

@Etorix you say that 4 years is not bad.. so it sounds like these SSDs are failing due to normal use, not due to, say, random manufacturing tolerances or something or that sort? Heat degradation etc. If that is the case, it seems fair that I could expect other drives to start showing faults in the near future, too? Recall that everything was fine less than a month ago, and since then two drives have shown faults.

What I'm getting at is, should I just replace all of them now with larger capacity drives? I don't want to replace two 1TB drives now, only for the remaining 6 drives to fail in the coming months. If that is likely to be the case, I would rather put 8 new 2TB drives in now.

A second, somewhat unrelated question: I don't have any spare SATA ports, but I do have a USB<->SATA adapter. When I connect my spare with this adapter, it shows up as da1 (da0 is OS USB stick, ada0-ada7 are RAIDZ2 pool drives). Not only does it show up as da1, not ada8 or similar, its serial number seems to be emulated by the adapter's chip, it shows up as "123456789012". Can I still use the USB adapter to replace one of the failing drives/Resilver, or is the device ID/name (da1) and/or fake serial number going to cause problems once I swap the bad SATA-connected drive out for the new, resilvered USB-connected drive?

Etorix · Aug 26, 2022

goosesensor said:
@Etorix you say that 4 years is not bad.. so it sounds like these SSDs are failing due to normal use, not due to, say, random manufacturing tolerances or something or that sort? Heat degradation etc. If that is the case, it seems fair that I could expect other drives to start showing faults in the near future, too? Recall that everything was fine less than a month ago, and since then two drives have shown faults.

That's my concern. A single failure could be just that: A single failure due to random variance.
Two identical drives failing in short succession may be drives reaching end-of-life under the load caused by the pool (e.g. reaching maximum writes), and if so, one may indeed expect that further drives will follow.

goosesensor said:
What I'm getting at is, should I just replace all of them now with larger capacity drives? I don't want to replace two 1TB drives now, only for the remaining 6 drives to fail in the coming months. If that is likely to be the case, I would rather put 8 new 2TB drives in now.

The two failing drives should certainly be replaced. Larger drives would not hurt.
You may also want to consider replacing the pool by another one, with a different geometry: Less drives, but larger ones.
8*1 TB in Z2 provides "only" 6 TB raw (about 4 TB usable before the pool has to grow).
With 2 TB drives, 6 drives in Z2 already provides 8 TB raw.
With 4 TB drives (TLC rather than QLC), just 4 drives in Z2 provide 8 TB, 6 drives provide 16 TB.

In this case, you'll need either a HBA to attach all drives or another (temporary) server and then replicate the old pool to the new one.

goosesensor said:
A second, somewhat unrelated question: I don't have any spare SATA ports, but I do have a USB<->SATA adapter. When I connect my spare with this adapter, it shows up as da1 (da0 is OS USB stick, ada0-ada7 are RAIDZ2 pool drives). Not only does it show up as da1, not ada8 or similar, its serial number seems to be emulated by the adapter's chip, it shows up as "123456789012". Can I still use the USB adapter to replace one of the failing drives/Resilver, or is the device ID/name (da1) and/or fake serial number going to cause problems once I swap the bad SATA-connected drive out for the new, resilvered USB-connected drive?

'daN' names are no concern (ZFS tracks drives by gptid anyway), but a USB adapter is not reliable enough to be used for long-term storage in a data pool.
If replacing drives to keep the 8-wide Z2, tough, I'd be tempted to try and use the adapter during resilver—maybe with the new drive already attached to a SATA port and the old drive attached to the USB adapter so it can still provide full redundancy. If it works, repeat. If issues arise (e.g. the USB-adapted drive dropping during resilver), revert to the traditional method: Offline the old drive, plug the new drive in its place, replace and resilver with reduced redundancy. (It's a raidz2 anyway, so there's still one degree of redundancy here.)

goosesensor · Aug 26, 2022

I was able to offline/replace one of the failing SATA drives with the new spare connected via USB <-> SATA adapter, then swap the new drive into the failing drive's SATA port and it all worked fine. Waiting on 2nd replacement in the mail.

Thanks for your help.

Alex_K · Aug 27, 2022

I thought when drive is showing as degraded, that means its failed and not used, but what is written above implies degraded drives are still used for redundancy? Is that correct? To what extent? Like, if we have 2 drives "degraded" in Z2 VDEV and 3rd becomes "degraded", does it not mean that whole vdev data is lost?

rvassar · Aug 27, 2022

Alex_K said:
I thought when drive is showing as degraded, that means its failed and not used, but what is written above implies degraded drives are still used for redundancy? Is that correct? To what extent? Like, if we have 2 drives "degraded" in Z2 VDEV and 3rd becomes "degraded", does it not mean that whole vdev data is lost?

Pools become degraded. Drives have different behaviors depending on the type device and the failure experienced. HDD's can generate uncorrectable read errors, map them out of use and run on for years. Or they can work perfectly one day, get powered off, and fail to spin back up because the spindle bearings are seized. For the uncorrectable read errors, SSD's are built to do an equivalent substitution, when cells wear out spare replacements are mapped in. These spares are engineered into the devices and are often referred to as flash over-subscription. Your 1Tb flash drive may actually contain 1.5Tb of flash memory to effect a specified life-span, with 500MiB sitting around like a hidden partial hotspare.

The problem with SSD's is their entire structure is stored in the very thing that fails. At initiation of failure a failing HDD may still respond to commands, and may produce data for 90+% of the read requests it receives, aiding the resilvering process. If the failure leads to cascading damage, this number may drop faster than you can silver in a replacement. When an SSD runs out of replacement cells, how the failure propagates becomes very dependent on where the next bad LBA appears. I've experienced Enterprise NVMe drives function perfectly one minute, and have no addressable namespace the next. The structure describing the namespace was simply gone.

With Z2 VDEV's, you need to meet the minimum number of parity stripes per VDEV to reconstruct the data in a logical block address (LBA). Four devices being the minimum for Z2, you need two devices to read each parity stripe for an LBA. VDEV's with more devices will have different minimums. For the 4 device VDEV these don't need to be the same two devices. The faulty devices may return valid data. It's possible to "puncture" individual device LBA's and create a checkerboard of failures that still result ZFS reconstructing valid data. But, once you lose the VDEV, the pool is gone. A RAIDZ2 pool comprised of 3 VDEV's of 4 devices each will be destroyed by losing 3 devices in the same VDEV.

Arwen · Aug 28, 2022

Like @rvassar, said, drives, (HDD or SSD), that still have usable data can be used to assist with recovery. Something that was not common when ZFS came out.

One point I would like to make, is that ZFS treats recovery by stripe of disk blocks. Simple example, you have a 4 disk RAID-Z2;

disk 0 - Failing block 10
disk 1 - Failing block 10 too
disk 2 - Failing block 20
disk 3 - Failing block 20 too

This allows the vDev to be fully recoverable. Block 10 is recoverable from disks 2 & 3. With disks 0 & 1 supplying block 20. But, you have to use replace in place because pulling any disk removes one stripe's worth of redundancy.

Plus, ZFS only re-builds used data on storage. So other storage block failures are not relevant during re-silvers.

Important Announcement for the TrueNAS Community.

Failing SSDs: best way not to lose data

goosesensor

Dabbler

Etorix

Wizard

Arwen

MVP

NugentS

MVP

goosesensor

Dabbler

rvassar

Guru

Etorix

Wizard

Arwen

MVP

goosesensor

Dabbler

Etorix

Wizard

goosesensor

Dabbler

Alex_K

Explorer

rvassar

Guru

Arwen

MVP

Similar threads

Important Announcement for the TrueNAS Community.

Failing SSDs: best way not to lose data

Dabbler

Wizard

MVP

MVP

Dabbler

Guru

Wizard

MVP

Dabbler

Wizard

Dabbler

Explorer

Guru

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Failing SSDs: best way not to lose data"

Similar threads