Do SATA drives do onboard ECC?

javaguru · Dec 31, 2021

Hi all,

I've been searching for reputable information about this but have not had luck. Searching for "hard drive ECC" on google just gives a flood of results about ECC RAM.

It is my understanding that SAS drives can do on-drive parity-based ECC to recover from bit errors on the stored data (I'm not referring to data in-flight which I know is part of the SATA spec). Is that feature available on any SATA drives too, such as white-label Exos? If so, then that would reduce the usefulness of data scrubbing for detecting bit-flips in particular.

Ericloewe · Dec 31, 2021

All vaguely-modern disks would be impossible without advanced error correction. This applies to both spinning rust and SSDs.

javaguru said:
If so, then that would reduce the usefulness of data scrubbing for detecting bit-flips in particular.

Not in the slightest. ZFS exists as it does despite all the measures taken by drive manufacturers.

Arwen · Dec 31, 2021

ZFS Data scrubbing does something no other conventional file system does, reads metadata, (aka directory, free block lists, etc...), data blocks and any redundancy. This becomes key when you have lots of static, rarely used data, like on a media server.

For example, my media server has 3,700 videos, (though many hundreds are shorter special features from the source media), and 1,100 songs. I basically put any new purchase on my media server because I don't own a disc player. Then use my media player to watch it / them. But, I rarely watch all of them often, and it might be years before I re-watch something. So a unrecoverable bad block can happen, and without my twice a month ZFS scrubbing I might never know about it.

To be clear, I have temporarily lost perhaps 30 movies or songs since I made that media server in 2014. ZFS scrubbing found problems, let me know the exact file, and then I simply restored it from backup. (My miniature media server does not have any redundancy for the media because I needed all the space.)

Other file systems like BTRFS, HAMMER and I think BCache do some of the data scrubbing. But, BTRFS does not checksum any RAID-5/6 parity, so no scrubbing the parity.

However, your point about a simple bit flip is accurate. Any modern disk should automatically correct such a problem, and return good data blocks to the user.

Constantin · Dec 31, 2021

Moreover, another amazing feature, snapshots, allows recovery from user errors such as mangling directory contents. I recently managed to do just that when I switched from iTunes in OS X Sierra to Music in OS X Monterey. Thank you, Apple, for making my day more interesting.

With the help of snapshots and now bliss, I am gradually fixing the issues that I have encountered to allow my iTunes content to appear in Music as it did in iTunes.

winnielinnie · Dec 31, 2021

Arwen said:
I have temporarily lost perhaps 30 movies or songs since I made that media server in 2014. ZFS scrubbing found problems, let me know the exact file, and then I simply restored it from backup.

Did this coincide with any IO errors or SMART errors? Or was the corruption (bad checksum) discovered by routine ZFS scrubbing in the absence of any other errors/warnings?

Losing / corrupting 30 or so files in a span of 7 years seems like something terrible is going on, especially if you had no other warnings. Technically, if this was a standard filesystem, such as NTFS, ExFAT, Ext4, or XFS, you would have had 30+ corrupted files on your drive, since 2014, without any warnings or SMART errors, and never know about it until you tried to open the files?

(I'm going on three years on my current FreeNAS/TrueNAS pool, and had never had even a single corruption found on my monthly scrubs.)

javaguru · Jan 1, 2022

Ericloewe said:
All vaguely-modern disks would be impossible without advanced error correction. This applies to both spinning rust and SSDs.

So, to be clear, if a cosmic ray flips a bit on my platter and I go to read that bit, the disk will correct it?

Arwen said:
However, your point about a simple bit flip is accurate. Any modern disk should automatically correct such a problem, and return good data blocks to the user.

If so, I'm hoping that I can reduce disk wear by using longer scrubbing intervals. I'm certainly glad to have ZFS's excellent scrubbing features, but it would be nice to know the disk has some facilities of its own.

Ericloewe · Jan 1, 2022

javaguru said:
So, to be clear, if a cosmic ray flips a bit on my platter and I go to read that bit, the disk will correct it?

Conceptually, yes. In practice, it depends on many factors. None of this should be viewed as an endorsement of HDD reliability.

joeschmuck · Jan 1, 2022

The answer I think you are looking for is CRC Error Checking that is built into the hard drive when writing and reading data, it's also part of the data transfer between the hard drive and hard drive controller which causes the UDMA_CRC_Errors count to increase in the SMART data. Here is a decent link to explain it, I'm sure there is more out there as well. Here is another link to a white paper on ECC/DSP controller use of NAND SSD's by WD. Lastly here is a link specifically dealing with Exos hard drives per your question.

javaguru said:
If so, then that would reduce the usefulness of data scrubbing for detecting bit-flips in particular.

But as the others have said, ZFS takes data integrity to the next level, when your data MUST be correct and you can change the Scrub schedule as you desire but be aware, no one here will tell you that it's sound advice to reduce the periodicity between scrubs. I use the original default of every 30 days, it works for me and my important data is all backed up on an external device.

javaguru said:
If so, I'm hoping that I can reduce disk wear by using longer scrubbing intervals.

Hummmm, if you run your system the way most of us do, the hard drives are always spinning to reduce wear on the drive motor and power circuits, the drive electronics automatically will pull the heads off the platters when needed, and the heads will run sweep across the entire platter surface to ensure even wear. A scrub does not cause any physical harm to a hard drive and just moves the heads around, like normal. These are low mass items and not a thing of when I was in my 20's when we used stepper motors to move the heads. I would worry more about if you are sleeping the drives causing damage or powering off/on the system frequently than a scrub. I'm trying to put this into perspective.

There is a lot of data out on the internet, you just need to look hard for it.

Arwen · Jan 1, 2022

Back about 20 years ago, when 512 byte sectors were the norm, I vaguely recall that many disk sector CRC & ECC was capable of fixing up to 11 bits bad in a single sector. Beyond that, the disk would have to attempt multiple reads of the sector, (and the CRC / ECC too), in an attempt to see if the data in the sector could be recovered.

This "extreme" attempt by hard disks is not needed in most NAS & SAN implementations due to redundancy. This process can take more than 1 minute, PER SECTOR. So most manufacturers of disks targeted for NAS and SAN use time limited error recovery. Western Digital abbreviates that to TLER, and Seagate uses different name, yet same concept. Remember, even if ZFS gives up sooner and gets the data from redundancy, that disk with the potentially bad block is hung until error recovery is complete. This probably means ZFS will give up using it completely, and degrade the pool. Not perfect.

ZFS can get stuck waiting for desk top disk drives, (aka non-NAS or SAN), to perform TLER. It is generally recommended to limit TLER to like 7 seconds. So, when ZFS detects a bad disk block, it simply recovers it from redundancy.

winnielinnie said:
Did this coincide with any IO errors or SMART errors? Or was the corruption (bad checksum) discovered by routine ZFS scrubbing in the absence of any other errors/warnings?

Losing / corrupting 30 or so files in a span of 7 years seems like something terrible is going on, especially if you had no other warnings. Technically, if this was a standard filesystem, such as NTFS, ExFAT, Ext4, or XFS, you would have had 30+ corrupted files on your drive, since 2014, without any warnings or SMART errors, and never know about it until you tried to open the files?

(I'm going on three years on my current FreeNAS/TrueNAS pool, and had never had even a single corruption found on my monthly scrubs.)

No, I don't remember any I/O errors or SMART errors. The corruption was found by regular ZFS scrubs, (twice a month).

That miniature media server, while GOOD solid hardware, could get warm. Especially during ZFS scrubs or Gentoo Linux updates, (which any package to be updated, is compiled from source.) It is designed to be fan-less. I've since put a USB powered fan to blow air across the top. (I was unable to purchase the heat sink top separately.) That fan seemed to reduce the errors somewhat. Plus, it now has a dedicated UPS that probably can last more than 1 hour. The UPS may have helped too.

Last, the 2.5" 2TB drive is likely an SMR, (it's a laptop type). Not saying it's reported all the errors, as the mSATA SSD has also "lost" data.

Important Announcement for the TrueNAS Community.

Do SATA drives do onboard ECC?

javaguru

Cadet

Ericloewe

Server Wrangler

Arwen

MVP

Constantin

Vampire Pig

winnielinnie

MVP

javaguru

Cadet

Ericloewe

Server Wrangler

joeschmuck

Old Man

Arwen

MVP

Similar threads

Important Announcement for the TrueNAS Community.

Do SATA drives do onboard ECC?

Cadet

Server Wrangler

MVP

Vampire Pig

MVP

Cadet

Server Wrangler

Old Man

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Do SATA drives do onboard ECC?"

Similar threads