One or more devices has experienced an error resulting in data corruption.

jgreco · Mar 31, 2016

titan_rw said:
At $OldJob, we use to qualify embedded computers for operation from -40c to +85c. It was only this high because they were sealed, and therefore the internal temperature got up to 85 or so.

Yeah, I come from the land of hospital monitoring devices. It's a lot more entertaining when you ratchet up the paranoia levels to reach outside the normal environmental range.

-40c... something outdoorsy?

titan_rw · Mar 31, 2016

jgreco said:
-40c... something outdoorsy?

Yea, oil field equipment that quite commonly operated in northern Alberta / and the territories. But because it was sealed, and could very well be in direct sunlight at 30-35c, it's internal temperature could end up at 85.

cyberjock · Mar 31, 2016

When I worked at a nuclear power plant we had a few embedded systems that were a little computer with something like rubber cement poured into the cavity and it filled the entire "case". For us, that was part of the ratings needed to ensure proper operation when under water, high humidity (such as steam.. .whoops) and other extreme conditions. It was pretty cool to see it and such, but there was no fixing it. If it didn't work you replaced the whole thing.

jgreco · Mar 31, 2016

I remember when companies like Commodore used to fill their power supply bricks with epoxy, just to be sphincters.

SirMaster · Apr 1, 2016

Even if his RAM is bad, it doesn't necessarily mean the data on his pool is corrupted (even though ZFS thinks that it is.)

Bad RAM can certainly trigger cksum errors to be logged to the pool during a scrub, but due to the precise way ZFS is designed, it's pretty unlikely that any data corruption due to bad RAM would actually be committed to disk for files that were already on the disk before the RAM went bad.

You either need to have RAM that is so horribly broken that I don't know how your system hasn't crashed yet, or you need to see the unlikely scenario of a hash collision in order for existing data already on disk to be overwritten by corrupted data due to bad RAM.

In my time with ZFS I've certainly seen systems and pools that reported files as corrupt, but when transplanted into a non faulting system, the files were recovered and when compared to the original files, were actually not corrupted at all. ZFS can and does report false corruption in the face of bad RAM.

jgreco · Apr 1, 2016

That's not necessarily true. ZFS does attempt to correct errors when a read results in a corrupt block.

SirMaster · Apr 1, 2016

jgreco said:
That's not necessarily true. ZFS does attempt to correct errors when a read results in a corrupt block.

But that relies on a bad HDD. I was really referring to where the disks are working fine, and the RAM is the only thing malfunctioning. OP posted his SMART status and there is no real reason at this time to think his HDDs are getting any errors.

If ZFS detects a block being read doesn't match the checksum, but the disk didn't return a hard read error, then ZFS will only repair the block in memory so it can serve the user a "repaired" copy of that block. It doesn't actually edit the block on-disk in a case such as this.

A read will only cause ZFS to overwrite a block on the disk if the read is accompanied by a URE.

Note that when I say overwrite, I mean write a new block and update the block pointer to that new block.

A scrub can update blocks on-disk, but only if ZFS knows that it found and verified an un-corrupted "copy" of the block to replace the block it thinks is corrupt with.

It works like this. Scrub reads a block into RAM, then it calculates the checksum and compares it with the checksum ZFS stored back when the block was originally written. Say the RAM is bad and the RAM either corrupted the block as it was written into RAM, or it corrupted the checksum as it was written into the RAM. Either way, when ZFS compares the two it will say there is a cksum error since they don't match (even though the block on the disk is actually fine). ZFS will log this cksum error to the zpool status.

Now it will attempt to repair the data. First it reads in the parity or mirror data for the block and from that, it reconstructs a redundant "copy" of the block. It also reads in the checksum for this redundant copy of the block that was stored when the mirror or parity blocks were first created. Now it compares the checksum it just now computed for the redundant block, to the checksum that was there from when it was originally stored.

Two things can happen here. Either this redundant block copy was read into good RAM this time and ZFS finds that it matches its original checksum (thus verifying the redundant copy is "good"), and then it now replaces the original bad data block with this new and verified good block.

Or, when ZFS reads in the redundant copy block and redundant block's checksum, the bad RAM also corrupts one of these. This time when ZFS compares the two, it will see that it also doesn't match. ZFS will in this case so far not overwrite anything on-disk yet, because it didn't find a verified good copy to use yet. ZFS then checks the next redundant copy (higher n-way mirror or RAIDZ2 or Z3).

If all these additional copies also get corrupted as they are written to RAM, then ZFS will ultimately abort the repair operation for that block and it will not overwrite anything on-disk. It will then report that there are corrupt files in the pool that it cannot repair. But in reality, the original block is still on the disk and is in fact not corrupt even though ZFS thinks it is.

devnullius · Apr 1, 2016

It feels to me that that is exactly what's happening here: a haunted consumer PC with some small but fatal-like error 'somewhere'. I'd love to hear if importing the disks into another machine would solve all problems. I must say I really like what you wrote down here! Gives a secure feeling too :)

Thanks!

Devvie

titan_rw · Apr 1, 2016

SirMaster said:
But that relies on a bad HDD. I was really referring to where the disks are working fine, and the RAM is the only thing malfunctioning. OP posted his SMART status and there is no real reason at this time to think his HDDs are getting any errors.

If ZFS detects a block being read doesn't match the checksum, but the disk didn't return a hard read error, then ZFS will only repair the block in memory so it can serve the user a "repaired" copy of that block. It doesn't actually edit the block on-disk in a case such as this.

A read will only cause ZFS to overwrite a block on the disk if the read is accompanied by a URE.

My understanding of this is that ZFS will 'self heal' the disks whenever it detects checksum errors. It doesn't matter if this is via a regular cifs read of the file, or a scrub.

Take a two drive mirror. If you manually erase some sectors from one of the disks, then reimport the pool, and read a file back that occupied some of those blocks, zfs will realize that one of the disks is returning bad data. It will then read the data from the other disk, which will correctly match the checksum. The good data is returned to the client, and also written back to the 'bad' disk so that full redundancy is restored.

This is one of the points of a self healing filesystem. But you do have to trust the memory subsystem, because if the memory corrupts the data, and not the disk, zfs has no way to know.

This was gone through step by step in one of Jeff Bonwick and Bill Moore's presentations on ZFS.

I think this is about where in the video they describe it:

https://youtu.be/NRoUC9P1PmA?t=57m45s

Edit: Specifically time 1:00:15 is where he describes the 'bad' disks data being updated with good data after an application read. 57:45 is where the self healing section starts though.

SirMaster · Apr 1, 2016

Yes, but the repair process that ZFS uses to repair the block is as I described whether during read or scrub. It verifies that the redundant copy (that ZFS wants to use to repair the original block it thinks is corrupt) is itself valid. If the redundant copy is not verified valid (because RAM also corrupted it) then it wont overwrite the original block. If it does check out to be valid, (then the bad RAM didn't corrupt it that time), and we overwrite the block with correct data, whether or not the block we are fixing was bad in the first place or not.

My original point is just that bad RAM can also create false positive "corrupt" data as far as ZFS is concerned.

People are so willing to understand that ZFS can't function entirely as it should if the RAM is bad. This is obvious, that your data can't really be trusted anymore. But in the case of bad RAM they seem to for some reason still fully trust "zpool status" What they fail to also realize is that ZFS's admin functions that keep track of metadata such as which blocks on disk are good and which aren't can also be wrongly influenced by bad RAM. "zpool status" can easily be lying because it got confused by the bad RAM. When you have bad RAM all sorts of funky ZFS failure scenarios can happen including the case where the data is fine, but ZFS incorrectly thinks it's corrupt.

At least in my experience debugging with ZFS and with faulty RAM I saw that ZFS logged cksum errors and marked data as corrupt more often than it actually corrupted data that was already good on-disk. Of course new incoming data could easily be written corrupt from the start and there isn't really anything reasonable ZFS can do about that one.

cyberjock · Apr 1, 2016

Well, I have a stick of ECC RAM (server grade stuff) that is bad. It has lots of multi-bit errors that should have triggered a MCE for a user, but for some reason didn't. I plan to put this in a test system (along with known good RAM) so this kind of thing can be put to rest for good. ;)

SirMaster · Apr 1, 2016

cyberjock said:
Well, I have a stick of ECC RAM (server grade stuff) that is bad. It has lots of multi-bit errors that should have triggered a MCE for a user, but for some reason didn't. I plan to put this in a test system (along with known good RAM) so this kind of thing can be put to rest for good. ;)

Sweet. More data points on this stuff is great as there aren't that many people testing this kind of thing in a public capacity. No doubt SUN devs tested this stuff a LOT while developing ZFS, but its not like we can simply read through all their old testing results unfortunately.

cyberjock · Apr 1, 2016

Yeah. I've been trying to get my hands on a stick of RAM that was ECC and bad so we can test the living heck out of it. I'm glad that the person was willing to ship it to me. More will come once I get some time to test this further. :D

devnullius · Apr 2, 2016

well, keep us posted in this thread please of any results? :)

cyberjock · Apr 2, 2016

I will be making a thread based on my findings, because the discussion is *far* larger than this thread.

Ericloewe · Apr 3, 2016

cyberjock said:
I will be making a thread based on my findings, because the discussion is *far* larger than this thread.

Looking forward to it. Some weird cases have left me a bit worried...

Black Ninja · Apr 3, 2016

jgreco said:
You, uh, are nowhere near 53% done. You might be 53% done with a single pass, but you need to run memtest for days or even weeks to be reasonably assured of detecting problem memory.

Do you run memtes86 v.4 in BIOS mode or memtest86 v.6 in UEFI mode ?

jgreco · Apr 4, 2016

I'm actually not too fussy. Since the systems I'm testing are usually ECC, the big thing is simply to get memtest pounding on the memory subsystem as hard as it can. All CPU's, all memory.

Black Ninja · Apr 4, 2016

jgreco said:
I'm actually not too fussy. Since the systems I'm testing are usually ECC, the big thing is simply to get memtest pounding on the memory subsystem as hard as it can. All CPU's, all memory.

By default is on 1 cpu core unless it was changed to the user. You are saying is better to put on all cores ?

jgreco · Apr 4, 2016

Black Ninja said:
By default is on 1 cpu core unless it was changed to the user. You are saying is better to put on all cores ?

Yes, it generates more traffic to the memory subsystem and exercises more parts of the system. The goal is to tease out any flaws in the system. Driving it harder as part of the burn-in process means that it shouldn't have any trouble shouldering a lighter NAS load later.

Important Announcement for the TrueNAS Community.

One or more devices has experienced an error resulting in data corruption.

Resident Grinch

Guru

Inactive Account

Resident Grinch

Patron

Resident Grinch

Patron

Patron

Guru

Patron

Inactive Account

Patron

Inactive Account

Patron

Inactive Account

Server Wrangler

Guru

Resident Grinch

Guru

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "One or more devices has experienced an error resulting in data corruption."

Similar threads