ECC vs non-ECC RAM and ZFS

cyberjock · Oct 6, 2013

I've seen many people unfortunately lose their zpools over this topic, so I'm going to try to provide as much detail as possible. If you don't want to read to the end then just go with ECC RAM. For those of you that want to understand just how destructive non-ECC RAM can be, then I'd encourage you to keep reading. Remember, ZFS itself functions entirely inside of system RAM. Normally your hardware RAID controller would do the same function as the ZFS code. And every hardware RAID controller you've ever used that has a cache has ECC cache. The simple reason: they know how important it is to not have a few bits that get stuck from trashing your entire array. The hardware RAID controller(just like ZFS) absolutely NEEDS to trust that the data in RAM is correct.

For those that don't want to read, just understand that ECC is one of the legs on your kitchen table, and you've removed that leg because you wanted to reuse old hardware that uses non-ECC RAM. Just buy ECC RAM and trust ZFS. Bad RAM is like your computer having dementia. And just like those old folks homes, you can't go ask them what they forgot. They don't remember, and neither will your computer.

Here's some assumptions I've made and some explanation for them:

1. We're going to deal with a system that has a single bit error. A memory location is stuck at "1". Naturally more than a single bit error is going to cause more widespread corruption. For my examples I'm going to provide a hypothetical 8 bits of RAM. Bit number 2 will be stuck at "1". I will properly mark it with an underline and bold for bits that end up corrupted. Later I will move this location, but I will discuss that with you at that time.

2. Most servers have very little RAM used by the system processes and large amounts of RAM for the cache, so we're going to assume that the system runs stable without problems as the errors are most likely in the cache. Even if the system happens to run into the bad RAM location resulting in a crash, you are more than likely going to reset it and keep going until you realize after multiple crashes in a short period of time that something else is wrong. At that point, you're going to already be in "oh crap" territory.

3. I'm going to ignore any potential corruption of the file system for my examples. Clearly corrupting the file system itself is VERY serious and can be fatal for your data. I will cover this topic at the very end of this discussion.

4. No additional corruption from any other subsystem or data path will occur. Obviously any additional corruption won't make things any easier.

5. What I am about to explain is a limitation of ZFS. This is not FreeNAS specific. It is ZFS specific.

Now on to some examples:

What happens when non-ECC RAM goes bad in a system for file systems that don't do their own parity and checksumming?

So pretend a file is loaded into RAM and then saved to an NTFS/ext4/UFS/whatever-file-system-you-want-that-doesn't-do-its-own-parity/checksumming.

The file is 8bits long. 00110011.

Since our file got stored in RAM, its been corrupted. Now its 01110011. Then that is forwarded to the disk subsystems and subsequently stored on the disk trashed. If this were file system data you might have bigger problems too. So now, despite what your destination is, a hard disk, an SSD, or a redundant array, the file will be saved wrong. No big deal except for that file. It might make the file unable to be opened in your favorite office suite or it might cause a momentary corruption of the image on your streaming video.

What happens when non-ECC RAM goes bad in a ZFS system?

So pretend that our same 8bit file is stored in RAM wrong.

Same file as above and same length: 00110011.

The file is loaded into RAM, and since its going to be stored on your ultra-safe zpool it goes to ZFS to be paritied and checksummed. So your RAIDZ2 zpool gets the file. But, here's the tricky part. The file is corrupted now thanks to your bad RAM location.

ZFS gets 01110011 and is told to safely store that in the pool. So it calculates parity data and checksums for the corrupted data. Then, this data is saved to the disk. Ok, not much worse than above since the file is trashed. But your parity and checksum isn't going to help you since those were made after the data was corrupted.

But, now things get messy. What happens when you read that data back? Let's pretend that the bad RAM location is moved relative to the file being loaded into RAM. Its off by 3 bits and move it to position 5. So now I read back the data:

Read from the disk: 01110011.

But what gets stored in RAM? 01111011.

Ok, no problem. ZFS will check against its parity. Oops, the parity failed since we have a new corrupted bit. Remember, the checksum data was calculated after the corruption from the first memory error occurred. So now the parity data is used to "repair" the bad data. So the data is "fixed" in RAM. Now it's supposed to be corrected to 01110011, right? But, since we have that bad 5th position, its still bad! Its corrected for potentially one clock cycle, but thanks to our bad RAM location its corrupted immediately again. So we really didn't "fix" anything. 01111011 is still in RAM. Now, since ZFS has detected corrupted data from a disk, its going to write the fix to the drive. Except its actually corrupting it even more because the repair didn't repair anything. So as you can see, things will only get worse as time goes on.

Now let's think about your backups.

If you use rsync, then rsync is going to backup the file in its corrupted form. But what if the file was correct and later corrupted? Well, thanks to rsync the backup itself is actually corrupted.

What about ZFS replication? Surely that's better, right? Well sort of. Thanks to those regular snapshots your server will happily replicate the corruption to your backup server. And lets not forget the added risk of corruption during replication because when the ZFS checksums are being calculated to be piped over SSH those might be corrupted too!

But we're really smart. We also do religious zpool scrubs. Well, guess what happens when you scrub the pool. As that stuck memory location is continually read and written to, zfs will attempt to "fix" corrupted data that it thinks is from your hard disk and write that data back. But instead it is actually reading good data from your drive, corrupting it in RAM, fixing it in RAM(which doesn't fix it as I've shown above), and then writing the "fixed" data to your disk. This means the data in entire pool is being trashed while trying to do a scrub.

So in conclusion:

1. All that stuff about ZFS self-healing goes down the drain if the system isn't using ECC RAM.
2. Backups will quite possibly be trashed because of bad RAM. Based on forum users over the last 18 months, you've got almost no chance your backups will be safe by the time you realize your RAM is bad.
3. Scrubs are the best thing you can do for ZFS, but they can also be your worst enemy if you use bad RAM.
4. The parity data, checksums, and actual data need to all match. If not, then repairs start taking place. And what are you to do a disk needs replaced and parity data and actual data don't match because of corruption? The data is lost.

To protect your data from loss with ZFS, here's what you need to know:

1. Use ECC RAM. It's a fundamental truth.
2. ZFS uses parity, checksums, mirrors, and the copies parameter to protect your data in various ways. Checksums prove that the data on the disk isn't corrupted, parity/mirrors/copies corrects those errors. As long as you have enough parity/mirrors/copies to fix any error that ever occurs, your data is 100% safe(again, assuming you are using ECC RAM). So running a RAIDZ1 is very dangerous because when one disk fails you have no more protection. During the long(and strenuous) task of resilvering your pool you run a very high risk of encountering errors on the remaining disks. So any error is detected, but not corrected. Let's hope that the error isn't in your file system where corruption could be fatal for your entire pool. In fact, about 90% of users that lose their data had a RAIDZ1 and suffered 2 disk failures.
3. If you run out of parity/mirrors and your pool is unmountable, you are in deep trouble. There are no recovery tools for ZFS, and quotes from data recovery specialists start in the 5 digit range. All those tools people normally use for recovery of desktop file systems don't work with ZFS. ZFS is nothing like any of those file systems, and recovery tools typically only find billions of 4k blocks of data that looks like fragments to a file. Clearly it would be cheaper(and more reliable) to just make a backup, even if you have to build a second FreeNAS server. Let's not forget that if ZFS is corrupted just badly enough to be unmountable because of bad RAM, even if your files are mostly safe, you'll have to consider that 5 digit price tag too.
4. Usually, when RAM goes bad you will normally lose more than a single memory location. The corruption is usually a breakdown of the insulation between locations, so adjacent locations start getting trashed too. This only creates more multi-bit errors.
5. ZFS is designed to repair corruption and isn't designed to handle corruption that you can't correct. That's why there's no fsck/chkdsk for ZFS. So once you're at the point that ZFS' file structure is corrupted and you can't repair it because you have no redundancy, you are probably going to lose the pool(and the system will probably kernel panic).

So now that you're convinced that ECC really is that important, you can build a system with ECC for relatively cheap...

Motherboard: Supermicro X9SCM-F ($160ish)
CPU: Pentium G2020 ($60ish)
RAM: KVR16E11K4/32 32GB of DDR3-ECC-1600Mhz RAM($350ish)

So total cost is about $570. Less if you don't want to go to a full 32GB of RAM. If you went with a 2x8GB RAM stick kit you can get the total price down about $370. The G2020 is a great CPU for FreeNAS.

Of course, if you plan to use plugins like Plex that can be CPU intensive for transcoding you will need a little more power. Be wary of what CPUs do and don't support ECC. All Xeons do, and some i3s do. Check with Intels specification sheets to be sure before you spend the money. I use an E3-1230v2(about $250) and it is AMAZING! No matter what I throw at it I can't get more than about 30% CPU usage. Don't go by the TDP to try to pick the "greenest" CPU either. TDP is for full load heat output. That provides no information on what kind of power usage you will see when idle(which is what your system will probably be doing 99% of the time). My system with no disks installed and the system idle is at about 35w. Unfortunately I can't help with AMD CPUs since I'm not a fan of AMD. I do know that trying to go with "Green" CPUs for AMDs has disappointed many people here. "Green" CPUs perform slower than other CPUs, so be careful and don't buy a CPU that can't perform fast enough to make you happy.

The motherboard I listed above is ideal too. It has dual Intel Gb LAN, IPMI(never need a keyboard, mouse and monitor ever again!), and PCIe slots for that M1015 SAS controller you might want someday. I can't tell you how awesome IPMI is. But its basically remote desktop for your system. Except you can even use it during bootup. For example, you can go into your BIOS and change settings remotely! And you can mount CDs without a CD-ROM on the computer, remotely! How cool is that!? Add in the ability to remotely power on and off the server with IPMI and you have something for that server you want to shove in the corner and forget about.

ECC RAM can only detect and correct single bit errors. When you have multi-bit errors(if my experience is any indication) the system can detect(but not correct) those errors and immediately halts the system with a warning message on the screen with the bad memory location. Naturally, halting the system is bad as the system becomes unavailable. But its better to halt the system and let you remove the bad DIMM than to let your zpool get corrupted because of bad RAM. Remember, the whole goal is to protect your zpool and a system halt is the best bet once you've realized things are going badly in RAM.

Here's another way to look at it:

Example 1: Running a server with its native file system(probably ext3, NTFS, or HFS+), non-ECC RAM. The only way you can expect to lose your entire drive's worth of data is if your drive actually fails completely. If your RAM goes bad you'll potentially lose a few files to corruption that have been recently opened. You may have to run a chkdsk/fsck on the partition to get it back in good shape. But you'll be able to take that disk and put it in another machine and get most(if not all) of your data back. Even a worst case scenario there's plenty of tools like Ontrack EasyRecovery DIY Software that will work for NTFS and you can expect to have a reasonable chance of getting most(if not all) of your data back. You can also call those data recovery professionals and for 4-figures they might get your data back from a failed disk.

Example 2: Server with ZFS and non-ECC RAM. Now you have more ways to lose your data:

1. If the drive completely dies(obviously).
2. Based on prior users that have had non-ECC RAM fail you'll reboot to find your pool unmountable. This means your data is gone. There is no data recovery tools out there for ZFS. NONE. It's ALL gone for good.
3. What about the fact that you used those non-server grade parts? Guess what, they can also trash your pool and make it unmountable. The outcome is exactly the same as #2. You just lost all of your data because the pool won't mount.

The problem is that your native file systems have tools like chkdsk and fsck to fix file system errors. You also have plenty of options for software recovery with utilities like Ontrack EasyRecovery DIY Software.

But, no matter how much searching you do, there is no ZFS recovery tools out there. You are welcome to call companies like Ontrack for data recovery. I know one person that did, and they spent $3k just to find out if their data was recoverable. Then they spent another $15k to get just 200GB of data back.

So tell me which example you'd rather fall into? #1 where you have fewer opportunities to lose everything and have recovery options. Or #2 where you have quite alot more opportunities to lose everything and have zero recovery options to boot? I'd rather be in scenario 1 than scenario 2. I can't imagine you'd want scenario 2 either.

So when you read about how using ZFS is an "all or none" I'm not just making this up. I'm really serious as it really does work that way. ZFS either works great or doesn't work at all. That really truthfully how it works

Don't like it, be one of the few that "stick it to the man" with non-recommended components. It's your win(or loss). But you will get absolutely no sympathy when you show up and your pool doesn't mount like many people that think they can build a cheap system and get away with it.

PLEASE TAKE THIS AS A WARNING TO NOT USE NON-ECC RAM WITH ZFS.

Cool presentation from Intel about RAM errors and non-ECC vs ECC.

Someone else that confirms this: https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/

List of threads where people unfortunately experienced data loss with bad non-ECC RAM. (Updated when I can remember to):

http://forums.freenas.org/threads/freenas-crash-stuck-on-auto-import-volume.15816/

http://forums.freenas.org/threads/ecc-vs-non-ecc-ram-and-zfs.15449/page-5#post-87136

http://forums.freenas.org/index.php?threads/zfs-import-crashes-freenas.18571

List of people that have posted info that asserts the fact that ECC saved them:

HolyK · Oct 6, 2013

Great article Cyberjock, as always. I will just add an link to the Intel's ARK and also the Advanced Filter page which will make CPU selection process so simple. For example this is the selection of the Haswell CPUs with 3GHz++ speed AND supporting ECC => 16 matches ... way better to choose one than going through all 328CPUs (which is the actual number of Intel CPUs supporting ECC ^^)

BTW: I already ordered X10SL7-F + ECC memory, i realized that i have too much things on my RAIDZ2 pool which i would not like to loose...

Dusan · Oct 6, 2013

I'll add one white paper. It studies how does ZFS handle disk and memory corruptions:
https://www.usenix.org/legacy/events/fast10/tech/full_papers/zhang.pdf
It shows that ECC really *is* a good idea.

SmallGuy · Oct 6, 2013

No doubt that ECC RAM is highly recommanded (required will say Cyberjock as it is red writted in the Sun whitepaper), and don't want to start again a long debate ECC vs. NON-ECC as there is no discussion to be had from a technical point of view, and as there is enough ink on the forum about that.

But ECC RAM doesn't protect against multiple bit error (but advertised I recognize), motherboard failure, power supply failure, fire, Meteorite, nuclear attack, space invaders...

I just want to warn (with humor), that Freenas users have to keep in mind that they can kiss byby their data in multiples scenarii.
I'm just affraid that people think that with using ECC RAM, their data are absolutly safe with Freenas (much more safe than without ECC RAM, OK).
The safe attitude is to backup their precious data , off site if possible, and not necessarily from Freenas, especialy if they use non-ECC RAM.
The trap being that's binding to back-up a so large amount of data (large disk capacity, time spend...).

For my point of view, I use Freenas as a home sharing point, not as backup strategy, and have other constraints which have me to urge to make an "irrationnal choice", with full knowledge of the facts (Crazy man I am ; and I see comming the "good luck!" answer).
But I will never ask help on the forum if my data are gone due to RAM failure (If I'm able to identify that's a RAM failure as data are silently corrupted in this case).
In any case, you don't need help if your zpool is unmountable as there isn't any solution affordable to recover your data with ZFS.

So my message is:

!!USE ECC RAM!!
AND
!!BACK-UP YOUR DATA!!

(Despite I'm not sure back-up is usefull in case of nuclear world war).:)

HolyK · Oct 6, 2013

SmallGuy said:
(Despite I'm not sure back-up is usefull in case of nuclear world war).:)

Off-Planet backup will do the trick :D

Regarding the "backup" ... well it's still better to have *some* backup than have all *valuable* data on workstation only. After all, it's all about "how much are these data valuable to you" and therefore "how much money you want to invest to keep them *safer*". Someone is satisfied with external USB box, another one will not sleep well even with server-grade NAS with 1:1 backup server in the same room and must have off-site (planet? :D) backup. It's just all about finding the right point between safety and money. More or less, one will start with the low cost solution but will end with the nice server-grade hardware under the table. I am the nice example. If I'm not counting HW mirror in my workstation, my first NAS build (on paper) was like some atom-based MB, three 2TB discs in Raidz1 and 4GB of memory in lowcost case. Real build was/is 6x3TB in RaidZ2 on G2020 and 16GB RAM, which will be upgraded to Supermicro MB with SAS2 and ECC RAM next week. Who knows what the next "step" will be :]

cyberjock · Oct 6, 2013

SmallGuy said:
I just want to warn (with humor), that Freenas users have to keep in mind that they can kiss byby their data in multiples scenarii.

Well, yes. But that is assuming that the RAM doesn't detect those errors. ECC allows for multibit error detection, but not correction. And every system I've ever seen that used ECC RAM and had a multibit errorthe end result was a system halt. I'm fairly sure that it was a hardware halt and not something from the OS. So I'd argue that ZFS was still protected at that moment. And in the worst case scenario, ZFS would have had a write in progress at the moment of failure. But on reboot(assuming the system will POST and function, which I've never seen on multi-bit errors either) ZFS would simply rollback the incomplete transaction.

So I'd argue that ECC RAM really is the end-all for corruption of RAM except in situations where the ECC algorithm has a collision. And then you start talking about statistical odds. ZFS has its own potential issues with collision, but they are extremely low just like ECC RAM's chances of an undetected error.

I actually discussed this, but I guess when I copied and pasted the post I somehow excluded the last paragraph. I'll have to go find the missing paragraph when I get home.

SmallGuy · Oct 6, 2013

cyberjock said:
And then you start talking about statistical odds.

That is all the difficulty.
It is difficult to have a precise idee of the number of errors we could encoutered per GB of RAM per time unit.
There are only few statistics about that on Internet.
The real question is how offen a bit error happend?
Following this paper:
http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
it is "is more than 8% of DIMM affected by errors per year" and these are correctable errors. But "for all platforms, the top 20% of DIMMs with errors make up over 94% of all observed errors".
What does it means?
I understand that errors are more related to fail devices than unpredictable (or random) errors. From my researchs, errors are correlated with aging of the DIMM, temperature, fequency, and undervoltage.
So what I want explain is that not so trivial to demonstrated how much exactly the reliability of the entire system is increase using ECC RAM, compared to non-ECC RAM (5%, 10%, 20 %, more??).
Using ECC covering the major part of RAM failures, but there still about 6% of the cases (estimmation following my researchs, don't trust me) where ECC RAM can't correct errors.
I just want to highlight, that it is impossible to cover 100% fo the possible failures. And the more you approch 100% the more the cost increase.
Each user have to determine what is acceptable for him and how many he want pay for that.
But as it is difficult to amount, the best way is to use ECC RAM (less questions in that case).
An other forum link:
http://hardforum.com/showthread.php?t=1689724&page=6
If you want to laugh:
http://forums.anandtech.com/archive/index.php/t-2235106.html

cyberjock · Oct 6, 2013

SmallGuy said:
That is all the difficulty.
It is difficult to have a precise idee of the number of error we could encoutered per GB of RAM per time unit.
There are only few statistics about that on Internet.
The real question is how offen a bit error happend?

I care about a bitflip due to solar radiation or whatever. But I really care about bad ram locations. Those aren't recoverable.

SmallGuy said:
Following this paper:
http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
it is "is more than 8% of DIMM affected by errors per year" and these are correctable errors. But "for all platforms, the top 20% of DIMMs with errors make up over 94% of all observed errors".
What does it means?

It means that good quality RAM has few errors while the low quality RAM has lots of errors. Color me SHOCKED that they figured that out!

SmallGuy said:
I understand that errors are more related to fail device than unpredictable (or random) error. From my researchs, error are correlated with aging of the DIMM, temperature, fequency, and undervoltage.
So what I want explain is that not so trivial to demonstrated how much exactly the reliability of the entire system is increase using ECC RAM, compared to non-ECC RAM (5%, 10%, 20 %, more??).
Using ECC covering the major part of RAM failure, but there still about 6% of the case(estimmation following my researchs, don't trust me) where ECC RAM can't correct error.

I don't buy 6% at all. And even if it were 6%, every server I've seen that identified an error it could correct resulted in an immediate system halt. So you STILL didn't get driven over the bus.

My whole problem with the failure of non-ECC RAM is:
1. You typically have no indication at all that anything is wrong until its too late. This is unacceptable to me and I want as much warning as I can get. ECC RAM either corrects the bad memory location or halts the system.
2. More than likely, your religious backups are trashed beyond recovery. And its unrealistic to think we can backup 20TB of data to Blu-Rays as permanent storage "in case" RAM trashes our other backup. This is absolutely unacceptable. Any setup that potentially trashes your backup without warning should be a big freakin' red flag for anyone that values their data. It invalidates the entire purpose for the backup. A backup that fails and gives you an error message in a log that it failed is far more useful than a backup than doesn't give you an error. At least you can go back and fix the error and rerun a backup. You don't have that option for bad non-ECC RAM.
3. Based on user experience, #1 and #2 are by far what you could and should expect if your non-ECC RAM fails you.

SmallGuy said:
I just want to highlight, that it is impossible to cover 100% fo the possible failure. And the more you approch 100% the more the cost increase. Each user have to determine what is acceptable for him and how many he want pay for that.

Yes, 100% is impossible, but its not like its excessively expensive to go to ECC. I just built a very trustworthy system for $370 in my first post! That's more than most builds are in the forums that don't use ECC RAM and often use incompatible hardware! So I don't buy the "its too expensive" argument. People choose to go non-ECC, then spend more money than a system with ECC RAM cost, then complained because they didn't buy ECC RAM later. It was your choice and your knowledge to apply. If you chose not to apply your own knowledge and wisdom, take a look in the mirror, that's where the problem really lies.

SmallGuy said:
An other forum link:
http://hardforum.com/showthread.php?t=1689724&page=6
If you want to laugh:
http://forums.anandtech.com/archive/index.php/t-2235106.html

And there was nothing in either of those threads that even remotely blew my hair back. There was definitely a few people talking that clearly have no understanding of statistics, but I'll ignore those people. :)

But you clearly are missing the point, and I'm not here to argue with you over it until the end of time. I'm trying to save some users from pain from lost data because they "didn't know" how important ECC RAM was. If you want to argue why you don't need it, feel free to. I won't be replying to arguments because its a waste of my time. Either you understand it or you don't. But don't expect anyone here to feel bad because you couldn't heed the warning and learn from the poor souls before you that lost all of their data. Me personally, I'd rather learn from someone else's pain than learn from my own pain. And apparently some people have yet to actually do more than listen to that simple logic.

And for the record, do you know how many pools we've seen trashed in systems with bad ECC RAM. I'll give you a hint: zero. Is that a coincidence? I'll wager a big "no" on that one.

vegaman · Oct 6, 2013

Thanks for the post. Been learning about error detection/correction at uni lately but hadn't looked into how it works with ZFS specifically. Interesting to see how disastrous non-ECC RAM can be.

Sent from my GT-I9300 using Tapatalk 4

cyberjock · Oct 6, 2013

vegaman said:
Thanks for the post. Been learning about error detection/correction at uni lately but hadn't looked into how it works with ZFS specifically. Interesting to see how disastrous non-ECC RAM can be.

Sent from my GT-I9300 using Tapatalk 4

You are welcome. I find it just amazing that people see how unreliable bad RAM is to a system, but aren't willing to jump to the same logical conclusion with ZFS.

Bad RAM = unreliable system
Bad RAM = unreliable ZFS

What's so hard about that?

vegaman · Oct 6, 2013

I'd guess it's more about them not accepting that their RAM could be bad.
From the various threads I've read they seem to support this with things like "but my system never crashes", "but I ran memtest" and such. And I'm sure some (maybe even most) of these systems run fine, at least for a while. But when the consequences range from corrupt data to complete data loss instead of "hmm, maybe I should rma that stick of RAM" I don't see why you'd even consider using non-ECC RAM.

Sent from my GT-I9300 using Tapatalk 4

cyberjock · Oct 17, 2013

Here's a hypothetical scenario I just wrote up for someone that had questions about why rsync with bad non-ECC RAM is a killer...

Assume the following:
1. You are rsyncing from pool 1 to pool 2 and each pool is a separate system with non-ECC on both ends. Both sides have RAIDZ3 because your data is very valuable to you and you don't want to lose it.
2. Your primary FreeNAS system has non-ECC RAM and has bad RAM. Backup system has non-ECC RAM(or even has ECC RAM.. it doesn't matter), but it does not have any errors.
3. For simplicity we'll assume that you were on vacation for the last 3 days, so your daily nighttime rsync should find no data has changed since you were in... Hawaii!

Rsync will do two things when you initiate the transfer. Each server will go through all of its files looking for files that either don't exist on the destination, don't exist on the source, or have changed via checksums. Each machine will be responsible for its own checksum calculations(remember.. we're trying to save network traffic load, right?)

For checksums, each system will calculate its own checksums. As it reads and checksums every file on pool 1 the files will appear to have changed because its getting trash data. That change will be because the data is being read into RAM, trashed(even though ZFS will try to fix the error but be unable to), then checksummed by rsync. When pool 2 does its checksum all of the files will provide the proper checksum as they are being read correctly. So the files will "appear" to be changed between your primary and backup server. So that means that the primary server needs to send the "new updated" file to the backup server since the backup server's checksum doesn't match rsync. But what's really going on if you are paying attention is that rsync is currently trashing all of the files on your backup server with garbage from the primary server.

So a few days(or maybe weeks go by) and you are totally unaware of the silent killer trashing your backups every night.

One day you go to open up a picture and its trashed. You wonder why. So you close it and open it. It's still trashed. Then you start checking things out in your pool. You do a zpool status and see some chksum errors, but nothing that looks horrible. You decide that this isn't a big deal as you'll just grab the backup from the backup pool. Oh my.. the backup is trashed too. So now you come to the forum and start asking questions.

"Why is ZFS failing me!? It's supposed to be reliable but my pictures won't open? What did I do wrong?" you ask. One of the forum users asks you to post some command outputs. Nothing looks to terribly wrong. Hard disks are in good health, everything seems to be fairly healthy. But then the jerk shows up(We'll call him..... cyberjock). He says "hey, try a RAM test OP!" You download the ISO from memtestx86.org and boot it up. You immediately see RAM errors galore.

So back to the forums you go. You post back that your server failed the RAM test. You want to know what to do next. What do we tell you? First, we tell you "that sucks", then we tell you that you are another sucker and another statistic for the forum. Then you are left to figure out what you are going to do. :/

I know this story all too well because virtually every user that has lost their data to bad RAM has the exact same story. They thought they had done EVERYTHING right. They had nightly backups automated with rsync(or zfs replication) and they were sure that their data was 100% protected when they went to bed every night. They had regular ZFS scrubs, they had regular hard drive testing and monitoring via SMART. They even were so awesome to have the server send a text message if a hard drives started to misbehave. The only thing they didn't realize was that non-ECC RAM would cause all of the money they've spent to instantly be in vain, and without warning. And since the way you identify bad RAM is called "ECC RAM" and you didn't have that you are left in the cold to destroy your pools and lose everything you thought you had generously protected with a RAIDZ3 pool and a complete backup system.

Pretty shitty story isn't it? And there's a long list of people that have stories that mimic this one.

jyavenard · Oct 17, 2013

cyberjock said:
Here's a hypothetical scenario I just wrote up for someone that had questions about why rsync with bad non-ECC RAM is a killer...
So a few days(or maybe weeks go by) and you are totally unaware of the silent killer trashing your backups every night.

it will not be a silent error, and it most cases, it will not trash your data either. not the way rsync works.

the sender calculates the CRC, which will be rubbish due to bad RAM. The remote rsync calculates the CRC. They will not match. in this condition rsync resend the whole file (not partially)

When doing so, it works in a temporary file; it doesn't touch the original (the already archived one). You see the file created as .originalname.RANDOMVALUE. By default this temporary variables is created in the same destination directory. But this can be changed with -T, --temp-dir=DIR create temporary files in directory DIR

After transferring the file; the temporary file is tested once again; and their checksum compared. They do not match as the original CRC was incorrectly calculated. rsync will delete the temporary variable then (the original destination file isn't touched yet: only if the two checksums match will it replace the previous one). rsync will then either abort or be very verbose. If it doesn't abort, it lists the file that didn't transfer properly, and it will be in the summary at the end.

for what it's worth. rsync is an approved tool for backup by the FDA for medical devices; which require 100% confidence that either no file is saved or the file to be good.

That example is also, more than unlikely to ever actually happen; with bad RAM the likelihood of the program being loaded or run incorrectly much greater. e.g. the system would have had crashed very early on...

If RAM is bad, then it's bad... bad ECC ram would have as much chance of corrupting data : remember its bad. Where and how it's bad: no one knows. you can't pick and choose where it would fail.

so to me, comparing "bad" RAM with ECC RAM is like comparing apple and orange. Either you compare good RAM, or you compare bad. ECC doesn't prevent bad RAM, it allows to correct a single incorrect bit per load (64 bits)

So now back to the original example, assuming it's not bad RAM; but instead it's the "natural" occurrence of a bit being flipped in the RAM. Something that happens: rsync would have detected... and wouldn't have corrupted the otherwise good original data.

My $0.02

cyberjock · Oct 17, 2013

jyavenard said:
it will not be a silent error, and it most cases, it will not trash your data either. not the way rsync works.

jyavenard said:
the sender calculates the CRC, which will be rubbish due to bad RAM. The remote rsync calculates the CRC. They will not match. in this condition rsync resend the whole file (not partially)

When doing so, it works in a temporary file; it doesn't touch the original (the already archived one). You see the file created as .originalname.RANDOMVALUE. By default this temporary variables is created in the same destination directory. But this can be changed with -T, --temp-dir=DIR create temporary files in directory DIR

After transferring the file; the temporary file is tested once again; and their checksum compared. They do not match as the original CRC was incorrectly calculated. rsync will delete the temporary variable then (the original destination file isn't touched yet: only if the two checksums match will it replace the previous one). rsync will then either abort or be very verbose. If it doesn't abort, it lists the file that didn't transfer properly, and it will be in the summary at the end.

Yes, but you are totally ignoring the fact that it is VERY likely your file will be in RAM thanks to ZFS' advanced cache. So it'll read it once, then compare it(oh gee its already cached in bad RAM) and it will naturally pass the second checksum.

Also keep in mind that the file, as it is walked through and generated, has the checksum being generated behind it. Since it already knows what the checksum should be(since it didn't match, triggering the file transfer) it will pass. Remember I didn't say that the backup server is bad. Only the primary. If the backup server is bad rsyncs could potentially fail all over the place as the checksums may fail twice.

We've already seen people with a single bad bit in RAM that was stuck causing rsync to transfer file successfully, but an md5sum of the files showed that they were corrupt.

jyavenard said:
for what it's worth. rsync is an approved tool for backup by the FDA for medical devices; which require 100% confidence that either no file is saved or the file to be good.

This is the same administration that approves medical devices for use in humans that has been proven to be hackable and proven to be able to remotely triggered which may cause death. Yeah, I totally trust the FDA to get everything right. Just because the FDA(or any government institution) has approved it for use in a particular situation does NOT make it a fit for ALL situations.

jyavenard said:
That example is also, more than unlikely to ever actually happen; with bad RAM the likelihood of the program being loaded or run incorrectly much greater. e.g. the system would have had crashed very early on...

Really? There's at least 10 threads in this forum of people that have run their server for days and simply rebooted the server when it panic'd. They never had a clue what was coming until the bus drove over them.

And let me find the thread of the guy that had a single bit stuck on his system, did some rsync test files to his new FreeNAS server with non-ECC RAM. Happened to do md5 sums of the files and found that one of them was corrupted. This was WITH rsync. Then he posted here asking if rsync was trustworthy or if he had a configuration issue causing his files to be trashed.

jyavenard said:
If RAM is bad, then it's bad... bad ECC ram would have as much chance of corrupting data : remember its bad. Where and how it's bad: no one knows. you can't pick and choose where it would fail.

so to me, comparing "bad" RAM with ECC RAM is like comparing apple and orange. Either you compare good RAM, or you compare bad. ECC doesn't prevent bad RAM, it allows to correct a single incorrect bit per load (64 bits)[/quote]

You need to read up on what ECC RAM does. It does NOT "allow for correction of a single incorrect bit per load". It allows for detection AND correction of single bit errors and detection of multi-bit errors. So in essence, it can find all errors, but only correct 1 bit. And by all errors, there is the statitical 1x10^37(or something like that) actual errors that it won't detect because of the chance that the checksum will still happen to be valid. But note that is 1x10^-37 errors and not reads. You'd need that many actual errors for one to go undetected. I'd say I'm more likely to die of old age before I'd see that in my lifetime.

jyavenard said:
So now back to the original example, assuming it's not bad RAM; but instead it's the "natural" occurrence of a bit being flipped in the RAM. Something that happens: rsync would have detected... and wouldn't have corrupted the otherwise good original data.

That depends on the exact timing of the bit flip. I'm less concerned with a single bit flip than I am from RAM with stuck bit(s). Those are what kills your pool. A single bit flip generally won't do much more than trash that one bit. And if the bit was flipped after the parity data was calculated but before the data was written to disk, then a scrub will fix the error. The timing of a bit flip is critical to how the "end game" is for the potentially lost data.

jyavenard · Oct 17, 2013

the issue by the person above was that he had faulty RAM. So your answer is that it wouldn't have happened if he had used ECC. That's a fallacy.
If he had used faulty ECC RAM, he would have had exactly the same issues. Well, maybe not the same seeing that it's random: but he would have lost his data.
That it's ECC doesn't reduce the risk for the RAM to be faulty.
Faulty RAM isn't the same as memory corruption due to bit flip (which is what ECC is designed to correct)

And no, I'm not disagreeing that ECC RAM is important. It is... you missed my point unfortunately. Hopefully this post will explain it better.

cyberjock · Oct 17, 2013

Here ya go.. a user that had a single bit of RAM. Ran Rsync and still had corruption despite the file transfer completing successfully. And not once. He did it over and over again. Then turned to the forum for help. http://forums.freenas.org/threads/single-memory-location-corruption-with-rsync.14001/ Notice that his sig now says ECC RAM. He's definitely sold on the idea that ECC is better.

By the way, if you read up on some computer science papers from the 1980s(when the whole ECC versus non-ECC war began) plenty of professors had papers that basically ended with something to the effect of "And if you are doing any kind of checksumming or paritying of data in RAM, you must use ECC RAM for those mathematical formulas to be trustworthy." If you check out things like RAID controllers, they are all ECC RAM. If you check out any kind of purpose built hardware that is designed to be involved with checksumming or parity calculations, it will include ECC RAM.

In fact, even desktop CPUs have internal CPU caches that are parity checked at the least. Parity is used for performance reasons as any "error" can simply be obtained by rereading the RAM to fix the bad bit in cache.

cyberjock · Oct 17, 2013

jyavenard said:
the issue by the person above was that he had faulty RAM. So your answer is that it wouldn't have happened if he had used ECC. That's a fallacy.
If he had used faulty ECC RAM, he would have had exactly the same issues. Well, maybe not the same seeing that it's random: but he would have lost his data.
That it's ECC doesn't reduce the risk for the RAM to be faulty.
Faulty RAM isn't the same as memory corruption due to bit flip (which is what ECC is designed to correct)

And no, I'm not disagreeing that ECC RAM is important. It is... you missed my point unfortunately. Hopefully this post will explain it better.

And explain the fallacy?

And I never said that ECC reduces the risk for the RAM to be faulty. All it does is a calculation between what was originally written to RAM and the ECC data it has for that particular location. If they match then everything is fine. If they don't match, the something is wrong and a fix is needed. In fact, you could argue that ECC RAM increases the risk of faulty ram statistically since you must have 72-bits of data that must be good for every 64-bits of system memory you can use. That extra 8 bits(1/9th of RAM) translates into more room for problems. It's no different than saying a system with 128GB of RAM is less likely to have bad RAM than a system with 144GB of RAM(those proportions are identical to comparing non-ECC RAM with ECC RAM).

If he had used bad ECC RAM it would have corrected the errors(if single bit) and you'd get logs in your BIOS of the errors and you'd get to take action(I haven't confirmed that you get an error in the FreeNAS logs, but I think you will). Or for multi-bit errors you get system halts. At least, that's what I've seen for every stick of bad ECC RAM I've seen in a running system. So explain to me how ECC RAM doesn't protect you? You're covered from trashing ZFS with single bit errors since its corrected. And you're covered from trashing ZFS(or your backups or your bank transaction at your bank or whatever else that was being protected with ECC) with multi-bit errors since its identified and the system is halted. Now no transation is performed because it is obvious that the transaction would have been erroneous. I will give you the "chance" that ECC RAM has that 1 in some-really-big-number chance that the data in RAM will be corrupt and happen to not be detected. But let's be serious now, at some point you just have to say "it's not likely". You're more likely to have a power supply fail and burn out all of your components than your ECC RAM fail you to that extent.

Notice that I have yet to read a post that someone has had bad ECC RAM kill their pool. I couldn't find a case with google either. Do you think that's a coincidence?

jyavenard · Oct 17, 2013

cyberjock said:
And explain the fallacy?

I thought I did.

You talked about *bad* ram. *BAD* RAM to me, means faulty RAM: e.g. not working as it should : e.g. must be replaced.

As opposed to corrupted memory within the acceptable specification (e.g. 10−17 error/bit·h : error rate found on wikipedia: i.e. about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate)

If he had used bad ECC RAM it would have corrected the errors(if single bit) and you'd get logs in your BIOS of the errors and you'd get to take action(I haven't confirmed that you get an error in the FreeNAS logs, but I think you will). Or for multi-bit errors you get system halts. At least, that's what I've seen for every stick of bad ECC RAM I've seen in a running system. So explain to me how ECC RAM doesn't protect you? You're covered from trashing ZFS with single bit errors since its corrected. And you're covered from

??
if it's faulty, the behaviour of the system is unpredictable.

it wouldn't have corrected the error. Remember, it's faulty. Why would it have detected the error, or properly corrected it?

What you're seeing is typical of any system with faulty RAM: usually the machine crash sooner or later

You seem to confuse faulty RAM and corrupted data: they have nothing to do with one another. Only non-faulty ECC RAM will behave as its designed.

ECC works on 64 bits word; it's up to each ECC module to correct the data it contains. So say you have 2 ECC module: one faulty, one working. If the first one doesn't work as it should, the 2nd is totally unaware of whatever the first is doing: business as usual.

Notice that I have yet to read a post that someone has had bad ECC RAM kill their pool. I couldn't find a case with google either. Do you think that's a coincidence?

now, this is strawman; I never said anything of the sort. I'm not arguing the usefulness of ECC RAM nor its requirement. My point is about faulty RAM (ECC or not): nothing will save you from that. Working ECC detects and corrects fault; faulty ECC .... well it's... faulty, it could start playing the 9th symphony for all I know :)

cyberjock · Oct 17, 2013

jyavenard said:
I thought I did.

You talked about *bad* ram. *BAD* RAM to me, means faulty RAM: e.g. not working as it should : e.g. must be replaced.

As opposed to corrupted memory within the acceptable specification (e.g. 10−17 error/bit·h : error rate found on wikipedia: i.e. about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate)

to me, bad RAM is ram with a stuck bit. Normally the insulator or conductor breaks down causing a bit to be stuck as a 1 or a 0. You can also have issues due to radiation(my field of study) or other effects such as induced voltage, etc but that will cause corruption that is related to a single file. Usually, when the insulator or conductor breaks down to the point of having a stuck bit it ends up making contact with adjacent RAM locations. A good RAM testing software package will run through a series of tests that ensure that a stuck bit as a 0 or 1 is not stuck because of an adjacent location. That's why you can't just write all "0", then walk a "1" through all of your RAM and call it good. It takes many different and specific patterns to verify that nothing is failed. That's why memtest.org's test can take hours and hours for just 32GB of RAM.

My concern is warning/protecting users from potentially pool destroying errors.

jyavenard said:
??
if it's faulty, the behaviour of the system is unpredictable.

it wouldn't have corrected the error. Remember, it's faulty. Why would it have detected the error, or properly corrected it?

What you're seeing is typical of any system with faulty RAM: usually the machine crash sooner or later

You seem to confuse faulty RAM and corrupted data: they have nothing to do with one another. Only non-faulty ECC RAM will behave as its designed.

ECC works on 64 bits word; it's up to each ECC module to correct the data it contains. So say you have 2 ECC module: one faulty, one working. If the first one doesn't work as it should, the 2nd is totally unaware of whatever the first is doing: business as usual.

ECC doesn't work anything like whatever you are trying to explain. There is no "ECC module" at all. In fact, I have no idea where you get the idea that there is an "ECC module" on memory sticks at all. The only difference between non-ECC RAM and ECC RAM electrically is that every memory location is 72 bits wide instead of 64 bits wide. You have 1/9th more RAM for the ECC coding. You do know how ECC works in RAM, right? I'm really getting the feeling you don't. If you don't speak up. But there is no "ECC module". In fact, I've heard stories of people yanking off the extra ram chip that takes it from 72 bits to 64 bits and converting RAM from ECC to non-ECC. Not sure how true it is though.

jyavenard said:
now, this is strawman; I never said anything of the sort. I'm not arguing the usefulness of ECC RAM nor its requirement. My point is about faulty RAM (ECC or not): nothing will save you from that. Working ECC detects and corrects fault; faulty ECC .... well it's... faulty, it could start playing the 9th symphony for all I know :)

It's a strawman only until you can't find one. Surely with the millions of servers with ZFS someone would have a documented case of failed ECC RAM killing their pool, right? I would totally expect to if I didn't understand how ECC RAM works. But, I don't.

jyavenard · Oct 17, 2013

cyberjock said:
to me, bad RAM is ram with a stuck bit. Normally the insulator or conductor breaks down causing a bit to be stuck as a 1 or a 0.

yes, that I agree with... problem is with electrical fault, it's usually a whole row that will be faulty, and you can't recover from that. It's extremely rare to have a single bit permanently "stuck".
If that's the case; you can't predict the behaviour of your system: it will likely crash at some point

You can also have issues due to radiation(my field of study) or other effects such as induced voltage

that would cause a "soft" error, it's not permanent. this is not faulty, it's within operating errors and that can be recovered / detected.

Having said that, wikipedia mentions that with modern RAM, cosmic radiation has virtually no effect.

ECC doesn't work anything like whatever you are trying to explain. There is no "ECC module" at all. In fact, I have no idea where you get the idea that there is an "ECC module" on memory sticks at all. The only difference between non-ECC RAM and ECC RAM electrically is that every memory location is 72 bits wide instead of 64 bits wide. You have 1/9th more RAM for the ECC coding. You do know how ECC works in RAM, right? I'm really getting the feeling you don't. If you don't speak up. But there is no "ECC module".

I didn't explain how ECC worked at all. I was actually trying to make it simple for you by using figurative description. (module is the most common denomination used)

And once again, you mix different concept

I talk about one specific thing (a hardware fault), you respond about another and then argue about it...

In fact, I've heard stories of people yanking off the extra ram chip that takes it from 72 bits to 64 bits and converting RAM from ECC to non-ECC. Not sure how true it is though.

why bother? they work as-is, ECC is just disabled. The other way round works too: non-ECC ram. You can't mix modules though (ECC DIMM and non-ECC ones)

this is going nowhere... so I'll bow out here...

Important Announcement for the TrueNAS Community.

ECC vs non-ECC RAM and ZFS

Inactive Account

Ninja Turtle

Guru

Guru

Ninja Turtle

Inactive Account

Guru

Inactive Account

Explorer

Inactive Account

Explorer

Inactive Account

Patron

Inactive Account

Patron

Inactive Account

Inactive Account

Patron

Inactive Account

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ECC vs non-ECC RAM and ZFS"

Similar threads