RAIDing RAM as a solution to not having ECC

zealotx1 · Dec 18, 2021

I believe anyone serious into ZFS would have gone through all the reading about the importance of ECC and its implications.

During my research I came across this article
[The Great Debate] ecc-vs-non-ecc-RAM by Andrew Galloway a.k.a. nex7(link is dead but available through wayback machine)
Which is a link included in
[Why I Chose Non-ECC RAM for my FreeNAS] by briancmoses which i believe most people should have read.
The article states that

ecc failure rates an average of .22/dimm/yr versus 8.2/dimm/yr correctable errors (which would have been uncorrectable if it wasn't ECC). and despite it being significantly lower than non-ECC, the chances of it failing are still there and non zero.

My takeaway is that ECC still isnt the magic solution over non-ECC, just vastly better.

While the remaining articles at the bottom seem to suggest that ZFS is resilient to RAM errors.

Since both sides of the ECC debate have valid points, and the community ZFS stand is still the //Nobody got fired for choosing IBM/Intel// go for ECC as much as possible.

Which got me thinking, if ECC isnt 100% reliable either and we are already RAIDing our disks, why cant we apply the same to our RAM.
Essentially RAM is not much different from disks, albeit volatile vs non.
If it was possible to RAID5/RAIDZ the RAM, then the ECC requirement would be less of an impact and error rates could be even further reduced.
It would also remove the need to choose certain brands or models over others because they are statistically more reliable. And also removes the single point of failure, since ECC RAM can fail too.
The requirement would then be shifted from having ECC to having more RAM sticks.

I'm curious as to whether it is implementable and the tradeoffs (CPU overhead etc) are worth it.

jgreco · Dec 19, 2021

Many years ago (let's say 4.2BSD on a DEC VAX 11/750), the idea of ZFS style protection of blocks of data on disk would have been unthinkable due to the code complexity involved, the CPU cycles it would cost, and the fact that disks have basic checksum protection for each sector already.

Let me soapbox for a second here:

ZFS, even when it was introduced, was still seen as a bit of an insane venture. It was often compared to dedicated SAN hardware arrays, which were indeed significantly faster at I/O, and remain so today. Even comparing to a more modest solution, like a high end LSI RAID controller, the LSI has an edge because it is a dedicated processor with a highly optimized codeset, with no host CPU tax.

ZFS is almost always a loser on the "fastest possible" end of things, because it is using general purpose hardware, soaking away some of your CPU cycles, but it can accomplish at modest cost on relatively commodity hardware, a feat of scale -- management of hundreds of drives -- that normally only comes with six-plus-figure hardware storage systems.

ZFS used to get beat on all the time due to the CPU penalty of doing parity on the host CPU. However, we've moved from an era of servers with a small number of cores in the approximately one-gigahertz range, up to today where we have many cores with up to five-gigahertz, and many more highly optimized CPU extensions intended to accelerate graphics processing. I seem to recall that this has impacted some of the choices of ZFS additions for checksums and compression over the years.

zealotx1 said:
My takeaway is that ECC still isnt the magic solution over non-ECC, just vastly better.

That's accurate.

zealotx1 said:
While the remaining articles at the bottom seem to suggest that ZFS is resilient to RAM errors.

It isn't. It's just not horribly likely to be a problem. It's the seatbelt issue. There are still lots of people who drive around without seatbelts and survive their journeys just fine. Everyone could do this all the time and that'd be fine. Except for the people who, for whatever reason, came to a screeching halt because they ran into an immovable object. I keep seeing Futurama:Roswell That Ends Well as I type this...

zealotx1 said:
If it was possible to RAID5/RAIDZ the RAM, then the ECC requirement would be less of an impact and error rates could be even further reduced.

This already exists. It's called memory mirroring. It's done in hardware by the better quality Xeon platforms. Of course, it comes at a huge cost in additional memory, and in practice, at least in my experience, is only used on critical systems where you're trying to squeeze another 9 onto the end of 99.9whatever% reliability.

zealotx1 said:
I'm curious as to whether it is implementable and the tradeoffs (CPU overhead etc) are worth it.

ZFS is software, so, features can definitely be added. Your question is complex, because you've framed it as a ZFS feature, but it is really more a question of trust in a hardware platform. This really requires an understanding of the failure points where corruption could be introduced, and finding correct ways to detect and mitigate this.

ZFS was originally built on top of servers that natively supported ECC, and as a result, the code generally trusted the compute platform it lives on. It generally assumes errors will be introduced by I/O devices, which, as we know, is definitely a place where errors do get introduced.

I suspect the biggest problem with your idea is that it would require a re-examination of all the assumptions implicitly written into the existing code. This ends up being nontrivial, as we've seen with block pointer rewrite. And to do that, you need to understand what threats you're protecting against and how likely they are. Are we just interested in protecting block data at rest in the ARC? Or are we planning to be suspicious of every interaction with memory?

Arwen · Dec 20, 2021

I see this as 3 things:

R/O code & data protection
R/W data that is not going to, or coming from, disk / storage
R/W data from disk / storage

The last may be covered by ZFS when in debug mode, which checksums the data in memory and verifies it's checksum before use.

Over the last 2 years or so I have been thinking about #1, R/O code & data protection. Ideally it would be a scheme that attempts to protect binaries and R/O data all the way from the compiler. (Though simple checksums would not prevent purposeful criminal alterations, as they could simply update the checksums.) So the intent behind the checksums is to detect accidental alterations.

Here is a blog I wrote on the subject:

Computer memory protection

When your computer memory fails, who do you call?

arwenstarblog.wordpress.com

This was intended to be followed up by a more detailed implementation proposal, but I have not had time to finish it.

zealotx1 · Dec 20, 2021

Is it even possible to edit posts? apparently seem to have posted not-working links

Fixed the links:
[The Great Debate] ecc-vs-non-ecc-RAM by Andrew Galloway a.k.a. nex7(via wayback machine)
[Why I Chose Non-ECC RAM for my FreeNAS] by briancmoses

What I was thinking about data integrity is kinda like what Arwen has written, since these machines are usually configured then left in the closet to run indefinitely until something happens, and it is when something happens that you realise that it is too late and you cannot recover anything. The more data and longer it runs, the bigger the targetboard becomes.

People seem to be caught on by the survivorship bias swearing by ECC because they have been bitten once by not using ECC, thinking that it is the solution that will solve their problem, but rather just makes the chances slimmer. It's still like playing with fire but using chopsticks rather than a tree branch. Coupled with the fact that the recommendation seem to mention recommending ECC, people would naturally imply that there are only 2 choices, ECC vs non-ECC, masking the fact that it is still not failsafe, and the real solution is to further duplicate your data to many different places instead.

Once you realise ECC is not failsafe, suddenly regular RAM no longer seems that bad as many computers run without ECC fine, thus making ECC failure rates seem closer to non-ECC than it is to full data integrity.

But what jgreco has written also makes sense as I do not understand the threats that are inherent with memory, so RAID might or might not even target the corruption area that have an impact on data integrity.

Also, I dont see how mirroring on Xeon platform protects data integrity, my understanding of mirroring is that it shows if there's corruption in either copy, you can only know corruption has occured but have no way to determine which is the correct one, unless you have a 3rd copy on disk or something to be able to repair it. Same with duplicating 3 copies of every data. Even so, you would be relying on democracy to determine the correctness of the data.

Then again, if we are examining the assumption that ram is also inherently not trustable, is it also possible that other levels of memory are also affected, L3, L2, L1 and even the registers themselves.

I'm sorry if the way I write seems very ambiguous, as I do not want to assume anything due to the fact that I do not have a very deep understanding of the topic. This post was written as a result of me thinking about ECC while I'm researching options to best store my data, and having unable to meet the ECC recommendation, got caught on the anxiety train just like many other people, that no-ECC = death.

Samuel Tai · Dec 20, 2021

zealotx1 said:
Is it even possible to edit posts?

Yes, but the forum is set up so new users can't edit until they reach a certain number of posts, and "graduate" out of being n00bs.

jgreco · Dec 21, 2021

zealotx1 said:
realise ECC is not failsafe, suddenly regular RAM no longer seems that bad as many computers run without ECC fine, thus making ECC failure rates seem closer to non-ECC than it is to full data integrity.

And once you discover that smoking crack and smoking cigarettes both involve smoking, you can deem crack to not seem "that bad" using that same reasoning. Doesn't make that reasoning true. It's a matter of degree and risk, so I hope you appreciate my comparison in that light.

There is nothing you can do to guarantee data protection in every case. I can sit there with a sledgehammer banging on your server. I can freeze or heat your server. I can aim a gamma ray gun at your server. I can vibrate your server until mechanically something goes twang. I can vary the power input in terrible ways.

We know RAM can fail, and that occasional bit flips actually do happen. We do not necessarily need to protect from every kind of RAM failure in order for ECC to be useful. If you look at this from a different angle, there's a different purpose for ECC that is something other than what YOU are thinking it means: it's a way for us infrastructure engineers to be aware of evolving problems with a bit of gear, and take proactive measures to replace the RAM before a module fails in a more catastrophic manner. It also happens to correct and repair the damage to the stored data in many cases.

This is really very similar to what ZFS does for your data on disk. It can detect errors, and in many cases correct them. But it isn't expected to be a Fort Knox vault where nothing can ever go wrong. Using ZFS to store your files does not relieve you of the necessity to backup your files and to have a gameplan if your filer goes south.

These things all make your data safe_R_ (note that's SAFER with an R on the end), not SAFE. If you lose sight of that goal, it becomes harder to understand where the pain points are.

Arwen · Dec 21, 2021

@zealotx1 - I spoke to one of the engineers of the Sun UltraSPARC III processor shortly after it came out, and he mentioned that one of the intents was to have either parity or ECC on all lines inside the CPU. They failed on the first go around, but never the less, got close. If I understand correctly, the various cache RAM had ECC, while address lines used parity. If the parity failed during a data transfer, the error would be recorded and the transfer could be re-tried.

In any case, at least when Sun was engineering SPARC CPUs, they made a multi-armed attempt at reliability.

For the Xeon memory mirroring, it likely uses ECC to detect which is good / bad. The probable intent is not to have 100% reliable memory, but to reduce down time replacing failing memory.

I was hoping DDR5 would support 80 bit memory, double the ECC of today's 72 bit memory. Memory densities today are at such insane levels compared to 1980s when 128GBs was could have been an entire data center's hard drive storage. So doubling the ECC makes sense to me.

zealotx1 · Dec 22, 2021

@Sam
Such a bummer.
I don't see any negatives with editing posts, but probably due to something I have not thought of yet.
Maybe the forum should implement snapshots, so I can go back and repost a fresh post instead.

@jgreco
ah ok
It's a bit like S.M.A.R.T. for pre-emptive failure detection with hdd but with added benefit of error correction.(super simplified understanding)
I think it is best if I left this topic as is because I think I really really need to acquire far more knowledge w.r.t. memory before I start commenting based on assumptions.

Appreciate you taking time sharing all these knowledge.

@Arwen
I think I really really need to go read up more on Xeon memory as well.

Quoting [Wikipedia article on DDR5]:

Each DIMM has two independent channels. While earlier SDRAM generations had one CA (Command/Address) bus controlling 64 or 72 (non-ECC/ECC) data lines, each DDR5 DIMM has two CA buses controlling 32 or 40 (non-ECC/ECC) data lines each, for a total of 64 or 80 data lines. This four-byte bus width times a doubled minimum burst length of 16 preserves the minimum access size of 64 bytes, which matches the cache line size used by x86 microprocessors.

I'm not sure how accurate this information is but further googling found a page from [rambus] which show the same info, so it seems like instead of having 8 bit of ECC, it now has 8 bit for every half?

Still despite that, the line before that mentions:

Unlike DDR4, all DDR5 DIMMs have on die ECC, where errors are detected and corrected before sending data to the CPU. This, however, is not the same as true ECC memory with an extra data correction chip on the ram module. DDR5's on die error correction is to improve reliability and to allow to denser RAM chips while lowering the defect rate for each RAM chip. There still exists non-ECC and ECC DDR5 DIMM variants; the ECC variants have extra data lines to the CPU to send error detection data, enabling the CPU to detect and correct errors that occurred in transit.

It seems true ECC is still not going to be mainstream, despite torvalds ranting. I dont think ECC is coming to consumer platforms anytime soon since the next iteration would be DDR6 which is probably years away.

Further searching brings up this post from [reddit] with a [comment] that mentions other types of ECC as well, which seems to target different things.
Since each has their own benefits, I wonder if there is a limit to how much ECC is possible or you can just go all the way down the ECC inception rabbithole.

Nonetheless, with the advent of RISC-V I think anyone determined could pick up where SPARC left off and give their shot at designing their own cpus but I dont think many would have the resources or means to do so, much less if there are even any info available regarding such processor design. Then comes the part of actually manufacturing, integrating and selling them.

no_connection · Dec 23, 2021

In a way I think ZFS went a bit in the wrong direction by blindly trusting the hardware.
They didn't send ppl to the moon by adding ECC to the point that there where no errors. They kept rewriting the code until it didn't crash despite the errors.
However ZFS already runs out of memory bandwidth as it is and doing even more shuffleing of data might hurt the performance disadvantage even more. Ofc this depends heavily on what it is used for but I doubt there would be a dual code path for it.

On a personal note I really like ECC and almost never get crashes that where not attributed to something else.
Sure I can't be sure it ever did anything, and it might just be the conservative way I do things, but it sure feels more stable.

zealotx1 said:
Also, I dont see how mirroring on Xeon platform protects data integrity, my understanding of mirroring is that it shows if there's corruption in either copy, you can only know corruption has occured but have no way to determine which is the correct one

Since you are mirroring ECC RAM you would know which one is corrupt and can correct for multi bit errors instead of just single bit. Or am I missing something?

Arwen · Dec 23, 2021

@no_connection - You are right about the blinding accepting the ECC memory and CPU.

However, ZFS was specifically designed to not blinding accept the more error prone parts of disk storage:

Disk controller chips
Storage cables
Disks / SSDs
Power supplies

There is even a some what famous story about a ZFS engineer, (very early days), working on a desktop. His disk kept coming back with ZFS errors on read. This engineer did not understand why. Well, it turns out his disk WAS giving him bad data. He simply did not know until he used ZFS.

Further, ZFS was designed to accept complete and total loss of power, at any time, without data loss. This does not mean data in flight would be stored and retrievable, no file system can guarantee that. But, other file systems can corrupt data in a file, (half old file, half new file), or corrupt the directory entries that point to files. ZFS won't do either. It's all or nothing for a ZFS transaction group.

In regards to mirroring ECC RAM, if one side has a 2 bit error, (which is not correctable with normal ECC memory, but IS detectable), yet the other side is good, (FOR THAT BLOCK), then you know which you want to use. Remember, ECC is per 64 bit block. Having both sides of the mirrored RAM fail at the exact same block, before you can replace the faulty memory, would be pretty extreme corner case.

Is this mirrored ECC RAM scheme perfect?
No.
It does not account for places where more than 2 bits in a block are bad, but the ECC code can not detect this problem, so thinks that 64 bit block of memory is good. Thus my desire for doubling the amount of ECC.

Some other companies, IBM specifically and I know for a fact some of the older Sun Microsystem servers had this, used memory chips spread out. This improved the ability to survive a total RAM chip failure. Meaning it might show up as an ECC correctable error on all memory in that bank. But, no data loss, yet. That would give time for kernel software to migrate data off that bank of memory and black list it until repaired.

jgreco · Dec 24, 2021

no_connection said:
I think ZFS went a bit in the wrong direction by blindly trusting the hardware.

It wasn't an unreasonable assumption. Sun server gear was high end, high quality stuff. It really didn't go bad in ways that would cause the classes of corruption problems that we're discussing, and when Sun hardware went bad, Sun FE's would be onsite fixing it soon thereafter.

It was the open-sourcing and backporting to PC hardware that was the problem. Unlike the Sun hardware, it's not wise to blindly trust PC hardware.

Ericloewe · Dec 24, 2021

I had an elaborate point to make, which included the following link, but I decided to leave just the link: 6000 hulls.

no_connection · Dec 24, 2021

jgreco said:
It wasn't an unreasonable assumption

I agree it's not unreasonable, but there is something unnerving about a black box that when something, the tiniest thing goes wrong in it, it bricks. It can handle a lot of errors it was designed to handle, but if it bricks there is no way to fix it.
Solution is to throw UPS and more hardware and backup server and....

Ericloewe said:
I had an elaborate point to make, which included the following link, but I decided to leave just the link: 6000 hulls.

If only it had 6001 hulls, the it would have been fine.

Important Announcement for the TrueNAS Community.

RAIDing RAM as a solution to not having ECC

zealotx1

Cadet

jgreco

Resident Grinch

Arwen

MVP

Computer memory protection

zealotx1

Cadet

Samuel Tai

Never underestimate your own stupidity

jgreco

Resident Grinch

Arwen

MVP

zealotx1

Cadet

no_connection

Patron

Arwen

MVP

jgreco

Resident Grinch

Ericloewe

Server Wrangler

no_connection

Patron

Similar threads

Important Announcement for the TrueNAS Community.

RAIDing RAM as a solution to not having ECC

Cadet

Resident Grinch

MVP

Cadet

Never underestimate your own stupidity

Resident Grinch

MVP

Cadet

Patron

MVP

Resident Grinch

Server Wrangler

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "RAIDing RAM as a solution to not having ECC"

Similar threads