ECC vs non-ECC RAM and ZFS

jgreco · Oct 22, 2014

It would seem to depend on what sort of monitoring infrastructure your servers employ. I realize that the problem is that many of the people here only have the one server and therefore no infrastructure for monitoring their health. I'm not sure what actually works out best in that situation, because ideally you want some OTHER box to alert you of problems with your server, but for importing IPMI system event logs on the local system into the local system's logs I would expect it to be something like that, yes. The question is then what will happen when an anomaly is detected.

I'm not really able to test that since my FreeNAS boxes are virtualized, sorry.

cyberjock · Oct 22, 2014

The ECC algorithm provides for detecting multi-bit errors, but not being able to correct them. The chances of a multi-bit error being undetected is extremely small (think heat death of the universe, etc.). Note that a failure of 2 bits and a failure of "anything more than 2 bits" has the same chances statistically of being detected (or not being detected).

jobelz · Dec 23, 2014

Hey people!

I recently started trying FeeNAS by starting to read every relevant doc and post. It's a lot to take in for someone less technically savvy than most on these forums, but I'm trying and most importantly: learning. I'm running FreeNAS on a (non-ECC) test box right now to check it out and give myself some time to let it all sink in. After that I'll start looking for available compatible hardware in my region to build my actual NAS (and I'm sure I'll be asking expert advice on that

).

After reading up on ECC (that powerpoint slideshow really helped!) I've got a question. I tried searching for it but it seems it hasn't been asked/answered yet (my apologies if it was and I didn't find it).

It's clear that an ECC enabled system is desired for ZFS, however, would you state that in case you have non-ECC hardware (and for whatever reason you can't upgrade) you should choose another file system? Most pre-built NAS solutions I checked online only have ECC ram in the top tier of enterprise products, so I'd assume other filesystems are more forgiving on non-ECC systems, while lacking all the good stuff ZFS provides over them of course.

Thanks!
(Happy holidays!)

Knowltey · Dec 23, 2014

jobelz said:
Hey people!

I recently started trying FeeNAS by starting to read every relevant doc and post. It's a lot to take in for someone less technically savvy than most on these forums, but I'm trying and most importantly: learning. I'm running FreeNAS on a (non-ECC) test box right now to check it out and give myself some time to let it all sink in. After that I'll start looking for available compatible hardware in my region to build my actual NAS (and I'm sure I'll be asking expert advice on that ).

After reading up on ECC (that powerpoint slideshow really helped!) I've got a question. I tried searching for it but it seems it hasn't been asked/answered yet (my apologies if it was and I didn't find it).

It's clear that an ECC enabled system is desired for ZFS, however, would you state that in case you have non-ECC hardware (and for whatever reason you can't upgrade) you should choose another file system? Most pre-built NAS solutions I checked online only have ECC ram in the top tier of enterprise products, so I'd assume other filesystems are more forgiving on non-ECC systems, while lacking all the good stuff ZFS provides over them of course.

Thanks!
(Happy holidays!)

There isn't another file system. With FreeNAS you can choose whichever file system you want, as long as it is ZFS. (Unless you're using the outdated versions that used UFS)

And no, other file systems are just as prone if not more so. ZFS does at least have some checks so there are certain situations where ZFS can still save you where another file system might not with the same circumstances. But the real point is that running ZFS and the. Using non-ECC is like buying a car then continuing to walk to work through the bad part of town.

Ericloewe · Dec 23, 2014

The difference is that ZFS does a lot more in memory than other filesystems do. That said, every file server should use ECC RAM. That's one more reason not to trust your average synology and whatever NAS-in-a-box.

R.G. · Dec 23, 2014

Techno-ramble on ECC:

This is a fairly accessible paper on ECC with only mild excursions into hard math: http://www.hackersdelight.org/ecc.pdf

ECC as a theoretical discipline is an entire field of specialized coding theory. As such, there are many ways ECC is done, and many different ECC "code sets".

Each variant of ECC substitutes a "word" (i.e. specific set of bits) in the code set of valid ECC words for a word in the non-ECC word being protected. In this way, ECC can be viewed as a form of encryption - one plain-text word is represented by a different word in the ECC code set. It's not designed to hide the original bits, as the ECC word generally has the original bits in it and the ECC bits tacked onto the end or interspersed in the original, but the idea is that one plain-text word is equal to one and only one ECC'd word.

The words in the ECC code set have more bits than the original words. So a 64 bit "plaintext" word will have more bits in the ECC word that represents it. The whole point of adding those bits is that you can do logic operations on the plaintext and extra bits and come up with the result that the original word is correct or not, and may be able to figure out what the original plain-text word was and thereby "correct" the error by reporting the original plaintext bits that the extra ECC bits let you compute. ECC codes are obviously constructed so that this computation of whether there was an error and what the original plaintext was can be done VERY quickly, generally with a few hard-logic gates, not software.

One result of the design of the ECC code set is that each code set and checking logic has (potentially at least) different abilities to detect and correct errors, and different "blindnesses", errors it can't see at all. Parity is a simple single-error-detection code: it can "see" all single bit errors, but cannot correct any errors. But double bit errors cannot even be seen by parity, because two-bit errors transform one valid ECC word (the plain text word plus the parity bit) into another valid ECC word in the code set, but a different word than the original plaintext. In fact, parity detects all errors which involve one, three, five... odd numbers of bit error, but is blind to all even numbers of bit errors.

Adding more bits of ECC checking lets you start detecting more errors, cutting down that "blindness" to errors.

The ECC process used in computer RAM is designed to be single-error-correcting, double-error-detecting. What it does for three, four, and more errors depends on exactly what coding is done in the ECC process. It may or may not detect triple errors, quadruple errors, and so on.

Heat-death-of-the-universe arguments are fun, but they depend on the chances of each bit being in error being independent. The idea of speculating about what errors can happen and how many bits are likely to be in error (and, of course, being right about the speculations! :) ) supports many math and computer architecture professionals. What happens if, say, four bits are right next to each other on the RAM chip and a cosmic ray hits in the middle of the four, not just one? Or in a more pedestrian possibility, what happens if one entire RAM chip dies? The chip design and computer architecture pros have considered such matters, and so the insides of the RAM chips are messed with as are the ECC codes to minimize the exposures, but Mother Nature still has many ways to corrupt your data. For instance, one asteroid strike will likely get it all. :D

My own personal "right answer" is to do all I practically can to keep my data correct, but to spend only as much as the data is worth to do so. Backups are the first line of defense. Next is good quality hardware. Next is cleverness in selecting how your data is handled and stored. In all of this I feel that I need to stay flexible, because there is no The Answer, only more steps in the path.

DrKK · Dec 23, 2014

I think really, R.G., that there is a bit more to the story here. Bit errors in RAM have a particular, and peculiarly, pernicious effect (at least, in certain cases) with ZFS. What I'm referring to has been well curated elsewhere in this forum, particularly in a controversial post by Cyberjock. (hell, maybe even THIS post, I am not reviewing the 14 pages).

The question of ECC vs non-ECC, in a general, abstract sense, is not the same question as ECC vs non-ECC for ZFS, because of the pecularities of ZFS and the assumptions ZFS makes about things. Evaluating in the sense of "Meh, I can tolerate one bit error per X, given probability Y, blah blha blah" is not a particularly relevant calculus when, as is the case with ZFS, particular combinations of errors can metastatically have out-size ramifications.

Most of the most active FreeNAS guys will say, if asked, that ZFS without ECC is a force-multiplying mistake. If a user is unable or unwilling to find server-grade equipment, including ECC RAM, I think the small end-user is best advised to consider file storage solutions other than a ZFS-based appliance---or to keep virginal backups on hand, at a minimum.

I wish I could say that was "just an opinion", but really, it's not.

But I think we all tire of the ECC vs non-ECC jihad.

TXAG26 · Dec 23, 2014

Running non-ECC ram is not an option with ZFS. Moving right along...

ECC ram is like a seatbelt. Do you need to wear it to physically be capable of driving? Well no, not technically, but if you drive without it and get into a wreck I bet you'll wish you had worn it when you see what the windshield did to your face when it went through it.

TXAG26 · Dec 23, 2014

PS - DrKK will not be joining us for the next couple of days. Something about Sony benching/pulling him until Christmas Day. I don't know all the details...it's complicated...

cyberjock · Dec 23, 2014

TXAG26 said:
PS - DrKK will not be joining us for the next couple of days. Something about Sony benching/pulling him until Christmas Day. I don't know all the details...it's complicated...

ROFL. I'm talking to him right now and he's at home.

jgreco · Dec 24, 2014

cyberjock said:
ROFL. I'm talking to him right now and he's at home.

Did that actually just fly right over your head?

TXAG26 said:
PS - DrKK will not be joining us for the next couple of days. Something about Sony benching/pulling him until Christmas Day. I don't know all the details...it's complicated...

It's not complicated. Sony continues to fail. I used to be a diehard Sony guy dating back to my days in TV/video production, back when we hauled Portapaks and cameras around (shoulder *still* hurts!) and edited on a pair of Umatic recorders and an RM440. But I slowly lost respect for them over various technical issues (S-link was kind of a debacle) and then with the whole George Hotz thing... so I blacklisted Sony. Kind of tragic turn of events for what I considered to be a venerable company.

Sony had an opportunity, with The Interview, to show its true colors, and, I'm sorry to say, it has. I would have been impressed had they responded aggressively and, I dunno, maybe given away discounted tickets or distributed the movie online or whatever. Instead, they pulled the movie, and then attempted some mealy-mouthed PR backpedaling about how it wasn't actually them but that the theaters "made" them do it. I've raised kids and I'm familiar with "made me do it." You can actually release something and have everyone refuse to screen it. That's different than pulling the release because you can't find anyone willing to screen it. It has the same result in terms of screens it is shown on, but it is an entirely different beast in terms of the message you're sending to the public, to the bad guys, and to the fine folks over in North Korea who took offense.

Update: http://money.cnn.com/2014/12/24/media/interview-digital-release/index.html

I take it back. Good job, Sony, better a bit late than never.

TXAG26 · Dec 24, 2014

jgreco said:
Did that actually just fly right over your head?

I think it did! Chalk up a win, I just pulled one over on Cyberjock!

cyberjock · Dec 24, 2014

It clearly did. :(

cyberjock · Dec 24, 2014

http://hardware.slashdot.org/story/...les-vulnerable-to-bit-rot-by-a-simple-program

jgreco · Dec 24, 2014

Well &!#& !&(# @!#!& **!#!@$ !*#*!#$**! **@*#($*!* !*#!* #*!*$*%!*

pjc · Dec 24, 2014

As always, /. is way behind the times. The row hammer issue has been published and known for years, and this paper has been out for months. The 6.0 beta of memtest86 specifically has a row hammer test.

See this thread on the memtest86 forum for why their numbers are a bit exaggerated: http://www.passmark.com/forum/showthread.php?4836-MemTest86-v6-0-Beta&p=17941#post17941

TL;DR: They had to implement their own memory controller that didn't shuffle the addresses or invert the bits as normal controllers do. So while the chips themselves may be susceptible to row hammering, it's very hard to create an error under real-world conditions unless you have really bad memory. And then inducing anything beyond a single-bit error is nearly impossible.

So ECC to the rescue again. In the unlikely event that you get a row hammer in real-world conditions, it's probably corrected. And in the even unlikelier event that you get two errors, your system panics, and no data loss.

pjc · Dec 24, 2014

jobelz said:
I'd assume other filesystems are more forgiving on non-ECC systems, while lacking all the good stuff ZFS provides over them of course.

To answer the question I think you're asking:

Other filesystems will silently continue if you get in-memory errors. This may or may not cause data corruption and loss.

Because ZFS does such extensive checksumming, it will detect some (many?) in-memory errors and alert you to the data loss (if it can recover) or panic (if it can't).

So the difference is how memory errors are handled: silently failing (other filesystems) or loudly and noisily yelling (ZFS).

Note that ZFS can't detect all memory errors. For example if a bit flips in data that you're going to write to disk, it'll create the checksum from the bad data, since it has no way of knowing that the data is bad.

So as others have said, you should really be using ECC RAM in any file server, regardless of filesystem. The lack of ECC on affordable/home NAS appliances is what drove me to FreeNAS.

cyberjock · Dec 26, 2014

pjc said:
As always, /. is way behind the times. The row hammer issue has been published and known for years, and this paper has been out for months. The 6.0 beta of memtest86 specifically has a row hammer test.

See this thread on the memtest86 forum for why their numbers are a bit exaggerated: http://www.passmark.com/forum/showthread.php?4836-MemTest86-v6-0-Beta&p=17941#post17941

TL;DR: They had to implement their own memory controller that didn't shuffle the addresses or invert the bits as normal controllers do. So while the chips themselves may be susceptible to row hammering, it's very hard to create an error under real-world conditions unless you have really bad memory. And then inducing anything beyond a single-bit error is nearly impossible.

So ECC to the rescue again. In the unlikely event that you get a row hammer in real-world conditions, it's probably corrected. And in the even unlikelier event that you get two errors, your system panics, and no data loss.

I think that the "news" isn't that someone found this vulnerability (which you also said has been known for years). The news was that so many DIMMs out there *are* susceptible to it. That *is* news as previously there was lots of conjecture and nobody went out and tested lots of DIMMs to see what the result would be.

anodos · Dec 26, 2014

cyberjock said:
http://hardware.slashdot.org/story/...les-vulnerable-to-bit-rot-by-a-simple-program

Well, that's a rotten situation.

Knowltey · Jan 11, 2015

anodos said:
Well, that's a rotten situation.

The researchers don't delve deeply into applications of this, but hint at possible security exploits.

Hmm, sounds to me like it could be pretty useful for testing purposes.

Important Announcement for the TrueNAS Community.

ECC vs non-ECC RAM and ZFS

Resident Grinch

Inactive Account

Cadet

Patron

Server Wrangler

Explorer

FreeNAS Generalissimo

Patron

Patron

Inactive Account

Resident Grinch

Patron

Inactive Account

Inactive Account

Resident Grinch

Contributor

Contributor

Inactive Account

Sambassador

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ECC vs non-ECC RAM and ZFS"

Similar threads