Status of ECC memory

Chicken76 · Jan 14, 2013

How do I check if FreeNAS is using/monitoring ECC RAM?

I'm not familiar with BSD, so please be gentle with me. :)
Is there a way (through the web interface or the command line) to check the status of ECC RAM? I mean the number of errors encountered, and maybe in which module the errors occurred.

Ravefiend · Jan 14, 2013

My idea would be to say that your hardware / motherboard is responsible for this part. Any errors of such type are likely logged by the BIOS and you should find some log viewer in your BIOS too.

Chicken76 · Jan 15, 2013

I know ECC checking and correcting is handled by the hardware, independently from the OS, but some reporting usually happens.
My question was, is this data read and collected somewhere? If so, how can it be accessed?

cyberjock · Jan 15, 2013

I'd really like to know this too. I have no idea how you can find out the status of this stuff. I'm willing to bet you you have to query the memory controller somehow but I have no idea how. Anyone know the answer to his question?

Edit: I know there has been talk in this forum before about ECC vs Non-ECC. It would be great if I could run a command and see that XX number of RAM errors were corrected so we could see the benefits of ECC. That might give a real world experiment of how ECC is actually useful. Its one thing to read from Intel that ECC is better and why, but its a whole different ballgame when you see your own system tell you it had to fix errors.

jgreco · Jan 15, 2013

These are supposed to be caught through the machine check architecture subsystem. You would normally get an error message on the console like

MCA: Bank foo, Status foo
MCA: Vendor foo

and some other random information. This should also be reported to the IPMI/BMC subsystem if you're so equipped. The problem is that it's awfully hard to test whether or not this actually happens and is reported properly at all levels. These errors weren't showing up in a server's IPMI log for example:

which is a pretty good example of a module with what looks to be a nonpersistent fault of some sort (note all errors reported on the same module; there are 4 in the system).

If there are any "ECC Event Logging" options or similar in your BIOS, you'll want those turned on.

Otherwise, our experiences here are that most ECC errors flag marginal modules. Basically, if we see errors, it'll all be on a single module, and that's correctable by replacing the module, and then we're back to seeing no errors. But it is certainly possible for a system to sit on the bench for awhile and pass all tests and then see a module develop a problem later. Notice that six month gap where no errors were logged? **WTF**

jgreco · Jan 15, 2013

Ugh, that came out nearly unreadable. I apologize.

cyberjock · Jan 15, 2013

Ok, so what about all this stuff about ECC protects your data in RAM from cosmic radiation, blah blah blah. If cosmic radiation was flipping bits randomly(after all, Intel claims something like 300 random errors in RAM per year for a machine on 24x7x365) I'd expect ECC to correct approximately 300 random errors per year. Or roughly 1 error per day. Personally, I do not subscribe to the notion that we get that many RAM errors per year in non-ECC machines. Unfortunately I have never seen someone make a post along the lines of "Server X has had 296 errors this year, all completely random". I've only seen "look at this one stick.. it must be going bad".

But you just made it sound like we shouldn't be expecting this...

Do you have any explanation for this incongruity?

I'm really not challenging you on this. I've been baffled by this for more than 10 years and nobody seems to really have solid answers explaining this to me. Being the field of study I work in I have an in depth knowledge of how radiation affects RAM, but this whole thing about ECC versus non-ECC has always bothered me. Even when I asked instructors about it they didn't know because their training was not limited to x86 or any given architecture but was limited to capacitive DRAM cells.

What really baffles me about the whole ECC vs non-ECC(and this effect has baffled scientists for decades) is that theoretically you'd expect that if you double the quantity of RAM you should expect double the number of errors from cosmic radiation. Also, every die shrink in the industry is roughly 50% so you'd expect a doubling effect there too for atomic reasons (just go with it if you don't understand). But neither of these actually affect the error rates anywhere near the theoretical expectations! The actual error rates are far far smaller than expected by theory. Pretty cool eh?

jgreco · Jan 15, 2013

I'm really not challenging you on this.

Challenge away. I tend to be pragmatic about this sort of thing. With an EE/CS background, I understand why people expect to see a certain behaviour, and my best guess is that the issues aren't entirely understood. I reconcile the ground truth I observe with the theories of others, and here I come to the conclusion that theories fail in the face of facts. I am in no way married to my conclusions, if you can make a compelling challenge as to their correctness.

If cosmic radiation was flipping bits randomly(after all, Intel claims something like 300 random errors in RAM per year for a machine on 24x7x365) I'd expect ECC to correct approximately 300 random errors per year.

Yes, that's about what you'd expect, if we were to believe that.

So here's my practical observation.

1) I'm fully aware of the "Google ECC" study and other claims that ECC errors happen at the rate of (essentially some random number) per day.

2) On a computer that was nearing the ripe old age of 20 (years), we were starting to see parity errors pop up every quarter or thereabouts. Parity errors caused panics, so the events did not pass unnoticed. Aging parts. These panics were not regularly happening during the first half (two thirds?) of the server's life, though there may have been one or two over the years.

3) Our other gear typically doesn't show ANY significant level of ECC errors. Unless there's a bad chip, in which case you can get what I shared above.

My first conclusion is that the Google ECC study is fatally flawed, and my best guess is that it's because Google is buying the cheapest possible $#!+ that they possibly can and cramming it in racks as close as they can, running it as hot as they can, to keep their capex and opex as low as possible. I am *shocked* that they're seeing signs of unhappy hardware.

My second conclusion is that there's something wrong with the cosmic radiation theories, or at least how that's been applied to DRAM for error estimation purposes. We see this all day long in many industries. FDA has just ordered a reduction in the recommended dosage of sleep medications because they made what's essentially a mistake the first time around. Stuff happens.

So here's some hints.

A) Buy name-brand RAM. If you're Google and you're buying thousandlots from the lowest bidder, and the lowest bidder knows you're building ECC-protected servers for bottom dollar, you may well be getting something of a lower quality than the memory being sold in the channel to HP, Dell, or other OEM's and retail vendors. Remember that even marginal silicon often ends up somewhere. Hopefully not in your machine. A known vendor willing to provide a lifetime guarantee may charge 10% more than the scuzballs selling RAM over at never-effin-heard-of-us.com, but you know what? It's worth the extra. The known vendor is getting the stuff that they know that they're less likely to have to RMA.

B) Build a quality system that can watch for problems. Server quality mainboard, from a vendor who does this stuff for the enterprise market. Because you know what? That no-name board won't have ECC error logging support. Your ASUS, Gigabyte, etc. ones may - but they may not either. But you can be pretty sure that the HP, IBM, Supermicro, etc. stuff is intended to do that. Don't skimp on other parts like the power supply either.

C) Keep your system well cooled and unstressed.

D) Be proactive in bench testing (run memtest86 for a week!) and ongoing monitoring, replacing faulty modules.

Strikes me as a good idea to go survey a few other systems to see what's what around here. Hm.

Stephens · Jan 15, 2013

noosauce, did you ever find (or write) something to test for cosmic radiation random bit flipping? I think 2013's supposed to be the height of our Sun pouring out all kinds of good stuff.

cyberjock · Jan 15, 2013

Stephens said:
noosauce, did you ever find (or write) something to test for cosmic radiation random bit flipping? I think 2013's supposed to be the height of our Sun pouring out all kinds of good stuff.

I did have a thread about this last year, but somewhat gave up on it. I did find that Memtest has a feature called "bit fade test". It writes a pattern to up to 4GB of RAM, waits 90 mins, then reads it and looks for anything to be wrong. I was thinking when I hear that the solar flares are peaking I'll start up my spare machine and let it cycle over and over for a few days. I figure that if I put 4GB in a machine during a peak solar flare time over 3-5 days surely I should get at least one error, right?

I'd really like to put this whole BS about 300 errors/year thing to rest. I'm really just not buying that error rates are THAT high from cosmic radiation. Maybe 10-20 a year tops, but not much more.

I always buy name brand RAM. Mostly for compatibility. I've seen too many people buy Brand X from ebay for 1/2 price and then get upset when they're mailing it back at their expense as well as buying the more expensive name brand. Right now all of my machines are GSkill, Kingston, or Crucial.

BTW: Here's the only known bit-flip anyone has ever heard of in a production system that was actually identified to the level of detail as the admin that did his homework. http://it.slashdot.org/story/10/06/24/2210214/tracking-down-a-single-bit-ram-error. It's a very interesting read!

BobJ · Dec 30, 2016

Chicken76 said:
How do I check if FreeNAS is using/monitoring ECC RAM?

I'm not familiar with BSD, so please be gentle with me. :)
Is there a way (through the web interface or the command line) to check the status of ECC RAM? I mean the number of errors encountered, and maybe in which module the errors occurred.

This is an old thread but I am shocked that it has never been answered.

I will give you a few links on ECC and examples of a MB that properly reports issues. Most home built NAS machines do not report ECC errors. And finding out if the ECC is actually turned on and working is a little bit of faith as you don't know 100 percent if it is.

But if you buy a Dell server or Dell workstation they do have much better ECC Ram reporting.
All this talk on You must have ECC to run ZFS yet I bet most users don't even know if the ECC is really working because they got a low ball consumer MB that happens to have the ability to run ECC.

Also on Cosmic radiation errors. They are very real and they vary with Altitude. Meaning someone living at 12,000 Feet is going to have many more errors than someone at sea level. If you google it there is a study showing ECC errors vs altitude. In Space if you don't have ECC you can't even run the computer as the error rate is too high.

Here is how Dell uses ECC on their workstations
https://www.youtube.com/watch?v=q2aBP4VgxJc

Pugent systems has a write up on it. As of December 2016 they still do not have a MB that can report ECC errors like Dells have had for years.

https://www.pugetsystems.com/labs/articles/How-to-Check-ECC-RAM-Functionality-462/

"Conclusion:
For something that is critical for servers and some workstations, you would think that checking to see if ECC is working would be a simple matter. Unfortunately, we have found that there is no consistent, conclusive way to determine if ECC RAM is working properly. Even if none of the three methods we showed indicate that ECC is working, all that means is that none of them could detect that ECC is working. On the positive side, on every system we have needed to verify that ECC is working we have been able to confirm it by using one of the methods we showed in this article."

I am no way saying Dell is the only vendor that makes MBs that properly report and log ECC errors. HP, Super Micro probably do as well. Most consumer grade NAS boxes will not have this. in those cases you just have to trust that the ECC is actually working if you have low trust levels then break out the bucks and spend on server grade hardware.

Ericloewe · Dec 30, 2016

BobJ said:
Most home built NAS machines do not report ECC errors.

[Citation Needed]

Literally all the recommended hardware will report and log ECC errors. Validating that ECC is working before an error naturally occurs is tricky, but the same applies to all hardware until someone writes a tool that is shown to be reliable (memtest86 isn't and neither is memtest86+).

BobJ said:
But if you buy a Dell server or Dell workstation they do have much better ECC Ram reporting.

BobJ said:
Here is how Dell uses ECC on their workstations
https://www.youtube.com/watch?v=q2aBP4VgxJc

Sounds like marketing crap through and through.

BobJ said:
Also on Cosmic radiation errors. They are very real and they vary with Altitude.

They are also quite rare, as evidenced by the small number of corrections that take place across the forum users' machines. Not that I'm arguing against ECC.

BobJ said:
Pugent systems has a write up on it. As of December 2016 they still do not have a MB that can report ECC errors like Dells have had for years.

https://www.pugetsystems.com/labs/articles/How-to-Check-ECC-RAM-Functionality-462/

None of their methods are particularly reliable. We investigated them thoroughly two years ago.

BobJ said:
low ball consumer MB that happens to have the ability to run ECC.

Such things are exceedingly rare.

BobJ said:
All this talk on You must have ECC to run ZFS yet I bet most users don't even know if the ECC is really working because they got a low ball consumer MB that happens to have the ability to run ECC.

The number of people who know/care enough to want ECC but will try to cheap out to such an extent is going to be tiny.

It should also be noted that the notion that ZFS is less safe than other filesystems when using non-ECC RAM is not based on any real empirical evidence. In fact, evidence suggests that complete data loss due to bad RAM is unlikely, which would put ZFS on equal footing with traditional filesystems.

BobJ · Dec 30, 2016

Yes I agree on the ECC + ZFS you can run ZFS with no ECC and have no issues.

But the ram errors are real with altitude. When you say rare you fail to mention at what altitude. They are not rare at high altitude. Its not a made up thing cosmic rays are real.

But knowing if something is rare or not is not within your grasp without a survey. I know many people that run NAS build their own nas yet never once go to a forum to post anything.
So the real user base is a unknown. Puget systems sells workstations and admitted they sell nothing like Dell has on servers. Its not marketing it does work well on the servers and reports them via email if need. The ECC on the workstations, sure there is some marketing there but that does not mean it does not work. Intel markets cpus, that does not mean they do not work or they are all hype.

Below is a graph comparing errors are sea level vs 7,000 feet. Past 7,000 feet the errors go from 3x to 300x at 30,000 feet. In space they logged 280 to 1200 single bit errors per day in only 256 megs of ram.
http://puu.sh/t6uZN/60cbf90a14.png

tvsjr · Dec 30, 2016

First, if you're going to plagiarize from an academic paper, it's a good idea to cite it and provide a link:
https://www.cs.virginia.edu/~gurumurthi/papers/asplos15.pdf

You'll also observe that, from the bit you plagiarized, they state "likely attributable to altitude". This wasn't the variable for which they were testing, so there's no conclusive proof of that.

At the end of the day, most of the systems generating the data being stored on these FreeNAS systems are probably not running ECC. There's a much larger (though still infinitesimal) chance that crap data will get shoved to the FreeNAS without its knowledge, from a client system that suffered a bit flip.

BobJ · Dec 31, 2016

What is wrong with you being so offensive? Plagiarize is a pretty harsh accusation to someone that was simply providing info. And I did suggest you google the article earlier. I just put the part on elevation I never said I wrote it which would plagiarizing. Try thanking me for taking my time to provide information rather than cast stones. That was just one article, there are others on errors higher up like in space. And you are very welcome.

I run ECC and I also have a non ECC ZFS nas both running for years and no issues with either. Don't get me wrong I am not a EVERYONE must buy ECC alarmist. I was simply letting people know that errors do increase with altitude. That is why some scientific testing is done deep underground to lesson the confounding interference from neutrinos. For most people the flipped bits are not an issue. But it was interesting that certain brands of memory offer better shielding that others but they did not specify which brands and why.

tvsjr said:
First, if you're going to plagiarize from an academic paper, it's a good idea to cite it and provide a link:
https://www.cs.virginia.edu/~gurumurthi/papers/asplos15.pdf

You'll also observe that, from the bit you plagiarized, they state "likely attributable to altitude". This wasn't the variable for which they were testing, so there's no conclusive proof of that.

At the end of the day, most of the systems generating the data being stored on these FreeNAS systems are probably not running ECC. There's a much larger (though still infinitesimal) chance that crap data will get shoved to the FreeNAS without its knowledge, from a client system that suffered a bit flip.

Ericloewe · Dec 31, 2016

This thread is going nowhere. Locked.

Important Announcement for the TrueNAS Community.

Status of ECC memory

Chicken76

Dabbler

Ravefiend

Dabbler

Chicken76

Dabbler

cyberjock

Inactive Account

jgreco

Resident Grinch

jgreco

Resident Grinch

cyberjock

Inactive Account

jgreco

Resident Grinch

Stephens

Patron

cyberjock

Inactive Account

BobJ

Dabbler

Ericloewe

Server Wrangler

BobJ

Dabbler

tvsjr

Guru

BobJ

Dabbler

Ericloewe

Server Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

Status of ECC memory

Dabbler

Dabbler

Dabbler

Inactive Account

Resident Grinch

Resident Grinch

Inactive Account

Resident Grinch

Patron

Inactive Account

Dabbler

Server Wrangler

Dabbler

Guru

Dabbler

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Status of ECC memory"

Similar threads