ECC vs non-ECC RAM and ZFS

jgreco · Feb 8, 2014

No speed demon but sometimes it is just reliability that you need!

gpsguy · Feb 8, 2014

Maybe it's on his wedding registry.

DrKK said:
Perhaps you might want to update your CPU

cyberjock · Feb 8, 2014

Joshua Parker Ruehlig said:
Just upgraded my system to ECC ram, had to buy a new motherboard ($115) and ram ($260). Tested and ECC is working http://hardforum.com/showthread.php?t=1693051 :)

CPU - g530

Mobo - MBD-X9SCM-O

Ram - 4*8GB HMT41GU7MFR8C-PB

If I'm not mistaken, Supermicro has already said that the ECC function on celerons will not work, but that things like the ecc_check tool will falsely report that it is functional. I can't find the link right now. I'll have to search the forums for the other threads on the ecc_check tool as one of them is very detailed and clears up the details.

Joshua Parker Ruehlig · Feb 8, 2014

cyberjock said:
If I'm not mistaken, Supermicro has already said that the ECC function on celerons will not work, but that things like the ecc_check tool will falsely report that it is functional. I can't find the link right now. I'll have to search the forums for the other threads on the ecc_check tool as one of them is very detailed and clears up the details.

I would be interested if you could prove my system isn't actually giving me ECC functionality. I'll keep an eye out for it as well.

EDIT
Found the thread and see what you mean. I did read Dell sells some workstation machines with ECC and Sandy Bridge Pentiums/Celerons which why I would believe Intel ended up enabling it on these CPUs. I wish their was a way to 100%, maybe force a bit to flip and check the log to see fixed bits?

cyberjock · Feb 8, 2014

Joshua Parker Ruehlig said:
I would be interested if you could prove my system isn't actually giving me ECC functionality. I'll keep an eye out for it as well.

EDIT
Found the thread and see what you mean. I did read Dell sells some workstation machines with ECC and Sandy Bridge Pentiums/Celerons which why I would believe Intel ended up enabling it on these CPUs. I wish their was a way to 100%, maybe force a bit to flip and check the log to see fixed bits?

Oh, that's the major problem. ECC has almost no way to prove it does or doesn't work. Even if you similate an ECC error, I'm not sure if the simulation response appropriate if the ECC feature wasn't available and you tried to trigger it. The ecc_check behaves differently for different manufacturers, different motherboards, and different chipsets. So it's a big ol' hodge-podge of a mess.

So what I tell people is to buy hardware that absolutely is clearly marked as supporting ECC. It's just not worth the risk to try to use some component people think uses ECC, only to find out later the protection you've thought you had the whole time you never had. You can buy the $40 CPU that *might* have ECC support, or you can buy the $70 CPU that *has* ECC support. Seems like simple math for me. ;)

Intel's method for ECC is somewhat convoluted with the mixture of CPUs and motherboards necessary. AMD's is even worse. You literally have to go to the motherboard manufacturer to validate it or not. And from what I've read a few manufacturers have claimed to "support ECC". But apparently its been proven that definition of "supported" is not what you and I take it to mean. For those manufacturers it meant that you could put ECC RAM on the motherboard and the RAM would work. They weren't saying that the ECC feature worked! So surprise surprise, you'd buy their motherboard, ECC RAM, and a compatible CPU and think all was great. Then you found out that your motherboard manufacturer f*cked you. How awesome is that?

This is yet another example where "attention to detail" really is important.

DrKK · Feb 9, 2014

cyberjock said:
Oh, that's the major problem. ECC has almost no way to prove it does or doesn't work. Even if you similate an ECC error, I'm not sure if the simulation response appropriate if the ECC feature wasn't available and you tried to trigger it. The ecc_check behaves differently for different manufacturers, different motherboards, and different chipsets. So it's a big ol' hodge-podge of a mess.

So what I tell people is to buy hardware that absolutely is clearly marked as supporting ECC. It's just not worth the risk to try to use some component people think uses ECC, only to find out later the protection you've thought you had the whole time you never had. You can buy the $40 CPU that *might* have ECC support, or you can buy the $70 CPU that *has* ECC support. Seems like simple math for me. ;)

Intel's method for ECC is somewhat convoluted with the mixture of CPUs and motherboards necessary. AMD's is even worse. You literally have to go to the motherboard manufacturer to validate it or not. And from what I've read a few manufacturers have claimed to "support ECC". But apparently its been proven that definition of "supported" is not what you and I take it to mean. For those manufacturers it meant that you could put ECC RAM on the motherboard and the RAM would work. They weren't saying that the ECC feature worked! So surprise surprise, you'd buy their motherboard, ECC RAM, and a compatible CPU and think all was great. Then you found out that your motherboard manufacturer f*cked you. How awesome is that?

This is yet another example where "attention to detail" really is important.

So, my understanding on the Supermicro X10 board, is that it actually will not POST if ECC RAM is not installed, and performing the ECC function. I sure as hell hope that's right! I have the X10SLM+-F with a G3220 (and the ark says ECC is supported), and it's definitely ECC RAM. But now you have me worried, as you say, because there is no real practical test to see if you are actually performing error-correction with the 9th bit.

Joshua Parker Ruehlig · Feb 9, 2014

DrKK said:
So, my understanding on the Supermicro X10 board, is that it actually will not POST if ECC RAM is not installed, and performing the ECC function. I sure as hell hope that's right! I have the X10SLM+-F with a G3220 (and the ark says ECC is supported), and it's definitely ECC RAM. But now you have me worried, as you say, because there is no real practical test to see if you are actually performing error-correction with the 9th bit.

I'm gonna not worry about it too much until we get empirical evidence that ECC isn't working. The only empirical evidence we have thus far is the ecc_check program. It has in every case said ECC is enabled with a Xeon, or Sandy Bridge Celeron/Pentium/i3 + ECC ram. It has in every case said ECC isn't enabled with an i5, or with non-ECC ram. One guy even tested it with a ECC stick + non-ECC stick installed at the same time and the expected response was given. We also have rational evidence that Dell sold workstation system with ECC ram, low power processors, and one of those obscure consumer/server PXX chipset motherboards to sell to corporations. This is rational evidence of why intel would enable ECC (to make their vendor happy) while also not listing it on their page (to stop cheapskates like me from figuring them out).

cyberjock · Feb 9, 2014

Joshua Parker Ruehlig said:
I'm gonna not worry about it too much until we get empirical evidence that ECC isn't working. The only empirical evidence we have thus far is the ecc_check program. It has in every case said ECC is enabled with a Xeon, or Sandy Bridge Celeron/Pentium/i3 + ECC ram. It has in every case said ECC isn't enabled with an i5, or with non-ECC ram. One guy even tested it with a ECC stick + non-ECC stick installed at the same time and the expected response was given. We also have rational evidence that Dell sold workstation system with ECC ram, low power processors, and one of those obscure consumer/server PXX chipset motherboards to sell to corporations. This is rational evidence of why intel would enable ECC (to make their vendor happy) while also not listing it on their page (to stop cheapskates like me from figuring them out).

And that's totally find and totally your choice(and your loss if you are wrong). But me, I take the conservative approach(especially when giving people advice!)To assume it is working unless you can prove it isn't, or to assume it isn't working until you can prove it is? So which is more conservative? If you are happy to believe you have ECC support and that you "stuck it to the man" by not buying a $30 more expensive CPU(g3220), then great! Congratulations for you. Me, I'll stick to stuff that I know for 100% certainty what I'm getting with my dollar's worth.

As for DrKK, if your motherboard and CPU are both officially listed as supporting ECC RAM, you are fine(yours are). My concern is that the Intel ARK for the Celerons does NOT say ECC is supported. And its a bad idea(based on my conservative decision making) to make that leap of faith.

Joshua Parker Ruehlig · Feb 9, 2014

cyberjock said:
And that's totally find and totally your choice(and your loss if you are wrong). But me, I take the conservative approach(especially when giving people advice!)To assume it is working unless you can prove it isn't, or to assume it isn't working until you can prove it is? So which is more conservative? If you are happy to believe you have ECC support and that you "stuck it to the man" by not buying a $30 more expensive CPU(g3220), then great! Congratulations for you. Me, I'll stick to stuff that I know for 100% certainty what I'm getting with my dollar's worth.

As for DrKK, if your motherboard and CPU are both officially listed as supporting ECC RAM, you are fine(yours are). My concern is that the Intel ARK for the Celerons does NOT say ECC is supported. And its a bad idea(based on my conservative decision making) to make that leap of faith.

I'd also need to buy a different motherboard =/
I guess my future upgrade option is a Xeon. I don't see any LGA1155 cpu's with ECC & AES-NI.

cyberjock · Feb 9, 2014

Sorry, I forgot yours is the X9 and not the X10. You'll want to check out the G2020. It's $67.99 on Newegg with free shipping. And, it officially supports ECC without a doubt. So still, about $30 more certainty with ECC support. And, you'll see a small performance boost to boot! Based on my 2 mins of googling, speeds is about 20-30% faster! On a cost vs performance basis, the G2020 are amazing CPUs! We recommend the G2020 all over the place here. As long as you don't plan to do plex transcodingin a jail or use compression its more than enough for FreeNAS.

Joshua Parker Ruehlig · Feb 9, 2014

cyberjock said:
Sorry, I forgot yours is the X9 and not the X10. You'll want to check out the G2020. It's $67.99 on Newegg with free shipping. And, it officially supports ECC without a doubt. So still, about $30 more certainty with ECC support. And, you'll see a small performance boost to boot! Based on my 2 mins of googling, speeds is about 20-30% faster! On a cost vs performance basis, the G2020 are amazing CPUs! We recommend the G2020 all over the place here. As long as you don't plan to do plex transcodingin a jail or use compression its more than enough for FreeNAS.

I think I'll just get a Xeon in a year or two. I use transcoding in subsonic so it may be helpful =]

DJABE · Feb 22, 2014

http://ark.intel.com/products/77773/Intel-Pentium-Processor-G3220-3M-Cache-3_00-GHz
What do you say for this Haswell alongside SM http://www.supermicro.com/products/motherboard/xeon/c220/x10slm-f.cfm
Crucial DDR3 ECC unbuffered 2x8GB..

cyberjock · Feb 22, 2014

This is a thread about ECC versus non-ECC. If you have build questions please ask that elsewhere.

nullfork · Feb 25, 2014

Forgive me if this has already been discussed, but as an alternative to rebuilding your system with ECC components, couldn't you have a script that runs daily and produces a recursive MD5 hash and individual timestamp of the entire mount point and then compares this with the previous days output and any files whos timestamp hasn't changed but MD5 has can be a failure detection algorithm?

I know it's hardly real-time and of course there is then the issue of what do you do with the corrupted files, but surely it's better than nothing, but worse then a full ECC platform?

jgreco · Feb 25, 2014

Not to mention that the very act of reading the files with bad RAM could be contributing to the corruption.

fracai · Feb 25, 2014

nullfork said:
Forgive me if this has already been discussed, but as an alternative to rebuilding your system with ECC components, couldn't you have a script that runs daily and produces a recursive MD5 hash and individual timestamp of the entire mount point and then compares this with the previous days output and any files whos timestamp hasn't changed but MD5 has can be a failure detection algorithm?

I know it's hardly real-time and of course there is then the issue of what do you do with the corrupted files, but surely it's better than nothing, but worse then a full ECC platform?

As jgreco said, the problem with this plan is that the bad RAM that caused the corruption you hope to detect with the hash script is just as likely to trigger a false positive when you create or check the hash. If the corruption is indeed in the file on disk, ZFS is going to detect this as well and start stomping on the data. If the disk activity then leads to corruption of ZFS metadata while creating or checking the hashes, you've just destroyed all your data.

You might be alerted to a problem, but it'll probably be too late.

It's a value judgement. How much is your data worth? What would you need to do to restore it if your pool went bad? Is there any that is irreplaceable? What is the cost of losing that data? Is it worth it to just put in the funds for ECC now?

nullfork · Feb 25, 2014

Just to clarify, my proposal was to store the hashes OFF the ZFS devices, either on a separate disk or network stored.

fracai said:
As jgreco said, the problem with this plan is that the bad RAM that caused the corruption you hope to detect with the hash script is just as likely to trigger a false positive when you create or check the hash.

True, but a false positive can be verified by re-running the hash on that one file to achieve more confidence surely?

fracai said:
If the corruption is indeed in the file on disk, ZFS is going to detect this as well and start stomping on the data. If the disk activity then leads to corruption of ZFS metadata while creating or checking the hashes, you've just destroyed all your data.

Not sure I follow you, when you say ZFS is going to detect this as well, how so? The major point is that ZFS may well store data on disk in a corrupted state and not know about it, so it wouldn't ever "detect" it - thus the ECC recommendation...

fracai said:
You might be alerted to a problem, but it'll probably be too late.

I can't disagree with that, but surely at that point you can put the brakes on and then decide what you're going to do next...

fracai said:
It's a value judgement. How much is your data worth? What would you need to do to restore it if your pool went bad? Is there any that is irreplaceable? What is the cost of losing that data? Is it worth it to just put in the funds for ECC now?

Yes, all true - however I still believe comparing daily hashes and timestamps using off ZFS storage for the data could be beneficial.

fracai · Feb 25, 2014

ZFS has its own checksums that it checks when writing to disk (initial storage) and when reading from disk (file access, create the hash, check the hash).

If you store a file, create a hash, and later check the hash you have to read or write the file each time. That's three opportunities for corruption from bad RAM.

If the corruption occurs when the file is written, ZFS and your own hash will never know.

If the corruption occurs as ZFS does the read (creating or verifying your checksum), ZFS will detect this and attempt to repair the file; in effect, ZFS will cause the corruption because RAM said there was an error. Really, the file was fine, but the ZFS checksum had an issue either when accessing the checksum or when accessing the file data.

You may be able to put on the brakes, but it's probably too late for that as well. If you experience a flipped bit (cosmic ray hits your RAM) you may indeed be lucky enough to have that bit only affect a file. If the flipped bit hits pool metadata or it's a stuck bit that can't be changed from 0 or 1 and is guaranteed to hit pool metadata, you're going to corrupt the pool and lose everything.

I used hashes when I was initially loading my pool, and it did indeed flag around four files with differing hashes. I created a checksum file on my old storage, verified the hashes, transferred the data, and verified the hashes; identifying a few files with mismatches. I was then able to grab those few files a second time and verify again.
But, there's no point to calculating the hashes yourself. ZFS does that for you. Periodic scrubs will verify the hashes on all your data. Every file access verifies the hash for that file.

ECC protects you by correcting an error in RAM or by halting the system if it can't be corrected. By adding manual hash checking you're increasing the wear and tear on your drives and increasing the amount of data that is passing through your RAM. You could achieve the same thing by just checking "zpool status" at the end of the day. This will report any read / write / checksum errors that ZFS has encountered.

I'll give you that if you're lucky enough that a single flipped bit only caused a single corruption in file data, your checksum system would identify the bad file where "zpool status" would just report that a checksum error had been corrected. But, you're putting a whole lot of extra strain on your pool to identify a statistically unlikely event. I'd be concerned that the extra effort was going to lead to earlier hardware failure and replacement costs.

jgreco · Feb 25, 2014

nullfork said:
Just to clarify, my proposal was to store the hashes OFF the ZFS devices, either on a separate disk or network stored.

True, but a false positive can be verified by re-running the hash on that one file to achieve more confidence surely?

If you read the data and it becomes corrupted in-core due to bad RAM, ZFS will helpfully correct the "errors" on the "disk". There is no do-over opportunity. Once ZFS has decided something is wrong, it will try to fix it. There is no point in "re-running the hash" because ZFS will already have "fixed" your data, so you'll get the same (wrong) hash on the next try, because the data's been "corrected" on disk.

This is what is known as a failure to understand the behaviour of a complex system. The mere act of reading the data can lead to the very corruption that you're engineering a half-assed system to try to detect. The proper solution is to ensure the design requirements for ZFS are met; you put ECC in the server, and then the chance of random memory errors being a problem drops to about 0%. And then you can also use mtree to quickly generate yourself a gorgeous checksum tree as well, safely, if you happen to be that sort of paranoid.

cyberjock · Feb 25, 2014

nullfork,

Sorry, but I just have to say that this won't do anything to save you. You're gaining nothing except the *chance* to know that your RAM is bad. Well, big whoop. Your pool is still almost certainly trashed, and your backups *could* already be trashed. And the amount of increased disk activity to test your pool is just not a smart choice.

You're trying to save some cash by avoiding ECC and trying to get around it. The best case you can hope for is to detect that your RAM is crap. There's already *better* tools for that. It's called a RAM test.

The bottom line, this idea won't work how you want. I've thought about doing something like this a year ago, and I dismissed it out of hand because there's no real value gained. You are trading a well established error-detecting and correcting technology with a perceived crappy alternative that really isn't an alternative at all.

Important Announcement for the TrueNAS Community.

ECC vs non-ECC RAM and ZFS

Resident Grinch

Active Member

Inactive Account

Hall of Famer

Inactive Account

FreeNAS Generalissimo

Hall of Famer

Inactive Account

Hall of Famer

Inactive Account

Hall of Famer

Contributor

Inactive Account

Cadet

Resident Grinch

Guru

Cadet

Guru

Resident Grinch

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ECC vs non-ECC RAM and ZFS"

Similar threads