Perhaps you might want to update your CPU
Just upgraded my system to ECC ram, had to buy a new motherboard ($115) and ram ($260). Tested and ECC is working http://hardforum.com/showthread.php?t=1693051 :)
- CPU - g530
- Mobo - MBD-X9SCM-O
- Ram - 4*8GB HMT41GU7MFR8C-PB
If I'm not mistaken, Supermicro has already said that the ECC function on celerons will not work, but that things like the ecc_check tool will falsely report that it is functional. I can't find the link right now. I'll have to search the forums for the other threads on the ecc_check tool as one of them is very detailed and clears up the details.
I would be interested if you could prove my system isn't actually giving me ECC functionality. I'll keep an eye out for it as well.
EDIT
Found the thread and see what you mean. I did read Dell sells some workstation machines with ECC and Sandy Bridge Pentiums/Celerons which why I would believe Intel ended up enabling it on these CPUs. I wish their was a way to 100%, maybe force a bit to flip and check the log to see fixed bits?
Oh, that's the major problem. ECC has almost no way to prove it does or doesn't work. Even if you similate an ECC error, I'm not sure if the simulation response appropriate if the ECC feature wasn't available and you tried to trigger it. The ecc_check behaves differently for different manufacturers, different motherboards, and different chipsets. So it's a big ol' hodge-podge of a mess.
So what I tell people is to buy hardware that absolutely is clearly marked as supporting ECC. It's just not worth the risk to try to use some component people think uses ECC, only to find out later the protection you've thought you had the whole time you never had. You can buy the $40 CPU that *might* have ECC support, or you can buy the $70 CPU that *has* ECC support. Seems like simple math for me. ;)
Intel's method for ECC is somewhat convoluted with the mixture of CPUs and motherboards necessary. AMD's is even worse. You literally have to go to the motherboard manufacturer to validate it or not. And from what I've read a few manufacturers have claimed to "support ECC". But apparently its been proven that definition of "supported" is not what you and I take it to mean. For those manufacturers it meant that you could put ECC RAM on the motherboard and the RAM would work. They weren't saying that the ECC feature worked! So surprise surprise, you'd buy their motherboard, ECC RAM, and a compatible CPU and think all was great. Then you found out that your motherboard manufacturer f*cked you. How awesome is that?
This is yet another example where "attention to detail" really is important.
So, my understanding on the Supermicro X10 board, is that it actually will not POST if ECC RAM is not installed, and performing the ECC function. I sure as hell hope that's right! I have the X10SLM+-F with a G3220 (and the ark says ECC is supported), and it's definitely ECC RAM. But now you have me worried, as you say, because there is no real practical test to see if you are actually performing error-correction with the 9th bit.
I'm gonna not worry about it too much until we get empirical evidence that ECC isn't working. The only empirical evidence we have thus far is the ecc_check program. It has in every case said ECC is enabled with a Xeon, or Sandy Bridge Celeron/Pentium/i3 + ECC ram. It has in every case said ECC isn't enabled with an i5, or with non-ECC ram. One guy even tested it with a ECC stick + non-ECC stick installed at the same time and the expected response was given. We also have rational evidence that Dell sold workstation system with ECC ram, low power processors, and one of those obscure consumer/server PXX chipset motherboards to sell to corporations. This is rational evidence of why intel would enable ECC (to make their vendor happy) while also not listing it on their page (to stop cheapskates like me from figuring them out).
And that's totally find and totally your choice(and your loss if you are wrong). But me, I take the conservative approach(especially when giving people advice!)To assume it is working unless you can prove it isn't, or to assume it isn't working until you can prove it is? So which is more conservative? If you are happy to believe you have ECC support and that you "stuck it to the man" by not buying a $30 more expensive CPU(g3220), then great! Congratulations for you. Me, I'll stick to stuff that I know for 100% certainty what I'm getting with my dollar's worth.
As for DrKK, if your motherboard and CPU are both officially listed as supporting ECC RAM, you are fine(yours are). My concern is that the Intel ARK for the Celerons does NOT say ECC is supported. And its a bad idea(based on my conservative decision making) to make that leap of faith.
Sorry, I forgot yours is the X9 and not the X10. You'll want to check out the G2020. It's $67.99 on Newegg with free shipping. And, it officially supports ECC without a doubt. So still, about $30 more certainty with ECC support. And, you'll see a small performance boost to boot! Based on my 2 mins of googling, speeds is about 20-30% faster! On a cost vs performance basis, the G2020 are amazing CPUs! We recommend the G2020 all over the place here. As long as you don't plan to do plex transcodingin a jail or use compression its more than enough for FreeNAS.
As jgreco said, the problem with this plan is that the bad RAM that caused the corruption you hope to detect with the hash script is just as likely to trigger a false positive when you create or check the hash. If the corruption is indeed in the file on disk, ZFS is going to detect this as well and start stomping on the data. If the disk activity then leads to corruption of ZFS metadata while creating or checking the hashes, you've just destroyed all your data.Forgive me if this has already been discussed, but as an alternative to rebuilding your system with ECC components, couldn't you have a script that runs daily and produces a recursive MD5 hash and individual timestamp of the entire mount point and then compares this with the previous days output and any files whos timestamp hasn't changed but MD5 has can be a failure detection algorithm?
I know it's hardly real-time and of course there is then the issue of what do you do with the corrupted files, but surely it's better than nothing, but worse then a full ECC platform?
As jgreco said, the problem with this plan is that the bad RAM that caused the corruption you hope to detect with the hash script is just as likely to trigger a false positive when you create or check the hash.
If the corruption is indeed in the file on disk, ZFS is going to detect this as well and start stomping on the data. If the disk activity then leads to corruption of ZFS metadata while creating or checking the hashes, you've just destroyed all your data.
You might be alerted to a problem, but it'll probably be too late.
It's a value judgement. How much is your data worth? What would you need to do to restore it if your pool went bad? Is there any that is irreplaceable? What is the cost of losing that data? Is it worth it to just put in the funds for ECC now?
Just to clarify, my proposal was to store the hashes OFF the ZFS devices, either on a separate disk or network stored.
True, but a false positive can be verified by re-running the hash on that one file to achieve more confidence surely?