Why I love ECC.

Status
Not open for further replies.

ioquatix

Dabbler
Joined
May 9, 2017
Messages
48
I've got an old N40L. It's been an awesome first server but it's a bit long in the tooth.

Recently, it's been crashing on a semi-regular basis - it started off about once every few months and eventually once a week even.

I chalked it up to ZFS on Linux issues and not enough RAM. After upgrading my system to 2x8GB, I had even more frequent problems.

After running memtest86, the system would crash almost always at the 9 minute mark. I found that in some cases, it was reported on the boot screen, unrecoverable ECC error detected. I checked in the bios and there were a LOT of errors in the log, going back for quite a while. So the system was detecting the error and locking up to prevent any further issues.

While being far from ideal - ZFS was fine through all of this (can't say the same for my EXT4 boot drive) - it's survived numerous hard crashes, and as far as I can tell (I've checked and scrubbed), all my data is absolutely fine.

I'm absolutely certain that without ECC, I would have experienced silent data corruption. I ingest 10-100Gbytes of data per week as backups, and the system has 16Gbytes of RAM, so it seems plausible that every bit of memory was utilised on a regular basis.

I will always use ECC in my servers, and I'm seriously considering it for my workstation too now.
 
Last edited:

scrappy

Patron
Joined
Mar 16, 2017
Messages
347
A good cautionary tale. Glad to hear ZFS kept your data safe and clear of corruption.
 

ioquatix

Dabbler
Joined
May 9, 2017
Messages
48
By the way, I don't know if the problem was memory corruption due to faulty chips, faulty connections, or something wrong with one of the CPU cores (or both) or something else, e.g. the memory controller. But in any case, ECC detected an issue. In my mind, this is simply fantastic.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Yes, this is a good story. It's always good to find something where an issue did happen and ECC RAM helped. Too often all we can do is state the benefits of ECC RAM and rarely prove it. So many people feel that RAM doesn't fail if it passes testing that one initial time.

Thanks for sharing.
 
Status
Not open for further replies.
Top