Good news, everyone! Haswell i3s do work with ECC.

Status
Not open for further replies.

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Following my unfortunate experience last weekend, I've been doing a thorough burn-in on my old server, to make sure everything's fine.

Several days of prime95 were uneventful. As you might have guessed from the thread title, memtest86+ was not:
upload_2016-4-29_21-9-36.png

For future reference, memtest86+ just happily chugged along reporting zero errors.

Bad news for me is I have further debugging to do and further RMAs to do. That makes at least two major system components knocked out in that mysterious event.

Excellent news for the community is that a very popular platform, Core i3 4xxx plus Supermicro X10 board + QVL RAM clearly properly supports ECC error correction.
Also interesting is that the errors seem to be happening at regular intervals, suggesting it may be a single bad cell.

Now, let's get to the details:

Hardware:
  • Intel Core i3 4330
  • Supermicro X10SLM+-F
  • 2 * Crucial CT102472BD160B / Micron MT18KSF1G72AZ-1G6E1
  • Currently powered by a brand-new Seasonic X-650, powered by a Seasonic G-550 when the whole mess happened
So, what are we missing? Confirmation that the system will halt on an uncorrectable error. That's the outstanding question now.
 
Joined
Dec 2, 2015
Messages
730
It would be interesting to put this RAM into service in a test system, with known data, and see what happens.

Wasn't Cyberjock looking for bad EEC RAM, so he could do some testing to confirm FreeNAS properly responded to the errors? Maybe he would be prepared to buy the bad RAM, rather than have you RMA it. Bad EEC RAM is so rare, it may be worth more dead than alive.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Wasn't Cyberjock looking for bad EEC RAM, so he could do some testing to confirm FreeNAS properly responded to the errors?
He got one some time ago and is currently working on documenting his findings, last I heard.

Right now, my working theory is that the DIMM has a bad cell or small region, which fails intermittently (memtest86+ ran 14+ passes and only 4-5 of them resulted in ECC errors). That reduces its value somewhat, in comparison to one which fails consistently.

I feel a bit silly though, I should've checked the IPMI logs earlier in the memtest86+ run. Would've saved me a day.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I should add that I don't feel particularly at ease with doing any RMAs before I understand what happened to the server - and that's still quite the mystery.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
What I needed was uncorrectable RAM (which I can happily say that I do have). Unfortunately I'm still in the process of retiring my current main FreeNAS for a new FreeNAS system. Once that gets done then I'll be able to use my old system for further testing. I'm having problems getting manpower to move my servers and such (putting servers in server racks takes 2 people and I've had a hard time sucking a friend into helping me. This is a horribly busy time of year for my friends, so I may not get to testing for a month or two. :(

I need more geeky friends living near me. :p

Edit: Your behavior (failing ECC checks but nothing on memtest) is what I have expected (and been trying to share with people for the last year or more). ECC is handled at the hardware level, and memtest is operating at the software level. That's why I tell people that memtest runs aren't necessarily "great" for testing ECC RAM because it won't necessarily show errors. Looking in the IPMI logs is where it really matters. Now, for ECC to do its "thing" you do need to read from RAM, which does mean you must use it.

So using memtest to cycle through all memory locations so you can see if anything is bad in IPMI is the recommended way to test a system IMO. It's still not 100% though, because of a bunch of edge cases, etc. But if your system is in a condition where ECC is failing, you're hitting edge cases and memtest won't do, you're probably dead because the heat death of the universe has likely already taken place. ;)
 
Last edited:
Status
Not open for further replies.
Top