Ivy Bridge Core i3s and ECC - Santa Clara, we have a problem...

Status
Not open for further replies.

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
I mean, you'd THINK that at this point, Intel, if they made a massive error in showing ECC support for some of the lower-priced CPU's, would have gone through and updated the ark entries on their website, right? I mean, they've done so with the Ivy Bridges. It doesn't take a genius, given that screw up, for them to run through and make sure they didn't make similar errors in other CPU's.

The fact that the Haswell pentiums, et al, STILL show ECC supported, would almost certainly have to mean that the information is correct, at this point, right?

I'm just thinking out loud.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I mean, you'd THINK that at this point, Intel, if they made a massive error in showing ECC support for some of the lower-priced CPU's, would have gone through and updated the ark entries on their website, right? I mean, they've done so with the Ivy Bridges. It doesn't take a genius, given that screw up, for them to run through and make sure they didn't make similar errors in other CPU's.

The fact that the Haswell pentiums, et al, STILL show ECC supported, would almost certainly have to mean that the information is correct, at this point, right?

I'm just thinking out loud.

Right now, I think nobody at Intel has any idea what's going on. Hell, we have an Ivy Bridge showing ECC enabled a few comments up.
 

GrumpyBear

Contributor
Joined
Jan 28, 2015
Messages
141
It's interesting that if you look at the specifications for the Core Processors up to the 3rd generation Intel states in section 1.2.1 System Memory Support:
• The type of the DIMM modules supported by the processor is dependent on the PCH SKU in the target platform:
— Desktop PCH platforms support non-ECC UDIMMs only
— All In One platforms (AIO) support SO-DIMMs
Starting in the 4th Generation datasheet information on memory from Sections 1 and 2 have been merged and we find the same statement in Section 2.1:
• The type of the DIMM modules supported by the processor is dependent on the PCH SKU in the target platform:
— Desktop PCH platforms support non-ECC UDIMMs only
— All In One platforms (AIO) support SO-DIMMs
But in the latest Generation 5 datasheet the memory specification has been rewritten and the above statement is missing.
The inference has always been made from this statement that server systems would support ECC with i3 and Pentiums but what the specification says is that PCHs for Desktop Systems only support non-ECC.

Many have theorized that the i3s and the Pentiums have ECC support as there is no dual core Xenon available and Intel left it in to satisfy the market segment looking for a low-cost low-core server. Perhaps what all this is leading to is that in tandem with the releases of new 5th Generation Core processors we will be seeing a Xeon that will fit this segment which Intel has decided obviates the need to continue to support ECC in the Pentium and i3 lines.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Many have theorized that the i3s and the Pentiums have ECC support as there is no dual core Xenon available and Intel left it in to satisfy the market segment looking for a low-cost low-core server. Perhaps what all this is leading to is that in tandem with the releases of new 5th Generation Core processors we will be seeing a Xeon that will fit this segment which Intel has decided obviates the need to continue to support ECC in the Pentium and i3 lines.

That's my reasoning. Though Broadwell is probably lacking the statement because there are reportedly no plans for low-end socketed Broadwell processors. Only the high-end (presumably including a Xeon E3 v4 series) is getting socketed Broadwells.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Right now, I think nobody at Intel has any idea what's going on. Hell, we have an Ivy Bridge showing ECC enabled a few comments up.

And that is the *whole* damn problem with this. There is no definitive way to prove that the memory controller is actually operating in some kind of "ECC enabled" mode. That script just reads some bits and spits out the output. What those bits mean has to be interpreted. Until now the "meat world" has interpreted those bits as being ECC or non-ECC. But I've never liked or wanted to trust that validation because it didn't come "from Intel directly". I've trusted it because that's what everyone else in the IT industry trusted. But I still didn't like it. Not one bit.

What makes this all fugly as hell is that even if you look at my system (X9SCM-F, DDR3 with ECC, and Xeon E31230v2) is that I cannot prove that in my hardware setup (firmware versions, bios versions, etc.) I have no way of experimentally or programmatically proving for 100% certainty that I *am* actually using ECC. It been one of those things that I've always said 'Intel probably didn't screw this up and Supermicro hopefully didn't screw this up" so I've gone with it. If Supermicro made a mistake with their BIOS and broke ECC nobody would be any wiser until we had an actual real-world example. Even if we did, how do you prove for 100% certainty it was the motherboard and that it wasn't because the motherboard was bad, or a manufacturing defect from day one? I could think up a million scenarios where you could think you have ECC and yet not have ECC. I'm a skeptic like that and I like to be able to prove things for certainty. This is one thing that nobody has ever been able to prove, for certainty. You can get reasonably close by buying high-end stuff like Supermicro boards, Xeons, and ECC RAM.

In fact, a scenario like the one I just explained above where a BIOS mistake or a problem with the CPU might never be known is why I advocated so strongly against Asrock Rack when it first came out. They're new to the market and even more likely to make a mistake than Supermicro is because they are new. Do *you* want to trust your data to a company based solely on their reputation (or lackthereof with Asrock Rack)? I don't.

And don't even bring AMD into this conversation. Anyone using AMD is even more clueless as to how to validate ECC is working than Intel without drastic measures! LOL

So you have:

AMD which is strictly what some company says they can support on a sheet of paper.
Intel which is strictly what some company says they can support on a sheet of paper, but because of their higher market share and therefore a larger userbase, it would be slightly easier to prove that ECC doesn't work if there was some kind of problem. But considering that i3s don't have ECC support and we're finding out a couple of years later, is that "slightly" easier mean that instead of being 0% chance its now "just a fraction above zero" and therefore still meaningless?

We all know how much faith we can put on a sheet of paper too....

The bottom line, you should be buying products from reputable companies that have a very vested interest in having full support for ECC. And you should be buying products that end up in large corporations that use that hardware that way you are somewhat protected by the herd of people relying on the same technology you are. That is... if you are neurotic about file servers and actually want validation that you aren't the lone wolf on the island with major problems. ;)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
And that is the *whole* damn problem with this. There is no definitive way to prove that the memory controller is actually operating in some kind of "ECC enabled" mode. That script just reads some bits and spits out the output. What those bits mean has to be interpreted. Until now the "meat world" has interpreted those bits as being ECC or non-ECC. But I've never liked or wanted to trust that validation because it didn't come "from Intel directly". I've trusted it because that's what everyone else in the IT industry trusted. But I still didn't like it. Not one bit.

What makes this all fugly as hell is that even if you look at my system (X9SCM-F, DDR3 with ECC, and Xeon E31230v2) is that I cannot prove that in my hardware setup (firmware versions, bios versions, etc.) I have no way of experimentally or programmatically proving for 100% certainty that I *am* actually using ECC. It been one of those things that I've always said 'Intel probably didn't screw this up and Supermicro hopefully didn't screw this up" so I've gone with it. If Supermicro made a mistake with their BIOS and broke ECC nobody would be any wiser until we had an actual real-world example. Even if we did, how do you prove for 100% certainty it was the motherboard and that it wasn't because the motherboard was bad, or a manufacturing defect from day one? I could think up a million scenarios where you could think you have ECC and yet not have ECC. I'm a skeptic like that and I like to be able to prove things for certainty. This is one thing that nobody has ever been able to prove, for certainty. You can get reasonably close by buying high-end stuff like Supermicro boards, Xeons, and ECC RAM.

In fact, a scenario like the one I just explained above where a BIOS mistake or a problem with the CPU might never be known is why I advocated so strongly against Asrock Rack when it first came out. They're new to the market and even more likely to make a mistake than Supermicro is because they are new. Do *you* want to trust your data to a company based solely on their reputation (or lackthereof with Asrock Rack)? I don't.

And don't even bring AMD into this conversation. Anyone using AMD is even more clueless as to how to validate ECC is working than Intel without drastic measures! LOL

So you have:

AMD which is strictly what some company says they can support on a sheet of paper.
Intel which is strictly what some company says they can support on a sheet of paper, but because of their higher market share and therefore a larger userbase, it would be slightly easier to prove that ECC doesn't work if there was some kind of problem. But considering that i3s don't have ECC support and we're finding out a couple of years later, is that "slightly" easier mean that instead of being 0% chance its now "just a fraction above zero" and therefore still meaningless?

We all know how much faith we can put on a sheet of paper too....

The bottom line, you should be buying products from reputable companies that have a very vested interest in having full support for ECC. And you should be buying products that end up in large corporations that use that hardware that way you are somewhat protected by the herd of people relying on the same technology you are. That is... if you are neurotic about file servers and actually want validation that you aren't the lone wolf on the island with major problems. ;)

I agree with you in general. However, at least Haswell (maybe older families as well, but I haven't checked) implements the mythical ECC error injection functionality. Section 4.2.2 - "ECC - ECC DFT Register":

This register controls the ECC error injection DFT feature.

ECC error inject options:
000b = No ECC error injection.
001b = Reserved
011b = Inject correctable ECC error on ECC error insertion counter defined by ECC_INJECT_COUNT_C0.
101b = Reserved
111b = Inject non-recoverable ECC error on ECC error insertion counter defined by ECC_INJECT_COUNT_C0.

For some reason, there's three of these registers.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Ok, so I am confused now. I have a 1155 I3 3225 CPU and I was thinking of just using that with a Supermicro server board and some ECC ram. The intel ARK says no ECC support so would this combo work for ECC?

http://ark.intel.com/products/65692/Intel-Core-i3-3225-Processor-3M-Cache-3_30-GHz

According to all of the information currently available *right now*... you would not have ECC. If you had asked last week the answer would have been yes because the ARK incorrectly said that ECC was supported. Right now, you have to get a Pentium or a Xeon. I'm somewhat expecting the Pentiums to be dropped off the list soon, but time will tell. ;)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
@Ericloewe

Those registers are going to be the single biggest reason for me to upgrade my server later this year. I went with the Xeon when I did because at that time the general culture was that you went Xeon or you went home. The Pentiums and i3s were options, but I really questioned if that ECC = YES thing was really correct and as I wasn't going to take any chances I went with Xeon. Even before my FreeNAS server my Windows server had the lowest-end Xeon from that generation because I wasn't about to take the risk with Pentiums or i3s. ;)

In my opinion, it is very important that you be able to validate that ECC works (or doesn't work) properly for a given server. This is one thing that I've been wanting badly. If an Intel employee called me and offered to give me anything I want as an added feature to their CPUs, that feature would have been absolute proof via some kind of software testing if ECC is working properly. ;)
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Right now, you have to get a Pentium or a Xeon. I'm somewhat expecting the Pentiums to be dropped off the list soon, but time will tell. ;)

I'm really, really unimpressed with Intel right now. I mean, none of the iron in my DC is on anything other than a Xeon or Opteron, but think of all the SMBs out there who bought an entry-level Dell/HP server with an i3 and ECC RAM.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'm really, really unimpressed with Intel right now. I mean, none of the iron in my DC is on anything other than a Xeon or Opteron, but think of all the SMBs out there who bought an entry-level Dell/HP server with an i3 and ECC RAM.

Yep. That's why when I first responded to this post I said something like "I sure hope this is something that will be cleared up and they *do* have ECC support". If they truly don't support ECC, then the ecc_check.py and ecc_check.c code that's been floating around for all this time is useless and the only thing you can truly do to avoid the problem is buy the Xeons.
 

philiplu

Explorer
Joined
Aug 10, 2014
Messages
58
I read an interesting thread on HackerNews today here. There's a DRAM attack, "Rowhammer", where repeated accesses of alternating RAM rows can flip bits in adjoining rows. Seems that it's more likely in laptops than desktops or servers, but I wonder if it's triggerable in some server cases. If so, that might be a way to semi-reliably trigger an ECC error and detect whether the ECC functionality is working. I could see adding it to memtest86, where you wouldn't have to worry about virtual-to-physical translations complicating things.
 

pschatz100

Guru
Joined
Mar 30, 2014
Messages
1,184
Hi All,

So you're saying that ecc_check does not definitively show whether or not ECC is working? That's too bad.

The specs for my X9SCM board call for a Xeon® E3-1200, E3-1200 v2 series, 2nd or 3rd Gen Core i3, Pentium, or Celeron. Core i5 and i7 are missing from the list, so there must be some architectural difference between them and the others.
 

esamett

Patron
Joined
May 28, 2011
Messages
345
NOOB: can overclocking be used to create RAM errors for the desired testing? Just curious.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Maybe if you lower the RAM voltage at the same time. Or maybe you can even do it if you just lower the RAM voltage without overclocking, but I don't know if you can go low enough just with the BIOS settings.

I just thought about another method: radiation. We know cosmic background radiation can flip bits so increasing this by orders of magnitude with some radioactive source near the RAM sticks should produce some bit flip pretty frequently. Well, not for everyone but if someone has a source safe to handle by hand it can easily try this :)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
NOOB: can overclocking be used to create RAM errors for the desired testing? Just curious.

No, most server grade gear doesn't overclock. The traditional solution is to locate a bad RAM module and validate with that. The problem with that strategy is that it isn't always possible to locate a bad module of the latest technology.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Maybe if you lower the RAM voltage at the same time. Or maybe you can even do it if you just lower the RAM voltage without overclocking, but I don't know if you can go low enough just with the BIOS settings.

I just thought about another method: radiation. We know cosmic background radiation can flip bits so increasing this by orders of magnitude with some radioactive source near the RAM sticks should produce some bit flip pretty frequently. Well, not for everyone but if someone has a source safe to handle by hand it can easily try this :)

I've been waiting for someone crazy enough to borrow me a gamma source and a proper lab environment for a while now.

I read an interesting thread on HackerNews today here. There's a DRAM attack, "Rowhammer", where repeated accesses of alternating RAM rows can flip bits in adjoining rows. Seems that it's more likely in laptops than desktops or servers, but I wonder if it's triggerable in some server cases. If so, that might be a way to semi-reliably trigger an ECC error and detect whether the ECC functionality is working. I could see adding it to memtest86, where you wouldn't have to worry about virtual-to-physical translations complicating things.

memtest86+ has been as good as dead for over a year now. It doesn't even reliably detect that ECC modules are installed. Of course, in principle, it might work.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
No, most server grade gear doesn't overclock. The traditional solution is to locate a bad RAM module and validate with that. The problem with that strategy is that it isn't always possible to locate a bad module of the latest technology.

I agree that it won't OC but there may be memory adjustments exposed on certain boards. Overly aggressive CAS/RAS latencies might trigger it, but it's more likely to flip more than one bit as a result and just take your system down entirely.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Yep. That's why when I first responded to this post I said something like "I sure hope this is something that will be cleared up and they *do* have ECC support". If they truly don't support ECC, then the ecc_check.py and ecc_check.c code that's been floating around for all this time is useless and the only thing you can truly do to avoid the problem is buy the Xeons.

If I were Intel I'd be less worried about some Python scripts not working and way more worried about the fact that the system vendors of the world are going to be sticking them with a bill for misrepresentation of goods.
 

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
If I were Intel I'd be less worried about some Python scripts not working and way more worried about the fact that the system vendors of the world are going to be sticking them with a bill for misrepresentation of goods.
But doesn't Intel's fine print basically say that nothing they've published can be relied upon?
 
Status
Not open for further replies.
Top