ECC Configuration Missing from Supermicro X11SCL-IF Motherboard BIOS

wallboston

Dabbler
Joined
Nov 20, 2022
Messages
12
I have seen one other post from June 2020 address this issue but the resolution doesn't seem to apply to my problem. I just built my first TrueNAS system with the components below. I have run three different tests and none confirm that ECC is actually working. The motherboard BIOS manual page 73 says BIOS should display an "ECC Support" option that can be enabled/disabled. ECC Support is not being displayed in either BIOS version 1.6 or 1.8, the latest BIOS. Supermicro support claims the manual is out of date, that "ECC Support" is no longer displayed in the BIOS, but ECC is enabled by default. However, the three tests I ran all show ECC capability but none confirm that ECC is actually working. The most explicit test was PassMark MemTest86 Pro v10.1 which supports ECC injection for this CPU. That test shows no ECC errors being caught and corrected each time an error is injected by the test; injection is occurring but no correction. A screenshot of intermediate test results and an explanation from PassMark are also shown below.

Does anyone have this motherboard with "ECC Support" showing in the BIOS displayed under Advanced/Chipset Configuration/System Agent (SA) Configuration/Memory Configuration/ECC Support? Does anyone know if this option is supposed to appear in BIOS v1.8? I suspect I have a motherboard or BIOS issue preventing ECC operation but Supermicro isn't yet convinced. Does anyone have any other explanation for ECC not working with these components?

Selected System Components
Motherboard: Supermicro X11SCL-IF
CPU: Intel Xeon E-2234
RAM: Samsung 64GB (32x2) DDR4 2666
BIOS: v1.6 and v1.8 (I tested both)

PassMark MemTest86 Pro Test Results and Explanation

Test results:
1671741422906.png


ECC injection explanation from PassMark:
“If ECC errors were successfully injected and detected by the system, the user shall see an [ECC Inject] message followed by an [ECC Errors Detected] message. If [ECC Errors Detected] message does not appear, it is highly likely the ECC injection is locked or disabled by BIOS.”

Note that the [ECC Errors Detected] message would appear immediately after and under each [ECC Inject] message in the screenshot above. No errors are being detected, hence ECC detection is not functioning.

Community thoughts, anyone? Thank you!
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
At the very least I suggest that you ask Supermicro specifically if ECC injection is locked or disabled by the BIOS on your board.

Also, ask PassMark if you can send them a debug file so that they can indicate if ECC injection is locked or disabled by the BIOS on your board - they did so for me whan I pursued this topic a few years ago on my FreeNAS Mini's ASROCK board (it was disabled in that case) see post #65 at https://www.truenas.com/community/threads/memory-error-on-freenas-mini.59216/page-4
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The Pro version ends up being of dubious value because most vendors do not expose an option to unlock the IMC error injection capabilities, unfortunately.
 

wallboston

Dabbler
Joined
Nov 20, 2022
Messages
12
At the very least I suggest that you ask Supermicro specifically if ECC injection is locked or disabled by the BIOS on your board.

Also, ask PassMark if you can send them a debug file so that they can indicate if ECC injection is locked or disabled by the BIOS on your board - they did so for me whan I pursued this topic a few years ago on my FreeNAS Mini's ASROCK board (it was disabled in that case) see post #65 at https://www.truenas.com/community/threads/memory-error-on-freenas-mini.59216/page-4
Thanks. I've been communicating with Supermicro and am waiting for a reply but earlier they had said--to paraphrase--"don't worry ECC is enabled in BIOS" but that doesn't really answer the question you posed. I will follow up with them on this question.

I'm communicating with PassMark. They asked me for the log file and said it was in EFI\BOOT\ of the USB flash drive but I ran the test through IPMI on a virtual drive image so I'm not sure I can access a log until I am able to run the test from a real USB drive on my device. I won't have physical access to my system for a week or more. Will be following up with PassMark as soon as I get a log file.
 

wallboston

Dabbler
Joined
Nov 20, 2022
Messages
12
The Pro version ends up being of dubious value because most vendors do not expose an option to unlock the IMC error injection capabilities, unfortunately.
PassMark does give a list of CPUs they say "may" support their ECC injection test, stating that these CPUs "may support this feature [ECC Injection] but may be disabled (depending on your BIOS configurations)." The list includes "Intel 8th/9th Gen Core/Xeon E-2100 family (Coffee Lake)." Of course, my CPU is a Xeon E-2234 (Coffee Lake) so I assume it "may" support ECC injection. There is no feature in the Supermicro BIOS to enable ECC injection tests.

PassMark provides this more complete explanation of ECC Injection testing and various motherboards. It kind of leaves me even more confused, especially the statement at the bottom about using IPMI on Supermicro motherboards.

How do I know if my system supports ECC injection?​

In general, ECC injection is not a feature that is normally accessible by end-users. Even if the chipset supports the ECC injection feature, details are often sparse and not described in publicly available datasheets. Consult the datasheet for your CPU/memory controller chipset to determine whether the ECC injection feature is available and fully specified.

In particular, some Intel chipsets (Broadwell, Xeon Scalable) use Intel Trusted Execution Technology (Intel TXT) to lock ECC injection. Intel TXT, using secure hardware modules, verifies the integrity of the BIOS, firmware, OS and hypervisor in order to guarantee a trusted operating environment. As a result, this requires preventing access to specific memory controller registers from being compromised, including ECC injection registers.

Some chipsets that support ECC injection have a locking mechanism that once enabled in the BIOS, effectively disables the ECC injection capability. For these cases, a BIOS option may be available to leave the feature unlocked. Otherwise, a custom BIOS is required for unlocking the feature.

Why are ECC errors not being reported on my AMD Ryzen system?​

There is a possibility that a BIOS setting, Platform First Error Handling (PFEH), is preventing ECC errors from being reported to MemTest86.

An example of this setting is shown in the following screenshot.

BIOS PFEH setting

If this setting is enabled, set to disabled and try running MemTest86 again.

Another explanation is the use of out-of-band (OOB) monitoring solutions such as Baseboard Management Controller (BMC) and Intelligent Platform Management Interface (IPMI), which is used in server platforms (eg. Supermicro servers)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
It kind of leaves me even more confused, especially the statement at the bottom about using IPMI on Supermicro motherboards.
The two topics are somewhat unrelated. You can still test for ECC correctable errors without injecting them - you might get lucky, or double-lucky and have a predictably-defective DIMM. IPMI logs will show memory errors, typically. The main exception is where vendors felt that their products needed a sprinkle of fascist naming conventions for a dumb anti-feature that benefits nobody and decided that "Platform-first error handling" was somehow not a terrible idea.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
vendors felt that their products needed a sprinkle of fascist naming conventions for a dumb anti-feature that benefits nobody

That's like the second time in the space of a week, @Ericloewe ... STOP MAKING ME SPLORF MY DRINK. Or else! :smile:
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
The main exception is where vendors felt that their products needed a sprinkle of fascist naming conventions for a dumb anti-feature that benefits nobody and decided that "Platform-first error handling" was somehow not a terrible idea.
Hmmm... Can I coax you to expand on this some for my own edification? What does "platform-first error handling" imply?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
If you get out your vendorese decoder ring, you'll see that "platform", when not used to mean the literal platform (CPU, chipset and associated capabilities), means "system firmware".
So you need to turn around the ring and look at the compound words section, "-first" means "only works with".
In other words, "error handling only works with system firmware".

"Wait, what?", you might ask - "That doesn't make any sense.". You'd be right, it doesn't.

What's going on here is that some idiot thought that they were seeing too many ECC correctable errors and that the solution was to suppress these errors from reaching the host OS or some relevant logs. Perhaps they desperately needed to dump a truckload of marginal DIMMs that would error out frequently, or they figured they could undercut someone, or fudge some requirement, or weasel out of warranties...

So, end result: Someone (likely at AMI, because I've heard of it both on AMD and Intel systems) implemented a hack that prevents ECC correctable errors from being reported and this "feature" has made its way into some systems. Worse, on some systems, it's apparently not even shown as an option and always enabled. So you'll end up with a system that all of a sudden has uncorrectable errors, despite you never having seen a single correctable error (which is a statistically dubious scenario if the latter are reported).

I first heard of this nonsense in this talk, where Bryan Cantrill describes a rather extreme scenario and discussions with an unnamed vendor. At the time of the talk, the slightly less opaque "firmware-first" terminology was in use.
 
Joined
Jun 15, 2022
Messages
674
Not reporting ECC correctable errors (a visual demonstration):

 
Last edited:

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
In other words, "error handling only works with system firmware".

"Wait, what?", you might ask - "That doesn't make any sense.". You'd be right, it doesn't.
Thanks for taking the time - my non-decoder-ring inference was close enough to give me pause, that's why I asked.
 
Top