Uncorrectable ECC during installation - compatibility problems?

Cargeh

Cadet
Joined
Mar 12, 2023
Messages
3
Hi! I'm in a bit of a pickle, I can't install TrueNAS Scale due to "Uncorrectable ECC" errors (reported as kernel panics during the installation), but I don't understand what's to blame, it might've been my poor choice of RAM.

Hardware:
CPU: Intel Xeon E3-1220 V6 (used, bundled with the motherboard)
Motherboard: Supermicro X11SSM-F (used, bundled with the CPU)
Memory: Samsung 2x32GB ECC UDIMM DDR4 3200 (M391A4G43BB1-CWE) (new, exact product that I bought)
Boot drive: x2 2.5" Kingston 120GB SSD (new)
Power supply: Seasonic PRIME PX-650 650 Watt Platinum (new, for 10 HDDs)

System info:
Firmware Revision: 01.63​
Firmware Build Time: 09/04/2020​
BIOS Version: 2.7
BIOS Build Time: 12/06/2021
Redfish Version: 1.0.1

Problem:

I'm installing TrueNAS scale off of a USB stick, I get to where it starts extracting stuff and then it fails with kernel panics:

2023-03-13_00-44-28.png


I see it reported in the "Health event log" as "Uncorrectable ECC @ DIMMB2 - Assertion".



What I tried:
1. Ran memtest86 for 8 hours with both sticks, no errors, all good. Not even in the event log.
2. Tried to install TrueNAS with one stick only, in both different slots, both sticks - no difference, still fails.
3. Tried it with a spare consumer (non-ECC) DDR4 2666MHz RAM, worked pretty well, was able to install TrueNAS and boot into it.
4. Updated the BIOS to the latest available (I had 2.0c, but I saw that the support for 2667MHz RAM was added later), no difference
5. Updated the firmware to the latest just in case, no difference.

Memory is reported correctly is BIOS - 64GB, running at 2400MHz.



Assumptions:

  • I did get both sticks from one place, shipped together. Could they be faulty? How can I rule it out if I don't have any spare ECC memory? I assume the fact that it booted with non-ECC memory doesn't change much since there could be unreported errors?
  • I did get two 32GB sticks of 3200MHz because I thought I'd get clocked down anyway and I'd be able to re-use it in the future whenever I upgrade, and it was much cheaper than the alternatives anyway. I know that supermicro doesn't recommend using 2x32GB sticks, but I thought it was just an outdated instruction. I also looked for this specific model and some users reported that it worked fine. Could I have chosen bad memory?


Not sure what to try next, so any help would be much appreciated, thanks!
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176
Could they be faulty?
They could be, but it's not super likely. One, sure, but two?
What's more likely is an incompatibility between your system and those DIMMs - there may be tweaking you can do in the system firmware setup menu, but it's very technical, way outside my area of expertise and possibly able to damage hardware.

I did get two 32GB sticks of 3200MHz because I thought I'd get clocked down anyway and I'd be able to re-use it in the future whenever I upgrade, and it was much cheaper than the alternatives anyway. I know that supermicro doesn't recommend using 2x32GB sticks, but I thought it was just an outdated instruction. I also looked for this specific model and some users reported that it worked fine. Could I have chosen bad memory?
Wait a minute, 32 GB DIMMs aren't supposed to work with Xeon E3 v5/v6. I can't tell you what arcane DDR4 specification is at fault here, just that support for 32 GB DIMMs was a notable feature for Xeon E when it came out.
That leaves the question of what Linux is doing that memtest isn't... Was it memtest86 the commercial thing or memtest86+ the recently not-abandoned open-source thing? Can you also try TrueNAS Core?
 

Cargeh

Cadet
Joined
Mar 12, 2023
Messages
3
Wait a minute, 32 GB DIMMs aren't supposed to work with Xeon E3 v5/v6. I can't tell you what arcane DDR4 specification is at fault here, just that support for 32 GB DIMMs was a notable feature for Xeon E when it came out.
Oh, ok... that explains it... Thank you!

I really wish it was mentioned in the Hardware Recommendations.

I bought and installed 4x16GB, everything works just fine, so the problem was RAM after all. What an obscure error..

That leaves the question of what Linux is doing that memtest isn't... Was it memtest86 the commercial thing or memtest86+ the recently not-abandoned open-source thing?
I ran this: https://www.memtest86.com/download.htm - is that the good commercial thing, or was I supposed to run something else?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,600
@Cargeh - Some of the processor information is available on https://ark.intel.com/ like this;

https://ark.intel.com/content/www/u...on-processor-e31220-v6-8m-cache-3-00-ghz.html

Now it is not clear on the size per DIMM. But, it states 2 memory channels, (and I can assume 2 DIMMs per channel), with 64GB maximum memory., gives the 4 x 16GB. So, at a guess, 16GB is the maximum DIMM size.

Plus, the SuperMicro page for the system board X11SSM-F has 16GB listed as the maximum DIMM size;

https://www.supermicro.com/en/products/motherboard/x11ssm-f


This is not meant to be a criticism, but to help you and others reading this thread that their are hints if you have the right tools. Took me a long time to stumble across Intel Ark site. And, unlike some of the other vendors, SuperMicro actually lists lots of information about it's system boards.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176
This is actually one of those things that changes a bit between generations. Some CPUs are fine with larger DIMMs, up to their total maximum addressable capacity per channel, some will just not work.
Actually, I think it's caused by the DIMMs themselves more so than the CPUs, but the PC DRAM world is opaque and changes often, so I can't tell you exactly why that is.
 

a.sihanabun

Cadet
Joined
May 17, 2023
Messages
3
Hi, I need some help here. I can't install TrueNAS Scale due to "Uncorrectable ECC" errors as well, but in my case I already installed the correct RAM size (which is 16GB per DIMM).
Let me explain the details.

My hardware:
CPU: Intel Xeon E3-1245 V5 (used)
Motherboard: Supermicro X11SSM-F (new, bought from wiredzone)
Memory: Micron 2x16GB ECC UDIMM DDR4 3200 (MTA9ASF2G72AZ-3G2R) (new, this was the only Micron ECC UDIMM memory I could get locally, link to the product page)
Boot drive: 1x 2.5" Adata SU650 SSD 120GB (new)
Power supply: Fractal Design Ion Gold 750W (new, for 8 HDDs)

I've got the same BIOS version here:
Firmware Revision: 01.63
Firmware Build Time: 09/04/2020
BIOS Version: 2.7
BIOS Build Time: 12/06/2021
Redfish Version: 1.0.1

And exactly the same problem: the machine started extracting stuffs and then failed doing it with kernel panics as seen below
20230509_143607a.jpg


I checked the Log in BIOS and it was reported as "Uncorrectable ECC @ DIMMB2 - Assertion".
Then I thought they might've sent me some bad RAMs, but the Memtest86 results proved me wrong, since no error was found for all 4 passes (as seen below) and RAM speed was reported correctly is BIOS - 32GB, running at 2133MHz.
20230513_225502.jpg


Then I tried to:
1. Install Truenas Scale on the SSD with only 1 stick of RAM on DIMM A2, and nope, still got the same error.
2. Disable ECC in BIOS. I was able to install and run Truenas Scale without any problem, but after I re-enable ECC, it just won't boot anymore.
3. Install Truenas Scale on a 500GB HDD, got the same error.
4. Install and run Truenas Core, no problem was found, unless I try to install a plugin (every plugin installation, including the official ones, got stuck at 20% for half an hour or so, followed by a reboot)
5. Disable all firewall on my router, to see if it has something to do with the failed plugin installation I mentioned above. No difference whatsoever.
6. Run Linux Mint on this machine just for testing purpose, it worked flawlessly.
7. Run Unraid on this machine, worked flawlessly, but I really wanted Truenas Scale.

So I'm not sure what went wrong here. Please, help me with this.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Running something else on the machine won't end well if there's a problem with the machine.

Four PASSES of memtest? That's nothing. Try four days or even four weeks. Wait for it to fall over. If the memory is just on the edge, you may need to be more patient or aggressive, such as making sure your memtest is utilizing all cores simultaneously. If you can easily get a warning out of SCALE, you should be able to replicate this without too much effort on Memtest.
 

a.sihanabun

Cadet
Joined
May 17, 2023
Messages
3
Running something else on the machine won't end well if there's a problem with the machine.

Four PASSES of memtest? That's nothing. Try four days or even four weeks. Wait for it to fall over. If the memory is just on the edge, you may need to be more patient or aggressive, such as making sure your memtest is utilizing all cores simultaneously. If you can easily get a warning out of SCALE, you should be able to replicate this without too much effort on Memtest.
Thank you for the quick response, I think I got your point.
Tonight I'll try to run Memtest86 again and this time I'll make it 12 passes for each stick.
I'll post the results when it's over. Let's hope I'll get an error or two.
 

a.sihanabun

Cadet
Joined
May 17, 2023
Messages
3
Running something else on the machine won't end well if there's a problem with the machine.

Four PASSES of memtest? That's nothing. Try four days or even four weeks. Wait for it to fall over. If the memory is just on the edge, you may need to be more patient or aggressive, such as making sure your memtest is utilizing all cores simultaneously. If you can easily get a warning out of SCALE, you should be able to replicate this without too much effort on Memtest.

Update:
Last night I didn't run Memtest86, but I installed an 8GB memory stick from my Dell T40 instead (it's an SK Hynix), and truenas scale installation ran perfectly. I suspected that the memory sticks I bought had some compatibility issues with X11SSM-F since they're not confirmed by Crucial website. So I planned to return them and buy another pair from crucial (link to product page) which has compatibility confirmation for X11SSM-F. It'll take some time to arrive here, but I'll make sure to post an update afterward. Thanks for pointing out that the problem lies with the memory.
 
Top