How to eliminate the culprit of my "bad build"?

SofaKingBoring

Dabbler
Joined
Mar 24, 2023
Messages
13
Hi Guys,

I really need help, because my first TrueNAS build starts to turn into a minor catastrophe.
Sorry for the long read, but I believe some might find it interesting.

So when I planned out my build I followed a lot of the instructions in here (thanks to all for that!!).
Once everything was assembled I followed the Burn-in and Testing guide in here.
I also ran about a week of MemTestx86+.
Build:
BoardFujitsu/Kontron D3644-B
CPUXEON 2144G
RAM2x32GB ECC , different brands always DDR4 unregistered ECC 3200Mhz.
Power SupplyCorsair RM550x
Boot Pool250GB Nvme & 250GB 2,5" SSD
Data Pool1x WD Red 12TB, 1x IronWolf 12TB, mirrored



1) Initial findings

I found no errors after two weeks of testing and started to install TrueNAS.
Soon after install i noticed a lot of errors like this on the console:
kernel: EDAC MC0: 1 UE ie31200 UE on unknown memory (csrow:2 channel:1 page:0x0 offset:0x0 grain:1)
kernel: EDAC MC0: 1 UE ie31200 UE on unknown memory (csrow:2 channel:1 page:0x0 offset:0x0 grain:1)
kernel: EDAC MC0: 1 UE ie31200 UE on unknown memory (csrow:2 channel:1 page:0x0 offset:0x0 grain:1)

kernel: EDAC MC0: 1 UE ie31200 UE on unknown memory (csrow:2 channel:1 page:0x0 offset:0x0 grain:1)
csrow and channel vary.

I read up on this error and understood that this should represent uncorrectable ECC errors.
But I also found THIS link that is only available in German unfortunately.
In essence it says, there is a bug in EDAC on some linux systems, that is incorrectly showing those errors on healthy RAM modules (properly tested).
Solution was in some cases to disable Quick boot in the BIOS.
Unfortunately my MBs AMI BIOS didn't have a quick boot option available. My guess was because it's a workstation/server board it's just not there?
Other alternative: just dump those errors and ignore them, but make absolutely sure your config is OK.

So I did another round of testing my RAM and also tried another tool (PassMark MemTestx86).
And "Tadaa" suddenly PassMark's tool found "correctable ECC errors" consistently.

2) Following the wrong leads

So what did I do? I RMAed my 2 RAM modules, got new ones of another brand, twice.
(Always DDR4 unregistered ECC with 3200 MHz @ 1,2V maximum. The board only suppoirts 2666 MHz but it's clocking it down correctly).
When the test after the first return showed the same errors (and also exactly the same number of errors), I got suspicious.


3) Looking for help

I opened up a thread at the PassMark forums (LINK), to see if it could really be a coincidence.
For those that do not want to read through the other thread a quick summary:
PassMark suspects that this is not a RAM module defect but possibly a bug in the board's BIOS.
One thing that is reproducable:
If you start the MemTest after a cold boot, it will exactly report 20 correctable ECC errors in the first few minutes, then nothing after.
If you do a soft reboot after the first few minutes of testing -> No errors.

So PassMark believes that these are legitimate errors, not from the modules, but from the memory controller.
This would mean, I have to do something about it.

But what should I do exactly? Most obvious is returning the board.
Do you see any other failures in my build that should be rooted out?
What's bugging me a little:
If EDAC's uncorrectable errors are true, why did I never have an UE in more than a week of memory testing with 2 different tools?
So I really would like to get my next move right, before I end up in a trial-and-error spiral.

If I return the board. Any suggestions of boards you can actually get?
I'd prefer one with a real IPMI now. The current board only has the AMT/vPRO stuff.

Thanks
Chris
 
Last edited:

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
Is there a bios update available that might give you the options you're looking for or maybe even solve the problem?

I have had no problems with ECC in my RyZen based TrueNAS box. Been running since FreeNAS 11.

So if a BIOS up/downgrade doesn't fix it, RMA the board. I don't recall the model of my board, but it's an AM4, B-series chipset. I know one of my TN boxes has an ASUS board, but that could be in the backup server that is not using ECC.

Once you find a board you like, perhaps reach out to the maker, or google that board for ECC support. All RyZen supports ECC, but not all boards do. I would think it would be a yes/no thing, but maybe the ones that don't support it can "sometimes" work well enough to boot.
 

SofaKingBoring

Dabbler
Joined
Mar 24, 2023
Messages
13
In the meantime I had decided to change Boards.
The RMA went through, and with 100€ on top I bought a Supermicro X11SCH-LN4F.
I don't really need those 4 NICs, but the regular SCH with 2 interfaces was just not available at a reasonable price.
I ran MemTests for the last 3 days without any issues so far.

Also the IPMI is really nice.

Cheers
Chris
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
It is possible that your BIOS did not support 32GB memory modules on that 2144G CPU;

Specifically, their is this note;
Description
Support for up to 128GB system memory capacity will be available in 2019 and requires both a BIOS update and hardware platform support. Please contact your hardware provider regarding availability for your system.
This implies that 2 memory channels, supporting 2 DIMMs each, would be 4 x 32GB DIMMs. But, if the BIOS or board has problems, then 32GB DIMMs don't appear to be supported.

Anyway, just for anyone else looking for this type of information.
 

SofaKingBoring

Dabbler
Joined
Mar 24, 2023
Messages
13
My Board definitely did only support up to 64GB of RAM.
For the maximum Module size I couldn't find anything and Kontron doesn't provide a list of compatible RAM.
Also Crucial, Kingston and Co. do not list this board in their compatibility-checks.
Nevertheless the system was running "stable".
The EDAC errors and the Memtest checks had me worried too much to leave it this way.

I'm very happy so far with the Supermicro board. Especially the IPMI makes things more comfortable.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
My Board definitely did only support up to 64GB of RAM.
If a maximum memory amount is put together with the number of slots in the board, you can work out what the maximum module size can be. (16GB in your case, it seems)
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
But I also found THIS link that is only available in German unfortunately.
In essence it says, there is a bug in EDAC on some linux systems, that is incorrectly showing those errors on healthy RAM modules (properly tested).
Following this lead, had you tried TrueNAS CORE instead of SCALE?
 

SofaKingBoring

Dabbler
Joined
Mar 24, 2023
Messages
13
Following this lead, had you tried TrueNAS CORE instead of SCALE?
No, not really considered it, since usage of Docker/Kubernetes functions in Scale was pretty essential for my plans and the sizing of my build.
In hindsight I should have considered it to make it two different hosts and put that load on a second Proxmox host and just use TrueNAS for the storage.
But now I have a system sized for way more than just that :smile:.
 
Top