SofaKingBoring
Dabbler
- Joined
- Mar 24, 2023
- Messages
- 13
Hi Guys,
I really need help, because my first TrueNAS build starts to turn into a minor catastrophe.
Sorry for the long read, but I believe some might find it interesting.
So when I planned out my build I followed a lot of the instructions in here (thanks to all for that!!).
Once everything was assembled I followed the Burn-in and Testing guide in here.
I also ran about a week of MemTestx86+.
Build:
1) Initial findings
I found no errors after two weeks of testing and started to install TrueNAS.
Soon after install i noticed a lot of errors like this on the console:
kernel: EDAC MC0: 1 UE ie31200 UE on unknown memory (csrow:2 channel:1 page:0x0 offset:0x0 grain:1)
kernel: EDAC MC0: 1 UE ie31200 UE on unknown memory (csrow:2 channel:1 page:0x0 offset:0x0 grain:1)
kernel: EDAC MC0: 1 UE ie31200 UE on unknown memory (csrow:2 channel:1 page:0x0 offset:0x0 grain:1)
kernel: EDAC MC0: 1 UE ie31200 UE on unknown memory (csrow:2 channel:1 page:0x0 offset:0x0 grain:1)
csrow and channel vary.
I read up on this error and understood that this should represent uncorrectable ECC errors.
But I also found THIS link that is only available in German unfortunately.
In essence it says, there is a bug in EDAC on some linux systems, that is incorrectly showing those errors on healthy RAM modules (properly tested).
Solution was in some cases to disable Quick boot in the BIOS.
Unfortunately my MBs AMI BIOS didn't have a quick boot option available. My guess was because it's a workstation/server board it's just not there?
Other alternative: just dump those errors and ignore them, but make absolutely sure your config is OK.
So I did another round of testing my RAM and also tried another tool (PassMark MemTestx86).
And "Tadaa" suddenly PassMark's tool found "correctable ECC errors" consistently.
2) Following the wrong leads
So what did I do? I RMAed my 2 RAM modules, got new ones of another brand, twice.
(Always DDR4 unregistered ECC with 3200 MHz @ 1,2V maximum. The board only suppoirts 2666 MHz but it's clocking it down correctly).
When the test after the first return showed the same errors (and also exactly the same number of errors), I got suspicious.
3) Looking for help
I opened up a thread at the PassMark forums (LINK), to see if it could really be a coincidence.
For those that do not want to read through the other thread a quick summary:
PassMark suspects that this is not a RAM module defect but possibly a bug in the board's BIOS.
One thing that is reproducable:
If you start the MemTest after a cold boot, it will exactly report 20 correctable ECC errors in the first few minutes, then nothing after.
If you do a soft reboot after the first few minutes of testing -> No errors.
So PassMark believes that these are legitimate errors, not from the modules, but from the memory controller.
This would mean, I have to do something about it.
But what should I do exactly? Most obvious is returning the board.
Do you see any other failures in my build that should be rooted out?
What's bugging me a little:
If EDAC's uncorrectable errors are true, why did I never have an UE in more than a week of memory testing with 2 different tools?
So I really would like to get my next move right, before I end up in a trial-and-error spiral.
If I return the board. Any suggestions of boards you can actually get?
I'd prefer one with a real IPMI now. The current board only has the AMT/vPRO stuff.
Thanks
Chris
I really need help, because my first TrueNAS build starts to turn into a minor catastrophe.
Sorry for the long read, but I believe some might find it interesting.
So when I planned out my build I followed a lot of the instructions in here (thanks to all for that!!).
Once everything was assembled I followed the Burn-in and Testing guide in here.
I also ran about a week of MemTestx86+.
Build:
Board | Fujitsu/Kontron D3644-B |
CPU | XEON 2144G |
RAM | 2x32GB ECC , different brands always DDR4 unregistered ECC 3200Mhz. |
Power Supply | Corsair RM550x |
Boot Pool | 250GB Nvme & 250GB 2,5" SSD |
Data Pool | 1x WD Red 12TB, 1x IronWolf 12TB, mirrored |
1) Initial findings
I found no errors after two weeks of testing and started to install TrueNAS.
Soon after install i noticed a lot of errors like this on the console:
kernel: EDAC MC0: 1 UE ie31200 UE on unknown memory (csrow:2 channel:1 page:0x0 offset:0x0 grain:1)
kernel: EDAC MC0: 1 UE ie31200 UE on unknown memory (csrow:2 channel:1 page:0x0 offset:0x0 grain:1)
kernel: EDAC MC0: 1 UE ie31200 UE on unknown memory (csrow:2 channel:1 page:0x0 offset:0x0 grain:1)
kernel: EDAC MC0: 1 UE ie31200 UE on unknown memory (csrow:2 channel:1 page:0x0 offset:0x0 grain:1)
csrow and channel vary.
I read up on this error and understood that this should represent uncorrectable ECC errors.
But I also found THIS link that is only available in German unfortunately.
In essence it says, there is a bug in EDAC on some linux systems, that is incorrectly showing those errors on healthy RAM modules (properly tested).
Solution was in some cases to disable Quick boot in the BIOS.
Unfortunately my MBs AMI BIOS didn't have a quick boot option available. My guess was because it's a workstation/server board it's just not there?
Other alternative: just dump those errors and ignore them, but make absolutely sure your config is OK.
So I did another round of testing my RAM and also tried another tool (PassMark MemTestx86).
And "Tadaa" suddenly PassMark's tool found "correctable ECC errors" consistently.
2) Following the wrong leads
So what did I do? I RMAed my 2 RAM modules, got new ones of another brand, twice.
(Always DDR4 unregistered ECC with 3200 MHz @ 1,2V maximum. The board only suppoirts 2666 MHz but it's clocking it down correctly).
When the test after the first return showed the same errors (and also exactly the same number of errors), I got suspicious.
3) Looking for help
I opened up a thread at the PassMark forums (LINK), to see if it could really be a coincidence.
For those that do not want to read through the other thread a quick summary:
PassMark suspects that this is not a RAM module defect but possibly a bug in the board's BIOS.
One thing that is reproducable:
If you start the MemTest after a cold boot, it will exactly report 20 correctable ECC errors in the first few minutes, then nothing after.
If you do a soft reboot after the first few minutes of testing -> No errors.
So PassMark believes that these are legitimate errors, not from the modules, but from the memory controller.
This would mean, I have to do something about it.
But what should I do exactly? Most obvious is returning the board.
Do you see any other failures in my build that should be rooted out?
What's bugging me a little:
If EDAC's uncorrectable errors are true, why did I never have an UE in more than a week of memory testing with 2 different tools?
So I really would like to get my next move right, before I end up in a trial-and-error spiral.
If I return the board. Any suggestions of boards you can actually get?
I'd prefer one with a real IPMI now. The current board only has the AMT/vPRO stuff.
Thanks
Chris
Last edited: