A bunch of EDAC MC0: 1UE ie31200 messages (defective hardware?)

nasBuilder

Dabbler
Joined
Mar 25, 2023
Messages
26
Hi all, I started with truenas a couple of days ago. After reading a bunch on the internet I bought this hardware:

- MB: Fujitsu D-3644 - bought new
- 2 x Kingston 16Mb DDR4 ECC RAM (KSM26ES8/16HC) - bought new
- Intel I3-9100 - bought used
- 3 x WD Red SSD 2TB running in RaidZ1

While setting this all up I not, notced that I was getting a bunch of EDAC messages posted directly to truenas console. After reading some stuff I managed to understand that it was probably trying to tell me that a bunch of ECC errors were happening and were being corrected? I removed one RAM module, it didn't change anything the messages kept coming, sometimes randomly sometimes specifically after a click on something in the gui.

Yesterday I ran 2 passes of memtest with the one module inserted. At first I got a couple of ECC errors that were corrected (pretty much how they are shown in the attachment). I'm new to memtest so I was tinkering around, interrupted the test, restarted and let it run overnight, today in the morning it didn't display a single error and I'm now running truenas and not seeing a single error (you can see that the last error was on Mar 25 at 04:04:18).

I'm super new to all this ECC-build-your-own-nas thing, does anybody have an idea what to do:

- do I have a hardware problem? do you I need to send any of this hardware back...?
- how come the errors are not coming right now
- can a memtest "burn-in" a memory module so that it becomes error free
- WTF is going on :-D

Thank you
 

Attachments

  • IMG_20230325_111909.jpg
    IMG_20230325_111909.jpg
    677.6 KB · Views: 106
  • Screenshot 2023-03-25 at 11.22.53.png
    Screenshot 2023-03-25 at 11.22.53.png
    149.3 KB · Views: 104

MisterE2002

Patron
Joined
Sep 5, 2015
Messages
211
You should have 0 memory errors. Burn-in does not exist. It should be perfect directly.

Using 1 memory bank is a smart idea. But did you also used the other one? That way you can verify which one is (possible) broken.

You can use 2 memtools: PassMark MemTest86 and Memtest86+

If both are giving errors than it is suspicious. You can also try update the motherboard bios to latest version.
 

nasBuilder

Dabbler
Joined
Mar 25, 2023
Messages
26
Bios is updated to the latest version (1.28.0). I was getting the EDAC messages on each module individually and also together, I also tried different RAM sockets. I didn't run the memtest for the other module which is currently lying around. Will look into passmark. I have to say though that it's been running for 3 hours after restart after memtest and I didn't get a single EDAC message. I have a bad feeling about that, both memory modules were new and sealed...
 

nasBuilder

Dabbler
Joined
Mar 25, 2023
Messages
26
Ran memtest86+ without errors, I'm very confused. I guess I'm going to repeat with the second module now. Not that it's one of those hardly identifiable errors that sometimes show up or some weird controller broken in the CPU which I'll never find out...
 

nasBuilder

Dabbler
Joined
Mar 25, 2023
Messages
26
Memtest86+ didn't display any errors on the second module, after that I ran it again with both modules inserted: no errors, then I rebooted with both modules and am not seeing any EDAC messages. How this is possible and what it means, I have zero clue...
 

MisterE2002

Patron
Joined
Sep 5, 2015
Messages
211
If i understand correctly you get in truenas messages and also in memtest. I think passmark will not bring anything new to the table. Something seems wrong and i would not use this system with valuable data. Better unplug your pool and investigate.

Some possibilities:
* Linux bug (not sure)
* Compatibly bug between motherboard and memory (is it in the validated list of Fujitsu?)
* Temperature related
* Motherboard defect
 

nasBuilder

Dabbler
Joined
Mar 25, 2023
Messages
26
Truenas has been active now for several hours with both modules, not a single message. I dunno, I didn't insert the RAM modules properly...? Pretty sure they clicked though. Temperatures should be higher now than when I was seeing the messages yesterday if anything. The RAM is of course not validated by fujitsu, they validated like 3 modules which are old and impossible to find: I don't care to be honest :-D

You sound like the messages are bound to pop up again, though, which is worrying. I guess I'm going to let it run now while I'm reading on how backup the entire pool to a hotswap drive in truenas :-D
 

nasBuilder

Dabbler
Joined
Mar 25, 2023
Messages
26
Ok, after having no messages for several hours I needed to reboot and they are coming again sporadically. Super annoying
 

Attachments

  • IMG_20230325_173356.jpg
    IMG_20230325_173356.jpg
    365 KB · Views: 103

MisterE2002

Patron
Joined
Sep 5, 2015
Messages
211
Quite dangerous to ignore i think. Motherboard can also be broken. Other community members with experience?
Is it an option to switch motherboard? Asrock Rack are populair options. Supermicro is also nice but fits best in supermicro cases because of the connectors (you need a proprietary cable to convert the header connector)
Did not even know Fujitsu is making mobo's
 

nasBuilder

Dabbler
Joined
Mar 25, 2023
Messages
26
This mobo is quite famous in the unraid community at least, it has very low power consumption besides it doesn't cost too much and doesn't have IPMI draws 3 watts from what I understand and I don't really need it.

My impression is that some reboots make the messages come up and others don't if that makes sense. When they do come up I see them in /var/log/messages

- How can I try to replicate these messages in linux, are they going to show up in an ubuntu live distro or does it work differently...

In the meantime I'm a bit surprised how many bugs there are in the truenas gui:
an existing replication task can't be edited, apparently fixed in https://ixsystems.atlassian.net/browse/NAS-120432 but not going to be released officially till mid april - reallllly. So I installed a nightly build where it's fixed, however you still can't edit advanced replication options while creating a replication task it just closes the editor, cmon.

And I absolutely do not understand replication but that's me obviously. I thought I could just keep replicating my dataset recursively into a hotswap drive everytime I insert it and push a button but it only works once. Currently trying to figure out the difference between:

- replication from scratch
- full filesystem replication
- recursive replication

Yeah, that's going to take a long time for me to imitate my former synology solution :-D:-D
 

Attachments

  • Screenshot 2023-03-25 at 18.54.45.png
    Screenshot 2023-03-25 at 18.54.45.png
    443.3 KB · Views: 76
  • Screenshot 2023-03-25 at 19.27.14.png
    Screenshot 2023-03-25 at 19.27.14.png
    81.2 KB · Views: 65

nasBuilder

Dabbler
Joined
Mar 25, 2023
Messages
26
Basically, my final impression is that it depends on the boot, some boots do not make the messages popup at all (lucky boot I guess), others do show them periodically. I guess it's pretty bad news.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
This mobo is quite famous in the unraid community at least,

UnRAID, from what I've gathered, is designed to work on commodity hardware like that random desktop board you're talking about. TrueNAS is best run on the hardware we recommend, because the resulting product will end up resembling the official iXsystems hardware that the TrueNAS appliance is designed to run on.

ECC memory is strongly recommended for TrueNAS, and if you're not going to use ECC, then I strongly recommend running Memtest86 for weeks (we usually go for at least a month) before considering the memory safe to use. @MisterE2002 has also made good suggestions.
 

nasBuilder

Dabbler
Joined
Mar 25, 2023
Messages
26
UnRAID, from what I've gathered, is designed to work on commodity hardware like that random desktop board you're talking about.

I guess you didn't do any research because it's as far from a random desktop mainboard as it can only get. :-D
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I guess you didn't do any research

I actually did do some research on it. It's clearly not a server board as evidenced by the orientation of the RAM slots, onboard audio, lack of IPMI, etc. It's clearly not a workstation board because it has support for integrated GPU (Intel UHD, DP, DVI-D). It seems to be lacking gaming features like LED lighting and overclocking support. It's also not identified by Fujitsu as an industrial board, but rather as an "extended lifecycle" board. The presence of DP/DVI-D plus audio and flimsy x4-in-x16 PCIe slot indicate that while you can place a GPU in the primary x16 slot at full speed, the secondary x4-in-x16 slot will support a GPU only at low performance, which is a classic less-than-workstation configuration for multiple display setups (otherwise having the second -in-x16 slot makes little sense). So that thing takes Coffee Lake and would therefore be comparable to something like an HP EliteDesk 800 G... G4? I think.

Yeah, nailed it. Same stupid x16 PCIe configuration. Clearly sold as a desktop.


I guess you didn't do any research because it's as far from a random desktop mainboard as it can only get. :-D

Well, we disagree, then. It looks like a desktop board. It's provisioned like a desktop board. It doesn't fit into other standard board categories. If it looks like a duck and quacks like a duck, ....

More importantly, none of that works to disprove the point I was making, which is that TrueNAS works best when you make the hardware look as close as possible to the hardware that iXsystems sells, and they sell neither random desktop mainboards nor your specific Fujitsu board in their NAS systems.
 

nasBuilder

Dabbler
Joined
Mar 25, 2023
Messages
26
Well, alright :-D Not going to argue since you obviously are more experienced than me. Maybe the board is in a weird spot, it's designed to run 24/7 though and has extremely low power consumption. Seems strange for a desktop board. But it's clear that it's not a server board in the purest sense since it lacks IPMI and such.
Anyways, one can also say that I3-9100 is not meant to be run in a server too, for that you should take xeon - sure. For my usecase it's totally fine though, I'm not running any critical stuff in the slightest here. Maybe I'll get another board instead of this one, still curious to know what exactly is going on if there is an issue in the first place of course.
 

MisterE2002

Patron
Joined
Sep 5, 2015
Messages
211
In my opinion you should not use this for your (valuable) data. Did you find users with similar setup combinations? Users with ECC memory?
On a dutch site i found this: "ECC (alleen Xeon en Core i3 8th gen)". Meaning Xeon and 8th gen. This can be just be incomplete data on this site.

Also found this pdf. "ECC support for XEON processors only". Maybe check the latest firmware changelog (or ask Fujitsu) what exactly is supported.
 

MisterE2002

Patron
Joined
Sep 5, 2015
Messages
211
Also found "Compatibility_CPU_x_Board.pdf" (hidden under "manuals") they only state ECC with XEONs. That said i found also some tweakers with confirmation it should work (or the never noticed the messages you get).

ECC is hard. Everything needs perfect: memory, motherboard en cpu. I think they wont give any guarantees about non-XEON cpu's.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Sure, there are a bunch of users using it with ECC memory,

There's a difference between "people use it with ECC memory" and "it correctly works with ECC memory including all the little details such as reporting ECC errors to the operating system." This has been a problem for manufacturers in the past, who get it wrong more often than they get it right. You can probably put ECC memory into a non-ECC-supporting desktop system and that usually "works" as long as the memory specs are correct. That doesn't mean you are protected by ECC though.

Especially without IPMI, you need to know that the board has a working mechanism to report ECC errors back to the operating system, or else the value of the (airquotes) "ECC support" is rather questionable.
 

nasBuilder

Dabbler
Joined
Mar 25, 2023
Messages
26
I see, initially I wanted to save a couple of watts (electricity is insanely expensive here) by going for a board without IPMI, the other model I was eyeing was Supermicro X11SCH-F which is at least available for purchase. It looks though that I can't attach Noctua U9S to it which is a bumer... Probably going to ask tomorrow if I can return fujitsu. However, it's going to be really interesting if I get these errors with another board.

Latest update:

I feel I can reproduce the correctable ECC-errors when running memtest86 If I shutdown AND turn off the PSU, then turn it on and go to memtest: they are reported during the first 2 tests. However, If after finding the errors I just reboot without turning off the PSU they do not come back memtest. That's weird.
 

Attachments

  • Screenshot 2023-03-26 at 12.59.59.png
    Screenshot 2023-03-26 at 12.59.59.png
    895 KB · Views: 92
Top