Hardware errors (ECC errors?) on two servers, can they be ignored?

AMiGAmann · Mar 30, 2024

Hi,

I have two TrueNAS SCALE builds with identical hardware (except the disks): ASRockRack X570D4U, Ryzen 9 5900X, Kingston KSM32ED8 ECC RAM, 750W power supply. On both systems I occasionally see hardware errors in the logs.

Right now I am rsyncing data from one system to the other. On the receiving system I see the hardware errors once every 3-4 hours in the logs:

Mar 30 03:24:19 TrueNAS kernel: mce: [Hardware Error]: Machine check events logged
Mar 30 03:24:19 TrueNAS kernel: [Hardware Error]: Corrected error, no action required.
Mar 30 03:24:19 TrueNAS kernel: [Hardware Error]: CPU:0 (19:21:2) MC17_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
Mar 30 03:24:19 TrueNAS kernel: [Hardware Error]: Error Addr: 0x0000000469e32d80
Mar 30 03:24:19 TrueNAS kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x447000200a800201
Mar 30 03:24:19 TrueNAS kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Mar 30 03:24:19 TrueNAS kernel: EDAC MC0: 1 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x11e78cb offset:0x480 grain:64 syndrome:0x20)
Mar 30 03:24:19 TrueNAS kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

To me this looks like a RAM ECC error. On both systems Memtest86+ has run for several days without a single error.

Can I safely ignore those errors or is there anything I should do?

Best regards,
AMiGAmann

PhilD13 · Mar 30, 2024

According to Memtest86+ website on AMD Ryzen systems the DRAM ECC Enable needs to set to Enabled. The default is Disabled.

Why are ECC errors not being reported on my AMD Ryzen system?

There is a possibility that a BIOS setting, Platform First Error Handling (PFEH), is preventing ECC errors from being reported to MemTest86.

So you might have issues that are not being reported by the Memtest86+ testing.

AMiGAmann · Mar 30, 2024

Hi,

thanks for the information.

If I understand it correctly, the Ryzen system detects/corrects 1-bit-errors and informs the OS about it when that hardware error is shown.

So actually ECC correction seems to work and ECC errors are reported, I still wonder why they are not reported to memtest86+. I am currently unable to reboot and check the BIOS settings but will do asap.

Is there any chance to make TrueNAS report the errors? I just saw those errors accidentally in the logs.

And why are those errors occuring on two different systems (with almost identical hardware) - I assume it might be an incompatibility between motherboard and RAM? But the Kingston RAM is listed on the Memory QVL of the motherboard.

Best regards,
AMiGAmann

PhilD13 · Mar 30, 2024

The memtest86+ (opensource one) only has preliminary support for some Ryzan processors to test ECC and report starting at version 7.0 at least I found info from Jan 2024 saying that is the case. The memtest86 (the propitiatory one) does support testing ECC on Ryzan if the BIOS setting I posted above is Enabled testing according to their technical data page.

You can try enabling that parameter in BIOS and see if that helps memtest86+

If Truenas is reporting errors during operation, then you may have an issue with the memory.

Kingston website I found out is not always completely right about what memory goes to what due to processor/motherboard revisions etc..

Other than that I have nothing..........

Fleshmauler · Mar 31, 2024

I think I've ever gotten two of those errors in the two years I've ran X570D4U paired with the 16gb version of those Kingston sticks - which sounds about right. One every 4 hours sounds bad. Are you actually running them at 3200Mhz or the 2933 that the motherboard supports? I've read too many stability issues on the X570D4U on the level1 forums to bother going above 2933Mhz.

Apollo · Apr 1, 2024

I have watched a video from Gamers Nexus on Threadripper 7000 series where they would disable ECC while running Memtest in order to catch any errors that would occur without the system trying to fixed the error. This gives an idea on memory stability in non ECC mode, and if no errors are detected, then ECC can be enabled and the system should still be stable with added error correction with ECC.

AMiGAmann · Apr 1, 2024

One every 4 hours sounds bad. Are you actually running them at 3200Mhz or the 2933 that the motherboard supports?

I copied (approx. 80 TB) data from one TrueNAS SCALE build to the other for the last two days. The errors occured even more, I guess about hourly on both systems.

The memory runs at 2666 MT/s. When using only 2 sticks they were running at 3200, when using 4 sticks they are automatically lowered to 2666 MT/s, which is also noted in X570D4U manual:

I have watched a video from Gamers Nexus on Threadripper 7000 series where they would disable ECC while running Memtest in order to catch any errors that would occur without the system trying to fixed the error. This gives an idea on memory stability in non ECC mode, and if no errors are detected, then ECC can be enabled and the system should still be stable with added error correction with ECC.

I can do that for testing. Although it probably would only confirm that there are actually memory issues.

I looked for that "Platform First Error Handling (PFEH)" BIOS option, but it is not available on my system.

There are some different DRAM ECC relevant options, which are all left on default settings:

Code:

DRAM ECC Symbol Size=Auto

DRAM ECC Enable=Enabled (I guess this does mean that the ECC error correction is actually active)

DRAM UECC Retry=Auto (this probably does mean "DRAM Uncorrectable Error Correcting Code Retry": it will probably enable additional correction of errors if the first try failed)

Disable Memory Error Injection=true

Meanwhile I tested one of the systems with MemTest86 v10.7 pro, but this does again show no problems at all.

Well, I actually seem to have problems, otherwise the error logs mentioned in the first post would not have occured. Luckily the error correction helped me to avoid data corruption/crashes. But I guess something does not work as it should. Because I have two almost identical builds (only different disks) I am quite sure that there is no defective hardware.

But what else can I do to increase the stability? Shall I reduce MT/s manually to an even lower value?

Best regards,
AMiGAmann

AMiGAmann · Apr 1, 2024

I now disabled ECC in BIOS (DRAM ECC Enable=Disabled) and ran Memtest86 again. Memtest86 now displays "DDR4 2668MT/s" and not "DDR4 ECC 2668MT/s" as before, so ECC seems actually to be disabled.

After 2 hours the first pass is now finished without any errors.

I don't understand that.

Is my interpretation wrong that the error message in first post is actually an ECC correction?

Fleshmauler · Apr 1, 2024

AMiGAmann said:
I looked for that "Platform First Error Handling (PFEH)" BIOS option, but it is not available on my system.

There are some different DRAM ECC relevant options, which are all left on default settings:

Code:
DRAM ECC Symbol Size=Auto DRAM ECC Enable=Enabled (I guess this does mean that the ECC error correction is actually active) DRAM UECC Retry=Auto (this probably does mean "DRAM Uncorrectable Error Correcting Code Retry": it will probably enable additional correction of errors if the first try failed) Disable Memory Error Injection=true

Meanwhile I tested one of the systems with MemTest86 v10.7 pro, but this does again show no problems at all.

Yeah sadly no platform first error handling on our motherboard at all - from what I can recall blame was given to AMD for not making it a thing (take my memories & these allegations with a grain of salt); it is OS level only.

Also if not mistaken if you want to test with memtest pro & catch ECC issues you would have to change "Disable Memory Error Injection" from "True" to "False". That way memtest can inject errors. I could be way off on that. Switch back afterwards.

Realistically you can manually try to set voltage levels for stability, lower clocks and/or timings, etc. But this'd be something better asked of RAM overclockers than on a NAS forum. I'm not sure how much they play with ECC though. I'd recommend just RMAing to Kingston.

Oh, physical troubleshooting like removing & reseating the sticks, cleaning the contact points, ensuring gunk didn't get into the slots, testing them in different quantities and/or 1 at a time, etc. to try to narrow down which ones are problematic may yield something.

Davvo · Apr 1, 2024

If ECC kicks in and corrects the errors, you have no issue with your system. You should worry when the kernel goes full panik mode because it cannot correct it or, maybe worse, when it does not detects the errors and thus does not correct them.

TN has no way of telling you have an issue with ECC or with RAM.

AMiGAmann · Apr 2, 2024

I disabled ECC on one system in BIOS (DRAM ECC Enable=Disabled) and left it enabled on the other system. I ran MemTest86 on both systems 4 passes (=18 hours) without a single error. I know that a real "burn-in" would take days or weeks but remember that the error messages in first post popped up quite often under TrueNAS so I would have expected to see some errors within 18 hours.

I changed "Disable Memory Error Injection" from "true" to "false" and ran MemTest86 again. I see MemTest injecting the errors, but actually again no memory errors are detected.

On the MemTest website (https://www.memtest86.com/ecc.htm) they say at the very bottom:

Why are ECC errors not being reported on my AMD Ryzen system? There is a possibility that a BIOS setting, Platform First Error Handling (PFEH), is preventing ECC errors from being reported to MemTest86. Another explanation is the use of out-of-band (OOB) monitoring solutions such as Baseboard Management Controller (BMC) and Intelligent Platform Management Interface (IPMI).

There is actually no PFEH-setting in X570D4U-BIOS when using Ryzen 9 5900X. Yes, IPMI is enabled, but there are no ECC related events in the IMPI event log.

Although I think I saw several people reporting MemTest86 displaying ECC errors on X470/X570 boards, I do not remember anybody actually using an AsRockRack X570D4U.

Is it possible that PFEH is always enabled on that board and that it is impossible that ECC errors are being reported to/by MemTest86?

Another thing mentioned on the website above at the very bottom is:

Why am I consistently seeing Correctable ECC / EDAC errors on my system?

There is likely a bug in your EDAC/BIOS. Your ECC RAM is OK, but was not initialized properly by the BIOS on boot.

In order to initialize ECC, memory has to be written before it can be used. Usually this is done by BIOS, but with some motherboards this step is skipped if "Quick Boot" is enabled.

Possible Solution: If your system allows for it, try disabling Quick Boot in the BIOS, some error messages should disappear. The boot process may taker 30-60 seconds longer, but the EDAC error messages disappear due to the RAM check by the BIOS when booting.

X570D4U BIOS has "Fast Boot", but I disabled that from the beginning. Is there maybe any other relevant option to check?

Realistically you can manually try to set voltage levels for stability, lower clocks and/or timings, etc. But this'd be something better asked of RAM overclockers than on a NAS forum. I'm not sure how much they play with ECC though.

Well I would actually like to try increasing the stability. But as long as MemTest does not throw any errors, how should I check for improvements? I would have to transfer a lot of data between the machines and check the journalctl. Yes, this would probably work and I will do check later.

I'd recommend just RMAing to Kingston.

I have two server with 4 sticks each, both servers have the same problem. One set of sticks is from the beginning of 2023, the other set of sticks is from late 2023. I cannot really imagine that those set of sticks are both defective. I would have to RMA all 8 sticks because I cannot proof that only a single stick is the root cause.

Oh, physical troubleshooting like removing & reseating the sticks, cleaning the contact points, ensuring gunk didn't get into the slots, testing them in different quantities and/or 1 at a time, etc. to try to narrow down which ones are problematic may yield something.

One system was built by a professional in early 2023, the other system was built very carefully by myself. I am quite sure that there are no physical problems like dust/dirty contact points, etc. One thing to do is testing with 2 sticks (running then at 3200 MT/s instead of 2666 MT/s). Can this actually be a problem that the sticks are running with lower speed? I don't think so...

If ECC kicks in and corrects the errors, you have no issue with your system. You should worry when the kernel goes full panik mode because it cannot correct it or, maybe worse, when it does not detects the errors and thus does not correct them.

What do you mean with "you have no issues with your system"? Yes, I am really glad that ECC kicks in and corrects all the errors and that I do not have actual data loss or data corruption. But don't you think that those many error messages are kind of unnormal?

TN has no way of telling you have an issue with ECC or with RAM.

TN could IMHO at least watch and parse those error messages and offer a possibility in System Settings / Alert Settings to define a Warning Level for detected hardware errors.

Fleshmauler · Apr 2, 2024

There is actually no PFEH-setting in X570D4U-BIOS when using Ryzen 9 5900X. Yes, IPMI is enabled, but there are no ECC related events in the IMPI event log.

I've personally confirmed with Asrock in the past that X570D4U does not have IPMI level ECC logging

I have two server with 4 sticks each, both servers have the same problem

Sorry I thought it was just the one out of the two having issues; hence why I suggested the more modular & hands-on levels of troubleshooting.

don't you think that those many error messages are kind of unnormal

100% agree - I'm unable to replicate on my X570D4U (doesn't help you much, but I'm happy), so I can't really suggest more...

I guess last couple of things I could possibly think of is ensuring BIOS is up to date on both motherboards. Headsup for the BMC version there was some special nonsense in terms of having it clear all previously saved settings (sucks to reset fan curves); if you don't select those options to not save, then I remember the BMC updates being absolutely worthless & not actually fixing any features. I think I quoted them somewhere on the level 1 forums for the X570D4U, but they are pretty apparent when you do BMC update if you keep an eye out.

What else... uhh - check ram temps while these transfers are happening? Maybe a temperature level issue causing mild instability? Chipset temps too? I remember my board had chipset at like >70*c at IDLE until I slapped a fan towards it. For reference under load I don't ever see my dimms too much above 40*c and chipset is in the 50-60*c after adding some more fans directly pointing at the devices. My case has so many stupid small fans pointing at equipment that expects server grade airflow...

If BIOS is up to date + other suggestions yield no results then I'm all out of ideas. Consider going to Asrock & see if they have any input? Reaching out to vendor can sometimes yield positive results.

Davvo · Apr 2, 2024

AMiGAmann said:
But don't you think that those many error messages are kind of unnormal?

I do, but I don't think they are a threat to your system. I don't know if it's enough to get a RMA as well.
It might not be a RAM-related issue as well... as I wrote, if ECC is working correctly I would not worry too much.

Stux · Apr 2, 2024

When I had some recurring ECC errors in the past I removed the sticks, blew out the sockets with canned air, and reseated.

Haven’t had any errors since.

But at least I knew the ECC was working ;)

I find this situation can be worse when you expand the memory and start using sockets that were previously empty (and gathering dust)

AMiGAmann · Apr 4, 2024

I guess last couple of things I could possibly think of is ensuring BIOS is up to date on both motherboards.

Yes, it is up to date. I updated the one on the older system once and the newer board came with current BIOS/BMC. AsRockRack does not really maintain the BIOS, the current one is already 1,5 years old (October 2022).

check ram temps while these transfers are happening? Maybe a temperature level issue causing mild instability? Chipset temps too?

This is also not a problem. RAM temps are always beyond 40°C (IPMI sensors), Chipset temp is around 50-55°C. There is a big fan blowing directly at the chipset. The systems are built in a quite big Fractal Design Define 7 XL tower (https://www.fractal-design.com/de/products/cases/define/define-7-xl/black/) - and each system contains 8 (!) big fans. That's because I want to make sure the disks are also running at convenient temperatures (below 42°C).

Consider going to Asrock & see if they have any input?

Yes, I will probably do that.

When I had some recurring ECC errors in the past I removed the sticks, blew out the sockets with canned air, and reseated.

The newer system is a fresh build. X570D4U came right of the packaging and sticks were carefully seated.

Well, I am really baffled. After the long MemTest run which I did at the beginning of the week I checked some more BIOS settings. I did not really configure much in BIOS. I had CPU EcoMode enabled on both systems. I disabled EcoMode on one system and started TrueNAS on both again. I copied a lot of data from system A to B and from B to A (using Total Commander on a client), I deleted that data, copied data (500GB) from A to B with a rsync-job again like I did before, deleted it again. Since yesterday I am comparing with Total Commander on a client data from A with B, all good.

And here it comes: I did not have a single ECC error log on any of the systems since last weekend. Remember, last weekend I had a lots of errors, sometimes only 2-3 minutes in between, on both systems! Now I am not able to reproduce an error (to validate if EcoMode has anything to do with it).

This is really strange. I could actually delete the big 80TB dataset on the destination system and rsync it all over again, but I actually do not want to do this.

So I guess I have to monitor the journalctl on both systems and see what happens.

But I don't have an explanation for all that. Maybe the ECC RAM is *sometimes* not being initialized correctly by the board? As stated on the MemTest-Website linked above:

There is likely a bug in your EDAC/BIOS. Your ECC RAM is OK, but was not initialized properly by the BIOS on boot.

But it does not really make sense. Why should the initialization go wrong at the same time on both systems and then be okay on both systems at the same time.

Strange...

Fleshmauler · Apr 4, 2024

AMiGAmann said:
And here it comes: I did not have a single ECC error log on any of the systems since last weekend.

...All is well that ends well I guess - lol, maybe you had a localized cosmic event causing minor ECC errors?

Important Announcement for the TrueNAS Community.

Hardware errors (ECC errors?) on two servers, can they be ignored?

AMiGAmann

Contributor

PhilD13

Patron

AMiGAmann

Contributor

PhilD13

Patron

Fleshmauler

Explorer

Apollo

Wizard

AMiGAmann

Contributor

AMiGAmann

Contributor

Fleshmauler

Explorer

Davvo

MVP

AMiGAmann

Contributor

Fleshmauler

Explorer

Davvo

MVP

Stux

MVP

AMiGAmann

Contributor

Fleshmauler

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

Hardware errors (ECC errors?) on two servers, can they be ignored?

Contributor

Patron

Contributor

Patron

Explorer

Wizard

Contributor

Contributor

Explorer

MVP

Contributor

Explorer

MVP

MVP

Contributor

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Hardware errors (ECC errors?) on two servers, can they be ignored?"

Similar threads