Memory errors after uprgade to 12.0U5?

c32767a

Patron
Joined
Dec 13, 2012
Messages
371
I have a system that's been in service for quite some time with no issues. It's never been mistreated (eg over temp, power issues, etc).

It serves as a lab test/bench box.

Hardware is X9SRL-F , Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz
Memory is all Micron 36KSF2G72PZ-1G6.

After upgrading to 12.0-U5 from U2, I'm seeing memory errors in the logs:

Code:
Aug 21 12:21:22 nastest MCA: Bank 7, Status 0x8c00004000010093
Aug 21 12:21:22 nastest MCA: Global Cap 0x0000000001000c15, Status 0x0000000000000000
Aug 21 12:21:22 nastest MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 0
Aug 21 12:21:22 nastest MCA: CPU 0 COR (1) RD channel 3 memory error
Aug 21 12:21:22 nastest MCA: Address 0x18a9215c0
Aug 21 12:21:22 nastest MCA: Misc 0x214042c286
Aug 21 12:23:21 nastest MCA: Bank 7, Status 0x8c00004000010093
Aug 21 12:23:21 nastest MCA: Global Cap 0x0000000001000c15, Status 0x0000000000000000
Aug 21 12:23:21 nastest MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 0
Aug 21 12:23:21 nastest MCA: CPU 0 COR (1) RD channel 3 memory error
Aug 21 12:23:21 nastest MCA: Address 0x18a9215c0
Aug 21 12:23:21 nastest MCA: Misc 0x2140545486
Aug 21 12:27:04 nastest MCA: Bank 7, Status 0x8c00004000010093


The Address is consistent across all error messages. This address maps to DIMMA1 in the dmidecode data and I've replaced that DIMM, yet the errors continue.

Before I start replacing the guts of the entire system, these errors correlate precisely with the upgrade to -U5. There is no chance there's a defect or change on the TrueNas side that could cause this? Even if it's just something cosmetic like previously unreported memory errors are now being reported?
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
I have a system that's been in service for quite some time with no issues. It's never been mistreated (eg over temp, power issues, etc).

It serves as a lab test/bench box.

Hardware is X9SRL-F , Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz
Memory is all Micron 36KSF2G72PZ-1G6.

After upgrading to 12.0-U5 from U2, I'm seeing memory errors in the logs:

Code:
Aug 21 12:21:22 nastest MCA: Bank 7, Status 0x8c00004000010093
Aug 21 12:21:22 nastest MCA: Global Cap 0x0000000001000c15, Status 0x0000000000000000
Aug 21 12:21:22 nastest MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 0
Aug 21 12:21:22 nastest MCA: CPU 0 COR (1) RD channel 3 memory error
Aug 21 12:21:22 nastest MCA: Address 0x18a9215c0
Aug 21 12:21:22 nastest MCA: Misc 0x214042c286
Aug 21 12:23:21 nastest MCA: Bank 7, Status 0x8c00004000010093
Aug 21 12:23:21 nastest MCA: Global Cap 0x0000000001000c15, Status 0x0000000000000000
Aug 21 12:23:21 nastest MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 0
Aug 21 12:23:21 nastest MCA: CPU 0 COR (1) RD channel 3 memory error
Aug 21 12:23:21 nastest MCA: Address 0x18a9215c0
Aug 21 12:23:21 nastest MCA: Misc 0x2140545486
Aug 21 12:27:04 nastest MCA: Bank 7, Status 0x8c00004000010093


The Address is consistent across all error messages. This address maps to DIMMA1 in the dmidecode data and I've replaced that DIMM, yet the errors continue.

Before I start replacing the guts of the entire system, these errors correlate precisely with the upgrade to -U5. There is no chance there's a defect or change on the TrueNas side that could cause this? Even if it's just something cosmetic like previously unreported memory errors are now being reported?
I doubt it's related to the software, the changes in u5 were very small. You could always roll back to the previous release and verify.
 

c32767a

Patron
Joined
Dec 13, 2012
Messages
371
I doubt it's related to the software, the changes in u5 were very small. You could always roll back to the previous release and verify.

The upgrade was u2 - u5, though. :)

The Actual error messages are somewhat erratic, but I worked out a process that triggers an error within an hour or two.
After a number of cycles, it appears that -U2 does not emit errors, but -U5 does.

After swapping modules and versions and testing, I'm getting inconsistent results. I would not be surprised if there is a faulty module, but the info provided by dmidecode and the error messages does not accurately indicate the affected module.

At this point I'm going to let memtest beat on it for a while and see what happens. Swapping all the RAM out with new sticks seems to clear the problem so not sure how long I'll beat on it.. Just weird that it happened immediately after an update and -U2 doesn't seem to generate any log messages.
 

revengineer

Contributor
Joined
Oct 27, 2019
Messages
193
Memory goes bad over time, it's as simple as that. The facts that it showed up after an upgrade may be simply that an address was used regularly that previously was not in use. Running memtest86 is a good idea. I have dealt with this issue a couple of times on my supermicro system bought on ebay. Note that dmidecode is not reliable in identifying the bad ram stick. Check the BIOS of the X9 board; my X9DRi board showed the exact memory stick that was problematic.
 

c32767a

Patron
Joined
Dec 13, 2012
Messages
371
Memory goes bad over time, it's as simple as that. The facts that it showed up after an upgrade may be simply that an address was used regularly that previously was not in use. Running memtest86 is a good idea. I have dealt with this issue a couple of times on my supermicro system bought on ebay. Note that dmidecode is not reliable in identifying the bad ram stick. Check the BIOS of the X9 board; my X9DRi board showed the exact memory stick that was problematic.

It's easy enough to replace the DIMMs and move on, but In my time running Linux and BSD systems in large production environments, I've seen far too many cases where an issue that looked like hardware was actually software and a great amount of time and money was spent replacing fully functional hardware only to have the software issue remain.

In this case, the dmidecode data reporting a hardware problem was correct but not completely accurate, so while there was a bad RAM module, it's also true that the OS kernel is not providing accurate information about the state of the hardware.. And further, when running -U2 on the failed module, the kernel does not provide any notifications of the RAM issue. It is/was my assumption that this module has been failing for a while and something may have changed in the codebase between -u2 and -u5 that caused the kernel to start reporting it, which is why I asked the question in the first place.
 

revengineer

Contributor
Joined
Oct 27, 2019
Messages
193
If money is not an issue or you have little RAM, then replacing all DIMMs maybe the solution. I have 90 gigs in my system, and I was not willing to throw it all away for one bad stick. In any case, check if the BIOS is reporting anything. If it logs these errors, then its hardware issue for sure.
 
Top