Memory trouble?

revengineer

Contributor
Joined
Oct 27, 2019
Messages
193
I recently transitioned from Windows Server 2012 to Freenas 11.3 U3.2 (latest at time of writing) to host ~20 TB of data, a media server VM (emby), and a UrBackup jail. The final installed followed several months of testing including CPU stress and memory testing and HDD burnin. Testing showed that one 8 GB memory stick purchased on ebay as a set of 8 was faulty. This showed by memtest86+ freezing and ECC memory errors being manifested in the BIOS and IPMI logs. I replaced the bad stick and all testing completed successfully.

I recently discovered a lone error in the console (/var/log/messages):
Code:
Jul  8 11:00:56 max MCA: Bank 9, Status 0x8c000051000800c0
Jul  8 11:00:56 max MCA: Global Cap 0x0000000001000c17, Status 0x0000000000000000
Jul  8 11:00:56 max MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 33
Jul  8 11:00:56 max MCA: CPU 13 COR (1) MS channel 0 memory error
Jul  8 11:00:56 max MCA: Address 0xd2f503100
Jul  8 11:00:56 max MCA: Misc 0x90000200020228c


I checked the IPMI log and NO ECC errors are registered there. There are no other such errors in /var/log/messages dating back to 6/21/20. I have seen other posts on this error. While some had clearly significant issues, there were some with a similar on time problem. These threads simply stopped without resolution. Since this seems to be a fluke, I am inclined to ignore the error and keep monitoring. Am I making a mistake with this approach?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
No, this is an indication the memory controller on the CPU itself is faulty.
 

revengineer

Contributor
Joined
Oct 27, 2019
Messages
193
Thank you for the reply. Is this memory controller on the motherboard or in the CPU? Anything I should be doing?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
You will need to replace your CPU.
 

revengineer

Contributor
Joined
Oct 27, 2019
Messages
193
Hm, I am somewhat hesitant to buy new hardware based on a single error in a three week window. Is there a way to check that this is a persistent issue?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Try dmesg | search MCA, and ignore the CPU capabilities lists on boot. This should give you an idea if this has occurred in the past.
 

revengineer

Contributor
Joined
Oct 27, 2019
Messages
193
I checked and it this message above was the single occurrence. No other errors of this type were found.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Also check the various /var/log/dmesg* logs.
 

revengineer

Contributor
Joined
Oct 27, 2019
Messages
193
I do not see any such files. I do the /var/log/messages, which I checked. There is a rotated file messages.0.bz2 as well, which does not contain any such error either. This files dates back to 6/17/20 when I install the production OS. The log files from the burn in period where erased during that install but I do not recall any error messages from the test period sticking out.
 

revengineer

Contributor
Joined
Oct 27, 2019
Messages
193
I am reopening this issue. After a month, this memory error has reoccurred, and I have looked into the cause. First, the associated CPU changes between occurrences (I have had several in the last week). Command "mcelog" shows
Code:
Hardware event. This is not a software error.
MCE 0
CPU 14 BANK 9 TSC 137e121a69e76c
MISC 90000200020228c ADDR d2f503100
TIME 1597346246 Thu Aug 13 15:17:26 2020
MCG status:
MemCtrl: Corrected patrol scrub error

STATUS 8c000051000800c0 MCGSTATUS 0
MCGCAP 1000c17 APICID 22 SOCKETID 0
CPUID Vendor Intel Family 6 Model 62

When the error occurs, the address "d2f503100" is always the same.

I then ran memtest86 (without plus) because it is able to detect ECC errors. I received several ECC errors for the same channel and slot, although memtest86 does not tell me which module this corresponds to.

Finally, I found in the BIOS the setting to report every ECC error and not just once a threshold of 10 errors is exceeded. Now IPMI reports the ECC errors as "Assertion: Memory| Event = Correctable ECC@DIMME1(CPU2)". It is always the same DIMM in slot E1.

Based on this information, I am thinking that this memory stick is faulty rather than the CPU. Does this make sense?

(I have not yet rotated the memory sticks, but instead ordered a couple of new sticks as they are cheap enough.)
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Based on this information, I am thinking that this memory stick is faulty rather than the CPU. Does this make sense?

Yes, this is a reasonable conclusion.
 

revengineer

Contributor
Joined
Oct 27, 2019
Messages
193
Thank you, then there is hope the error should go away within a week as I receive the new used memory.
 

revengineer

Contributor
Joined
Oct 27, 2019
Messages
193
At the end of the day, I counted 8 ECC correctable error (all except one for the same address as noted above, one for a close by address, all on the same DIMM). Do I assume correctly that ECC is doing its thing and it is safe to operate the server this way until the replacement RAM arrives? Or should I be shutting the server down because I am at risk of data loss?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
ECC is working hard to compensate. Unless you need to have the server in operation until the RAM arrives, I'd leave the server off. ECC can only handle single-bit errors. If you have 2-bit errors, ECC won't save you.
 

revengineer

Contributor
Joined
Oct 27, 2019
Messages
193
I have replaced the RAM in slots E1 and E2 (to match the brand for both channels), and 3 cycles of memtest86 (minus the Hammer test) completed without ECC errors. FreeNAS has been up for 24 hours without memory errors either. So hopefully case closed.

One mystery remains: During troubleshooting I googled for a method to identify the bad DIMM and came across this blog. Interestingly, this procedure pointed at the DIMM in slot E2 rather than E1 flagged by the hardware. I gather the described mcelog/dmidecode method is unreliable but since I replaced both modules I cannot say for sure. If someone has insight here, I will take it as final wisdom for the future.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
I gather the described mcelog/dmidecode method is unreliable

Indeed. This is very hardware-dependent. I'd trust the IPMI on this, it's been written for the specific board it runs on.
 
Top