HOWTO: Troubleshooting faulty RAM

Status
Not open for further replies.

Evi Vanoost

Explorer
Joined
Aug 4, 2016
Messages
91
I had some faulty RAM in my SuperMicro X11 system. I have 512GB in 32GB ECC DDR4 modules so 'swapping out modules' and running many hours of memtest is neither cost effective nor particularly fast especially since the errors don't continuously crop up (I get between 3 and 10 per 24h)

These are the errors I got:

sudo cat /var/log/messages | grep MCA
Code:
MCA: Bank 7, Status 0xcc00008000010091
MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 16
MCA: CPU 16 COR (2) OVER RD channel 1 memory error
MCA: Address 0x64ec54e300
MCA: Misc 0x50028286
MCA: Bank 7, Status 0x8c00004000010091
MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 16
MCA: CPU 16 COR (1) RD channel 1 memory error
MCA: Address 0x64ec54e300
MCA: Misc 0x150028286
MCA: Bank 10, Status 0x8c00004d000800c1
MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 28
MCA: CPU 28 COR (1) MS channel 1 memory error
MCA: Address 0x64ec54e240
MCA: Misc 0x1221000100011a8c
MCA: Bank 10, Status 0x8c00004d000800c1
MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 29
MCA: CPU 29 COR (1) MS channel 1 memory error
MCA: Address 0x64ec54e240
MCA: Misc 0x1221000100011a8c


So initially I thought, 2 faulty memory sticks? It indicates Bank 7 and Bank 10 in the error log so I opened up the case and ... neither Bank 7 or Bank 10 was indicated, I looked all over the PCB and no such designations existed.

So I booted back up, nothing in BIOS nor IPMI indicated which module could be faulty. Decided to boot up and run "dmidecode" to see whether there was any indication or translation of Bank 7 or Bank 10 - I'm not sure where the log files get that information from but even dmidecode | grep -i "bank" only came up with things like Bank Locator: P1_Node1_Channel3_Dimm2. Even the hexadecimal in the error logs does not indicate anything although after carefully scrolling through the entire dmidecode, I found these entries:

sudo dmidecode -t 20
Code:
...
Handle 0x0047, DMI type 20, 35 bytes
Memory Device Mapped Address
		Starting Address: 0x01000000000
		Ending Address: 0x017FFFFFFFF
		Range Size: 32 GB
		Physical Device Handle: 0x002E
		Memory Array Mapped Address Handle: 0x0044
		Partition Row Position: 1
...


The address ranges which apparently each memory module contains, so then I just looked for one that contained Address 0x64ec... (although they seem to have an extra 0 at the beginning) 0x06000000000-0x067FFFFFFFF - indicated Physical Device Handle: 0x0041. Grepping for that handle got me this:

sudo dmidecode -t memory | grep -A23 "0x0041"
Code:
Handle 0x0041, DMI type 17, 40 bytes

Memory Device
	Array Handle: 0x0037
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 32 GB
	Form Factor: DIMM
	Set: None
	Locator: P2-DIMMH1
	Bank Locator: P1_Node1_Channel3_Dimm0
	Type: DDR4
	Type Detail: Synchronous
	Speed: 2133 MHz
	Manufacturer: Micron
	Serial Number: 1187AFDE
	Asset Tag: P2-DIMMH1_AssetTag (date:16/03)
	Part Number: 36ASF4G72PZ-2G1A1  
	Rank: 2
	Configured Clock Speed: 1866 MHz
	Minimum Voltage: Unknown
	Maximum Voltage: Unknown
	Configured Voltage: Unknown


The Locator information WAS something that was printed on the PCB of the motherboard. So I shut down the system and removed the module and booted back up. The system now complained about mismatched DIMM configurations but it booted. A good 24 hours later, still no memory errors in the logs so I'm pretty sure I had the right one so this one is going back to Crucial for warranty replacement.

But wait... didn't it indicate two different banks? Yes it did, but both banks indicated the same memory address space. I'm not sure what the deal is, perhaps it depends on which processor accesses the banks or perhaps it's because this is a 2-sided memory module.

Hopefully this is of use to someone. This worked for me on a SuperMicro board, not sure if it works for others, I would imagine quality boards would have the correct location programmed in their BIOS.
 
Last edited:

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
I suspect it's because the DIMM is dual ranked

BTW, in theory its possible the issue is the DIMM slot and not the DIMM. To discount this you could swap the faulty DIMM with another DIMM position to see if the error moves with the DIMM, or if it stays with the slot.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Thanks for sharing.
It would be excellent if you could provide the commands you ran in each step.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
What does the IPMI event log show? In my experience (which has been more than I'd like), it will show the location of the error--though whether the error is actually in the DIMM itself, or something else between the CPU core and the DIMM, is another issue.
 

Evi Vanoost

Explorer
Joined
Aug 4, 2016
Messages
91
The IPMI event log shows nothing about memory failures (it does now I unplugged the power and opened the chassis). I think it's because it was a correctable error (ECC RAM).
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504

RichR

Explorer
Joined
Oct 20, 2011
Messages
77
This might sound a little crazy, but..... make sure you don't have any standoffs or anything else touching the motherboard where they shouldn't..... It happened to me once,
 

droeders

Contributor
Joined
Mar 21, 2016
Messages
179
The IPMI event log shows nothing about memory failures (it does now I unplugged the power and opened the chassis). I think it's because it was a correctable error (ECC RAM).

Thanks for writing this up - I could see it being useful for someone in the future.

I have at least one Supermicro BIOS (can't remember which board) where it has a setting for minimum amount of ECC errors before logging them to the SEL. I believe the default on my board was 10, and I changed it to 1.
 
Status
Not open for further replies.
Top