Evi Vanoost
Explorer
- Joined
- Aug 4, 2016
- Messages
- 91
I had some faulty RAM in my SuperMicro X11 system. I have 512GB in 32GB ECC DDR4 modules so 'swapping out modules' and running many hours of memtest is neither cost effective nor particularly fast especially since the errors don't continuously crop up (I get between 3 and 10 per 24h)
These are the errors I got:
So initially I thought, 2 faulty memory sticks? It indicates Bank 7 and Bank 10 in the error log so I opened up the case and ... neither Bank 7 or Bank 10 was indicated, I looked all over the PCB and no such designations existed.
So I booted back up, nothing in BIOS nor IPMI indicated which module could be faulty. Decided to boot up and run "dmidecode" to see whether there was any indication or translation of Bank 7 or Bank 10 - I'm not sure where the log files get that information from but even
The address ranges which apparently each memory module contains, so then I just looked for one that contained Address 0x64ec... (although they seem to have an extra 0 at the beginning) 0x06000000000-0x067FFFFFFFF - indicated Physical Device Handle: 0x0041. Grepping for that handle got me this:
The Locator information WAS something that was printed on the PCB of the motherboard. So I shut down the system and removed the module and booted back up. The system now complained about mismatched DIMM configurations but it booted. A good 24 hours later, still no memory errors in the logs so I'm pretty sure I had the right one so this one is going back to Crucial for warranty replacement.
But wait... didn't it indicate two different banks? Yes it did, but both banks indicated the same memory address space. I'm not sure what the deal is, perhaps it depends on which processor accesses the banks or perhaps it's because this is a 2-sided memory module.
Hopefully this is of use to someone. This worked for me on a SuperMicro board, not sure if it works for others, I would imagine quality boards would have the correct location programmed in their BIOS.
These are the errors I got:
sudo cat /var/log/messages | grep MCA
Code:
MCA: Bank 7, Status 0xcc00008000010091 MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000 MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 16 MCA: CPU 16 COR (2) OVER RD channel 1 memory error MCA: Address 0x64ec54e300 MCA: Misc 0x50028286 MCA: Bank 7, Status 0x8c00004000010091 MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000 MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 16 MCA: CPU 16 COR (1) RD channel 1 memory error MCA: Address 0x64ec54e300 MCA: Misc 0x150028286 MCA: Bank 10, Status 0x8c00004d000800c1 MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000 MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 28 MCA: CPU 28 COR (1) MS channel 1 memory error MCA: Address 0x64ec54e240 MCA: Misc 0x1221000100011a8c MCA: Bank 10, Status 0x8c00004d000800c1 MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000 MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 29 MCA: CPU 29 COR (1) MS channel 1 memory error MCA: Address 0x64ec54e240 MCA: Misc 0x1221000100011a8c
So initially I thought, 2 faulty memory sticks? It indicates Bank 7 and Bank 10 in the error log so I opened up the case and ... neither Bank 7 or Bank 10 was indicated, I looked all over the PCB and no such designations existed.
So I booted back up, nothing in BIOS nor IPMI indicated which module could be faulty. Decided to boot up and run "dmidecode" to see whether there was any indication or translation of Bank 7 or Bank 10 - I'm not sure where the log files get that information from but even
dmidecode | grep -i "bank"
only came up with things like Bank Locator: P1_Node1_Channel3_Dimm2. Even the hexadecimal in the error logs does not indicate anything although after carefully scrolling through the entire dmidecode, I found these entries: sudo dmidecode -t 20
Code:
... Handle 0x0047, DMI type 20, 35 bytes Memory Device Mapped Address Starting Address: 0x01000000000 Ending Address: 0x017FFFFFFFF Range Size: 32 GB Physical Device Handle: 0x002E Memory Array Mapped Address Handle: 0x0044 Partition Row Position: 1 ...
The address ranges which apparently each memory module contains, so then I just looked for one that contained Address 0x64ec... (although they seem to have an extra 0 at the beginning) 0x06000000000-0x067FFFFFFFF - indicated Physical Device Handle: 0x0041. Grepping for that handle got me this:
sudo dmidecode -t memory | grep -A23 "0x0041"
Code:
Handle 0x0041, DMI type 17, 40 bytes Memory Device Array Handle: 0x0037 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 32 GB Form Factor: DIMM Set: None Locator: P2-DIMMH1 Bank Locator: P1_Node1_Channel3_Dimm0 Type: DDR4 Type Detail: Synchronous Speed: 2133 MHz Manufacturer: Micron Serial Number: 1187AFDE Asset Tag: P2-DIMMH1_AssetTag (date:16/03) Part Number: 36ASF4G72PZ-2G1A1 Rank: 2 Configured Clock Speed: 1866 MHz Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown
The Locator information WAS something that was printed on the PCB of the motherboard. So I shut down the system and removed the module and booted back up. The system now complained about mismatched DIMM configurations but it booted. A good 24 hours later, still no memory errors in the logs so I'm pretty sure I had the right one so this one is going back to Crucial for warranty replacement.
But wait... didn't it indicate two different banks? Yes it did, but both banks indicated the same memory address space. I'm not sure what the deal is, perhaps it depends on which processor accesses the banks or perhaps it's because this is a 2-sided memory module.
Hopefully this is of use to someone. This worked for me on a SuperMicro board, not sure if it works for others, I would imagine quality boards would have the correct location programmed in their BIOS.
Last edited: