ECC or memory controller errors,not sure how to interpret or how severe.

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
My console dumped pages like this. Its clear its about memory or memory controller, not obvious what exactly the issue reported is, how to.interpret it, whether its a warning or a hard problem, whether to just reseat RAM or replace it or swap sticks round to test controller,etc.

Help requested, thanks!

IMG_20220127_230502.jpg


Not sure why it's sideways but still... !!

Xeon E5 on supermicro X10, as in my SIG, main server.
 
Joined
Jan 7, 2015
Messages
1,155
This is telling you that the machine thinks one of your sticks of ram is bad. 0x306f1. The easiest way is to run a lot of memtest. For like a day or so (if can). Youll find out much sooner than a day if the ram is bad, lots of times youll get errors within the first pass or three if its bad. Any errors are bad news, however, i have seen these RAM bug a boos alot over time, and in several (not all) cases a RAM reseat will likely "fix" it.

So if its me id do something like this in this order.
Make a USB disk of memtest it comes on several popular linux "boot" disks ready to go.
Shutdown my TN server remove power.
Pull all ram sticks and reseat them one at a time. (Air duster? Absolutely.)
Boot up the linux memtest USB. In my experience if your ram is bad bad youll get errors very near immediately. But i have found errors deep into many passes. Let run as long as your willing. If you can make it to three passes error free Id try booting TN back up and watching close for more error.
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
This is telling you that the machine thinks one of your sticks of ram is bad. 0x306f1. The easiest way is to run a lot of memtest. For like a day or so (if can). Youll find out much sooner than a day if the ram is bad, lots of times youll get errors within the first pass or three if its bad. Any errors are bad news, however, i have seen these RAM bug a boos alot over time, and in several (not all) cases a RAM reseat will likely "fix" it.

So if its me id do something like this in this order.
Make a USB disk of memtest it comes on several popular linux "boot" disks ready to go.
Shutdown my TN server remove power.
Pull all ram sticks and reseat them one at a time. (Air duster? Absolutely.)
Boot up the linux memtest USB. In my experience if your ram is bad bad youll get errors very near immediately. But i have found errors deep into many passes. Let run as long as your willing. If you can make it to three passes error free Id try booting TN back up and watching close for more error.
Can the output identify which stick, or channel, it thinks is suspect, and what exactly was detected?

Or,if easier,what does that output actually say at a technical level, decoded and not simplified down to "memory error"? How do I read the lines?
 
Joined
Jan 7, 2015
Messages
1,155
Im not sure its written somewhere, if it is I havent found it.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Does
Code:
dmidecode
tell you anything?

What about your IPMI log?
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
I * think * I now understand what's going on, but its a bit of a reach, I wouldn't mind a sanity check. Also this may help others, so I'll detail how I got there.

I was guided by google searches for stuff like FreeBSD MCA GenuineIntel (with slight variation for whatever AMD CPUs show up, if needed) which led me to this thread in particular, and also this one.

TL;DR version: mcelog will dump the error messages, but for me at least wouldn't decode the actual DIMM reference. dmidecode appears to provide the link, but I would really like my understanding checked.

Data output from mcelog and dmidecode:​


Neatened mcelog output (removing obvious stuff that wasn't helpful):

Code:
MCE 0 BANK 8 TSC 184c510777e37a MISC 50525286 ADDR 1b93f085c0 STATUS cc00054000010090 MCGSTATUS 0 MCGCAP 7000c16
MCE 1 BANK 8 TSC 184d3c84b20d17 MISC 140484886 ADDR 1c56b48580 STATUS cc000cc000010090 MCGSTATUS 0 MCGCAP 7000c16
MCE 2 BANK 13 TSC 184d3c84b29d53 MISC 1221022022020c8c ADDR 1a90c08480 Corrected patrol scrub error STATUS cc000086000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 3 BANK 13 TSC 184d3c8c6fbc41 MISC 1221200202220c8c ADDR 1a90dc8480 Corrected patrol scrub error STATUS cc000386000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 4 BANK 13 TSC 184d3c9234f67e MISC 1221002022020c8c ADDR 1a91c08580 Corrected patrol scrub error STATUS cc000086000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 5 BANK 13 TSC 184d3c961c0d63 MISC 1221022022020c8c ADDR 1a91dc8580 Corrected patrol scrub error STATUS cc000346000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 6 BANK 13 TSC 184d3ca2089cc6 MISC 1221002022000c8c ADDR 1a92ec8480 Corrected patrol scrub error STATUS cc000206000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 7 BANK 13 TSC 184d3caf8a5dbd MISC 1221200222020c8c ADDR 1a93ec8580 Corrected patrol scrub error STATUS cc000406000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 8 BANK 13 TSC 184d3cc57cecf6 MISC 1221020222020c8c ADDR 1a958c8580 Corrected patrol scrub error STATUS cc000806000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 9 BANK 13 TSC 184d3cf8c4ddd0 MISC 1221000200220c8c ADDR 1a99588580 Corrected patrol scrub error STATUS cc001006000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 10 BANK 13 TSC 184d3d79b4d33c MISC 1221022022000c8c ADDR 1aa2e48480 Corrected patrol scrub error STATUS cc002046000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 11 BANK 13 TSC 184d3e51cec069 MISC 1221022222020c8c ADDR 1ab2e48480 Corrected patrol scrub error STATUS cc004006000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 12 BANK 13 TSC 184d4002010e52 MISC 1221002202020c8c ADDR 1ad2e48480 Corrected patrol scrub error STATUS cc008006000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 13 BANK 13 TSC 184d4362d0b8a9 MISC 1221022222000c8c ADDR 1b12ec8480 Corrected patrol scrub error STATUS cc010046000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 14 BANK 13 TSC 184d4b08c96bc1 MISC 1221202022020c8c ADDR 1ba3e48580 Corrected patrol scrub error STATUS cc020006000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 15 BANK 13 TSC 1855cfebb342be MISC 1221222200220c8c ADDR 1c7f3c8580 Corrected patrol scrub error STATUS cc036146000800c0 MCGSTATUS 0 MCGCAP 7000c16


Most but not all, are "corrected patrol scrubs", suggesting they were fixed internally.

Also if ADDR is the memory address of the error, then all errors occur in the range 0x01A00000000 to 0x01CFFFFFFFF.

I followed the hint from the post linked above, and used dmidecode to try and identify the exact DIMM. I found that dmidecode lists memory twice - once as purely physical devices, then again this time as memory mapped devices. It also split them into 2 groups of 4, each time (my board has 8 slots). So here are the relevant 4 chunks.

Physical devices - part 1 of 2. Device with handle 0x002B "contains" 4 x 32GB DIMMS with handles 0x002C-0x002F:

Code:
Handle 0x002B, DMI type 16, 23 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Multi-bit ECC
        Maximum Capacity: 256 GB
        Error Information Handle: Not Provided
        Number Of Devices: 4

Handle 0x002C, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 32 GB
        Form Factor: DIMM
        Set: None
        Locator: DIMMA1
        Bank Locator: P0_Node0_Channel0_Dimm0
        Type: DDR4
        Type Detail: Synchronous
        Speed: 2400 MT/s
        Manufacturer: Samsung
        Serial Number: 0372091A
        Asset Tag: DIMMA1_AssetTag (date:18/30)
        Part Number: M393A4K40CB1-CRC
        Rank: 2
        Configured Memory Speed: 2133 MT/s
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x002D, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 32 GB
        Form Factor: DIMM
        Set: None
        Locator: DIMMA2
        Bank Locator: P0_Node0_Channel0_Dimm1
        Type: DDR4
        Type Detail: Synchronous
        Speed: 2400 MT/s
        Manufacturer: Samsung
        Serial Number: 03720ACE
        Asset Tag: DIMMA2_AssetTag (date:18/30)
        Part Number: M393A4K40CB1-CRC
        Rank: 2
        Configured Memory Speed: 2133 MT/s

Handle 0x002E, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 32 GB
        Form Factor: DIMM
        Set: None
        Locator: DIMMB1
        Bank Locator: P0_Node0_Channel1_Dimm0
        Type: DDR4
        Type Detail: Synchronous
        Speed: 2400 MT/s
        Manufacturer: Samsung
        Serial Number: 03720914
        Asset Tag: DIMMB1_AssetTag (date:18/30)
        Part Number: M393A4K40CB1-CRC
        Rank: 2
        Configured Memory Speed: 2133 MT/s

Handle 0x002F, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 32 GB
        Form Factor: DIMM
        Set: None
        Locator: DIMMB2
        Bank Locator: P0_Node0_Channel1_Dimm1
        Type: DDR4
        Type Detail: Synchronous
        Speed: 2400 MT/s
        Manufacturer: Samsung
        Serial Number: 03720AD9
        Asset Tag: DIMMB2_AssetTag (date:18/30)
        Part Number: M393A4K40CB1-CRC
        Rank: 2
        Configured Memory Speed: 2133 MT/s


Physical devices - part 2 of 2. Device with handle 0x0030 "contains" 4 x 32GB DIMMS with handles 0x0031-0x0034:

Code:
Handle 0x0030, DMI type 16, 23 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Multi-bit ECC
        Maximum Capacity: 256 GB
        Error Information Handle: Not Provided
        Number Of Devices: 4

Handle 0x0031, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0030
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 32 GB
        Form Factor: DIMM
        Set: None
        Locator: DIMMC1
        Bank Locator: P0_Node0_Channel2_Dimm0
        Type: DDR4
        Type Detail: Synchronous
        Speed: 2400 MT/s
        Manufacturer: Samsung
        Serial Number: 22966F57
        Asset Tag: DIMMC1_AssetTag (date:18/17)
        Part Number: M393A4K40CB1-CRC
        Rank: 2
        Configured Memory Speed: 2133 MT/s

Handle 0x0032, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0030
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 32 GB
        Form Factor: DIMM
        Set: None
        Locator: DIMMC2
        Bank Locator: P0_Node0_Channel2_Dimm1
        Type: DDR4
        Type Detail: Synchronous
        Speed: 2400 MT/s
        Manufacturer: Samsung
        Serial Number: 17FDED6B
        Asset Tag: DIMMC2_AssetTag (date:17/02)
        Part Number: M393A4K40BB1-CRC
        Rank: 2
        Configured Memory Speed: 2133 MT/s

Handle 0x0033, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0030
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 32 GB
        Form Factor: DIMM
        Set: None
        Locator: DIMMD1
        Bank Locator: P0_Node0_Channel3_Dimm0
        Type: DDR4
        Type Detail: Synchronous
        Speed: 2400 MT/s
        Manufacturer: Micron
        Serial Number: 0FEE0901
        Asset Tag: DIMMD1_AssetTag (date:15/26)
        Part Number: 36ASF4G72PZ-2G3A1
        Rank: 2
        Configured Memory Speed: 2133 MT/s

Handle 0x0034, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0030
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 32 GB
        Form Factor: DIMM
        Set: None
        Locator: DIMMD2
        Bank Locator: P0_Node0_Channel3_Dimm1
        Type: DDR4
        Type Detail: Synchronous
        Speed: 2400 MT/s
        Manufacturer: Micron
        Serial Number: 0FEE08B7
        Asset Tag: DIMMD2_AssetTag (date:15/26)
        Part Number: 36ASF4G72PZ-2G3A1
        Rank: 2
        Configured Memory Speed: 2133 MT/s


Logical memory mapped devices - part 1 of 2. Device with handle 0x0035 (=physical device 0x002B) "contains" 4 x 32GB DIMMS with handles 0x0036-0x0039 (=physical devices 0x002C-0x002F):

Code:
Handle 0x0035, DMI type 19, 31 bytes
Memory Array Mapped Address
        Starting Address: 0x00000000000
        Ending Address: 0x01FFFFFFFFF
        Range Size: 128 GB
        Physical Array Handle: 0x002B
        Partition Width: 4

Handle 0x0036, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x00000000000
        Ending Address: 0x007FFFFFFFF
        Range Size: 32 GB
        Physical Device Handle: 0x002C
        Memory Array Mapped Address Handle: 0x0035
        Partition Row Position: 1

Handle 0x0037, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x00800000000
        Ending Address: 0x00FFFFFFFFF
        Range Size: 32 GB
        Physical Device Handle: 0x002D
        Memory Array Mapped Address Handle: 0x0035
        Partition Row Position: 1

Handle 0x0038, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x01000000000
        Ending Address: 0x017FFFFFFFF
        Range Size: 32 GB
        Physical Device Handle: 0x002E
        Memory Array Mapped Address Handle: 0x0035
        Partition Row Position: 1

Handle 0x0039, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x01800000000
        Ending Address: 0x01FFFFFFFFF
        Range Size: 32 GB
        Physical Device Handle: 0x002F
        Memory Array Mapped Address Handle: 0x0035
        Partition Row Position: 1


Logical memory mapped devices - part 2 of 2. Device with handle 0x003A (=physical device 0x0030) "contains" 4 x 32GB DIMMS with handles 0x003B-0x003E (=physical devices 0x0031-0x0034):

Code:
Handle 0x003A, DMI type 19, 31 bytes
Memory Array Mapped Address
        Starting Address: 0x02000000000
        Ending Address: 0x03FFFFFFFFF
        Range Size: 128 GB
        Physical Array Handle: 0x0030
        Partition Width: 4

Handle 0x003B, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x02000000000
        Ending Address: 0x027FFFFFFFF
        Range Size: 32 GB
        Physical Device Handle: 0x0031
        Memory Array Mapped Address Handle: 0x003A
        Partition Row Position: 1

Handle 0x003C, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x02800000000
        Ending Address: 0x02FFFFFFFFF
        Range Size: 32 GB
        Physical Device Handle: 0x0032
        Memory Array Mapped Address Handle: 0x003A
        Partition Row Position: 1

Handle 0x003D, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x03000000000
        Ending Address: 0x037FFFFFFFF
        Range Size: 32 GB
        Physical Device Handle: 0x0033
        Memory Array Mapped Address Handle: 0x003A
        Partition Row Position: 1

Handle 0x003E, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x03800000000
        Ending Address: 0x03FFFFFFFFF
        Range Size: 32 GB
        Physical Device Handle: 0x0034
        Memory Array Mapped Address Handle: 0x003A
        Partition Row Position: 1


From here, I'd really appreciate some checking if I understand this correctly:


  • mcelog shows that all errors occur with addresses in the range 0x01A00000000 to 0x01CFFFFFFFF.
  • The 2nd set of dmidecode data, memory mapped devices, suggests this should be mapped to memory mapped device handle 0x0039, which covers addresses 0x01800000000 to 0x01FFFFFFFFF, and corresponds to physical device 0x002F.
  • The 1st set of dmidecode information then finally tells us that physical device 0x002F corresponds to DIMMB2, located at P0_Node0_Channel1_Dimm1, and is a Samsung M393A4K40CB1-CRC serial number 03720AD9.

In theory, this should be the DIMM throwing the errors.

The "BANK" (8 or 13) in mcelog output seems irrelevant or unhelpful, I didn't understand how that ties in, if it does.

Question - is my understanding correct?

My planned action:​


Because some mcelog errors aren't confirmed to have been corrected, and because I don't like dodgy equipment anyway, I plan to (1) swap that DIMM with another and check that future errors still come from the same DIMM, (2) also run memtest and see if that DIMM is the one it reports and the errors still follow the DIMM, and if so, (3) chuck and replace.

Does that sound sensible? Anything else I should do?
 
Last edited:
Top