I * think * I now understand what's going on, but its a bit of a reach, I wouldn't mind a sanity check. Also this may help others, so I'll detail how I got there.
I was guided by google searches for stuff like
FreeBSD MCA GenuineIntel
(with slight variation for whatever AMD CPUs show up, if needed) which led me to
this thread in particular, and also
this one.
TL;DR version: mcelog
will dump the error messages, but for me at least wouldn't decode the actual DIMM reference. dmidecode
appears to provide the link, but I would really like my understanding checked.
Data output from mcelog
and dmidecode
:
Neatened
mcelog
output (removing obvious stuff that wasn't helpful):
Code:
MCE 0 BANK 8 TSC 184c510777e37a MISC 50525286 ADDR 1b93f085c0 STATUS cc00054000010090 MCGSTATUS 0 MCGCAP 7000c16
MCE 1 BANK 8 TSC 184d3c84b20d17 MISC 140484886 ADDR 1c56b48580 STATUS cc000cc000010090 MCGSTATUS 0 MCGCAP 7000c16
MCE 2 BANK 13 TSC 184d3c84b29d53 MISC 1221022022020c8c ADDR 1a90c08480 Corrected patrol scrub error STATUS cc000086000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 3 BANK 13 TSC 184d3c8c6fbc41 MISC 1221200202220c8c ADDR 1a90dc8480 Corrected patrol scrub error STATUS cc000386000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 4 BANK 13 TSC 184d3c9234f67e MISC 1221002022020c8c ADDR 1a91c08580 Corrected patrol scrub error STATUS cc000086000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 5 BANK 13 TSC 184d3c961c0d63 MISC 1221022022020c8c ADDR 1a91dc8580 Corrected patrol scrub error STATUS cc000346000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 6 BANK 13 TSC 184d3ca2089cc6 MISC 1221002022000c8c ADDR 1a92ec8480 Corrected patrol scrub error STATUS cc000206000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 7 BANK 13 TSC 184d3caf8a5dbd MISC 1221200222020c8c ADDR 1a93ec8580 Corrected patrol scrub error STATUS cc000406000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 8 BANK 13 TSC 184d3cc57cecf6 MISC 1221020222020c8c ADDR 1a958c8580 Corrected patrol scrub error STATUS cc000806000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 9 BANK 13 TSC 184d3cf8c4ddd0 MISC 1221000200220c8c ADDR 1a99588580 Corrected patrol scrub error STATUS cc001006000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 10 BANK 13 TSC 184d3d79b4d33c MISC 1221022022000c8c ADDR 1aa2e48480 Corrected patrol scrub error STATUS cc002046000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 11 BANK 13 TSC 184d3e51cec069 MISC 1221022222020c8c ADDR 1ab2e48480 Corrected patrol scrub error STATUS cc004006000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 12 BANK 13 TSC 184d4002010e52 MISC 1221002202020c8c ADDR 1ad2e48480 Corrected patrol scrub error STATUS cc008006000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 13 BANK 13 TSC 184d4362d0b8a9 MISC 1221022222000c8c ADDR 1b12ec8480 Corrected patrol scrub error STATUS cc010046000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 14 BANK 13 TSC 184d4b08c96bc1 MISC 1221202022020c8c ADDR 1ba3e48580 Corrected patrol scrub error STATUS cc020006000800c0 MCGSTATUS 0 MCGCAP 7000c16
MCE 15 BANK 13 TSC 1855cfebb342be MISC 1221222200220c8c ADDR 1c7f3c8580 Corrected patrol scrub error STATUS cc036146000800c0 MCGSTATUS 0 MCGCAP 7000c16
Most
but not all, are "corrected patrol scrubs", suggesting they were fixed internally.
Also if ADDR is the memory address of the error, then all errors occur in the range
0x01A00000000
to
0x01CFFFFFFFF
.
I followed the hint from the post linked above, and used
dmidecode
to try and identify the exact DIMM. I found that
dmidecode
lists memory twice - once as purely physical devices, then again this time as memory mapped devices. It also split them into 2 groups of 4, each time (my board has 8 slots). So here are the relevant 4 chunks.
Physical devices - part 1 of 2. Device with handle 0x002B "contains" 4 x 32GB DIMMS with handles 0x002C-0x002F:
Code:
Handle 0x002B, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 256 GB
Error Information Handle: Not Provided
Number Of Devices: 4
Handle 0x002C, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002B
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMMA1
Bank Locator: P0_Node0_Channel0_Dimm0
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MT/s
Manufacturer: Samsung
Serial Number: 0372091A
Asset Tag: DIMMA1_AssetTag (date:18/30)
Part Number: M393A4K40CB1-CRC
Rank: 2
Configured Memory Speed: 2133 MT/s
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Handle 0x002D, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002B
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMMA2
Bank Locator: P0_Node0_Channel0_Dimm1
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MT/s
Manufacturer: Samsung
Serial Number: 03720ACE
Asset Tag: DIMMA2_AssetTag (date:18/30)
Part Number: M393A4K40CB1-CRC
Rank: 2
Configured Memory Speed: 2133 MT/s
Handle 0x002E, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002B
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMMB1
Bank Locator: P0_Node0_Channel1_Dimm0
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MT/s
Manufacturer: Samsung
Serial Number: 03720914
Asset Tag: DIMMB1_AssetTag (date:18/30)
Part Number: M393A4K40CB1-CRC
Rank: 2
Configured Memory Speed: 2133 MT/s
Handle 0x002F, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002B
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMMB2
Bank Locator: P0_Node0_Channel1_Dimm1
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MT/s
Manufacturer: Samsung
Serial Number: 03720AD9
Asset Tag: DIMMB2_AssetTag (date:18/30)
Part Number: M393A4K40CB1-CRC
Rank: 2
Configured Memory Speed: 2133 MT/s
Physical devices - part 2 of 2. Device with handle 0x0030 "contains" 4 x 32GB DIMMS with handles 0x0031-0x0034:
Code:
Handle 0x0030, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 256 GB
Error Information Handle: Not Provided
Number Of Devices: 4
Handle 0x0031, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0030
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMMC1
Bank Locator: P0_Node0_Channel2_Dimm0
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MT/s
Manufacturer: Samsung
Serial Number: 22966F57
Asset Tag: DIMMC1_AssetTag (date:18/17)
Part Number: M393A4K40CB1-CRC
Rank: 2
Configured Memory Speed: 2133 MT/s
Handle 0x0032, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0030
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMMC2
Bank Locator: P0_Node0_Channel2_Dimm1
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MT/s
Manufacturer: Samsung
Serial Number: 17FDED6B
Asset Tag: DIMMC2_AssetTag (date:17/02)
Part Number: M393A4K40BB1-CRC
Rank: 2
Configured Memory Speed: 2133 MT/s
Handle 0x0033, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0030
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMMD1
Bank Locator: P0_Node0_Channel3_Dimm0
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MT/s
Manufacturer: Micron
Serial Number: 0FEE0901
Asset Tag: DIMMD1_AssetTag (date:15/26)
Part Number: 36ASF4G72PZ-2G3A1
Rank: 2
Configured Memory Speed: 2133 MT/s
Handle 0x0034, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0030
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMMD2
Bank Locator: P0_Node0_Channel3_Dimm1
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MT/s
Manufacturer: Micron
Serial Number: 0FEE08B7
Asset Tag: DIMMD2_AssetTag (date:15/26)
Part Number: 36ASF4G72PZ-2G3A1
Rank: 2
Configured Memory Speed: 2133 MT/s
Logical memory mapped devices - part 1 of 2. Device with handle 0x0035 (=physical device 0x002B) "contains" 4 x 32GB DIMMS with handles 0x0036-0x0039 (=physical devices 0x002C-0x002F):
Code:
Handle 0x0035, DMI type 19, 31 bytes
Memory Array Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 128 GB
Physical Array Handle: 0x002B
Partition Width: 4
Handle 0x0036, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x007FFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x002C
Memory Array Mapped Address Handle: 0x0035
Partition Row Position: 1
Handle 0x0037, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00800000000
Ending Address: 0x00FFFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x002D
Memory Array Mapped Address Handle: 0x0035
Partition Row Position: 1
Handle 0x0038, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x01000000000
Ending Address: 0x017FFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x002E
Memory Array Mapped Address Handle: 0x0035
Partition Row Position: 1
Handle 0x0039, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x01800000000
Ending Address: 0x01FFFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x002F
Memory Array Mapped Address Handle: 0x0035
Partition Row Position: 1
Logical memory mapped devices - part 2 of 2. Device with handle 0x003A (=physical device 0x0030) "contains" 4 x 32GB DIMMS with handles 0x003B-0x003E (=physical devices 0x0031-0x0034):
Code:
Handle 0x003A, DMI type 19, 31 bytes
Memory Array Mapped Address
Starting Address: 0x02000000000
Ending Address: 0x03FFFFFFFFF
Range Size: 128 GB
Physical Array Handle: 0x0030
Partition Width: 4
Handle 0x003B, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x02000000000
Ending Address: 0x027FFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x0031
Memory Array Mapped Address Handle: 0x003A
Partition Row Position: 1
Handle 0x003C, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x02800000000
Ending Address: 0x02FFFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x0032
Memory Array Mapped Address Handle: 0x003A
Partition Row Position: 1
Handle 0x003D, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x03000000000
Ending Address: 0x037FFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x0033
Memory Array Mapped Address Handle: 0x003A
Partition Row Position: 1
Handle 0x003E, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x03800000000
Ending Address: 0x03FFFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x0034
Memory Array Mapped Address Handle: 0x003A
Partition Row Position: 1
From here, I'd really appreciate some checking if I understand this correctly:
mcelog
shows that all errors occur with addresses in the range 0x01A00000000
to 0x01CFFFFFFFF
.
- The 2nd set of
dmidecode
data, memory mapped devices, suggests this should be mapped to memory mapped device handle 0x0039, which covers addresses 0x01800000000
to 0x01FFFFFFFFF
, and corresponds to physical device 0x002F.
- The 1st set of dmidecode information then finally tells us that physical device 0x002F corresponds to
DIMMB2
, located at P0_Node0_Channel1_Dimm1
, and is a Samsung M393A4K40CB1-CRC
serial number 03720AD9
.
In theory, this should be the DIMM throwing the errors.
The "BANK" (8 or 13) in
mcelog
output seems irrelevant or unhelpful, I didn't understand how that ties in, if it does.
Question - is my understanding correct?
My planned action:
Because some mcelog errors aren't confirmed to have been corrected, and because I don't like dodgy equipment anyway, I plan to (1) swap that DIMM with another and check that future errors still come from the same DIMM, (2) also run memtest and see if that DIMM is the one it reports and the errors still follow the DIMM, and if so, (3) chuck and replace.
Does that sound sensible? Anything else I should do?