freenas MCA: CPU 0 COR OVER RD channel 1 memory error

mdreed · Dec 6, 2020

Hi folks,

My FreeNAS server has been throwing MCA errors that seem to be related to ECC memory. I've ordered a replacement DIMM, but I'm wondering if its possible to identify exactly which module I need to replace. My server is an ASRock C2550D4I w/ 16 gigs ECC memory and 2x5 TB & 2x3 TB WD Red NAS hard drives.

The recent errors are here. An example is:

Code:

Dec  1 23:28:06 freenas MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
Dec  1 23:28:06 freenas MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
Dec  1 23:28:06 freenas MCA: CPU 0 COR OVER RD channel 0 memory error
Dec  1 23:28:06 freenas MCA: Address 0x5bb5c258

Running dmidecode gives this data. What I believe is the relevant part is here:

Code:

Handle 0x0020, DMI type 17, 34 bytes
Memory Device
    Array Handle: 0x001E
    Error Information Handle: Not Provided
    Total Width: 64 bits
    Data Width: 64 bits
    Size: 4096 MB
    Form Factor: DIMM
    Set: None
    Locator: DIMM0
    Bank Locator: BANK 0
    Type: DDR3
    Type Detail: Synchronous Unbuffered (Unregistered)
    Speed: 1600 MT/s
    Manufacturer: Micron
    Serial Number: 12231724
    Asset Tag: KBANK 0 DIMM0 AssetTag
    Part Number: 18KSF51272AZ-1G6K
    Rank: 2
    Configured Memory Speed: 1600 MT/s
 
Handle 0x0021, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x00000000000
    Ending Address: 0x000FFFFFFFF
    Range Size: 4 GB
    Physical Device Handle: 0x0020
    Memory Array Mapped Address Handle: 0x001F
    Partition Row Position: 1
 
Handle 0x0022, DMI type 17, 34 bytes
Memory Device
    Array Handle: 0x001E
    Error Information Handle: Not Provided
    Total Width: 64 bits
    Data Width: 64 bits
    Size: 4096 MB
    Form Factor: DIMM
    Set: None
    Locator: DIMM0
    Bank Locator: BANK 1
    Type: DDR3
    Type Detail: Synchronous Unbuffered (Unregistered)
    Speed: 1600 MT/s
    Manufacturer: Micron
    Serial Number: 13106656
    Asset Tag: KBANK 1 DIMM0 AssetTag
    Part Number: 18KSF51272AZ-1G6K
    Rank: 2
    Configured Memory Speed: 1600 MT/s
 
Handle 0x0023, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x00100000000
    Ending Address: 0x001FFFFFFFF
    Range Size: 4 GB
    Physical Device Handle: 0x0022
    Memory Array Mapped Address Handle: 0x001F
    Partition Row Position: 1
 
Handle 0x0024, DMI type 17, 34 bytes
Memory Device
    Array Handle: 0x001E
    Error Information Handle: Not Provided
    Total Width: 64 bits
    Data Width: 64 bits
    Size: 4096 MB
    Form Factor: DIMM
    Set: None
    Locator: DIMM1
    Bank Locator: BANK 0
    Type: DDR3
    Type Detail: Synchronous Unbuffered (Unregistered)
    Speed: 1600 MT/s
    Manufacturer: Micron
    Serial Number: 12231069
    Asset Tag: KBANK 0 DIMM1 AssetTag
    Part Number: 18KSF51272AZ-1G6K
    Rank: 2
    Configured Memory Speed: 1600 MT/s
 
Handle 0x0025, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x00200000000
    Ending Address: 0x002FFFFFFFF
    Range Size: 4 GB
    Physical Device Handle: 0x0024
    Memory Array Mapped Address Handle: 0x001F
    Partition Row Position: 1
 
Handle 0x0026, DMI type 17, 34 bytes
Memory Device
    Array Handle: 0x001E
    Error Information Handle: Not Provided
    Total Width: 64 bits
    Data Width: 64 bits
    Size: 4096 MB
    Form Factor: DIMM
    Set: None
    Locator: DIMM1
    Bank Locator: BANK 1
    Type: DDR3
    Type Detail: Synchronous Unbuffered (Unregistered)
    Speed: 1600 MT/s
    Manufacturer: Micron
    Serial Number: 13106655
    Asset Tag: KBANK 1 DIMM1 AssetTag
    Part Number: 18KSF51272AZ-1G6K
    Rank: 2
    Configured Memory Speed: 1600 MT/s
 
Handle 0x0027, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x00300000000
    Ending Address: 0x003FFFFFFFF
    Range Size: 4 GB
    Physical Device Handle: 0x0026
    Memory Array Mapped Address Handle: 0x001F
    Partition Row Position: 1

So a few questions:
1) Is it likely that replacing a single DIMM will solve this problem, or can it be determined from these logs that the problem lies with the motherboard?
2) If it is indeed a single DIMM, which one should I swap out? Or do I need to just guess and check?

Thank you for any help!

mdreed · Dec 6, 2020

If it matters, the bulk of the errors happening on Dec 1 coincided with a scrub of my zfs pool.

revengineer · Dec 6, 2020

I recently went through this procedure twice because the used RAM I bought on ebay was crappy. In your case you are using ECC memory and ECC is doing its job correcting the errors, so all is well in the short term. In my case I had a single address that was affected repeatedly (~daily). In your case multiple addresses are affected. However, they seem to be contained within the same module located in Bank 5.

The most reliable way to find the memory slot on the motherboard, which corresponds to this Bank 5, is to look in the BIOS event log, which you may also be able to access via IPMI if your motherboard has this feature. The BIOS event log will give you the name of the slot, which you can then cross reference with the layout in the manual of your motherboard.

I initially when down the same road you did using dmidecode but did not have much success. dmidecode gives you a list of all the memory modules and there should be an entry for BANK 5. However, I do not see this in your pastebin output. In the list you posted, you see that there is an entry "Bank Locator", which refers to the bank number, and a "Locator", which should refer to the name of the memory slot. I write "should" because in my case this Locator was pointed to an different slot. I decided on using the BIOS information and it was the correct choice. If your BIOS does not offer this information and the dmidecode method still yields errors after replacing the stick, you may need to try replacing each module, which can be tedious if you have lots of modules.

To determine whether only one module is affected, you should run memtest86. Note that you cannot use memtest86+ because this early fork is not maintained and cannot diagnose ECC error. Memtest86 will tell you which modules are buggy. Note that the errors can be stochastic so that you should run at least 3 passes to find any errors.

mdreed · Dec 6, 2020

Thank you for the thoughtful reply, revengineer! I should understand "Bank 5" in the errors as referring to a specific DIMM? The memory address reported seems to kind of be all over the place.

My new DIMM is suppposed to be arriving tomorrow. I'll report back.

revengineer · Dec 6, 2020

Wrt your question, the answer is "yes". The trick is to find the physical location that is references by the numbered bank.

mdreed · Dec 7, 2020

I've just looked at the bios event log and unfortunately it's not terribly helpful. It does show the events (labeled "Smbios 0x01" and described as "Single Bit ECC Memory Error"), but it doesn't give any more information beyond the date and time. In particular, it doesn't say anything about which module is causing the problem. Under the column labeled "Severity", which I've seen other interfaces apparently use to disclose the DIMM, it just says "N/A" (image attached).

I've also looked a bit more closely at the memory addresses reported in the MCA errors and cross referenced with dmidecode. If the addresses are to believed, about half the errors are coming from KBANK 0 DIMM0 (addresses 0 to 4294967295) and the other half from KBANK 1 DIMM0 (4294967296 to 8589934591). Is it possible I have *two* DIMMs that need replacing?

no_connection · Dec 8, 2020

That could be side one and two of DIMM0? Could be power issue on the module or slot. Just guessing here.

Herr_Merlin · Dec 8, 2020

some bios you need to reset the state of the dimm slot otherwise it won't check if the module is healthy again after replacement.

revengineer · Dec 8, 2020

Well checking the BIOS first was worth a try. Next you can check the procedure linked here to chase down the DIMM using dmidecode commands. For my Supermicro motherboard, this yielded inaccurate results, but you have nothing to loose trying. If this does not work you will have to remove them one by one and test the sticks using memtest86. The free version seems to detect ECC errors but you can not inject them (which you do not need). Memtest86 will also tell you how many modules are affected; I am not going to try to guess this from the addresses.

revengineer · Dec 8, 2020

no_connection said:
That could be side one and two of DIMM0? Could be power issue on the module or slot. Just guessing here.

Until excluded, I would go with the assumption that there is a single bad memory stick that needs replacing. I would not blame it on the slot especially if this worked in the past. When I first encountered the issue, someone stated that the memory controller on my CPU was bad, which it wasn't and which would have been a much more expensive fix. But the OP is at a point where he really just needs to start testing with memtest86 and swapping modules around to check the various failure modes.

Important Announcement for the TrueNAS Community.

freenas MCA: CPU 0 COR OVER RD channel 1 memory error

mdreed

Dabbler

mdreed

Dabbler

revengineer

Contributor

mdreed

Dabbler

revengineer

Contributor

mdreed

Dabbler

Attachments

no_connection

Patron

Herr_Merlin

Patron

revengineer

Contributor

revengineer

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

freenas MCA: CPU 0 COR OVER RD channel 1 memory error

Dabbler

Dabbler

Contributor

Dabbler

Contributor

Dabbler

Attachments

Patron

Patron

Contributor

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "freenas MCA: CPU 0 COR OVER RD channel 1 memory error"

Similar threads