MCA: Bank 5, Status 0xd417ec8000900090

Status
Not open for further replies.

reqlez

Explorer
Joined
Mar 15, 2014
Messages
84
Hi. Got these errors today, looks like something wrong with my ECC RAM ? Wondering how to proceed from your experience and if the errors can be decoded further ? My system is an Avoton C2550 with 8GB ECC RAM... i'm assuming "COR" means corrected ... and this is why you run ECC RAM on a NAS ??? :)


nas01.local kernel log messages:
MCA: Bank 5, Status 0xd417ec8000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
MCA: CPU 0 COR OVER RD channel 0 memory error
MCA: Address 0x17f871cc0
MCA: Bank 5, Status 0xd41c8cc000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
MCA: CPU 0 COR OVER RD channel 0 memory error
MCA: Address 0x74d4bc98
MCA: Bank 5, Status 0xd41d40c000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
MCA: CPU 0 COR OVER RD channel 0 memory error
MCA: Address 0x200c8f8c0
MCA: Bank 5, Status 0xd4001d0000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 2
MCA: CPU 1 COR OVER RD channel 0 memory error
MCA: Address 0x26d277ac0
MCA: Bank 5, Status 0xd400118000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 4
MCA: CPU 2 COR OVER RD channel 0 memory error
MCA: Address 0x1f2037ee0
MCA: Bank 5, Status 0xd4001ec000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 6
MCA: CPU 3 COR OVER RD channel 0 memory error
MCA: Address 0x13068e8
MCA: Bank 5, Status 0xd415b84000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
MCA: CPU 0 COR OVER RD channel 0 memory error
MCA: Address 0xc8f8c0
MCA: Bank 5, Status 0xd400128000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 2
MCA: CPU 1 COR OVER RD channel 0 memory error
MCA: Address 0x26d277ac0
MCA: Bank 5, Status 0xd4000bc000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 4
MCA: CPU 2 COR OVER RD channel 0 memory error
MCA: Address 0xc8fa40
MCA: Bank 5, Status 0xd4000c0000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 6
MCA: CPU 3 COR OVER RD channel 0 memory error
MCA: Address 0x6d277ac0
MCA: Bank 5, Status 0xd417778000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
MCA: CPU 0 COR OVER RD channel 0 memory error
MCA: Address 0xc8f8c0
MCA: Bank 5, Status 0xd4000f0000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 2
MCA: CPU 1 COR OVER RD channel 0 memory error
MCA: Address 0x26d277ac0
MCA: Bank 5, Status 0xd4000c4000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 4
MCA: CPU 2 COR OVER RD channel 0 memory error
MCA: Address 0xc8fa40
MCA: Bank 5, Status 0xd4000d4000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 6
MCA: CPU 3 COR OVER RD channel 0 memory error
MCA: Address 0x26d277ac0
MCA: Bank 5, Status 0xd417e24000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
MCA: CPU 0 COR OVER RD channel 0 memory error
MCA: Address 0xc8f8c0
MCA: Bank 5, Status 0xd40012c000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 2
MCA: CPU 1 COR OVER RD channel 0 memory error
MCA: Address 0x26d277ac0
MCA: Bank 5, Status 0xd4000ac000900090
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 4
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Yeah. That's exactly how I'm taking it. It's finding errors and correcting them, then reporting tons of errors to you. So now the fun begins...

So ECC errors can be from RAM, from the RAM slot, motherboard components failing, or the PSU not providing clean power. Being that *all* of the errors are on channel 0 and nowhere else it's probably not the PSU.

So I'd shutdown the server and start using RAM sticks one at a time with some kind of LiveCD. While it's almost certainly safe to use your server in this condition statistically, I don't exactly recommend it for obvious reasons. But you need to identify if it actually is a stick of RAM, then RMA it appropriately if you can.

Our first user to have their butt saved thanks to ECC. At least I can take some comfort that some people listen to me when I tell them to use ECC RAM.
 

reqlez

Explorer
Joined
Mar 15, 2014
Messages
84
Hi.

Thanks for the prompt reply.

Guess what, I'm even using the recommended minimum of RAM... Crazy ! lol

No I'm a big advocate of ECC ram for mission critical stuff, I mean if you are storing porn that you can redownload and you don't care about the stuff in your server, go ahead and use Non ECC but one day you will crash and burn ! So expect to loose your porn ;-)

Yea weird because I did 3 hours of memtest86+ on this RAM before setting up the pool ... So much for Memory testers ! I'll run more memtest and look at the ipmi logs I guess, but memory is like a week old so yes I can RMA
 

reqlez

Explorer
Joined
Mar 15, 2014
Messages
84
Hi. just wanted to post an update ... so after i reseated the ram, i never got those errors again ... but i'm still trying to figure out how come none of the memory test tools found any errors, i also looked in the Asrock IPMI logs and found nothing either.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Ok, so ECC works at the hardware layer. The hardware, if its doing its job, will correct the error and the sofware test won't know because it's reads and writes will be corrected before the software sees it. That's how ECC works. That's also precisely why ECC is recommended for ZFS. If your memory test was failing with ECC RAM you'd be in big trouble because that means that ECC isn't correcting the errors AND it's not halting the system.
 
Status
Not open for further replies.
Top