Supermicro H8DME-2 motherboard new memory error

Status
Not open for further replies.

AlphaG

Cadet
Joined
Oct 22, 2014
Messages
8
I have a stable FreeNAS installation that I update somewhat regularly running for several months. It is a media server and has about 3Tb data on it but is infrequently accessed. Over the last couple weeks I have been getting a new error that seems to be a main memory error. The system has two 6-core AMD Opteron processors and the 6th core appears to be kicking off a bank 4 memory error. I will try to replace the offending RAM sticks if that is the case. Is this error truly a "bad memory" error? From searches it seems this is a corrected error so ECC is doing it's job? I get anywhere from a few to many of these notifications daily now. Hopefully not a CPU memory error in which case the CPU could be on it's way out.

MCA: Bank 4, Status 0xd401c000d6080813
MCA: CPU 6 COR OVER BUSLG Source RD Memory
MCA: Address 0xef24dd740
MCA: Address 0xef24dd750
MCA: Bank 4, Status 0xd400c0009d080813
MCA: CPU 6 COR OVER BUSLG Source RD Memory
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd400c0009d080813
MCA: CPU 6 COR OVER BUSLG Source RD Memory
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd400c0009d080813
MCA: CPU 6 COR OVER BUSLG Source RD Memory
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd400c0009d080813
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd400c0009d080813
MCA: CPU 6 COR OVER BUSLG Source RD Memory
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd400c0009d080813
MCA: CPU 6 COR OVER BUSLG Source RD Memory
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd400c0009d080813
MCA: CPU 6 COR OVER BUSLG Source RD Memory
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd400c0009d080813
MCA: CPU 6 COR OVER BUSLG Source RD Memory
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd400c0009d080813
MCA: CPU 6 COR OVER BUSLG Source RD Memory
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd40140004b080813
MCA: CPU 6 COR OVER BUSLG Source RD Memory
MCA: Address 0xe2b48d140
MCA: Bank 4, Status 0xd401c000d6080813
MCA: CPU 6 COR OVER BUSLG Source RD Memory
MCA: Address 0xe2b48d150
MCA: Bank 4, Status 0xd400c0009d080813
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd400c0009d080813
MCA: CPU 6 COR OVER BUSLG Source RD Memory
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd400c0009d080813
MCA: CPU 6 COR OVER BUSLG Source RD Memory
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd400c0009d080813
MCA: CPU 6 COR OVER BUSLG Source RD Memory
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd400c0009d080813
MCA: CPU 6 COR OVER BUSLG Source RD Memory
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd400c0009d080813
MCA: CPU 6 COR OVER BUSLG Source RD Memory
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd400c0009d080813
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd400c0009d080813
MCA: Address 0xef24dd740
MCA: Bank 4, Status 0xd400c0009d080813
MCA: Address 0xef24dd740

nsDC43B.png
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
Is this error truly a "bad memory" error?

No way to know. There isn't enough information because the percentage of users with AMDs is very small.

From searches it seems this is a corrected error so ECC is doing it's job? I get anywhere from a few to many of these notifications daily now. Hopefully not a CPU memory error in which case the CPU could be on it's way out.

Again, no way to know. There isn't enough information because the percentage of users with AMDs is very small.

On Intel platforms, you can simulate an ECC error, and those are logged and reported similar to what you are seeing. But, if the error can't be corrected with ECC then the system is halted to prevent corruption of the OS and data. On AMD nobody has any experience to verify or deny this happens on AMD. One of the big reasons we don't recommend AMD is because there is no validation that any of this really works properly on the AMD platform. You are literally buying into a product that nobody can really vouch that it works correctly or not. So you may have rampant corrupting going on right now and don't know it yet. Or you might be getting those errors because ECC is correcting them.

Unfortunately, if you contact Supermicro or AMD they are likely to tell you what you want to hear... "the data is safe" because they'd hate to admit their product has a problem. So yeah, pinch of salt on this one. The only good safe bet is to shut the system down while you obtain replacement parts.
 
Joined
Oct 2, 2014
Messages
925
I had a similar issue, where it was throwing all kinds of errors. I first went off what the FreeNAS email sent me, which was BANK8, but i pulled bank 8 thought it was fixed, then it happened again....i was frustrated and plagued by it for a few weeks, until i started poking around and i found a command to put in FreeNAS that gives you how the banks are ACTUALLY labeled, i opened putty and put in "demidecode" and pressed enter and it returned what was populated with what size DIMM's how FreeNAS labeled it (BANKx) ,and then what it actual location was on the board.
 
Status
Not open for further replies.
Top