Memory Errors

Status
Not open for further replies.

Bulldog

Dabbler
Joined
Jan 16, 2016
Messages
18
I have received memory errors like the following in the security run output email reports, happens maybe once a week or every other week, no consistent trend. Have run memtest86+ memtest86 pro etc, well over 4 passes, multiple times. never a single error is shown, have check the IPMI log as well with nothing. The errors received are usually different memory addresses, and have linked to the Micron DIMMs, but as far as memtest is concerned these DIMMs are perfectly fine.

> MCA: Bank 12, Status 0x8c000042000800c3
> MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 0
> MCA: CPU 0 COR (1) MS channel 3 memory error
> MCA: Address 0x8a92b2ec0
> MCA: Misc 0x90000008000848c

Hardware is
FreeNAS-9.10.2-U6
Supermicro X10SRL-F
Xeon E5-1620 v3
2x 32GB ECC Samsung DIMMs M393A4K40BB0-CPB
2x 32GB ECC Micron DIMMs 36ASF4G72PZ-2G3A1
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I would suggest trying to test one module at a time. It can sometimes be difficult to isolate a defective DIMM and the memory testing utilities are sometimes not sensitive enough. I had a system at work (a Dell, under warranty) where it had a defective DIMM that the BIOS / EFI was identifying as having defects, but the built-in diagnostics didn't show an error. Dell ended up replacing the DIMM 3 times before we got one that worked without the BIOS throwing an error and they replaced the system board too because their tier 2 support couldn't believe that they had three bad DIMMS in a row and that the diagnostics didn't pick up on it but the BIOS did. Memory. It can be a hard thing to get right.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
And as Chris implied, it can be the board too.

Or just dust in a slot.
 

ech1965

Dabbler
Joined
Mar 16, 2018
Messages
17
I would suggest trying to test one module at a time. It can sometimes be difficult to isolate a defective DIMM and the memory testing utilities are sometimes not sensitive enough. I had a system at work (a Dell, under warranty) where it had a defective DIMM that the BIOS / EFI was identifying as having defects, but the built-in diagnostics didn't show an error. Dell ended up replacing the DIMM 3 times before we got one that worked without the BIOS throwing an error and they replaced the system board too because their tier 2 support couldn't believe that they had three bad DIMMS in a row and that the diagnostics didn't pick up on it but the BIOS did. Memory. It can be a hard thing to get right.

Other story of mine...

I spent 4 months trying to solve a "MEMORY ISSUE" on one of our DELL 430:
BIOS got tired of correcting ECC errors....
Swap memory modules... still the same slot --> replace mainboard

Issue reappeared... Finally the CPU was the culprit !!
Hard to have them acknowledge they need to replace a cpu because of a ECC issue....
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Seen ECC errors from damaged cpu socket pins

Imagine trying to debug without ECC...
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Seen ECC errors from damaged cpu socket pins

Imagine trying to debug without ECC...
Easy....
Iyf8wqG.jpg
 

Bulldog

Dabbler
Joined
Jan 16, 2016
Messages
18
So at this point, i guess memtest86 does not work anymore for testing memory? With how infrequently i receive 1 MCA error in FreeBSD its almost like its FreeBSDs problem. For it to bounce between the 2 Micron DIMMs only its just odd. Never does the address point to the 2 Samsung DIMMs. I wouldn't think being 2 different brands would cause this, they are both the correct DDR4 memory.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
So at this point, i guess memtest86 does not work any more for testing memory? With how infrequently i receive 1 MCA error in FreeBSD its almost like its FreeBSDs problem. For it to bounce between the 2 Micron DIMMs only its just odd. Never does the address point to the 2 Samsung DIMMs. I wouldn't think being 2 different brands would cause this, they are both the correct DDR4 memory.
If you get an error at all using the diagnostic, that is a win. Pull the two questionable (you said Micron) modules and get an RMA to have them replaced. Most memory makers have a lifetime warranty.
The only other thing you could do is to test each of the suspect DIMMs individually. Just know that isolating a memory error can be very difficult. In another thread, last week I think, I was telling about how it took over a week to get a memory problem corrected in a Dell system under warranty. They sent (in the end) 4 memory modules before we got one that would pass the system board self-test at boot.
 
Status
Not open for further replies.
Top