Phantom RAM issue

Status
Not open for further replies.
Joined
Jun 6, 2018
Messages
9
Having an issue I can't figure out and wondering if someone can help me track the problem down.

Today I woke up to this security output e-mail from my setup:

Code:
> MCA: Bank 8, Status 0x88000040000200cf

> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000

> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0

> MCA: CPU 0 COR (1) MS channel ?? memory error

> MCA: Misc 0x4670220100081100

-- End of security output --


This happened before. Last time it happened I did a lot of research and took the system down and ran memtest. Memtest seemed to pick up nothing after a day of testing. I still didn't feel good, so research led me to reseating the RAM. So I used Dmidecode to find the physical address and this is where it gets weird.

24317-16380699577be67dd74f140ad67727a8.jpg


The bank in question has no RAM installed. I consulted the motherboard manual to find the physical location, took down the system and confirmed there is no memory in the slot. And to be safe I did reseat all the RAM. After more research I was led to the conclusion that it could be either a NIC or UPS issue. I decided to roll the dice and see what happens.

That was three weeks ago and today it pops up again. I'm at a lost as what to do next. I want to replace the hardware that is causing the problem, but I don't know what hardware that is and how to track it down at this point.

Can anyone point me in the right direction on the next step to troubleshoot?
 

Attachments

  • badram.jpg
    badram.jpg
    45.7 KB · Views: 218
Joined
Dec 29, 2014
Messages
1,135
Can anyone point me in the right direction on the next step to troubleshoot?

I would say it definitely sounds like RAM. Some systems are more persnickety than others. I have had some where they didn't like mixing RAM from different manufacturers even though all the sticks had the same rating (PC3-10600R). You need to make sure you have enough for all the functions that FreeNAS is doing, but you could try removing some sticks and see if the problem goes away. If you can take a downtime hit, you could try an exhaustive memtest to try and identify the problem stick. It could also be the motherboard RAM slot, but that seems a little less likely.
 
Joined
Jun 6, 2018
Messages
9
All memory is the same, so no mixing.

I have plenty of RAM (192GB) but I feel pulling blindly would be an issue. Plus I still have the ability to RMA the memory (if it is that) so it would be better to know what exact stick it is.

Thanks for the reply.
 

wblock

Documentation Engineer
Joined
Nov 14, 2014
Messages
1,506
I'm skeptical that the Locator or Bank information has anything to do with the silkscreen labels actually on the motherboard. If there were serial numbers shown there, that would be a lot more useful. If you can afford the downtime, testing each stick singly would be worthwhile.
 

wblock

Documentation Engineer
Joined
Nov 14, 2014
Messages
1,506
Good question. dmidecode ought to do that, but it might be that the vendors don't actually bother to record a machine-readable serial number on the sticks. Sorry, don't know.
 
Joined
Jun 6, 2018
Messages
9
Looking at Dmidecode again I do get serials for my RAM. But for the Bank 8 I get nothing because it's reporting no memory installed in that location.

Thanks again for the reply.
 

wblock

Documentation Engineer
Joined
Nov 14, 2014
Messages
1,506
Assuming the serial number is actually on the RAM labels, get a list of them from dmidecode and look for their physical locations on the motherboard.
 
Joined
Jun 6, 2018
Messages
9
Update: I contacted the vendor I bought the system from and they are helping me troubleshoot. They think the CPU could have become unseated, just slightly, in transit to cause this error.

I'll update if that turns out to be the issue.

Update Update: Reseating the CPUs went fine. But upon system checks I ended up losing RAM (2 sticks.) Check Dmidecode, found the location and reseated the ram. After a reboot the RAM was still gone. Then for a last test I took out the 2 sticks and moved some nearby RAM to those slots and ended up losing more RAM. Went from 192 to 160 to finally now sitting at 128GB of RAM. Talked to the vendor and they think there could be hardware issues with the board so I'm sending it back to them to look at.

The good thing about all this (as frustrating as it is) I was doing burn in testing (not crucial data yet.) Getting the kinks out now is better then having this kind of problem pop up a year into use.

Thanks everyone for the help! Really appreciate everyone chiming in and offering solutions!
 
Last edited:
Status
Not open for further replies.
Top