Memory problem

Status
Not open for further replies.

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
So I clearly have a bad DIMM in my system (SuperMicro X9DRD-7LN4F-JBOD motherboard), because every now and then I see stuff like this in the IPMI event log:
upload_2016-9-16_8-34-55.png

ECC is great, in that it can correct the errors, and IPMI is great, in that it tells me they're there (and, I think, where they are), but I'm having a little trouble decoding the location. It says the error is at DIMMD1(CPU1).

I've got 16 DIMMs in the system, and one is bad. But which one? CPU1 tells me it's on CPU #1, the one toward the front (i.e., away from the back panel connectors on the board), which narrows down the options. But which DIMM? Are these designations silkscreened on the board?
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
The manual shows P1-DIMMD1 as the second DIMM from the right, on the left-hand side of CPU1:
dimmd1.jpg
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Thanks for the find--I looked in the manual, but didn't see this. Now to figure out what DIMMs are in there so I can replace this one with an equivalent...
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Thanks for the find--I looked in the manual, but didn't see this.
I know from experience that Supermicro includes pretty good layout information in their manuals -- but it's often too small to notice.
Now to figure out what DIMMs are in there so I can replace this one with an equivalent...
Aye, for that you'll have to crack the box open and take a look!
 

wblock

Documentation Engineer
Joined
Nov 14, 2014
Messages
1,506
Cynical and jaded people might wonder if what the program calls "D1" corresponds with the silkscreen. Not me, though. Any vendor that calls the CPU fan connector "FAN A" clearly would not make that mistake.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Any vendor that calls the CPU fan connector "FAN A" clearly would not make that mistake.
Ah, but it's not the CPU fan connector, after all, at least on LGA 1150 boards. When quizzed, their support said it was controlled by the "motherboard temperature" (presumably PCH or a nearby thermistor).

So yeah, they need to work on their documentation.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
So I clearly have a bad DIMM in my system (SuperMicro X9DRD-7LN4F-JBOD motherboard), because every now and then I see stuff like this in the IPMI event log:
View attachment 13682
ECC is great, in that it can correct the errors, and IPMI is great, in that it tells me they're there (and, I think, where they are), but I'm having a little trouble decoding the location. It says the error is at DIMMD1(CPU1).

I've got 16 DIMMs in the system, and one is bad. But which one? CPU1 tells me it's on CPU #1, the one toward the front (i.e., away from the back panel connectors on the board), which narrows down the options. But which DIMM? Are these designations silkscreened on the board?
Woah, that's not every now and then, that's overflowed the log several times!
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Cynical and jaded people might wonder if what the program calls "D1" corresponds with the silkscreen. Not me, though. Any vendor that calls the CPU fan connector "FAN A" clearly would not make that mistake.
Wait! What? Their tech support guys told me that FAN1 is the CPU fan! :)
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Woah, that's not every now and then, that's overflowed the log several times!
Well, yes, but not quite. When the problem comes up (twice that I've noticed), it hits the logs a lot and seriously degrades system performance. The last time, a couple of days ago, it resulted in the system spontaneously rebooting. The last time it came up before then was a few weeks back, and I was able to reboot the system. No issues between the two events.

Certainly needs to be fixed, but it does seem to be somewhat intermittent.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Well, yes, but not quite. When the problem comes up (twice that I've noticed), it hits the logs a lot and seriously degrades system performance. The last time, a couple of days ago, it resulted in the system spontaneously rebooting. The last time it came up before then was a few weeks back, and I was able to reboot the system. No issues between the two events.

Certainly needs to be fixed, but it does seem to be somewhat intermittent.
I had the same ECC RAM problem with my older Supermicro X8SIE system several months ago. Re-seating the memory modules seems to have fixed it...
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
seriously degrades system performance
Judging from the log, with an ECC error detected interrupt storm.

The last time, a couple of days ago, it resulted in the system spontaneously rebooting.
That sounds like it might've panicked due to an uncorrectable error.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Have not had a recurrence of the problem since the 14th, but I had some time this morning to down the server and reseat the DIMMs. And took a note of what the current RAM is. If I see a problem again, I'll order a replacement stick (or two). Thanks again for the help.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Well, that didn't take long--it went down again early this morning. I'd already ordered a couple of replacement sticks which should be here shortly.

Anyone know if @cyberjock is still looking for sticks of bad ECC RAM?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I think he had what he needed.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Guess I'll just toss it, then.

Chalk up another win for IPMI! I could have spent a long time otherwise trying to run down what was causing my system to, apparently randomly, degrade performance and reboot.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Drat!
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Guess I'll just toss it, then.

Chalk up another win for IPMI! I could have spent a long time otherwise trying to run down what was causing my system to, apparently randomly, degrade performance and reboot.
Keep it around to test systems. Seems reasonably flaky.

That flaky DIMM I got from the experience earlier this year fixed itself, which is good, I guess... It took forever to show errors, so it would've been painful to use anyway.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Got replacement DIMMs the other day. Had a chance to replace D1 this afternoon. Server went down again earlier this evening, showing the same stuff in the IPMI event log. Is there any other reasonable conclusion than a faulty DIMM socket? That would be annoying, but I'm sure I could live with only 120 GB of RAM rather than 128 GB.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Got replacement DIMMs the other day. Had a chance to replace D1 this afternoon. Server went down again earlier this evening, showing the same stuff in the IPMI event log. Is there any other reasonable conclusion than a faulty DIMM socket? That would be annoying, but I'm sure I could live with only 120 GB of RAM rather than 128 GB.
Bad IMC or bent pin in the socket.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
IMC? To me, that means Instrument Meteorological Conditions, but I'm pretty sure that's not what you meant by it.

The plot thickens. I removed the DIMM from D1 this morning, and the system wouldn't boot at all. Hmmm... Looks like I'll need to go back to the manual to see how much RAM I'll have to yank out of the system so that it will boot with that socket empty.
 
Status
Not open for further replies.
Top