Failing DIMM: DIMM location. (Correctable memory component found) (DIMMC1)

Joined
Dec 2, 2015
Messages
730
I rebooted my server to update to FN 11.3-U3.1, and got an email:

Code:
Subject: FreeNAS Server Critical Alert

See IPMI Event Log for details
IP : XXX.XXX.XXX.XXX
Hostname: BigBertha-IPMI
SEL_TIME: 2020/05/28 10:54:24
SENSOR_NUMBER: ff
SENSOR_TYPE: BIOS OEM
SENSOR_NAME: Memory Error
EVENT_DESCRIPTION: Failing DIMM: DIMM location (Correctable memory component found) (DIMMC1)
EVENT_DIRECTION: Assertion
EVENT SEVERITY:"non-critical"


I logged into IPMI, and found the System Event Log had an entry:
Failing DIMM: DIMM location. (Correctable memory component found) (DIMMC1)

There were six similar errors over the last six months in the log, all for slot C1, but this is the first one for which I received an email.

I shutdown the server and swapped the DIMMs between slots C1 and D1, to attempt to determine if the issue is bad DIMM or a bad slot. Or maybe reseating the DIMM would correct the problem.

What does this error mean? Did ECC find and correct a one bit error?

Server Details:
Supermicro X10SRH-cF
E5 1650-v4 CPU
96 GB ECC RAM
8x 4TB WD Red in RAIDZ2 + 8x 4TB WD Red in RAIDZ2 as local backup pool
Norco RPC-4224 chassis.
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
It would help if you'd provided the brand and speed of the memory modules you have in use, but from my experience this is a rank or latency incompatibility between some modules.
But its nothing to worry about as long as it doesn't turn into uncorrectable...
Happens to my system all the time, usually takes a couple of reboots until it recovered (but mine is tough on the board - 8 ranks and 1 rank mixed, so I am not complaining;) )
 
Joined
Dec 2, 2015
Messages
730
I contacted Supermicro Tech Support, seeking clarification on what this message means. My question was:

The IPMI Event Log shows seven errors in the last six months, all exactly: "Failing DIMM: DIMM location. (Correctable memory component found) (DIMMC1)".

Today I swapped the DIMMS in slots C1 and D1, to see if the error follows the DIMM, or stays in the same slot (assuming that the reseating of the DIMMS hasn't cleaned up the issue).
What type of DIMM failures could trigger this log entry? Is this a report that the DIMM has had a correctable memory error? Or something else?
Their response was:
Yes, there has been correctable memory error.

Can you try updating to latest BIOS 3.2 and IPMI 3.86 in our website and see if it helps with issue?

Now I'm pondering the risk/reward ratio of updating the BIOS and IPMI. I had one previous very bad experience with a borked IPMI update on another server that left it with no IPMI.

I'm thinking the BIOS update is more likely to directly affect this issue, if the root cause is not the DIMM itself. I think I'll hang fire until I get another event, as it is possible that reseating the DIMM has resolved it. If another event occurs in the original DIMM slot, I'll try a BIOS update. If the issue follows the DIMM to the new slot, I'll consider that the DIMM is the most likely cause and keep an eye on it, expecting that I may eventually need to replace that DIMM. I have enough RAM in the server that I could run without that DIMM for awhile if needed while I'm waiting for a new DIMM to arrive. I won't do a proactive DIMM replacement as COVID has cratered my income for the foreseeable future and I need to conserve cash.
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
IPMI wont indeed do nothing for that, Bios it is.
But i dont think that will change much either to be honest, at this point X10SRH is so old that only minor things get changed in a new Bios version.
But usually it does not hurt either and you always can go back, but if its not happening often I wouldnt bother.
 
Joined
Dec 2, 2015
Messages
730
There haven't been any more failing DIMM messages logged via IPMI, but grep 'memory error' /var/log/messages shows roughly 25 correctable memory errors per day on slot 2 before the DIMM swap. The errors moved to slot 3 after the DIMM slot swap.

Conclusions:
  1. this is not a board problem as the error followed the DIMM when I swapped slots.
  2. this is either a failing DIMM (most likely) or a BIOS that is too sensitive to something strange the DIMM is doing (much less likely)
Plans:
  1. check warranty status of this DIMM, and try to get it replaced under warranty if possible.
  2. if the DIMM cannot be replaced under warranty, update the BIOS at a convenient date and time
  3. if the errors continue, as they likely will, check DIMM prices and order a replacement DIMM.
 

subhuman

Contributor
Joined
Nov 21, 2019
Messages
121
Now I'm pondering the risk/reward ratio of updating the BIOS and IPMI.
I'm with you on that.
"Do a BIOS update while you're getting memory errors."
What could possibly go wrong?
 

Dan Tudora

Patron
Joined
Jul 6, 2017
Messages
276
hello
many bad things can happen when have a memory error
maybe for sure, remove de fault DIMM and keep one eye on log/IPMI for 1 day or 2 before do a BIOS update
be carefully
succes
 
Joined
Dec 2, 2015
Messages
730
The vendor issued an RMA, so the failing DIMM is in the mail back to them to be replaced by a new one.
 
Joined
Dec 2, 2015
Messages
730
The replacement DIMM arrived today and is working well so far, with no errors reported after 5 hours in service. I have not upgraded the BIOS or IPMI as the risk is much greater than the potential reward.
 
Top