RAM testing results, should I be worried?

MalVeauX

Contributor
Joined
Aug 6, 2020
Messages
110
Hrm,

I'm testing my ECC RAM at the moment. I've been running Memtest86 for a while now just to let it do it's thing. So far I'm on pass #5 after 18+ hours or so and it reports 0 (zero) errors. I had to run an older version (V4) because it's an old board (X8SIL-F) that doesn't support UEFI booting.

Memtest86_Pass_5_08092020.jpg


However, when I went to my event log on my motherboard via IPMI I found an event alert by chance just being nosy:

Memory_Uncorrectable_Error_09082020.jpg


I tried looking this up, but basically, this is a total real error that was not corrected by ECC. So my understanding is, this would have resulted in corruption or lost some data if it were not already backed up, correct?

I'm confused how Memtest didn't report and error, but the IPMI log shows an error (not corrected by ECC)?

DIMM4B, does that mean it's the DIMM 4 slot on the board to help identity which stick it is?

How would ZFS handle this on a mirror pool of data? Would it have caught it having a different checksum and heal, or would this have been a total bust where it was corrupted in the RAM and was written corrupted? I'm not sure how serious this is. It looks like I need different RAM though and that worries me.

This server is going to be housing media (our family pictures, movies, etc) to serve to a few client machines in the house. The data will be on mirrors, no parity stuff, strictly 1:1 mirrors. I'm not sure if the above means I shouldn't put data on this server yet, and figure out this error above, replace it entirely or what.

Thoughts?

Very best,
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
This thread may help understanding https://www.ixsystems.com/community/threads/memory-issues.60593/page-2#post-430822

You need the paid version of Memtest to specifically test the ECC function, but, even then, there may be specifics to your system that may mean that it doesn't provide usable results.

Your IPMI log report might suggest that your BIOS isn't inhibiting ECC error reporting though.

For detailed info on your log report and DIMM ID, I suggest you try contacting Supermicro support or maybe the sysadmin group on Reddit if there's nothing forthcoming here (or from a general Google search?...).

I looked at the quick ref guide and manual for your mobo - they don't identify any slot as "4B" unfortunately.
 

MalVeauX

Contributor
Joined
Aug 6, 2020
Messages
110
This thread may help understanding https://www.ixsystems.com/community/threads/memory-issues.60593/page-2#post-430822

You need the paid version of Memtest to specifically test the ECC function, but, even then, there may be specifics to your system that may mean that it doesn't provide usable results.

Your IPMI log report might suggest that your BIOS isn't inhibiting ECC error reporting though.

For detailed info on your log report and DIMM ID, I suggest you try contacting Supermicro support or maybe the sysadmin group on Reddit if there's nothing forthcoming here (or from a general Google search?...).

I looked at the quick ref guide and manual for your mobo - they don't identify any slot as "4B" unfortunately.

Thanks,

So far, nothing conclusive has shown up in searches around, just sort of empty leads in multiple directions.

Yea, oddly the IPMI event log refers to something that isn't referred to in the manual for the board. That's weird to me. You'd think it would be painted out rather clearly instead of this cloak and dagger stuff. Sigh.

Even if the ECC isn't being tested (which I understand it is not here specifically), shouldn't the error show up in Memtest in some way? Odd that the board reports an uncorrectable error, yet the Memtest is ok with it?

I'm trying to figure out if this is nothing to be worried about, ie, normal, or if I need to be purchasing new RAM like right now.

Very best,
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
A single ECC correction seems innocuous to me. This can be chalked up to a random bit flip from cosmic rays. I'd be more concerned if there were a spurt of corrections within a short time interval.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
I should have also mentioned the Passmark forum - they were responsive to my questions about the ECC test function on the paid version. Perhaps you should pose the question(s) to them,
 

MalVeauX

Contributor
Joined
Aug 6, 2020
Messages
110
A single ECC correction seems innocuous to me. This can be chalked up to a random bit flip from cosmic rays. I'd be more concerned if there were a spurt of corrections within a short time interval.

Thanks; so you don't think that it said "uncorrectable" vs "corrected" is a problem with this?

Very best,
 

MalVeauX

Contributor
Joined
Aug 6, 2020
Messages
110
I should have also mentioned the Passmark forum - they were responsive to my questions about the ECC test function on the paid version. Perhaps you should pose the question(s) to them,

Thanks, I'll snoop around. This is older hardware so likely someone else has posted this before. Not sure I'm ready to pay $50 for the Pro version of this software that may or may not even be reliable at showing me anything meaningful here.

Very best,
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Thanks; so you don't think that it said "uncorrectable" vs "corrected" is a problem with this?

Very best,
No, I think this is just a case of Chinglish strikes again.
 

MalVeauX

Contributor
Joined
Aug 6, 2020
Messages
110
No, I think this is just a case of Chinglish strikes again.

Ok, so does this imply that the "uncorrectable" is meant that the ECC corrected successfully and that this is a translation error in the motherboard? And that's why Memtest didn't note the error perhaps?

Very best,
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Ok, so does this imply that the "uncorrectable" is meant that the ECC corrected successfully and that this is a translation error in the motherboard? And that's why Memtest didn't note the error perhaps?

Very best,

Yes, that's the scenario that makes the most sense from what you've observed.
 

MalVeauX

Contributor
Joined
Aug 6, 2020
Messages
110
Yes, that's the scenario that makes the most sense from what you've observed.

That's very interesting, there's some logic there.

I'll keep the test going as long as I can and see what else comes up, if any.

If it doesn't repeat or show up again, I think I'd be satisfied to simply not look and just move forward with the server and use it with live data on the new drives once they arrive.

If I see it again, well, I may need to explore new RAM or I'll just not sleep well thinking it's just going to corrupt stuff.

Do you think there's further testing I should be doing to assess this?

Very best,
 
Last edited:

MalVeauX

Contributor
Joined
Aug 6, 2020
Messages
110
Update:

I finally stopped testing the RAM. I'm at over 78 hours of non-stop testing in MemTest86, 19 total passes, with zero (o) errors reported so far. The only error I saw reported was the one I posted above from IPMI alerts, but it has not happened again.

Would you all say this is worth using, or should I still be worried about the error message from IPMI or was it merely a translation error and really is corrected being that MemTest reported no errors?

Memtest86_Pass_19_09112020.jpg


Very best,
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
I’d say you’ve pretty well burned in your RAM, and it’s ready for service.
 

MalVeauX

Contributor
Joined
Aug 6, 2020
Messages
110
I’d say you’ve pretty well burned in your RAM, and it’s ready for service.

Thanks;

I'm still of course worried about why it said "uncorrectable" in my IPMI event log instead of "correctable" or something. Yet MemTest reported zero errors. So I'm agreeing that it's probably a translation thing since it didn't happen again. I'll just keep watch and if it occurs again, I'll get new RAM. So I guess I'm going to go forward for now.

Very best,
 
Top