Memory Error on FreeNAS Mini

Status
Not open for further replies.

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Does it actually work?
I'm not sure how I would check. If you have a suggestion, I'll try it.
During testing it periodically reports as shown at the bottom of the screenshot I posted above, but is it a smoke and mirrors job?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Ordinarily, you could just check the IPMI log for ECC errors, but that's slightly more complicated in this case, due to the real errors that are going on.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
you could just check the IPMI log for ECC errors
Thanks - I didn't realize they would be logged there. And there are both correctable and uncorrectable ECC errors logged - but not as many as the memtest display seems to suggest.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
My investigations on the beta version, which said it was the professional version, revealed that it worked exactly never. C2750 (someone did me the favor of trying it out on their hardware, since I don't have one), Xeon E5-1650 v3, i3-4330... None of them. Which really sucked away my enthusiasm for paying for the thing.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Understood - that would have been my reaction, too. I will later try repeat tests with the injection turned off and see what the log does or doesn't show. I'll try it on my FS12-TY, too, when the pressure's off. But right now I need to focus on tracking down whatever's going on with this C2750 board/cpu/memory.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
4 passes completed, no errors. Set off another a set to run overnight. @Ericloewe, log entries for EEC errors appear to have stopped while program's interface continues to indicate error injection.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
2 runs of 4 memtest passes completed without error on stick that appeared to indicate a problem. This morning commenced testing again, this time with "problem stick" back in its original slot and the original priority-slot stick of the 4 in that slot.
@Ericloewe, immediately on starting the new test this morning the event log showed two uncorrectable ECC errors asserted. Presently the program interface suggests that there have been 11 injections so far but the event log still shows only the original two.
From some past casual observations I have questions about the event log's fidelity, but I've never thought about a serious attempt to test it. And it's not on my radar screen today with this testing attempt in progress.
I had to make several attempts to get the system to post this morning after rearranging the memory. I'm starting to wonder if this is "just another bad Asrock C2750 board". I'll keep testing the memory though (as long as it will let me) to try to zero in on the problem.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
2 runs of 4 passes each memtest on two sticks as described above - no errors. Next I will try the other two sticks together - test them single afterwards if there are errors (trying to speed things up a bit).
@Ericloewe, immediately after start of 9 hour test run this morning the system event log showed the two errors as mentioned above. No others showed throughout the day...
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,995
Testing in pairs is smart. If you get a failure, note the address and keep testing a bit longer. You want to get quite a few failures and ensure the address is in teh same range. Once that happens, swap only one of the RAM sticks with a different stick, leave one still in the machine. Document this by serial number. If the RAM fails in the same location then remove the original RAM (the one you started with) and place the RAM in that you previously removed after the initial test failure. If the test passes then you have identified the failing stick.

If the test fails no matter what RAM is installed provided it is always two sticks then you have a motherboard issue. You can try a few different things to tweak the motherboard timings and voltage which may make the system stable. Examples are slow downt the RAM clock speed, slow down the CPU clock speed, increase the RAM and/or CPU voltage for the buss. You would be best to go to an overclocking site to do some reading before tweaking. By no means add more voltage to the CPU/RAM etc... other than in the absolute smallest voltage such as .001 VDC. Sometimes lowering a voltage can also lead to stability. Slowing a clock rate down by a few digits can do miricles. I use to do a lot of this many years ago but since CPUs and buss interfaces are really fast these days, I don't see a need to do it myself, unless there is a stability issue. Also, try a different power supply first, disconnect all your hard drives too and run a minimal power load.

Of course it you do really believe there is a motherboard issue then the best thing to do is replace it. Trying to prove this can take a lot of time.

Best of luck to you.
 
Last edited:

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Still working away testing memory pairs - two sets of 4 passes per pair, 6 of 8 complete so far, no errors yet.
@Ericloewe, in 10.5 hours testing so far today, no entries in System Event Log to indicate memory error recognition despite Memtest86Pro's interface indications...
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,995
in 10.5 hours testing so far today, no entries in System Event Log to indicate memory error recognition despite Memtest86Pro's interface indications...
You could contact both the manufaturer of MemTest86 Pro and Supermicro and tell them the results that the log is not recording the induced failures and see what they have to say. Maybe it doesn't work on your motherboard properly or work as we would all expect?
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Yes, I am intending to do that - it's Asrock, BTW. I'll report.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,995

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Yes, I am intending to do that
I put in a posting on the forum that Memtest86+ directs one to. I'm not confident of an answer - I found similar question on the function of the ECC error injection mode that had gone unanswered.
It's a fun experience to even get to the forum - the memtest86+ site link doesn't work. I had to edit the link result to get to the site (French language) and hunt for the actual forum (which is in English). The most difficult part of registering to be able to post was answering the random question - which was in French even on the English-language version of the registration form, and required the answer to be in French - assuming that one could correctly determine what color the bloody magazine's mascot really is ...
Talk about a way to limit support questions ...
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
You're using memtest86, not memtest86+ - they're completely different and the latter is essentially abandoned these days.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Oh, yes, I guess so ... Thank you.
I'll comment to them at Passmark directly when I find out how.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Testing update:

Memtest86 - have an active dialog ongoing with the Passmark forum moderator, distilled ?'s down to "Should I have any expectation of any response from the either Memtest86 or the system to this error injection? If there is no correction of an injected error, what indication do we have? If there is correction of an injected error, what indication do we have?"

Asrock C2750 4Dl: all combinations of memory pairs in dual channel configuration (sticks and slots) completed - two test runs of four passes each - no errors.
Tested to find/confirm all bootable combinations of single channel memory configurations (one config only, with two sticks and two slots).
Before making the 12 memtests possible in this single channel mode, and considering the fact that this still leaves one slot out of four untested except for the full four slot, four stick, configuration as my first test (which failed), I decided to run the full configuration test again overnight (takes ~ 12.25 hours) for the hell of it. Wouldn't you know it - no errors! So I ran it two more times - no errors!
All of these latest full configuration tests have the uprated main case fan and the three additional fans over stock supply, and the CPU and main board temps have not approached the Asrock/Intel defined limit of >80C - same apparent conditions as the first test where there were five reported memory errors.
@Ericloewe, at no time in the last three tests was there any indication of a memory error event in the System Event Log (in discussion with Passmark as to expectations ...).
Soooo..., what to do now? Given that the single channel tests will not test the one single slot that was not included on the two channel tests, but all the sticks have been tested multiple times, I'm sorely tempted to pull the HDD's back out of my backup machine, reinstall in the Mini case, reconfigure the jails, etc., put the Mini back into normal service and monitor it carefully.

It's a very nice day here in NE Ohio so I'm going to spend the rest of the daylight hours working outside while deciding what to do - start single channel tests with two sticks or put back into service.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
If you've tested the configuration you'll be using, I don't see that much of a reason not to start using it.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,995
I agree with @Ericloewe that you might as well start using the system again, assuming you plan to retain it. I still have no idea why the CPU temps would have raised in the first place. Maybe you could contact iXsystems and ask them about it. You shouldn't have to add additional fans to cool the system/cpu, but that is just me. If you are comfortable running the system as-is, that is fine too. Honestly, I would likely have installed an additional fan if I had noticed the CPU was getting too warm.

What sucks is the fact of running the purchased MemTest86 does not give you that warn fuzzy feeling that it did anything more than the free version except lighten your wallet.
 
Status
Not open for further replies.
Top