SOLVED The usefulness of ECC (if we can't assess it's working)?

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
If you're willing to settle for doing it on just one or two platforms whose behaviour is known, that's a lot easier.
Yes please. That's an aspect of intent with one of my sub questions regarding known good configurations.
Let's have a list that should work and is easy to test if it actually does on arrival and continues to work over time.
If one wants to take a different route then their on their own until enough traction develops to also include that route.

I would like again bring under attention that ixsystems is promoting WD red. I buy nothing else these days so again no snark statement is intended but why only for a specfic HDD that is indeed very important to the setup but AFAIK not other vital aspects like we are discussing now are promoted.

I have seen LSI 2008 HBA once in a thread about virtualization FreeNAS but that in it self is not really what I meant with. Known good configuration.
 
Last edited:

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
Now, I do wonder: Could there be a community effort to contribute to the middleware layer for SuperMicro boards
where do I chime in? I for one would like to help out.
I only hope AMD Ryzen is getting a chance, via community efforts like you propose, also to make stable server setups affordable for small fish like me
 
Last edited:

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
The ASRock Rack series sends alerts on ECC error. I am thinking that reasonably, this may not be a FreeNAS middleware layer ask, this may just be a matter of configuring IPMI to send alerts.

Curiously poking around, I find that the x570 Creator you have there is considerably more expensive than an X470D4U, and doesn't have IPMI. The CPU looks like it was chosen for heavy-duty virtualization. Let's say that throwing more money at it with less manageability strikes me as a curious choice.
 
Last edited:

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
this old asrock rack E3C226D2I from my main does not seem to :( But I will admit I am too inexperienced to recognize it after looking throughout the options..


Firmware Revision: 0.18.0
Firmware Build Time: Jan 21 2015 22:32:24 CST
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912

diversity

Contributor
Joined
Dec 4, 2018
Messages
128

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
But thx alot for showing me, I'll be sure to check it out.

However, 0 ecc errors in the logs since 2016. Is that a bad omen for reporting functionality or just rock solid ram?
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
passmark memtest 86 8.4 rc 2 build 1000 is unable to inject errors on my main so I am kind nervous
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
However, 0 ecc errors in the logs since 2016. Is that a bad omen for reporting functionality or just rock solid ram?

That's as expected. ECC RAM is carefully "binned" - tested for reliability before being sold. Memory shows ECC error in the first 12-18 months, and then after that, most sticks that survived their first 18 months won't die in the coming years. For a single server, you expect not to see ECC errors. For thousands of servers, seeing ECC errors is almost guaranteed.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Intel BIOS forbids injection. That's not a reason to be nervous. It's okay to trust that ASRock did the testing in their labs, and that their alerting function works. We don't all need to replicate vendor lab tests. :)

Edit and inevitable car analogy: I don't test that the check engine light will come on just before my oil runs out and the engine block seizes. I trust that the manufacturer has done that test, the car will alert me, and I'll do oil changes because I don't want my engine block to seize.
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
Ohh my, I realize now more than ever I should continue my quest.
I have double checked the event log sensors in my main's asrock rack E3C226D2I. As far as I can tell no ECC error category is present.

What makes me even more worried is that if intel BIOS forbids injection, then how trustworthy is the supported CPU family list on Passmarks mememtest pro compare page?

  • AMD Bulldozer (15h)
  • AMD Steamroller (15h)
  • AMD Jaguar (16h)
  • AMD Ryzen (17h) [Note: Injection is disabled in most AMD retail CPUs. To enable, please consult the Processor Programming Reference document]
  • AMD Steppe Eagle SoC
  • AMD Merlin Falcon SoC
  • Intel Nehalem
  • Intel Lynnfield
  • Intel Westmere
  • Intel Xeon E3 family (Sandy Bridge)
  • Intel Xeon E3 v2 family (Ivy Bridge)
  • Intel Xeon E3 v3 family (Haswell)
  • Intel Xeon E3 v4 family (Broadwell)
  • Intel Xeon E3 v5 family (Skylake)
  • Intel Xeon E3 v6 family (Kaby Lake)
  • Intel Atom C2000 SoC
  • Intel Broadwell-H SoC
  • Intel Apollo Lake SoC
I still am not aware of a documented/reported known working configuration.

And again Yorick, I am really thankful you are participating. Maximum respect. But please let me politely reject your car analogy as follows:
When I buy I car I first test drive it. As soon as I hit the breaks for the first time I know if they are working or not.
After having bought it I assess each time I break for a stop sign or something if they still do.
And where I come from there is a mandatory, by government law, yearly check up to make sure the breaks are not about to fail.

This is exactly what I am looking for regarding my data.

I'd like to put it in a more dramatic way just to spurr up things if I can. I would rather die in a crash due to failing car breaks than loose the data I am trying to protect. I am actually quite serious about this. I have many 'backups' spread all around. nothing really industrial though. But I would like it to be better.
Thus I am looking to have 3 FreeNAS setups spead around the globe synced each having 3 x 3TB mirrored.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Let's distinguish between the part that is wearing - RAM - and the alert system for that - ECC/BIOS/IPMI. With an automated alert system, I do not need to do manual testing of the wearing part, as long as I know that the automated alert system exists and functions.

In your brake analogy, the wearing part is the brake system, and the alert system is completely manual. At least on my car it is.

Testing the automated alert system is a fun little side project. I've got some feelers out for a stick of defective ECC. I'll post if I get a bite. Edit: Crucial can't part with defective sticks. Maybe someone on the STH forums feels generous.

If you are really really keen, you can always solder wires to an ECC stick and inject errors, see link further up in this thread.

Edit: More error injection ideas. "Two syringes", wow that sounds scary, see video at https://www.vusec.net/projects/eccploit/ . And more generally rowhammer, which will work if (big IF) the specific modules in that specific board are susceptible to it.

Edit2: If you thought sticking needles into your DIMM socket was scary, have a look at this: http://bluesmoke.sourceforge.net/heat_gun.html

Edit3: Masking a pin. Hmm. http://bluesmoke.sourceforge.net/testing.html . Though maybe Kapton tape instead.

Edit4: Heat lamps! Oh my goodness. https://www.cs.princeton.edu/~appel/papers/memerr.pdf

So, out of all of those, I think the most reasonable things to try, in order, are:
- Boot from a Linux stick and try rowhammer. Slim chance that the memory is actually susceptible to it and, costs nothing to try
- Mask a pin with Kapton tape. Introduces error, will allow one to verify the alerting system works.
- ..... yeah no I'm not comfortable with any of the others :). Mayyyybe the gooseneck clip-on lamp with a 50W bulb. But, yeah, not sure I am keen enough to go down that road.

So, @diversity , I can't wait to hear your test results :)
 
Last edited:

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
lol. Yorick. You called me out on the dramatic part :) and correctly so.

I will sleep on it some nights and probably end up feeling helpless as I dare not do the things you are suggesting.

But one never knows, I'll have a look first over a few days.

Thx again for contributing.
(thumbsup)
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
My FreeBSD servers do inform me of correctable ECC errors:
Code:
Nov 19 11:48:06 ph002 kernel: MCA: CPU 0 COR (5) OVER MS channel 3 memory error
Nov 19 11:48:06 ph002 kernel: MCA: Address 0x1f709a48c0
Nov 19 11:48:06 ph002 kernel: MCA: Misc 0x90010000040188c
Nov 19 11:48:06 ph002 kernel: MCA: Bank 12, Status 0xcc00010c000800c3
Nov 19 11:48:06 ph002 kernel: MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Nov 19 11:48:06 ph002 kernel: MCA: Vendor "GenuineIntel", ID 0x406f1, APIC ID 0


How to trigger alerts from that is left as an exercise to the reader ;)
can you please share your setup details? We can then add it to the, not (yet) existant, known good configurations.

If we have but a few then notifications might become a posibility
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
@Yorick I have to admit I am out of my league :( I quick scanned the links you offered and found no footing :(

I thank all for their contribution. I respect and accept (even though it just does not feel logical to me) that one does not feel that if one can't assess ECC is working it's basically useless. Perhaps my wording was chosen poorly.

I will open up a new thread about known good configurations with commercial memtesters

kind regards
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Edit and inevitable car analogy: I don't test that the check engine light will come on just before my oil runs out and the engine block seizes. I trust that the manufacturer has done that test, the car will alert me, and I'll do oil changes because I don't want my engine block to seize.
Unfortunately, the check engine will always come after you have lost too much oil already.
At least for people still using gasoline or diesel cars.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Unfortunately, the check engine will always come after you have lost too much oil already. At least for people still using gasoline or diesel cars.

Sometimes.

True story: two months ago I left Milwaukee on a deployment to the Bay Area. With supplies and gear for three racks, it wasn't cost-effective to ship, and I liked the idea of a cross-country drive. So I piled everything into the SUV and started driving west. For various reasons, I went off the Interstate in order to swing down to I-40 via US-54, and in the process, went through The Middle of Nowhere, MO. So it's a Sunday afternoon, turning dark, and I'm headed down some two lane state highway ... and the battery light comes on. No cell coverage. Oh crap. :smile: So I took to a policy of driving very gingerly, as it was probably an alternator failure, and managed about 15 more miles and got into the smallest little town you ever did see, basically a few houses and businesses around a town square park. It's great driving as you experience progressive electronics failures, starting with the fancy stuff, working its way into the dash, and then having all the panel meters die, then losing power steering, knowing that any stop is going to be the end of it. But, yay, cell signal in town! The tow truck came in from about sixty miles away. Alternator was fried, of course.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
It's great driving as you experience progressive electronics failures
It's even more fun flying that way. Fortunately, the engine doesn't depend on the battery/alternator at all.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Top