SOLVED The usefulness of ECC (if we can't assess it's working)?

diversity · Apr 9, 2020

jgreco said:
If you're willing to settle for doing it on just one or two platforms whose behaviour is known, that's a lot easier.

Yes please. That's an aspect of intent with one of my sub questions regarding known good configurations.
Let's have a list that should work and is easy to test if it actually does on arrival and continues to work over time.
If one wants to take a different route then their on their own until enough traction develops to also include that route.

I would like again bring under attention that ixsystems is promoting WD red. I buy nothing else these days so again no snark statement is intended but why only for a specfic HDD that is indeed very important to the setup but AFAIK not other vital aspects like we are discussing now are promoted.

I have seen LSI 2008 HBA once in a thread about virtualization FreeNAS but that in it self is not really what I meant with. Known good configuration.

diversity · Apr 9, 2020

Yorick said:
Now, I do wonder: Could there be a community effort to contribute to the middleware layer for SuperMicro boards

where do I chime in? I for one would like to help out.
I only hope AMD Ryzen is getting a chance, via community efforts like you propose, also to make stable server setups affordable for small fish like me

Yorick · Apr 9, 2020

The ASRock Rack series sends alerts on ECC error. I am thinking that reasonably, this may not be a FreeNAS middleware layer ask, this may just be a matter of configuring IPMI to send alerts.

Curiously poking around, I find that the x570 Creator you have there is considerably more expensive than an X470D4U, and doesn't have IPMI. The CPU looks like it was chosen for heavy-duty virtualization. Let's say that throwing more money at it with less manageability strikes me as a curious choice.

diversity · Apr 9, 2020

this old asrock rack E3C226D2I from my main does not seem to :( But I will admit I am too inexperienced to recognize it after looking throughout the options..

Firmware Revision:	0.18.0
Firmware Build Time:	Jan 21 2015 22:32:24 CST

Yorick · Apr 9, 2020

diversity said:
this old asrock rack E3C226D2I from my main does not seem to :(

Not sure why you say that. Reading https://download.asrock.com/Manual/IPMI/E3C226D2I.pdf , the IPMI can be set up to send SMTP alerts. You'd expect ECC errors to show up in IPMI event log, and alerts to be sent via SMTP once configured.

diversity · Apr 9, 2020

Yorick said:
Not sure why you say that. Reading https://download.asrock.com/Manual/IPMI/E3C226D2I.pdf , the IPMI can be set up to send SMTP alerts. You'd expect ECC errors to show up in IPMI event log, and alerts to be sent via SMTP once configured.

I made a mistake and have edit my post in the mean time. It is inexperience: (

diversity · Apr 9, 2020

But thx alot for showing me, I'll be sure to check it out.

However, 0 ecc errors in the logs since 2016. Is that a bad omen for reporting functionality or just rock solid ram?

Yorick · Apr 9, 2020

Inexperience is good, it gives you something to learn! Growth mindset, always :). https://www.youtube.com/watch?v=FpN1yQap_is

diversity · Apr 9, 2020

passmark memtest 86 8.4 rc 2 build 1000 is unable to inject errors on my main so I am kind nervous

Yorick · Apr 9, 2020

diversity said:
However, 0 ecc errors in the logs since 2016. Is that a bad omen for reporting functionality or just rock solid ram?

That's as expected. ECC RAM is carefully "binned" - tested for reliability before being sold. Memory shows ECC error in the first 12-18 months, and then after that, most sticks that survived their first 18 months won't die in the coming years. For a single server, you expect not to see ECC errors. For thousands of servers, seeing ECC errors is almost guaranteed.

Yorick · Apr 9, 2020

Intel BIOS forbids injection. That's not a reason to be nervous. It's okay to trust that ASRock did the testing in their labs, and that their alerting function works. We don't all need to replicate vendor lab tests. :)

Edit and inevitable car analogy: I don't test that the check engine light will come on just before my oil runs out and the engine block seizes. I trust that the manufacturer has done that test, the car will alert me, and I'll do oil changes because I don't want my engine block to seize.

diversity · Apr 9, 2020

Ohh my, I realize now more than ever I should continue my quest.
I have double checked the event log sensors in my main's asrock rack E3C226D2I. As far as I can tell no ECC error category is present.

What makes me even more worried is that if intel BIOS forbids injection, then how trustworthy is the supported CPU family list on Passmarks mememtest pro compare page?

AMD Bulldozer (15h)
AMD Steamroller (15h)
AMD Jaguar (16h)
AMD Ryzen (17h) [Note: Injection is disabled in most AMD retail CPUs. To enable, please consult the Processor Programming Reference document]
AMD Steppe Eagle SoC
AMD Merlin Falcon SoC
Intel Nehalem
Intel Lynnfield
Intel Westmere
Intel Xeon E3 family (Sandy Bridge)
Intel Xeon E3 v2 family (Ivy Bridge)
Intel Xeon E3 v3 family (Haswell)
Intel Xeon E3 v4 family (Broadwell)
Intel Xeon E3 v5 family (Skylake)
Intel Xeon E3 v6 family (Kaby Lake)
Intel Atom C2000 SoC
Intel Broadwell-H SoC
Intel Apollo Lake SoC

I still am not aware of a documented/reported known working configuration.

And again Yorick, I am really thankful you are participating. Maximum respect. But please let me politely reject your car analogy as follows:
When I buy I car I first test drive it. As soon as I hit the breaks for the first time I know if they are working or not.
After having bought it I assess each time I break for a stop sign or something if they still do.
And where I come from there is a mandatory, by government law, yearly check up to make sure the breaks are not about to fail.

This is exactly what I am looking for regarding my data.

I'd like to put it in a more dramatic way just to spurr up things if I can. I would rather die in a crash due to failing car breaks than loose the data I am trying to protect. I am actually quite serious about this. I have many 'backups' spread all around. nothing really industrial though. But I would like it to be better.
Thus I am looking to have 3 FreeNAS setups spead around the globe synced each having 3 x 3TB mirrored.

Yorick · Apr 9, 2020

Let's distinguish between the part that is wearing - RAM - and the alert system for that - ECC/BIOS/IPMI. With an automated alert system, I do not need to do manual testing of the wearing part, as long as I know that the automated alert system exists and functions.

In your brake analogy, the wearing part is the brake system, and the alert system is completely manual. At least on my car it is.

Testing the automated alert system is a fun little side project. I've got some feelers out for a stick of defective ECC. I'll post if I get a bite. Edit: Crucial can't part with defective sticks. Maybe someone on the STH forums feels generous.

If you are really really keen, you can always solder wires to an ECC stick and inject errors, see link further up in this thread.

Edit: More error injection ideas. "Two syringes", wow that sounds scary, see video at https://www.vusec.net/projects/eccploit/ . And more generally rowhammer, which will work if (big IF) the specific modules in that specific board are susceptible to it.

Edit2: If you thought sticking needles into your DIMM socket was scary, have a look at this: http://bluesmoke.sourceforge.net/heat_gun.html

Edit3: Masking a pin. Hmm. http://bluesmoke.sourceforge.net/testing.html . Though maybe Kapton tape instead.

Edit4: Heat lamps! Oh my goodness. https://www.cs.princeton.edu/~appel/papers/memerr.pdf

So, out of all of those, I think the most reasonable things to try, in order, are:
- Boot from a Linux stick and try rowhammer. Slim chance that the memory is actually susceptible to it and, costs nothing to try
- Mask a pin with Kapton tape. Introduces error, will allow one to verify the alerting system works.
- ..... yeah no I'm not comfortable with any of the others :). Mayyyybe the gooseneck clip-on lamp with a 50W bulb. But, yeah, not sure I am keen enough to go down that road.

So, @diversity , I can't wait to hear your test results :)

diversity · Apr 9, 2020

lol. Yorick. You called me out on the dramatic part :) and correctly so.

I will sleep on it some nights and probably end up feeling helpless as I dare not do the things you are suggesting.

But one never knows, I'll have a look first over a few days.

Thx again for contributing.
(thumbsup)

diversity · Apr 9, 2020

Patrick M. Hausen said:

can you please share your setup details? We can then add it to the, not (yet) existant, known good configurations.

If we have but a few then notifications might become a posibility

diversity · Apr 9, 2020

@Yorick I have to admit I am out of my league :( I quick scanned the links you offered and found no footing :(

I thank all for their contribution. I respect and accept (even though it just does not feel logical to me) that one does not feel that if one can't assess ECC is working it's basically useless. Perhaps my wording was chosen poorly.

I will open up a new thread about known good configurations with commercial memtesters

kind regards

Apollo · Apr 9, 2020

Yorick said:
Edit and inevitable car analogy: I don't test that the check engine light will come on just before my oil runs out and the engine block seizes. I trust that the manufacturer has done that test, the car will alert me, and I'll do oil changes because I don't want my engine block to seize.

Unfortunately, the check engine will always come after you have lost too much oil already.
At least for people still using gasoline or diesel cars.

jgreco · Apr 9, 2020

Apollo said:
Unfortunately, the check engine will always come after you have lost too much oil already. At least for people still using gasoline or diesel cars.

Sometimes.

True story: two months ago I left Milwaukee on a deployment to the Bay Area. With supplies and gear for three racks, it wasn't cost-effective to ship, and I liked the idea of a cross-country drive. So I piled everything into the SUV and started driving west. For various reasons, I went off the Interstate in order to swing down to I-40 via US-54, and in the process, went through The Middle of Nowhere, MO. So it's a Sunday afternoon, turning dark, and I'm headed down some two lane state highway ... and the battery light comes on. No cell coverage. Oh crap.

So I took to a policy of driving very gingerly, as it was probably an alternator failure, and managed about 15 more miles and got into the smallest little town you ever did see, basically a few houses and businesses around a town square park. It's great driving as you experience progressive electronics failures, starting with the fancy stuff, working its way into the dash, and then having all the panel meters die, then losing power steering, knowing that any stop is going to be the end of it. But, yay, cell signal in town! The tow truck came in from about sixty miles away. Alternator was fried, of course.

danb35 · Apr 9, 2020

jgreco said:
It's great driving as you experience progressive electronics failures

It's even more fun flying that way. Fortunately, the engine doesn't depend on the battery/alternator at all.

Apollo · Apr 9, 2020

danb35 said:
It's even more fun flying that way. Fortunately, the engine doesn't depend on the battery/alternator at all.

There are now electric planes for transporting passengers.
https://www.harbourair.com/harbour-...of-worlds-first-commercial-electric-airplane/

Important Announcement for the TrueNAS Community.

SOLVED The usefulness of ECC (if we can't assess it's working)?

Contributor

Contributor

Wizard

Contributor

Wizard

Contributor

Contributor

Wizard

Contributor

Wizard

Wizard

Contributor

Wizard

Contributor

Contributor

Contributor

Wizard

Resident Grinch

Hall of Famer

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "The usefulness of ECC (if we can't assess it's working)?"

Similar threads