messages:Mar 30 23:02:05 tubby MCA: Bank 12, Status 0x8c00004e000800c3 messages-Mar 30 23:02:05 tubby MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000 messages-Mar 30 23:02:05 tubby MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 0 messages-Mar 30 23:02:05 tubby MCA: CPU 0 COR (1) MS channel 3 memory error messages-Mar 30 23:02:05 tubby MCA: Address 0x1f17b5bbc0 messages-Mar 30 23:02:05 tubby MCA: Misc 0x1229402000201c8c -- messages.1:Jan 22 00:53:48 tubby MCA: Bank 12, Status 0x8c00004e000800c3 messages.1-Jan 22 00:53:48 tubby MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000 messages.1-Jan 22 00:53:48 tubby MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 0 messages.1-Jan 22 00:53:48 tubby MCA: CPU 0 COR (1) MS channel 3 memory error messages.1-Jan 22 00:53:48 tubby MCA: Address 0x1f17a9bbc0 messages.1-Jan 22 00:53:48 tubby MCA: Misc 0x1229402000201c8c -- messages.6:Aug 23 19:32:07 tubby MCA: Bank 12, Status 0x8c00004e000800c3 messages.6-Aug 23 19:32:07 tubby MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000 messages.6-Aug 23 19:32:07 tubby MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 0 messages.6-Aug 23 19:32:07 tubby MCA: CPU 0 COR (1) MS channel 3 memory error messages.6-Aug 23 19:32:07 tubby MCA: Address 0x200951bbc0 messages.6-Aug 23 19:32:07 tubby MCA: Misc 0x1229402000201c8c
Let's distinguish between the part that is wearing - RAM - and the alert system for that - ECC/BIOS/IPMI. With an automated alert system, I do not need to do manual testing of the wearing part, as long as I know that the automated alert system exists and functions.
In your brake analogy, the wearing part is the brake system, and the alert system is completely manual. At least on my car it is.
Testing the automated alert system is a fun little side project. I've got some feelers out for a stick of defective ECC. I'll post if I get a bite. Edit: Crucial can't part with defective sticks. Maybe someone on the STH forums feels generous.
If you are really really keen, you can always solder wires to an ECC stick and inject errors, see link further up in this thread.
Edit: More error injection ideas. "Two syringes", wow that sounds scary, see video at https://www.vusec.net/projects/eccploit/ . And more generally rowhammer, which will work if (big IF) the specific modules in that specific board are susceptible to it.
Edit2: If you thought sticking needles into your DIMM socket was scary, have a look at this: http://bluesmoke.sourceforge.net/heat_gun.html
Edit3: Masking a pin. Hmm. http://bluesmoke.sourceforge.net/testing.html . Though maybe Kapton tape instead.
Edit4: Heat lamps! Oh my goodness. https://www.cs.princeton.edu/~appel/papers/memerr.pdf
So, out of all of those, I think the most reasonable things to try, in order, are:
- Boot from a Linux stick and try rowhammer. Slim chance that the memory is actually susceptible to it and, costs nothing to try
- Mask a pin with Kapton tape. Introduces error, will allow one to verify the alerting system works.
- ..... yeah no I'm not comfortable with any of the others :). Mayyyybe the gooseneck clip-on lamp with a 50W bulb. But, yeah, not sure I am keen enough to go down that road.
So, @diversity , I can't wait to hear your test results :)
Create rowhammer base jail and start it ssh to FreeNAS iocage console rowhammer pkg install gcc git bash git clone https://github.com/google/rowhammer-test.git cd rowhammer-test bash make.sh ./rowhammer-test
1) Asrock did NOT properly test ECC for their Ryzen platforms (like the X470D4U / X470D4U2-2T). I did do this testing and discovered ECC reporting is NOT working on these platforms and they've admitted this. They said they'll remove things like ECC Error Event logs in the IPMI, because it fools people into thinking that it does work.
Huh? No it doesn't do that. I have a supermicro board and there was nothing more in the event log.Did the middleware send you an email alert when these happened?
More information should be in the event log of your IPMI.
Supermicro SYS-1028R-WTNRT with mainboard X10DRW-NT. FreeBSD 11.3can you please share your setup details? We can then add it to the, not (yet) existant, known good configurations.
If we have but a few then notifications might become a posibility
Huh? No it doesn't do that. I have a supermicro board and there was nothing more in the event log.
Alert received from FreeNAS IPMI IP : 192.168.2.8 Hostname: SEL_TIME: 2020/04/16 11:19:47 SENSOR_NUMBER: 53 SENSOR_TYPE: Memory SENSOR_NAME: OEM EVENT_DESCRIPTION: Correctable ECC @DIMMS6 EVENT_DIRECTION: Assertion EVENT SEVERITY:"information" TrueNAS @ freenas.wuffden.local New alerts: * Memory #0x53 Asserted Correctable ECC (@DIMMO6(CPU4)). Current alerts: * Memory #0x53 Asserted Correctable ECC (@DIMMO6(CPU4)).
definitely padawans MUST to be torture :DI used to torture some of my padawans with a criticism that "this is a detail-oriented business".
@diversity and I have chatted a bit about this in private messaging. This is essentially the reason I am suggesting to stick with major manufacturers who make servers as their bread and butter, and also to stick with the much more thoroughly deployed and tested Intel Xeon stuff.
I used to torture some of my padawans with a criticism that "this is a detail-oriented business".
When you install FreeBSD, do you install the ISO to HDD? Of course. Do you configure resolv.conf? Almost always. Do you set a root password? Probably. Do you set up NTP? Usually. Do you install an SSL CA pack? Maybe. Do you set up smartd? Possibly. Do you set up powerd, or nscd? Don't lie to me (heh). See, there are all sorts of corners of server software configuration that you COULD touch and maybe SHOULD touch, but don't.
In that same manner, there's lots of stuff that newbie server manufacturers COULD do, but the number of people clamoring for arcane features that require significant expertise to arrange correctly through several highly specialized mainboard subsystems does not really make a compelling case for a low-yield server manufacturer to make such an engineering investment. Heck, I don't even trust Supermicro to get this stuff right on alternative platforms like Ryzen. I'm pretty sure they do have the expertise in-house, which gives them a huge advantage, but call me skeptical until someone demonstrates that it works.