SOLVED The usefulness of ECC (if we can't assess it's working)?

Mastakilla · Apr 23, 2020

Although I'm very happy to hear that Diversity managed to see reporting of ECC Errors on the Asrock Rack X470D4U when using a Ryzen 3950 (Zen 2), I'm also confused and puzzled to what this means for my failing experiences to achieve the same...

Causing memory errors by shorting pins of memory modules

Firstly I did some research on this "Triggering memory errors using 'needles' or wires"-approach, which Diversity has used. I didn't look into this method yet, as it seemed too risky and I didn't want to damage or degrade any of my hardware.

I found a good (but complicated) description in the following paper:

https://download.vusec.net/papers/eccploit_sp19.pdf

Error injection with a shunt probe.To reduce noise and cross-talk between high-speed signals, data pins of the DDR DIMM (DQx) are physically placed next to a ground (VSS) signal. As the ground plane (VSS) has a very low impedance compared to the data signal and because the signal driver is (pseudo) open drain, short-circuiting the VSS and DQx signals will pull DQx from its high voltage level to “0”. Depending on the encoding of the high voltage, this short-circuiting results in a 1-to-0 or 0-to-1 bit flip on a given DQx line.Figure 1 displays the locations of the important signals andshows that a DQx signal is always adjacent to a VSS signal.There fore, to inject a single correctable bit error, while the system exercises the memory by writing and reading all ones,we have to short-circuit a DQx signal with VSS. We can achieve the short-circuiting effect with the help of a custom-built shunt probe using syringe needles (Figure 2a). We insert the probe in the holes of the DIMM socket as shown in Figure 2b. For clarity, we omit the memory module from the picture. We then use tweezers to control when the error is injected by shorts-circuiting the two needles and thus the targeted DQx and nearby VSS signal. This method, while simple (and cheap), is effective in the case of a memory controller that computes ECCs in a single memory transaction(ECC word size is 64 bits) and can be used instead of expensive ad-hoc equipment [30], [31]. On some systems (e.g., configuration AMD-1) data is retrieved in two memory transactions and then interleaved. Because of the low temporal accuracy of the shunt probe method, an error inserted on memory line DQk (0≤k <64) that appears on data bit 2*k will also “reflect” on data bit 2*k+1 inside the 128 bit ECC word. In this case the syndrome corresponds to two bit errors and contradicts Proposition 1. To ensure single bit errors, once the interleaved mechanism is understood, the exercising data can be constructed such that the reflected positions contain only bits that are encoded tolow voltage, essentially masking the reflections.

So in short, you're connecting a data-pin with a ground-pin, so that the current on the data-pin "flows away" into the ground-pin and this "flips a bit". When using the correct pins and not accidently shorting anything wrong, this should actually be "reasonably safe" to do I think... (please correct me if I'm wrong :) )
Why is this causing single-bit errors and not multi-bit errors? I THINK because every "clock tick" data is pulled from each data-pin of the memory module. So if you change only the "result" of 1 pin, you get a maximum 1 bit flipped per "clock tick", which ECC can then correct. (not sure though)

This paper is already a bit older and was using DDR3. The 'AMD-1' configuration they are talking about (where interleaving is making things complicated) is an 'AMD Opteron 6376 (Bulldozer (15h))'.
As far as I understand, the extra complexity of "interleaving", that happens on the Opteron system, is only applicable on Ryzen, when using Dual Rank memory. As Diversity was using a single 8GB module, I suppose he only has single rank memory, so wasn't confronted with the "interleaving-complexity". However, if I would try this with my 16GB modules, I would be confronted with this extra complexity, because ?all? 16GB modules are dual rank...
I concluded this from this article (but I could misinterpreting things!):
https://www.reddit.com/r/Amd/comments/6nzjeb/optimising_ryzen_cpu_memory_clocks_for_better/

RankInterleaving: auto (left on default, untested; should only be toggled with Dual Rank Memory* )

As the paper was for DDR3, you of course need find the pin layout of DDR4, before you can apply it on Ryzen. Following datasheet has pretty clear pin layout of unregistered ECC DDR4 module:

http://www.supertalent.com/datasheets/SuperTalent%20Datasheet-DDR4-ECCUDIMM-F24EA8GS.pdf

On page 6, 7 and 8 you can see the description per pin and on page 17 you can see a picture of where those pin numbers are on the memory module. I suppose all VSS-pins are ground pins and DQ+number pins are data pins. So if we follow the example of the paper and short DQ0 with a VSS, the corresponds to shorting pin-4 (VSS) with pin-5 (DQ0). But I guess shorting pin-2 (VSS) with pin-3 (DQ4) could work equally fine.

This should help us "understand" a bit better what Diversity has done and how to "safely" reproduce it.

The results from Diversity
I was in contact with Diversity and have some more details on his testing (all pictures in this "chapter" are from Diversity himself). Diversity used the following video: https://www.youtube.com/watch?v=_npcmCxH2Ig. Instead of needles and tweezers, he used a thin wire, as in the picture below.

He was able to trigger "ECC errors" in Memtest86 and "Corrected errors" in Linux (Proxmox).

My testing / experiences
As you know, I also tried triggering reporting of corrected memory errors. I tried this by overclocking / undervolting the memory to the point where it is on the edge of stability. This edge is very "thin", can be hard to reach and can result in the following scenarios in my understanding:
1) Not unstable enough, so no errors at all
2) Only just unstable enough, so that single-bit error occurs only sometimes when stressing the memory enough. These will then be corrected by ECC and will not cause faults or crashes.
3) A little more unstable, so that single-bit errors occur a bit more often and less stress is required on the memory to achieve this. But also (uncorrected) multi-bit errors can occur sometimes, which could cause faults / crashes.
4) Even a little bit more unstable, so that mostly multi-bit errors occur when stressing the memory and single bit errors might be rare. This also makes the system more prone to faults and crashes.
5) Even more unstable, so the multi-bit errors occur even when hardly stressing the memory at all. This makes the system very unstable and probably will not be able to boot into OS all the time.
6) Too unstable, so that it doesn't boot at all.
Both scenario 2) and 3) are "good enough" for testing reporting of corrected memory errors. Perhaps even scenario 4), if you're lucky...

During all my testing I tried 100+ possible memory settings, using all kinds of frequencies, timings and voltages, of which 10-30 were potentially in scenario 2) or 3). I "PARTLY" kept track of all testing in the below (incomplete) Excel:

This convinced me that I should have at least once been in scenario 2) or 3), where I should have seen corrected errors (but didn't). That is why I concluded that it didn't work and I contacted Asrock Rack and AMD to work on this.

Conclusions / Questions / Concerns
Now what does all of this means? Does this mean that I never reached scenario 2) or 3)? Does it mean scenario 2) and 3) are almost impossible to reach using the methods I tried? Or does it mean that Diversity perhaps triggered a different kind of memory error? I'm not sure and I hope someone can clarify...

I know there is error correction and reporting happening on many layers in a modern computer. As far as I know, there are these:
1) Inside the memory modules itself (only when you have ECC modules). The memory module then has an extra chip on the module to store checksums. I think this works similar to RAID5 for HDDs. So that a memory error is detected and corrected in the module itself, even before it exits the memory module.
2) On the way from the memory module to the memory controller on the CPU (databus). Error detecting / correcting / reporting for these kinds of errors are handled by the memory controller in the CPU, so ECC memory isn't even required to make this work.
3) Inside the CPU data is also transfered between L1/L2/L3 caches and the CPU. Also there Error detecting / correcting / reporting is possible I think.

All of these might look confusingly similar when reported to the OS, but I do think they are often reported in a slightly different manner. I've seen reports where the CPU cache (L1/L2/L3) was clearly mentioned when reporting a corrected error for example, but I'm not sure what the exact difference between reports of 1) and 2) would be.
In Proxmox screenshots I do read things like "Unified Memory Controller..." and "Cache level: L3/GEN...", but again, I'm not entirely sure if these mean that the errors are in 2) or 3) instead of 1)...

Diversity is draining the current from a data-pin "outside" of the memory module, but I still see 2 possibilities of what's happening:
1) The drain on the data-pin is pulling the current away also inside the memory module itself, where the memory module detects the error, corrects and reports it.
2) The drain on the data-pin only happens after the memory module has done its checks and the error is only detected on the databus and corrected / reported there (so not by the ECC functionality of the memory module).

To know which scenario is happening, we could:

try to find someone with knowledge on exactly how each type of error is reported and who can exactly identify what is being reported
perhaps using a non-ECC single-rank module also reports the same kind of ECC errors, which would proof it is happening on the databus.
perhaps someone with more (I don't have any actually) electrical engineering knowledge can also say something more meaningful than myself?

Diversity did mention that his IPMI log was still empty after getting those errors. So there the motherboard / IPMI is certainly missing some required functionality.

Yorick · Apr 23, 2020

Mastakilla said:
Now what does all of this means?

It means you need a better method of injecting errors. Which you are hard at work towards, pins 4 and 5 sounds like a start.

It may also mean you and AMD have been talking past each other. I note they told you error reporting wasn't supported. In the context of ECC, this might mean:

- We detect and correct 1-bit ECC errors. This can then also be reported to the OS / IPMI.
- We do not detect and report uncorrectable 2-bit ECC errors.

I don't think they meant "we detect and correct 1-bit ECC errors, and this is invisible to the OS and IPMI"

Which means AsRock shouldn't remove the sensors in IPMI for 1-bit correctable errors. It's quite possible that 2-bit errors will not be reported by the chipset / CPU, and that's okay for a desktop CPU - but I could swear I saw an article that claims it works anyway.

This bit here talks about 1-bit vs 2-bit on Ryzen: https://www.reddit.com/r/Amd/comments/bsszwg/ecc_ryzen_and_2bit_errors/

Mastakilla · Apr 23, 2020

But are you sure that the "needle-approach" from Diversity tests the ECC functionality of the memory modules and not of the memory data bus? Because I still have some doubts on that...

What AMD TW was claiming is indeed clearly not accurate. Could be a misunderstanding, could be a lie to get rid of the hard questions. Doesn't really matter actually.
AMD US is working with Asrock TW engineers on the issue as we speak though. I got a confirmation of that today... Still thinking on how to bring them this "new information" from Diversity though...

Thanks for the reddit link btw... I'll try to read through tomorrow...

JaimieV · Apr 24, 2020

Fascinating thread, thank you all. It does make my decision to use old Dell server hardware feel very much like the right one! When setting up initially, one of my ECC DIMMs was bad and the R510 alerted on it, so that was good. Also I've previously looked after a datacentre with over 300 similar (but newer) machines and had two ECC events reported by IPMI during that lifetime. ECC issues that aren't "this DIMM is busted" appear to be a vanishingly rare thing in real life, which you would expect considering how complex the rest of a computer is and how well it all works - and wouldn't if there were errors being thrown in the digital signals at any visible rate.

Yorick · Apr 24, 2020

Mastakilla said:
tests the ECC functionality of the memory modules and not of the memory data bus

I don't know what you are referencing when you speak of error correction on the data bus. I use memory, I've never read the DDR4 spec. Can you point to some documentation that speaks to the data bus error correction you are referencing?

Mastakilla · Apr 24, 2020

Good point to question this :)

I'm not sure anymore where I "learned" this, but somehow it is now in my brain as "a fact", so I guess someone knowledgeable must have told me this...?

When googling I couldn't directly find a knowledgeable source that confirms it, but I did find for example this:
https://www.reddit.com/r/Amd/comments/etjkjn/oc_tip_for_ryzen_3000_infinity_fabric_run_linux/
Here non-server-minded overclockers without ECC memory are bumping into corrected errors on their infinity fabric.

It does say pretty clearly "Link Error", where the screenshot from Diversity says "DRAM ECC Error". However on that same line in his screenshot it also says "Unified Memory Controller", which is on the CPU, right?
Both mention the L3 cache on next line...

I'll let you know if I find back a more knowledgeable source...

Mastakilla · May 7, 2020

I'm happy to report that, after disabling "Platform First Error Handling (PFEH)" in the BIOS, (corrected) single-bit are properly reported to the OS, also when overclocking / undervolting! So I'm now getting the same results as Diversity with his memory pin shorting method...

The reason I was failing to detect this earlier was:

Memtest86 v8.2 reported "unknown" for ECC support. Memtest86 v8.3 reported "enabled" for ECC support. So I assumed, if it was working, Memtest86 v8.3 should be able to detect them. However, a couple of days ago I figured out that Memtest86 v8.4 beta had Zen2 ECC support in its changelog. So after testing I figured out that Memtest86 v8.3 does NOT support Zen2 ECC, but Memtest86 v8.4 beta DOES support Zen2 ECC.
I only discovered the BIOS option "Platform First Error Handling (PFEH)" very recently. During all my previous testing, except for the very last couple short tests, it was set to the default "enabled". I probably did too little testing with Linux / Windows after disabling it.

So in short:

(corrected) single-bit memory errors -> motherboard (BIOS) -> OS ==> works 100%
(corrected) single-bit memory errors -> motherboard (BIOS) -> OS -> IPMI ==> not sure if the OS properly forwards the error to the IPMI
(corrected) single-bit memory errors -> motherboard (BIOS) -> IPMI ==> 100% broken
(corrected) single-bit memory errors -> motherboard (BIOS) -> IPMI -> OS ==> 100% broken
(uncorrected) multi-bit memory errors -> * ==> I'm not sure if it is broken (or perhaps not even possible on Zen2) or if we just haven't been able trigger them yet. I've ran Memtest86 v8.4 with unstable memory for many hours now. In doing so, I've triggered about 3000 "【76 m】ECC Correctable Errors" (=single-bit) and about 100 of CPU errors, but 0 "ECC Correctable Errors" (=multi-bit). Also using the shorting-method, we haven't achieved any "ECC Correctable Errors" (=multi-bit) yet. We are currently in contact with the persons who wrote the paper (see link above for details) that explained the shorting-method, to see how to trigger multi-bit errors reliably.

So if I understand it correctly

= ok
I think we can only validate this once 3) is fixed
Is actually a bug and should be fixed by Asrock Rack (with help of AMD and perhaps the IPMI-chip manufacturer Aspeed)
Can only work / be fixed once 3) gets fixed
Suggestions are welcome. Perhaps AMD can confirm if Zen2 properly supports this? But not like AMD TW claimed that “reporting is not supported”, which we now clearly proved to be false

I've send this information to Asrock Rack + AMD. Asrock Rack, on the same day, confirmed that they, together with AMD, had come to the exact same conclusion (using error injection in Linux) and that they asked AMD for assistance to report these errors in the IPMI as well. So hopefully we'll someday get this important feature on this motherboard!!

In meantime, me and (especially) Diversity, are still trying to trigger (uncorrected) multi-bit errors as well. We're in contact with the interesting folks of ECCploit for this, who have a very profound knowledge on this matter... (Check out Lucians talk on OffensiveCon19 https://www.youtube.com/watch?v=R2aPo_wwmZw).

Finally some real progress on this matter! Thanks to Diversity for getting my hopes up again, cause I almost gave up on this...

Mastakilla · Jul 8, 2020

Hi everyone,

It's been awhile ago, but I have another update...

I'll start with a summary and then post all "evidence". This is all regarding the platform ASRock Rack X470D4U2-2T + AMD Ryzen 3x00 (Zen 2 cores) + ECC memory (see my signature for more specifics), using the latest stable BIOS. I am using the "overclock the memory until it is barely stable"-method, as described earlier posts.

Memory Injection on Linux, using mce-inject, as described some posts earlier, does not inject memory errors on a platform level, but only on an OS level. So it is not suitable for testing if the IPMI / BMC properly handles memory error detection. We've discovered this because the "Platform First Error Handling" toggle in the BIOS, has no effect on this method.
ECC correction works!
- Already confirmed / proven earlier in this thread.
- When using default BIOS settings.
(Corrected) single-bit ECC memory error detection by "the OS" works (if correctly implemented)!
- Already confirmed / proven earlier in this thread.
- But only when setting "Platform First Error Handling" to disabled in the BIOS.
- Works on for example
  - Memtest86 v8.4 or higher
  - Linux kernel 5.6 or higher
  - TrueNAS 12.0 beta 1 (not on FreeNAS 11.3)
(Uncorrected) multi-bit ECC memory error detection by "the OS" works (if correctly implemented)!
- This is a new discovery.
- But only when setting "Platform First Error Handling" to disabled in the BIOS.
- Works on for example
  - Memtest86 (unreleased version - fixes will be included in next release)
  - Linux kernel 5.7 (probably also on 5.6, but I didn't try it)
  - Not sure about TrueNAS 12.0 beta 1. I haven't been able to trigger or recognize it yet.
IPMI / BMC is unable to detect any kind of memory error
- Confirmed once more.
- Even when setting "Platform First Error Handling" to enabled in the BIOS.
- Asrock Rack is (hopefully) still working on getting this fixed?

(Uncorrected) multi-bit ECC memory error detection by "the OS"
Memtest86
After notifying Passmark that Linux is able to detect (uncorrected) multi-bit ECC memory errors and Memtest86 v8.4 isn't, they've asked me to send the log files. They then provided me a new version (so far still unreleased) which fixes the issue and can properly detect (uncorrected) multi-bit ECC memory errors!
Sorry, forgot to take a screenshot of this one. I do still have the log file. Here is the summary of the report:

Test Start Time	2020-05-25 08:19:11
Elapsed Time	1:47:46
Memory Range Tested	0x0 - 80F380000 (33011MB)
CPU Selection Mode	Parallel (All CPUs)
ECC Polling	Enabled
# Tests Passed	7/19 (36%)

Lowest Error Address	0x489128C48 (18577MB)
Highest Error Address	0x73D8367A0 (29656MB)
Bits in Error Mask	00000000FDDFFFFF
Bits in Error	30
Max Contiguous Errors	2

ECC Correctable Errors	2689
ECC Uncorrectable Errors	244

Linux
Maybe I have triggered these earlier already, but I didn't notice them till recently.

[root@localhost ~]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 3 Corrected Errors
mc0: csrow3: 1 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 3 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
[root@localhost ~]# ras-mc-ctl --summary
Memory controller events summary:
Corrected on DIMM Label(s): 'mc#0csrow#2channel#1' location: 0:2:1:-1 errors: 3
Corrected on DIMM Label(s): 'mc#0csrow#3channel#0' location: 0:3:0:-1 errors: 3
Fatal on DIMM Label(s): 'mc#0csrow#3channel#0' location: 0:3:0:-1 errors: 1

No PCIe AER errors.

No Extlog errors.

No devlink errors.
Disk errors summary:
0:0 has 17 errors
0:2048 has 147 errors
0:2816 has 4 errors
MCE records summary:
12 Corrected error, no action required. errors
1 Deferred error, no action required. errors
2 Uncorrected, software containable error. errors
[root@localhost ~]#

Code:

[root@localhost ~]# cat /var/log/messages
...
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8
May 20 00:08:59 localhost kernel: mce_notify_irq: 1 callbacks suppressed

May 20 00:08:59 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:08:59 localhost kernel: [Hardware Error]: Corrected error, no action required.
May 20 00:08:59 localhost kernel: [Hardware Error]: CPU:0 (17:71:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
May 20 00:08:59 localhost kernel: [Hardware Error]: Error Addr: 0x00000003080ccb40
May 20 00:08:59 localhost kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0xf79c00000b800003
May 20 00:08:59 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 20 00:08:59 localhost kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x0)
May 20 00:08:59 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

May 20 00:08:59 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:08:59 localhost kernel: [Hardware Error]: Corrected error, no action required.
May 20 00:08:59 localhost kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
May 20 00:08:59 localhost kernel: [Hardware Error]: Error Addr: 0x00000003095cc100
May 20 00:08:59 localhost kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x510600800a800302
May 20 00:08:59 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 20 00:08:59 localhost kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x0 offset:0x0 grain:64 syndrome:0x80)
May 20 00:08:59 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:08:59 localhost rasdaemon[995]:           <...>-661   [000]     0.000066: mce_record:           2020-04-01 19:34:33 +0200 Unified Memory Controller (bank=17), status= dc2040000000011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
May 20 00:08:59 localhost rasdaemon[995]: Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=3, cpu_type= AMD Family 17h Zen1, cpu= 0, socketid= 0, misc= d01a0f7c01000000, addr= 3080ccb40, synd= f79c00000b800003, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: mc_event store: 0x55aaea8a4418
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:08:59 localhost rasdaemon[995]:           <...>-661   [000]     0.000066: mc_event:             2020-04-01 19:34:33 +0200 1 Corrected error: Cannot decode normalized address on mc#0csrow#3channel#0 (mc: 0 location: 3:0 grain: 6)
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:08:59 localhost rasdaemon[995]:           <...>-661   [000]     0.000066: mce_record:           2020-04-01 19:34:33 +0200 Unified Memory Controller (bank=18), status= dc2040000000011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
May 20 00:08:59 localhost rasdaemon[995]: Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=1,csrow=2, cpu_type= AMD Family 17h Zen1, cpu= 0, socketid= 0, misc= d01a01d301000000, addr= 3095cc100, synd= 510600800a800302, ipid= 9600150f00, mcgstatus=0, mcgcap= 11c, apicid= 0
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: mc_event store: 0x55aaea8a4418
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:08:59 localhost rasdaemon[995]:           <...>-661   [000]     0.000066: mc_event:             2020-04-01 19:34:33 +0200 1 Corrected error: Cannot decode normalized address on mc#0csrow#2channel#1 (mc: 0 location: 2:1 grain: 6 syndrome: 0x00000080)
May 20 00:08:59 localhost abrt-server[1611]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:08:59 localhost abrt-server[1614]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:08:59 localhost abrt-server[1618]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:08:59 localhost systemd[1]: Started dbus-:1.3-org.freedesktop.problems@2.service.
May 20 00:08:59 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dbus-:1.3-org.freedesktop.problems@2 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:09:00 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Found oopses: 1
May 20 00:09:00 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Creating problem directories
May 20 00:09:00 localhost abrt-notification[1657]: System encountered a non-fatal error in ??()
May 20 00:09:01 localhost abrt-dump-journal-oops[1036]: Reported 1 kernel oopses to Abrt
May 20 00:11:12 localhost systemd[1]: dbus-:1.3-org.freedesktop.problems@2.service: Succeeded.
May 20 00:11:12 localhost audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dbus-:1.3-org.freedesktop.problems@2 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:12:15 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:12:15 localhost kernel: [Hardware Error]: Corrected error, no action required.
May 20 00:12:15 localhost kernel: [Hardware Error]: CPU:0 (17:71:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
May 20 00:12:15 localhost kernel: [Hardware Error]: Error Addr: 0x0000000301a4ef80
May 20 00:12:15 localhost kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0xf79c00000b800003
May 20 00:12:15 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 20 00:12:15 localhost kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x0)
May 20 00:12:15 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
May 20 00:12:15 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8
May 20 00:12:15 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:12:15 localhost rasdaemon[995]:           <...>-661   [000]     0.000086: mce_record:           2020-04-01 19:37:49 +0200 Unified Memory Controller (bank=17), status= dc2040000000011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
May 20 00:12:15 localhost rasdaemon[995]: Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=3, cpu_type= AMD Family 17h Zen1, cpu= 0, socketid= 0, misc= d01b0fff01000000, addr= 301a4ef80, synd= f79c00000b800003, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0
May 20 00:12:15 localhost rasdaemon[995]: rasdaemon: mc_event store: 0x55aaea8a4418
May 20 00:12:15 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:12:15 localhost rasdaemon[995]:           <...>-661   [000]     0.000086: mc_event:             2020-04-01 19:37:49 +0200 1 Corrected error: Cannot decode normalized address on mc#0csrow#3channel#0 (mc: 0 location: 3:0 grain: 6)
May 20 00:12:15 localhost abrt-server[1674]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:12:15 localhost systemd[1]: Started dbus-:1.3-org.freedesktop.problems@3.service.
May 20 00:12:15 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dbus-:1.3-org.freedesktop.problems@3 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:12:17 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Found oopses: 1
May 20 00:12:17 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Creating problem directories
May 20 00:12:17 localhost abrt-notification[1710]: System encountered a non-fatal error in ??()
May 20 00:12:18 localhost abrt-dump-journal-oops[1036]: Reported 1 kernel oopses to Abrt
May 20 00:12:59 localhost systemd[1]: Starting Cleanup of Temporary Directories...
May 20 00:12:59 localhost systemd-tmpfiles[1712]: /usr/lib/tmpfiles.d/BackupPC.conf:1: Line references path below legacy directory /var/run/, updating /var/run/BackupPC → /run/BackupPC; please update the tmpfiles.d/ drop-in file accordingly.
May 20 00:12:59 localhost systemd-tmpfiles[1712]: /etc/tmpfiles.d/tpm2-tss-fapi.conf:3: Line references path below legacy directory /var/run/, updating /var/run/tpm2-tss/eventlog → /run/tpm2-tss/eventlog; please update the tmpfiles.d/ drop-in file accordingly.
May 20 00:12:59 localhost systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
May 20 00:12:59 localhost systemd[1]: Finished Cleanup of Temporary Directories.
May 20 00:12:59 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:12:59 localhost audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8

May 20 00:14:26 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:14:26 localhost kernel: [Hardware Error]: Corrected error, no action required.
May 20 00:14:26 localhost kernel: [Hardware Error]: CPU:0 (17:71:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
May 20 00:14:26 localhost kernel: [Hardware Error]: Error Addr: 0x0000000395164300
May 20 00:14:26 localhost kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0xf79c00000b800003
May 20 00:14:26 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 20 00:14:26 localhost kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x0)
May 20 00:14:26 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

May 20 00:14:26 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:14:26 localhost kernel: [Hardware Error]: Corrected error, no action required.
May 20 00:14:26 localhost kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
May 20 00:14:26 localhost kernel: [Hardware Error]: Error Addr: 0x000000030088c100
May 20 00:14:26 localhost kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x510600800a800302
May 20 00:14:26 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 20 00:14:26 localhost kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x0 offset:0x0 grain:64 syndrome:0x80)
May 20 00:14:26 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:14:26 localhost rasdaemon[995]:           <...>-661   [000]     0.000099: mce_record:           2020-04-01 19:40:01 +0200 Unified Memory Controller (bank=17), status= dc2040000000011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
May 20 00:14:26 localhost rasdaemon[995]: Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=3, cpu_type= AMD Family 17h Zen1, cpu= 0, socketid= 0, misc= d01b0fff01000000, addr= 395164300, synd= f79c00000b800003, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: mc_event store: 0x55aaea8a4418
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:14:26 localhost rasdaemon[995]:           <...>-661   [000]     0.000099: mc_event:             2020-04-01 19:40:01 +0200 1 Corrected error: Cannot decode normalized address on mc#0csrow#3channel#0 (mc: 0 location: 3:0 grain: 6)
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:14:26 localhost rasdaemon[995]:           <...>-661   [000]     0.000099: mce_record:           2020-04-01 19:40:01 +0200 Unified Memory Controller (bank=18), status= dc2040000000011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
May 20 00:14:26 localhost rasdaemon[995]: Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=1,csrow=2, cpu_type= AMD Family 17h Zen1, cpu= 0, socketid= 0, misc= d01a033c01000000, addr= 30088c100, synd= 510600800a800302, ipid= 9600150f00, mcgstatus=0, mcgcap= 11c, apicid= 0
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: mc_event store: 0x55aaea8a4418
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:14:26 localhost rasdaemon[995]:           <...>-661   [000]     0.000099: mc_event:             2020-04-01 19:40:01 +0200 1 Corrected error: Cannot decode normalized address on mc#0csrow#2channel#1 (mc: 0 location: 2:1 grain: 6 syndrome: 0x00000080)
May 20 00:14:26 localhost abrt-server[1729]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:14:26 localhost abrt-server[1732]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:14:26 localhost abrt-server[1735]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:14:28 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Found oopses: 1
May 20 00:14:28 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Creating problem directories
May 20 00:14:28 localhost abrt-notification[1772]: System encountered a non-fatal error in ??()
May 20 00:14:28 localhost systemd[1]: dbus-:1.3-org.freedesktop.problems@3.service: Succeeded.
May 20 00:14:28 localhost audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dbus-:1.3-org.freedesktop.problems@3 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:14:29 localhost abrt-dump-journal-oops[1036]: Reported 1 kernel oopses to Abrt
May 20 00:17:03 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8

May 20 00:17:03 localhost kernel: mce: Uncorrected hardware memory error in user-access at 621211640
May 20 00:17:03 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:17:03 localhost kernel: [Hardware Error]: Uncorrected, software restartable error.
May 20 00:17:03 localhost kernel: [Hardware Error]: CPU:9 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
May 20 00:17:03 localhost kernel: [Hardware Error]: Error Addr: 0x0000000621211640
May 20 00:17:03 localhost kernel: [Hardware Error]: IPID: 0x000000b000000000
May 20 00:17:03 localhost kernel: [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
May 20 00:17:03 localhost kernel: [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
May 20 00:17:03 localhost kernel: Memory failure: 0x621211: Sending SIGBUS to memtester:1666 due to hardware memory corruption
May 20 00:17:03 localhost kernel: Memory failure: 0x621211: recovery action for dirty LRU page: Recovered
May 20 00:17:03 localhost audit[1666]: ANOM_ABEND auid=0 uid=0 gid=0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=1666 comm="memtester" exe="/usr/bin/memtester" sig=7 res=1
May 20 00:17:03 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:17:03 localhost rasdaemon[995]:           <...>-213   [009]     0.000114: mce_record:           2020-04-01 19:42:37 +0200 Load Store Unit (bank=0), status= bc002800000c0135, Uncorrected, software containable error., mci=UECC Poison consumed, mca= DC data error type 1 (poison consumption).
May 20 00:17:03 localhost rasdaemon[995]: Memory Error 'mem-tx: data read, tx: data, level: L1', cpu_type= AMD Family 17h Zen1, cpu= 9, socketid= 0, ip= 401e81, cs= 33, misc= d01a000000000000, addr= 621211640, ipid= b000000000, mcgstatus=7 RIPV EIPV MCIP, mcgcap= 11c, apicid= 9
May 20 00:17:03 localhost audit: BPF prog-id=44 op=LOAD
May 20 00:17:03 localhost audit: BPF prog-id=45 op=LOAD
May 20 00:17:03 localhost audit: BPF prog-id=46 op=LOAD
May 20 00:17:03 localhost systemd[1]: Started Process Core Dump (PID 1790/UID 0).
May 20 00:17:03 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@1-1790-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:17:03 localhost systemd[1]: Started dbus-:1.3-org.freedesktop.problems@4.service.
May 20 00:17:03 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dbus-:1.3-org.freedesktop.problems@4 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:17:04 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Found oopses: 1
May 20 00:17:04 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Creating problem directories
May 20 00:17:05 localhost abrt-dump-journal-oops[1036]: Reported 1 kernel oopses to Abrt
May 20 00:17:06 localhost abrt-notification[1833]: System encountered a non-fatal error in ??()
May 20 00:17:07 localhost systemd-coredump[1792]: Core file was truncated to 2147483648 bytes.
May 20 00:17:08 localhost abrt-dump-journal-core[1035]: Failed to obtain all required information from journald
May 20 00:17:12 localhost systemd-coredump[1792]: Process 1666 (memtester) of user 0 dumped core.#012#012Stack trace of thread 1666:#012#0  0x0000000000401e81 compare_regions (/usr/bin/memtester + 0x1e81)
May 20 00:17:12 localhost systemd[1]: systemd-coredump@1-1790-0.service: Succeeded.
May 20 00:17:12 localhost audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@1-1790-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:17:12 localhost systemd[1]: systemd-coredump@1-1790-0.service: Consumed 1.976s CPU time.
May 20 00:17:12 localhost audit: BPF prog-id=46 op=UNLOAD
May 20 00:17:12 localhost audit: BPF prog-id=45 op=UNLOAD
May 20 00:17:12 localhost audit: BPF prog-id=44 op=UNLOAD
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2020-05-20-00:17:04-1036-0'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2020-05-20-00:14:28-1036-0'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2020-05-20-00:12:17-1036-0'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2020-05-20-00:09:00-1036-0'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2020-05-20-00:03:33-1036-0'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ras-2020-05-20-00:03:31-995'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ras-2020-05-20-00:17:03-995'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ras-2020-05-20-00:12:15-995'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ras-2020-05-20-00:14:26-995'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ras-2020-05-20-00:08:59-995'
May 20 00:17:17 localhost abrt-server[1844]: Error: No segments found in coredump './coredump'
May 20 00:17:17 localhost abrt-server[1844]: Can't open file 'core_backtrace' for reading: No such file or directory
May 20 00:17:17 localhost abrt-notification[1889]: Process 1666 (memtester) crashed in ??()

Mastakilla · Jul 8, 2020

FreeNAS / TrueNAS testing
I've also done some brief testing in FreeNAS / TrueNAS. For this I've created a Fedora 32 virtual machine inside FreeNAS / TrueNAS, allocating 20GB of the 32GB of RAM to the VM and then ran "memtester 18gb" in the Fedora VM to stress the memory. Below are the results:

FreeNAS 11.3 U3.2 (and probably earlier as well) does not detect anything at all. It just crashes after awhile (probably when an uncorrected error occurs). I couldn't find anything in the logs.
TrueNAS 12.0 beta 1 properly detects the corrected errors and shows the following on the console and in /var/log/messages

Code:

Jul  7 13:08:50 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:08:50 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:08:50 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:08:50 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:08:50 data MCA: Address 0x400000326059a00
Jul  7 13:08:50 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:08:50 data MCA: Bank 18, Status 0x9c2040000000011b
Jul  7 13:08:50 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:08:50 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:08:50 data MCA: CPU 0 COR GCACHE LG RD error
Jul  7 13:08:50 data MCA: Address 0x40000031dc09ae0
Jul  7 13:08:50 data MCA: Misc 0xd01a0ffc01000000
Jul  7 13:08:54 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:08:54 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:08:54 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:08:54 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:08:54 data MCA: Address 0x40000032772a880
Jul  7 13:08:54 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:08:54 data MCA: Bank 18, Status 0x9c2040000000011b
Jul  7 13:08:54 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:08:54 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:08:54 data MCA: CPU 0 COR GCACHE LG RD error
Jul  7 13:08:54 data MCA: Address 0x400000323044240
Jul  7 13:08:54 data MCA: Misc 0xd01a0ffc01000000
Jul  7 13:08:56 data kernel: ix1: link state changed to UP
Jul  7 13:09:51 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:09:51 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:09:51 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:09:51 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:09:51 data MCA: Address 0x4000003254bf4c0
Jul  7 13:09:51 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:09:51 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:09:51 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:09:51 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:09:51 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:09:51 data MCA: Address 0x4000003254b8240
Jul  7 13:09:51 data MCA: Misc 0xd01a0ffd01000000
Jul  7 13:12:52 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:12:52 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:12:52 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:12:52 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:12:52 data MCA: Address 0x4000003242494c0
Jul  7 13:12:52 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:12:52 data MCA: Bank 18, Status 0x9c2040000000011b
Jul  7 13:12:52 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:12:52 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:12:52 data MCA: CPU 0 COR GCACHE LG RD error
Jul  7 13:12:52 data MCA: Address 0x40000031dc09ac0
Jul  7 13:12:52 data MCA: Misc 0xd01a0ffc01000000
Jul  7 13:13:03 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:13:03 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:13:03 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:13:03 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:13:03 data MCA: Address 0x400000275f39e00
Jul  7 13:13:03 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:13:20 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:13:20 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:13:20 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:13:20 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:13:20 data MCA: Address 0x40000026edd1e00
Jul  7 13:13:20 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:13:20 data MCA: Bank 18, Status 0x9c2040000000011b
Jul  7 13:13:20 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:13:20 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:13:20 data MCA: CPU 0 COR GCACHE LG RD error
Jul  7 13:13:20 data MCA: Address 0x4000002c7adc4c0
Jul  7 13:13:20 data MCA: Misc 0xd01a0ffc01000000
Jul  7 13:14:17 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:14:17 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:14:17 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:14:17 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:14:17 data MCA: Address 0x4000002b303e9c0
Jul  7 13:14:17 data MCA: Misc 0xd01a0fac01000000
Jul  7 13:14:17 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:14:17 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:14:17 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:14:17 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:14:17 data MCA: Address 0x4000002c2c024c0
Jul  7 13:14:17 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:14:44 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:14:44 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:14:44 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:14:44 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:14:44 data MCA: Address 0x400000293281500
Jul  7 13:14:44 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:14:44 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:14:44 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:14:44 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:14:44 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:14:44 data MCA: Address 0x400000327290240
Jul  7 13:14:44 data MCA: Misc 0xd01a0ffb01000000
Jul  7 13:14:56 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:14:56 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:14:56 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:14:56 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:14:56 data MCA: Address 0x40000023a430300
Jul  7 13:14:56 data MCA: Misc 0xd01a0f3a01000000
Jul  7 13:14:56 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:14:56 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:14:56 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:14:56 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:14:56 data MCA: Address 0x4000002ab5afb00
Jul  7 13:14:56 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:15:08 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:15:08 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:15:08 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:15:08 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:15:08 data MCA: Address 0x400000217793440
Jul  7 13:15:08 data MCA: Misc 0xd01a0f9301000000
Jul  7 13:15:08 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:15:08 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:15:08 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:15:08 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:15:08 data MCA: Address 0x40000029ef7f880
Jul  7 13:15:08 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:15:23 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:15:23 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:15:23 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:15:23 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:15:23 data MCA: Address 0x4000002186c7440
Jul  7 13:15:23 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:15:23 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:15:23 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:15:23 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:15:23 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:15:23 data MCA: Address 0x400000327290240
Jul  7 13:15:23 data MCA: Misc 0xd01a0ff701000000
Jul  7 13:15:38 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:15:38 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:15:38 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:15:38 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:15:38 data MCA: Address 0x400000264398f40
Jul  7 13:15:38 data MCA: Misc 0xd01a0e2301000000
Jul  7 13:15:38 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:15:38 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:15:38 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:15:38 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:15:38 data MCA: Address 0x40000029aba7880
Jul  7 13:15:38 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:16:05 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:16:05 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:16:05 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:16:05 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:16:05 data MCA: Address 0x400000218d41440
Jul  7 13:16:05 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:16:05 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:16:05 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:16:05 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:16:05 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:16:05 data MCA: Address 0x4000002c11e64c0
Jul  7 13:16:05 data MCA: Misc 0xd01a0fed01000000
Jul  7 13:24:40 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:24:40 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:24:40 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:24:40 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:24:40 data MCA: Address 0x4000002beace500
Jul  7 13:24:40 data MCA: Misc 0xd01a085001000000
Jul  7 13:24:40 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:24:40 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:24:40 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:24:40 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:24:40 data MCA: Address 0x4000002c43a04c0
Jul  7 13:24:40 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:36:00 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:36:00 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:36:00 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:36:00 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:36:00 data MCA: Address 0x4000003242494e0
Jul  7 13:36:00 data MCA: Misc 0xd01a08ab01000000
Jul  7 13:36:00 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:36:00 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:36:00 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:36:00 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:36:00 data MCA: Address 0x4000002cd0a6200
Jul  7 13:36:00 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:38:35 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:38:35 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:38:35 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:38:35 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:38:35 data MCA: Address 0x4000002b50ea9c0
Jul  7 13:38:35 data MCA: Misc 0xd01a0e2301000000
Jul  7 13:38:35 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:38:35 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:38:35 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:38:35 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:38:35 data MCA: Address 0x40000032517d6c0
Jul  7 13:38:35 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:39:18 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:39:18 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:39:18 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:39:18 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:39:18 data MCA: Address 0x4000002c4bfd800
Jul  7 13:39:18 data MCA: Misc 0xd01a0c5201000000
Jul  7 13:39:18 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:39:18 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:39:18 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:39:18 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:39:18 data MCA: Address 0x4000002c16784c0
Jul  7 13:39:18 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:40:35 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:40:35 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:40:35 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:40:35 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:40:35 data MCA: Address 0x400000292f99500
Jul  7 13:40:35 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:40:35 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:40:35 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:40:35 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:40:35 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:40:35 data MCA: Address 0x4000002c6f2c4c0
Jul  7 13:40:35 data MCA: Misc 0xd01a0fe401000000
Jul  7 13:41:45 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:41:45 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:41:45 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:41:45 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:41:45 data MCA: Address 0x40000027d3c2cc0
Jul  7 13:41:45 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:41:45 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:41:45 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:41:45 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:41:45 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:41:45 data MCA: Address 0x4000002c535a4c0
Jul  7 13:41:45 data MCA: Misc 0xd01a0fd801000000

I tried to decode them using mcelog, but it seems the CPU currently isn't supported. I am in contact with maintainer of the mcelog code to get this fixed...

Code:

mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 17 TSC 5be1c86174
MISC d01b0fff01000000 ADDR 4000003085c6b40
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS 9c2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 17 TSC 5bf3eabd20
MISC d01b0fff01000000 ADDR 40000031f3bab00
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 2
CPU 0 BANK 17 TSC 5c1e4b68ec
MISC d01b0fff01000000 ADDR 400000305340b40
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 3
CPU 0 BANK 17 TSC 5d1e06bb8c
MISC d01a0ffd01000000 ADDR 40000032212fe80
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 4
CPU 0 BANK 18 TSC 5d1e06fb40
MISC d01b0fff01000000 ADDR 400000321c6e8c0
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS 9c2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 5
CPU 0 BANK 17 TSC 5db84458c8
MISC d01a0ffa01000000 ADDR 400000321cb62c0
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 6
CPU 0 BANK 18 TSC 5db8448fc4
MISC d01b0fff01000000 ADDR 400000326096240
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 7
CPU 0 BANK 17 TSC 6525f2fee0
MISC d01b0fff01000000 ADDR 400000326059a00
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 8
CPU 0 BANK 18 TSC 6525f33840
MISC d01a0ffc01000000 ADDR 40000031dc09ae0
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS 9c2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 9
CPU 0 BANK 17 TSC 68c46a10b8
MISC d01b0fff01000000 ADDR 40000032772a880
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 10
CPU 0 BANK 18 TSC 68c46a4670
MISC d01a0ffc01000000 ADDR 400000323044240
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS 9c2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 11
CPU 0 BANK 17 TSC 97f381ab20
MISC d01b0fff01000000 ADDR 4000003254bf4c0
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 12
CPU 0 BANK 18 TSC 97f381e0fc
MISC d01a0ffd01000000 ADDR 4000003254b8240
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 13
CPU 0 BANK 17 TSC 12f2bbd5708
MISC d01b0fff01000000 ADDR 4000003242494c0
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 14
CPU 0 BANK 18 TSC 12f2bbd8dbc
MISC d01a0ffc01000000 ADDR 40000031dc09ac0
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS 9c2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 15
CPU 0 BANK 17 TSC 138ab97bb58
MISC d01b0fff01000000 ADDR 400000275f39e00
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 16
CPU 0 BANK 17 TSC 1476695ee60
MISC d01b0fff01000000 ADDR 40000026edd1e00
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 17
CPU 0 BANK 18 TSC 147669621d8
MISC d01a0ffc01000000 ADDR 4000002c7adc4c0
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS 9c2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 18
CPU 0 BANK 17 TSC 17684f22960
MISC d01a0fac01000000 ADDR 4000002b303e9c0
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 19
CPU 0 BANK 18 TSC 17684f25bdc
MISC d01b0fff01000000 ADDR 4000002c2c024c0
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 20
CPU 0 BANK 17 TSC 18d341a1658
MISC d01b0fff01000000 ADDR 400000293281500
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 21
CPU 0 BANK 18 TSC 18d341a4fdc
MISC d01a0ffb01000000 ADDR 400000327290240
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 22
CPU 0 BANK 17 TSC 196d9d34680
MISC d01a0f3a01000000 ADDR 40000023a430300
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 23
CPU 0 BANK 18 TSC 196d9d389b8
MISC d01b0fff01000000 ADDR 4000002ab5afb00
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 24
CPU 0 BANK 17 TSC 1a11a4a49c4
MISC d01a0f9301000000 ADDR 400000217793440
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 25
CPU 0 BANK 18 TSC 1a11a4a8348
MISC d01b0fff01000000 ADDR 40000029ef7f880
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 26
CPU 0 BANK 17 TSC 1ae230113e8
MISC d01b0fff01000000 ADDR 4000002186c7440
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 27
CPU 0 BANK 18 TSC 1ae23014dd8
MISC d01a0ff701000000 ADDR 400000327290240
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 28
CPU 0 BANK 17 TSC 1b9faf273e0
MISC d01a0e2301000000 ADDR 400000264398f40
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 29
CPU 0 BANK 18 TSC 1b9faf2b034
MISC d01b0fff01000000 ADDR 40000029aba7880
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 30
CPU 0 BANK 17 TSC 1d0fa35ac30
MISC d01b0fff01000000 ADDR 400000218d41440
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 31
CPU 0 BANK 18 TSC 1d0fa35e83c
MISC d01a0fed01000000 ADDR 4000002c11e64c0
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 32
CPU 0 BANK 17 TSC 3802329c0a8
MISC d01a085001000000 ADDR 4000002beace500
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0
mcelog: Unknown CPU type vendor 2 family 23 model 1
Hardware event. This is not a software error.
MCE 33
CPU 0 BANK 18 TSC 3802329f588
MISC d01b0fff01000000 ADDR 4000002c43a04c0
TIME 1594121663 Tue Jul  7 13:34:23 2020
STATUS dc2040000000011b MCGSTATUS 0
MCGCAP 11c APICID 0 SOCKETID 0
CPUID Vendor AMD Family 23 Model 1 Step 0

I haven't been able to trigger uncorrected errors yet on TrueNAS 12.0 beta 1, I think. The corrected errors have the same status code on FreeBSD, as on Linux. And I couldn't find any error in FreeBSD with the status code of the uncorrected errors in Linux. I suspect that uncorrected errors occur when the system crashes after awhile. But I couldn't find anything about this in the log files. Not sure if this is a bug or not. Please let me know if I need to submit this...
TrueNAS 12.0 beta 1 doesn't seem to send any email when MCA errors occur. If I remember well, this is a "known bug" (please correct me if I'm wrong). This does seem very troublesome and I hope this gets fixed soon!

Yorick · Jul 11, 2020

Thanks for keeping after this! Has AsRock said anything about fixing their IPMI issues? One of the things I love about my Supermicro board is that I don’t need to rely on OS error reporting. The IPMI will let me know when ECC has issues, and the OS gets a shot at it, as well.

Mastakilla · Jul 11, 2020

A couple days ago I've asked Asrock Rack for an update on fixing the IPMI ECC reporting and they responded:

We check ASPEED but they said ECC function need AMD BIOS support.
We then check with TW AMD if they can provide BIOS code? But there is no such code.

I've just replied them that I find it strange the OS, like Linux / Windows / TrueNAS / Memtest86, all can report ECC memory errors perfectly fine with the BIOS as it is, so I don't really understand why "a piece" is missing in the BIOS? I've asked them to ask ASPEED for more details on what exactly is missing and why they can't use the same methods as the OS...

Mastakilla · Jul 12, 2020

Regarding TrueNAS not alerting for ECC errors:
I found this bug that is planned to be fixed in TrueNAS 12 beta 2:

https://jira.ixsystems.com/browse/NAS-105287

However, it does mention a dependency on mcelog, which currently does not support any Ryzen CPU at all. One of the people working the FreeBSD mcelog told me:

After some debugging, my initial analysis was incorrect. The k8 CPU's line was the last supported AMD CPU supported by mcelog. AMD needs to supply patches to mcelog in order to gain this support in mcelog.

Mcelog is an intel-driven project, the devs that contribute are intel employees so it is possible amd doesn't want to be associated with the project but I'm not really positive. There is an upstream bug[1] report that may make things a bit more clear.

[1] https://github.com/andikleen/mcelog/issues/62

Hope this helps,

Richard Gallamore

So probably support wont come soon...

So I hope TrueNAS can perhaps switch from mcelog to rasdaemon (which does already support AMD) for decoding the MCA messages

NAS___ · Jul 15, 2020

Have you seen this:

Will ZFS and non-ECC RAM kill your data? – JRS Systems: the blog

Is testing for bad ram enough for a few days using memtest86?
I'm planning upgrading my 4 years old freenas without ecc to a new install, again without ecc as I can't find any AM4 35w CPU supporting it (except 220g pro which is too hard to find here).

danb35 · Jul 15, 2020

NAS___ said:
Is testing for bad ram enough for a few days using memtest86?

It is if you believe that RAM never fails over time.

NAS___ · Jul 15, 2020

Do you know if 220ge or 3000g supports ECC memory?
From ASRock website, their motherboard support ecc, but not on these cpu family.
Their tech support said they have multiple users telling it's working...

MalVeauX · Sep 9, 2020

I'm testing my ECC RAM at the moment. I've been running Memtest86 for a while now just to let it do it's thing. So far I'm on pass #5 after 18+ hours or so and it reports 0 (zero) errors. I had to run an older version (V4) because it's an old board (X8SIL-F) that doesn't support UEFI booting.

However, when I went to my event log on my motherboard via IPMI I found an event alert by chance just being nosy:

I tried looking this up, but basically, this is a total real error that was not corrected by ECC. So my understanding is, this would have resulted in corruption or lost some data if it were not already backed up, correct?

Can't be good to see this error in such a small period of time (less than a day) right? I'm still confused how Memtest doesn't see an error but the IPMI event log did?

DIMM4B, does that mean it's the DIMM 4 slot on the board to help identity which stick it is?

How would ZFS handle this on a mirror pool of data? Would it have caught it having a different checksum and heal, or would this have been a total bust where it was corrupted in the RAM and was written corrupted? I'm not sure how serious this is. It looks like I need different RAM though and that worries me.

This server is going to be housing media (our family pictures, movies, etc) to serve to a few client machines in the house. The data will be on mirrors, no parity stuff, strictly 1:1 mirrors. I'm not sure if the above means I shouldn't put data on this server yet, and figure out this error above, replace it entirely or what.

Thoughts?

Very best,

NAS___ · Sep 10, 2020

Hi
I don't have ECC memory in my system so I can't help regarding detection. If I were you, I would change the faulty module even if the error was corrected.
To my knowledge and understanding, with memory errors, you need some really evil RAM to get into troubles

but for sure it can always happen.
You can read this to get more details: https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-your-data

MalVeauX · Sep 10, 2020

NAS___ said:
Hi
I don't have ECC memory in my system so I can't help regarding detection. If I were you, I would change the faulty module even if the error was corrected.
To my knowledge and understanding, with memory errors, you need some really evil RAM to get into troubles but for sure it can always happen.
You can read this to get more details: https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-your-data

Thanks,

I'm curious though, if something is corrected, is it faulty?

Very best,

fosaq · Sep 21, 2020

Hi,
I'm trying to get ECC running on the new board with a Zen 2 CPU but I'm not sure if it works. ASRockRack X470D4U with Ryzen 3 3100.

Memtest 8.4 free says:
ECC Enabled: Yes (ECC correction)
ECC Polling: Disabled
ECC Injection: Disabled (even if I change the ECC error injection option)

BUT:
In FreeNAS (11.3) "dmidecode -t memory" says the data with is still 64 bit but it should be 72.
I'm not able to install the beta version 12.0 of freenas.

Before, in another test system (Intel, very unstable but ECC was running), it was 72 with the same RAM.

Do I need another CPU for ECC? The website does not say which CPU supports ECC, even for the Epyc's this is not listed.
I tried resetting the BIOS to default values, deactivating or activating PFEH. Nothing changes.

Any suggestions? Should I go with Intel at the moment?

Yorick · Sep 25, 2020

Mastakilla said:
A couple days ago I've asked Asrock Rack for an update on fixing the IPMI ECC reporting and they responded:

@Mastakilla , do you have further feedback from AsRock Rack where they are at? Are they stalled, or still working on it?

Important Announcement for the TrueNAS Community.

SOLVED The usefulness of ECC (if we can't assess it's working)?

Patron

Wizard

Patron

Guru

Wizard

Patron

Patron

Patron

Patron

Wizard

Patron

Patron

Explorer

Hall of Famer

Explorer

Contributor

Explorer

Contributor

Dabbler

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "The usefulness of ECC (if we can't assess it's working)?"

Similar threads