FreeNAS Build with 10GBe and Ryzen

Ericloewe · Dec 21, 2019

Mastakilla said:
This is very interesting, as MemTest86 Pro (not the Free version) supports ECC Injection:

Some of us looked into that a while back. A bunch of Supermicro Intel systems didn't successfully inject any errors. I think we tried Haswell, Skylake and Avoton boards.

It's really interesting that your system can inject errors but not detect them. I wouldn't be surprised if the BIOS was somehow misconfiguring the memory controller during boot.

Mastakilla · Dec 21, 2019

Ericloewe said:
Bryan Cantrill mentions in one of his BSDnow interviews (they total several hours, so I can't tell you which one) that an unspecified vendor told him they just hide ECC errors and that, when asked, the reasoning was something like the error rate was too insane to warn on.

Do you mean "hide from the IPMI event log?

Johnnie Black said:
Earlier this year I got two ECC corrected warning errors in the same day, separated buy a few hours, on a Supermicro X9SCL-F, never got one before and none since, and I have multiple similar boards.

localhost *Warning* Memory - Correctable memory error ; OEM Event Data3 code = 00h

Maybe one of those cosmic ray things ?

Where did you see this warning?

Ericloewe said:
Some of us looked into that a while back. A bunch of Supermicro Intel systems didn't successfully inject any errors. I think we tried Haswell, Skylake and Avoton boards.

It's really interesting that your system can inject errors but not detect them. I wouldn't be surprised if the BIOS was somehow misconfiguring the memory controller during boot.

Interesting. When you say "didn't successfully inject any errors" do you mean that you didn't see the messages with "[ECC Inject]...", like in my screenshot?
Did SuperMicro have the option in the BIOS for enabling / disabling memory injection?

I suspect that maybe Asrock Rack didn't properly implement Ryzen 3000 yet. Maybe all of this does work with a Ryzen 1000/2000 CPU...

Johnnie Black · Dec 22, 2019

Mastakilla said:
Where did you see this warning?

I got them by email, they are from the system event log, still there:

This is how they appear in the bios:

rvassar · Dec 22, 2019

I've always been amused by this notion of a photon crossing the entire universe, tunneling thru our atmosphere and coming to a halt in the one cell of DRAM in my system that will cause it to crash. Yes it can happen, but... The statistics are stacked pretty heavily against it. Think more local. 40+% of your lifetime exposure to radiation comes from the potassium within your own body. One of the driving forces behind the push to RoHS lead free chips was that all lead is the decay byproduct of uranium. Hence all lead has traces of uranium in it. If you use it inside chip carriers, you're placing an alpha & beta decay source right next to the silicon. In our homes there are other sources. Granite... Both under our buildings, and used as kitchen counter tops. The light pink/grey/white crystals are made up of high temp silica minerals. The black specs are often lower temp accessory minerals that tend to form from whatever's left in the melt. These are often sources of Radon gas.

So... Winter... Snow on the ground, we close up our houses to keep warm, and there's a spike in local radiation sources. There's probably an interesting paper in there somewhere, but not enough people run ECC memory at home to have noticed.

Ericloewe · Dec 22, 2019

Mastakilla said:
Do you mean "hide from the IPMI event log?

Well, I assume the answer is "pretty much everywhere", meaning IPMI and OS.

Mastakilla said:
Interesting. When you say "didn't successfully inject any errors" do you mean that you didn't see the messages with "[ECC Inject]...", like in my screenshot?
Did SuperMicro have the option in the BIOS for enabling / disabling memory injection?

Right, no messages and no ECC errors logged. And that included one system that had previously detected ECC errors, so ECC was definitely working on that one. Supermicro does not expose any option to enable this in any X10 or X11 BIOS I've looked at. Note that I haven't gone through all versions for my boards, so it's possible that AMI and/or Supermicro sneaked that in there.
On a tangential note, why does AMI still exist? And where is Phoenix getting their customers to stay in business? I have literally not seen a Phoenix BIOS in years. The Desktop/Server market seems to be exclusively AMI, the laptop market is very Insyde-heavy and then some stuff uses TianoCore or derivatives.

Mastakilla · Dec 24, 2019

Bad news I’m afraid… I’ve received a response from Asrock Rack, with "official statement" from AMD on this, regarding ECC on this mobo (and AM4 in general):

Dear Mastakilla,

So many thanks for you detail experience.
We will share this information to RD 

However we got AMD official respond today

* AM4 support ECC function
* AM4 does not support ECC error reporting function

Here is the conclusion:
AM4 platform CPU (Ryzen 1000,2000,3000 series) can all support ECC correction, but not ECC report function

Best regards,
Kevin Hsiueh
Asrock Rack Incorporation

To which I responded:

Hi Kevin,

Thanks for getting back to me!

That is very unfortunate news…

Does this mean that the sensors for “DRAM ECC Error A1/A2/B1/B2” in the IPMI Event Log are unused and always will remain empty, even if memory errors do occur?
Do you know why these sensors then exist on this board? Were they simply copied over from an existing Intel /TR4 / Epyc Board, without testing them? Or were they added explicitly, but weren’t you aware of this missing feature (and also didn’t test it)?

Kind regards,

Mastakilla

And their response:

Dear Mastakilla,

According to AMD, X470 is desktop MB, and our QT won’t test ECC report function on desktop MB.
We follow AMD POR to writes specification.
In order to prevent misunderstanding, we will also remove ”DRAM ECC Error A1/A2/B1/B2” in the IPMI Event Log”.
Thanks for doing so many test and kind remind, and we will pay more attention on similar case in the future.

Best regards,
Kevin Hsiueh
Asrock Rack Incorporation

So no ECC reporting is supported…

Not entirely sure of this, but doesn’t this mean that:

there is no way to know for sure ECC is actually doing something or to validate that it actually works (even for Asrock Rack or AMD themselves).
there is no way to know if your memory is stable or not (ECC might be correcting errors all the time without you knowing about it). This is especially relevant if you want to overclock it.

I’m also not entirely sure all of this is true. Wendell told me he knew about people who reported logged error corrections on Ryzen. Perhaps AMD / Asrock Rack told me this to stop asking annoying questions about it? I certainly hope so ;) (please prove me wrong)

I'm also a bit confused on the importance of this...

On the one hand, jgreco has stressed the importance of a proper implementation of ECC, like having working reporting / alerting.
On the other hand Ericloewe just told that some vendors "hide ECC errors" anyway. I suppose SuperMicro is not one of those then? Or do you mean that ECC reporting isn't very reliable in general anyway?

Ericloewe · Dec 25, 2019

Mastakilla said:
On the other hand @Ericloewe just told that some vendors "hide ECC errors" anyway. I suppose SuperMicro is not one of those then?

As far as I've seen, anyway.

Constantin · Dec 25, 2019

The issue is twofold:

The whole point of ZFS is preserving data integrity. Hence the importance of ECC-RAM, hopefully ensuring that randomly-damaged content is not faithfully written to disk, scrubbed on occasion, etc.
Verifying ECC operation is tricky. I seem to recall having issues intentionally injecting memory errors with memtest.

For me, it comes down to buying a verified ECC-supporting system from a trusted board maker, fit it with ECC-RAM and then letting the software/hardware do the rest.

VolumeTank · Jan 2, 2020

I actually did 2 different test on two different Ryzen Procesors, Ryzen 5 2400G & Ryzen 5 3600, only the 3600 appears settings for ECC configurations. I also purchase the memtest86 pro version and did some test. Came to find out that is not the is not recognizing the errors, rather error are not being injected at all. ECC is enable on the CPU, but error injection is disable.

On the motherboard is an option to enable, but in my case with BIOS 3.50 with ASrock when enable the set error injection disable to "false" it would not boot at all. As of today ASRock launch another BIOS update 3.70 with 1.0.0.4B AGESA. I give it another try and and set error injection disable to "false" and this time it boot, but still error are not being injecting.

I contact PassMark Support and send my MemTest86.log to see what they know about the issue, they respond was:

Code:

On Wed, Jan 1, 2020 at 9:42 PM PassMark Support <help@passmark.com> wrote:
It is very hard to generate just small number of errors in RAM.
We have had attempts with EMI and hot air guns. But generally the state of the system oscillates between no errors & heap of errors and and crash situation.
> > ECC seems to be enabled.
> >  find_mem_controller - AMD Ryzen (70h-7fh) ECC mode: detect: yes, correct:
> > yes, scrub: no, chipkill: no
> >  ECC polling enabled
> >
> > But injection isn't
> >  MtSupportRunAllTests - Injecting ECC error
> >  inject_ryzen - UMC error injection configuration writes are disabled
> > (MISCCFG[0] = 00000117)

poldi · Jan 5, 2020

Mastakilla said:
Bad news I’m afraid… I’ve received a response from Asrock Rack, with "official statement" from AMD on this, regarding ECC on this mobo (and AM4 in general):

To which I responded:

And their response:

So no ECC reporting is supported…

Not entirely sure of this, but doesn’t this mean that:

there is no way to know for sure ECC is actually doing something or to validate that it actually works (even for Asrock Rack or AMD themselves).

there is no way to know if your memory is stable or not (ECC might be correcting errors all the time without you knowing about it). This is especially relevant if you want to overclock it.

I’m also not entirely sure all of this is true. Wendell told me he knew about people who reported logged error corrections on Ryzen. Perhaps AMD / Asrock Rack told me this to stop asking annoying questions about it? I certainly hope so ;) (please prove me wrong)

I'm also a bit confused on the importance of this...

On the one hand, jgreco has stressed the importance of a proper implementation of ECC, like having working reporting / alerting.
On the other hand Ericloewe just told that some vendors "hide ECC errors" anyway. I suppose SuperMicro is not one of those then? Or do you mean that ECC reporting isn't very reliable in general anyway?

Man this is a well researched thread with some very valuable information.
Maybe I can share some first hand experience. It turns out the best way to test the ECC capabilities of the X470D4U (I don't have the 10G variant but I recon this doesn't matter here) is if you have some faulty memory at hand.
I was 'lucky' enough to get sent 2 Kingston KSM26ED8/16ME modules which caused me massive headaches until I discovered that they are indeed both bad using memtest86. The headaches were frequent restarts of at least once every other day.
Once I got around to run memtest86 it turned out that each of the modules was generating 100s of correctable errors within one pass (round about 1,5 hours duration).
I then RMA'd the RAM modules and swapped them against modules that passed memtest86 and since then my stability issues are resolved.

My reasoning would be now that with such a high error rate the chances of getting an uncorrectable double bit flip error in a reasonable amount of time (let's say every other day) is not at all unlikely. If I recall correctly the correct system response to a double bit flip error if caught by ECC would be to throw a panic and initiate a restart. I don't have any direct proof, but based on my experience with the faulty RAM I would conclude that ECC is implemented (at least in parts) correctly in the X470D4U and the system will respond correctly when faced with an uncorrectable ECC error.
I can also confirm that none of the above, correctable or uncorrectable error, was ever manifested in the IPMI event log. So the answer @Mastakilla received from ASRock, that reporting is not implemented on AM4, confirms the suspicion I had.

So far so good, I have one interesting tid bit to share though. I noticed that my console/dmesg contained frequent events from the Machine Check Architecture like so:

Code:

Oct 18 21:30:06 tempest MCA: Bank 16, Status 0xd42040000000011b
Oct 18 21:30:06 tempest MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
Oct 18 21:30:06 tempest MCA: Vendor "AuthenticAMD", ID 0x800f82, APIC ID 0
Oct 18 21:30:06 tempest MCA: CPU 0 COR OVER GCACHE LG RD error
Oct 18 21:30:06 tempest MCA: Address 0x4000000e89dfd40

Which indicates that some correctable read error in some cache occurred.
Likewise my restarts/crashes typically contained this last output in /data/crash:

Code:

MCA: Bank 0, Status 0xb4002800000c0135
MCA: Global Cap 0x0000000000000117, Status 0x0000000000000007
MCA: Vendor "AuthenticAMD", ID 0x800f82, APIC ID 10
MCA: CPU 8 UNCOR DCACHE L1 DRD error
MCA: Address 0x1000004dae13400
panic: Unrecoverable machine check exception
cpuid = 8

Saying that this time some uncorrectable error occurred.
Needless to say that all these have disappeared since I swapped the RAM.

It appears though that although the IPMI log is unable to show anything. FreeNAS is made aware of the fact that some ECC error has occurred via the MCA.
It is not ideal but in case these events (especially the correctable once) start to show up in the console or in the daily email from your FreeNAS box it might indicate that one of the RAM modules is starting to go bad.
Hope this helps someone. It cost me a lot of hair pulling and frustration so I thought I should share.

Mastakilla · Jan 5, 2020

VolumeTank said:
I actually did 2 different test on two different Ryzen Procesors, Ryzen 5 2400G & Ryzen 5 3600, only the 3600 appears settings for ECC configurations. I also purchase the memtest86 pro version and did some test. Came to find out that is not the is not recognizing the errors, rather error are not being injected at all. ECC is enable on the CPU, but error injection is disable.

On the motherboard is an option to enable, but in my case with BIOS 3.50 with ASrock when enable the set error injection disable to "false" it would not boot at all. As of today ASRock launch another BIOS update 3.70 with 1.0.0.4B AGESA. I give it another try and and set error injection disable to "false" and this time it boot, but still error are not being injecting.

I contact PassMark Support and send my MemTest86.log to see what they know about the issue, they respond was:

Code:
On Wed, Jan 1, 2020 at 9:42 PM PassMark Support <help@passmark.com> wrote: It is very hard to generate just small number of errors in RAM. We have had attempts with EMI and hot air guns. But generally the state of the system oscillates between no errors & heap of errors and and crash situation. > > ECC seems to be enabled. > > find_mem_controller - AMD Ryzen (70h-7fh) ECC mode: detect: yes, correct: > > yes, scrub: no, chipkill: no > > ECC polling enabled > > > > But injection isn't > > MtSupportRunAllTests - Injecting ECC error > > inject_ryzen - UMC error injection configuration writes are disabled > > (MISCCFG[0] = 00000117)

Hi VolumeTank, thanks for doing some testing of your own!!

I've just checked my MemTest86 log file and it is slightly different from yours...
This could be because you're using a different motherboard (ASRock Fatal1ty X470) from a different manufacturer (Asrock Rack is not necesarely the same as Asrock)

The first part is the same

2019-12-20 06:23:16 - find_mem_controller - AMD Ryzen (70h-7fh) ECC mode: detect: yes, correct: yes, scrub: no, chipkill: no
2019-12-20 06:23:17 - ECC polling enabled

But the second part is different for me

2019-12-20 06:25:05 - MtSupportRunAllTests - Injecting ECC error
2019-12-20 06:25:05 - inject_ryzen - writing UMC_ECCERRINJ[0] = 00010001
2019-12-20 06:25:05 - inject_ryzen - writing UMC_ECCERRINJCTRL[0] = FFFFFFFA
2019-12-20 06:25:06 - inject_ryzen - writing UMC_ECCERRINJ[1] = 00010001
2019-12-20 06:25:06 - inject_ryzen - writing UMC_ECCERRINJCTRL[1] = FFFFFFFA

The "UMC error injection configuration writes are disabled" I only get if I leave “Disable Memory Error Injection” set to enabled in the BIOS, so it seems like this function is not working in your BIOS (and it is working in mine).

So you can try reporting this "bug" to Asrock, but I'm not sure if they'll fix it ;)

VolumeTank · Jan 5, 2020

Mastakilla said:
Hi VolumeTank, thanks for doing some testing of your own!!

I've just checked my MemTest86 log file and it is slightly different from yours...
This could be because you're using a different motherboard (ASRock Fatal1ty X470) from a different manufacturer (Asrock Rack is not necesarely the same as Asrock)

The first part is the same

But the second part is different for me

The "UMC error injection configuration writes are disabled" I only get if I leave “Disable Memory Error Injection” set to enabled in the BIOS, so it seems like this function is not working in your BIOS (and it is working in mine).

So you can try reporting this "bug" to Asrock, but I'm not sure if they'll fix it ;)

Thanks so much, for pointing this out, I have spent so many hours trying to figure out. I think I would give another try and also email ASRock. You definitely getting different output than mines. I also read that memtest86 add some type feature that detects weather the motherboards enables error injection or not. I can't remember where I read it, buit as soon as I find it again I'll post.

I'm still waiting for ASRock to answer me on a MOBO I accidentally bricked, 4 days and counting. Let see how long will it takes for them to answer me on this matter. Once again, Thanks!

Mastakilla · Jan 7, 2020

poldi said:
Man this is a well researched thread with some very valuable information.
Maybe I can share some first hand experience. It turns out the best way to test the ECC capabilities of the X470D4U (I don't have the 10G variant but I recon this doesn't matter here) is if you have some faulty memory at hand.
I was 'lucky' enough to get sent 2 Kingston KSM26ED8/16ME modules which caused me massive headaches until I discovered that they are indeed both bad using memtest86. The headaches were frequent restarts of at least once every other day.
Once I got around to run memtest86 it turned out that each of the modules was generating 100s of correctable errors within one pass (round about 1,5 hours duration).
I then RMA'd the RAM modules and swapped them against modules that passed memtest86 and since then my stability issues are resolved.

My reasoning would be now that with such a high error rate the chances of getting an uncorrectable double bit flip error in a reasonable amount of time (let's say every other day) is not at all unlikely. If I recall correctly the correct system response to a double bit flip error if caught by ECC would be to throw a panic and initiate a restart. I don't have any direct proof, but based on my experience with the faulty RAM I would conclude that ECC is implemented (at least in parts) correctly in the X470D4U and the system will respond correctly when faced with an uncorrectable ECC error.
I can also confirm that none of the above, correctable or uncorrectable error, was ever manifested in the IPMI event log. So the answer @Mastakilla received from ASRock, that reporting is not implemented on AM4, confirms the suspicion I had.

So far so good, I have one interesting tid bit to share though. I noticed that my console/dmesg contained frequent events from the Machine Check Architecture like so:

Code:
Oct 18 21:30:06 tempest MCA: Bank 16, Status 0xd42040000000011b Oct 18 21:30:06 tempest MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000 Oct 18 21:30:06 tempest MCA: Vendor "AuthenticAMD", ID 0x800f82, APIC ID 0 Oct 18 21:30:06 tempest MCA: CPU 0 COR OVER GCACHE LG RD error Oct 18 21:30:06 tempest MCA: Address 0x4000000e89dfd40

Which indicates that some correctable read error in some cache occurred.
Likewise my restarts/crashes typically contained this last output in /data/crash:

Code:
MCA: Bank 0, Status 0xb4002800000c0135 MCA: Global Cap 0x0000000000000117, Status 0x0000000000000007 MCA: Vendor "AuthenticAMD", ID 0x800f82, APIC ID 10 MCA: CPU 8 UNCOR DCACHE L1 DRD error MCA: Address 0x1000004dae13400 panic: Unrecoverable machine check exception cpuid = 8

Saying that this time some uncorrectable error occurred.
Needless to say that all these have disappeared since I swapped the RAM.

It appears though that although the IPMI log is unable to show anything. FreeNAS is made aware of the fact that some ECC error has occurred via the MCA.
It is not ideal but in case these events (especially the correctable once) start to show up in the console or in the daily email from your FreeNAS box it might indicate that one of the RAM modules is starting to go bad.
Hope this helps someone. It cost me a lot of hair pulling and frustration so I thought I should share.

Thanks for your input as well!

I tried looking into your MCA errors, but couldn't really find understandable (for me at least) documentation on it. I did find the following:
https://github.com/freebsd/freebsd/blob/master/sys/x86/x86/mca.c

Code:

/* Cache error. */
if ((mca_error & 0xef00) == 0x0100) {
    printf("%sCACHE %s %s error",
        mca_error_ttype(mca_error),
        mca_error_level(mca_error),
        mca_error_request(mca_error));
    break;
}
...
mca_error_ttype(uint16_t mca_error)
{

    switch ((mca_error & 0x000c) >> 2) {
    case 0:
        return ("I");
    case 1:
        return ("D");
    case 2:
        return ("G");
    }
    return ("?");
}
...
mca_error_level(uint16_t mca_error)
{

    switch (mca_error & 0x0003) {
    case 0:
        return ("L0");
    case 1:
        return ("L1");
    case 2:
        return ("L2");
    case 3:
        return ("LG");
    }
    return ("L?");
}
...
mca_error_request(uint16_t mca_error)
{

    switch ((mca_error & 0x00f0) >> 4) {
    case 0x0:
        return ("ERR");
    case 0x1:
        return ("RD");
    case 0x2:
        return ("WR");
    case 0x3:
        return ("DRD");
    case 0x4:
        return ("DWR");
    case 0x5:
        return ("IRD");
    case 0x6:
        return ("PREFETCH");
    case 0x7:
        return ("EVICT");
    case 0x8:
        return ("SNOOP");
    }
    return ("???");
}

Although I'm not at all sure, it seems to me like both of your errors are detected (and corrected in one instance) in the CPU cache and not in the memory itself (perhaps they were caused by the faulty memory though). I conclude that because the CPU core number is mentioned and also L1 sounds like "L1-cache" to me. So I'm not sure if this tells anything about the DIMM ECC capabilities / functionality.

As far as I understand there are many "levels" of error correction possible on AMD architectures:

In the memory itself: For this ECC memory and support for it is required
In the infinity fabric (this is the path from memory to the CPU): This has build in error detection and correction on all Zen AMD CPUs (Wendell from L1Tech explained this to me)
In all different levels of CPU cache: This has build in error detection and correction on all Zen AMD CPUs (not entirely sure of this one)

So I suspect that the errors you're seeing, maybe from 2) and 3), but not from 1). But again, I'm not sure at all. Hopefully someone more knowledgeable can clarify...

If they are indeed not from 1), then it could be that some memory errors get caught by the CPU error detection mechanisms, but I'm afraid it doesn't mean that all of them would be detected. Again, I'm no expert on this at all, these are just suspicions. Hopefully someone can clarify...

On a different note, just to be clear: With all this ECC blablabla, my main purpose is actually (and hopefully) to confirm that Ryzen can be a valid platform for FreeNAS. So I really hope that you're right and that someone can confirm that the ECC functionality on our motherboard is sufficient for a proper FreeNAS build...

Mastakilla · Jan 7, 2020

After some more googling, I was able to find some confirmations of my above suspicions

"MCA: CPU 8 UNCOR DCACHE L1 DRD error"
Seems to be an error related to the L1 CPU Cache (maybe caused by the faulty memory)

I googled for "L1 DRD error" and found for example
http://freebsd.1045724.x6.nabble.co...achine-check-exception-on-APU2-td6095428.html

"MCA: CPU 0 COR OVER GCACHE LG RD error"
Not sure about this one... 'LG' could be the 'L3' CPU cache, as this doesn't seem to be explicitly defined in the code. Not sure what the difference is between DRD and RD though.
Also in google I didn't really find anything conclusive for "LG RD error"

However I did find some posts regarding actual memory errors detected and they all seem to contain "memory error" at the end:
https://www.ixsystems.com/community/threads/mca-error-decoding.73214/
https://lists.freebsd.org/pipermail/freebsd-current/2011-July/025945.html

poldi · Jan 8, 2020

Mastakilla said:
Thanks for your input as well!

I tried looking into your MCA errors, but couldn't really find understandable (for me at least) documentation on it. I did find the following:
https://github.com/freebsd/freebsd/blob/master/sys/x86/x86/mca.c

Code:
/* Cache error. */ if ((mca_error & 0xef00) == 0x0100) { printf("%sCACHE %s %s error", mca_error_ttype(mca_error), mca_error_level(mca_error), mca_error_request(mca_error)); break; } ... mca_error_ttype(uint16_t mca_error) { switch ((mca_error & 0x000c) >> 2) { case 0: return ("I"); case 1: return ("D"); case 2: return ("G"); } return ("?"); } ... mca_error_level(uint16_t mca_error) { switch (mca_error & 0x0003) { case 0: return ("L0"); case 1: return ("L1"); case 2: return ("L2"); case 3: return ("LG"); } return ("L?"); } ... mca_error_request(uint16_t mca_error) { switch ((mca_error & 0x00f0) >> 4) { case 0x0: return ("ERR"); case 0x1: return ("RD"); case 0x2: return ("WR"); case 0x3: return ("DRD"); case 0x4: return ("DWR"); case 0x5: return ("IRD"); case 0x6: return ("PREFETCH"); case 0x7: return ("EVICT"); case 0x8: return ("SNOOP"); } return ("???"); }

Although I'm not at all sure, it seems to me like both of your errors are detected (and corrected in one instance) in the CPU cache and not in the memory itself (perhaps they were caused by the faulty memory though). I conclude that because the CPU core number is mentioned and also L1 sounds like "L1-cache" to me. So I'm not sure if this tells anything about the DIMM ECC capabilities / functionality.

As far as I understand there are many "levels" of error correction possible on AMD architectures:

In the memory itself: For this ECC memory and support for it is required

In the infinity fabric (this is the path from memory to the CPU): This has build in error detection and correction on all Zen AMD CPUs (Wendell from L1Tech explained this to me)

In all different levels of CPU cache: This has build in error detection and correction on all Zen AMD CPUs (not entirely sure of this one)

So I suspect that the errors you're seeing, maybe from 2) and 3), but not from 1). But again, I'm not sure at all. Hopefully someone more knowledgeable can clarify...

If they are indeed not from 1), then it could be that some memory errors get caught by the CPU error detection mechanisms, but I'm afraid it doesn't mean that all of them would be detected. Again, I'm no expert on this at all, these are just suspicions. Hopefully someone can clarify...

On a different note, just to be clear: With all this ECC blablabla, my main purpose is actually (and hopefully) to confirm that Ryzen can be a valid platform for FreeNAS. So I really hope that you're right and that someone can confirm that the ECC functionality on our motherboard is sufficient for a proper FreeNAS build...

Completely agree with you at first I also did not connect this at all. I also could not find any kind of documentation (there has got to be some for the MCA from AMD but it does not appear to be public).
Actually I thought I had a faulty CPU or some issues with the BIOS. I did replace the CPU because of this and also applied multiple BIOS upgrades but the MCA output was unchanged at a rate of around 1 per hour (the CORrectable ones that is). However the moment I replaced the RAM the MCA errors disappeared. At the rate of errors I could tell quite quickly and replacing the RAM was the only change I introduced. So to me there is a direct dependency, but I also don't quite understand how.
Over in the memtest86 forum there is another report of a user getting MCA errors with defective RAM (here) on FreeNAS. The details are scarce here. Also in this case it is the Intel MCA implementation and the error message is much clearer. So it appears MCA should/can report it. Maybe the AMD implementation is a bit shoddy here and masking the real error but without any documentation from AMD we cannot say anything with certainty.
What I actually meant to say with my original post is that if one gets these MCA errors, this can point to faulty RAM and one should follow this up with memtest86 to validate.

Elliot Dierksen · Jan 8, 2020

I have seen some of these kind of errors in the past, and sometimes around CPU upgrades and/or changes in the amount of RAM in the system. Knocking furiously on wood, I haven't seen them in a while. What I had to do with my servers was take RAM out and then add it back in between boots. I don't know if the reseating was what cleared it or if it was something more nebulous than that. Assuming you don't mind a little down time, that might be worth doing. If you add it back in a bit at a time, it might also help you identify either a dodgy RAM stick or perhaps a bad slot on the motherboard.

Evertb1 · Jan 9, 2020

I am about to build a FreeNAS storage server for an adult retraining charity (storage for the administration and course material). We are offering courses that makes it easier to re-enter the labor markt after illnes or job loss etc. As you can imagine we are not actually swimming in the money. And of course I need to do something about a good backup strategy as well. So this will swallow up most (if not all) of our budget for 2020. I thought about buying used hardware to keep costs within our means but I decided to go mostly with new hardware.

I have been reading this thread with interest because AMD seems to be at a better pricepoint at the moment. However after reading this thread, I believe there is still much uncertainty about AMD when it comes to the use of ECC memory. So I think I stay on the Intel platform and buy a Supermicro X10.../X11... motherboard with Intel CPU. A long time proofen combination.

Still, I like to compliment all the forum members for the way they have investigated this and shared the results of their research and experiments on the forum. Thumbs Up everybody.

Mastakilla · Jan 10, 2020

Someone @ Level1Techs just reported that he had a broken DIMM and saw "corrected memory errors" in Memtest86 using my motherboard and Ryzen 2600 CPU. He also said that ESXi had the following MCA errors:

Code:

2019-09-27T15:06:53.331Z cpu0:2097725)MCA: 136: CE Poll G0 Bf Sd42040000000011b A40000011966b480 Mc000000000000000 P242cd6880/10 Memory error, read

As you can see, this clearly states "Memory error"... Not sure if this could also be a "memory error from the infinity fabric" (which doesn't even require ECC memory to be properly detected and reported). But if it is not from the infinity fabric, then this seems to indicate that ECC error reporting on my motherboard does work for the Ryzen 2000 series (contrary to what Asrock Rack claims that AMD has told them)

https://forum.level1techs.com/t/asrock-rack-x470d4u2-2t/147588/74?u=mastakilla

VolumeTank · Jan 10, 2020

Mastakilla said:
Someone @ Level1Techs just reported that he had a broken DIMM and saw "corrected memory errors" in Memtest86 using my motherboard and Ryzen 2600 CPU. He also said that ESXi had the following MCA errors:

Code:
2019-09-27T15:06:53.331Z cpu0:2097725)MCA: 136: CE Poll G0 Bf Sd42040000000011b A40000011966b480 Mc000000000000000 P242cd6880/10 Memory error, read

As you can see, this clearly states "Memory error"... Not sure if this could also be a "memory error from the infinity fabric" (which doesn't even require ECC memory to be properly detected and reported). But if it is not from the infinity fabric, then this seems to indicate that ECC error reporting on my motherboard does work for the Ryzen 2000 series (contrary to what Asrock Rack claims that AMD has told them)

https://forum.level1techs.com/t/asrock-rack-x470d4u2-2t/147588/74?u=mastakilla

This here actually got my attention, which is what I was thinking could be:

AM4 support ECC function

AM4 does not support ECC error reporting function

Reason I say this is because, when I did my test; on the Ryzen 2400G ECC settings on the motherboard were not showing besides enable, disable or auto. Now, when I use the 3600 multiple settings appear on the motherboard. Meaning that, the motherboard detect the functionality of ECC with the new CPU.

Mastakilla · Jan 10, 2020

Ryzen 2400G has first gen 14nm Zen cores ("Ryzen 1000 series"), so different from Ryzen 2600 (Zen+ cores, "Ryzen 2000 series") and very different from Ryzen 3600 (Zen 2 cores, "Ryzen 3000 series"). That could explain the different options in the BIOS.
It indeed wouldn't surprise if the code for ECC functionality between different Ryzen series can't simply be ported and needs (partial) re-coding.

But that is something different than saying AM4 simply doesn't support ECC reporting.

Important Announcement for the TrueNAS Community.

FreeNAS Build with 10GBe and Ryzen

Server Wrangler

Patron

Guru

Guru

Server Wrangler

Patron

Server Wrangler

Vampire Pig

Dabbler

Attachments

Dabbler

Patron

Dabbler

Patron

Patron

Dabbler

Guru

Guru

Patron

Dabbler

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS Build with 10GBe and Ryzen"

Similar threads