FreeNAS Build with 10GBe and Ryzen

poldi

Dabbler
Joined
Jun 7, 2019
Messages
42
I am about to build a FreeNAS storage server for an adult retraining charity (storage for the administration and course material). We are offering courses that makes it easier to re-enter the labor markt after illnes or job loss etc. As you can imagine we are not actually swimming in the money. And of course I need to do something about a good backup strategy as well. So this will swallow up most (if not all) of our budget for 2020. I thought about buying used hardware to keep costs within our means but I decided to go mostly with new hardware.

I have been reading this thread with interest because AMD seems to be at a better pricepoint at the moment. However after reading this thread, I believe there is still much uncertainty about AMD when it comes to the use of ECC memory. So I think I stay on the Intel platform and buy a Supermicro X10.../X11... motherboard with Intel CPU. A long time proofen combination.

Still, I like to compliment all the forum members for the way they have investigated this and shared the results of their research and experiments on the forum. Thumbs Up everybody.
Well we have to be specific here. The results Mastakilla reported on are not about ECC support on AMD but very specifically ECC support with Ryzen on the X470D4U (and variants). Ryzen is priced and marketed as consumer CPU. The stance from AMD is here that they don't outright deactivate ECC within the CPU but they leave the implementation to the motherboard manufacturers. The X470D4U happens to be one of the first AM4 boards to get an ECC stamp from ASRock and as such is a very interesting offering.
We come to the conclusion that the error correcting side of ECC seems to be correctly implemented by ASRock however the reporting side is not available, which ASRock explains with the lack of support in the AM4 socket.
As such, if your main worry is that memory bit flips are corrected for your FreeNAS system you are fine. If you also want to know if a RAM DIMM is going bad you are plum out of luck. For production hardware this is not going to cut it but for the enthusiast home server, I am fine with that.
If you are looking for server grade CPUs from AMD you have to move to EPYC and not look at Ryzen. Actually there is the more workstation oriented Threadripper in between but if the ECC support is improved there I cannot say. Guess that is a topic for another thread.
 
Last edited:

Evertb1

Guru
Joined
May 31, 2016
Messages
700
For production hardware this is not going to cut it but for the enthusiast home server, I am fine with that.
That we are talking about Ryzen is well understood I think. To me, every consumer grade AMD CPU before that, doesn't even come into the picture. That Ryzen is marketed as consumer hardware is not necessary a bad thing as long as it can do the job and it is durable enough. I have used an Intell Core i3 CPU for years in my own server (Windows Home Server and Windows Server 2012 Essentials) before I made the switch to Xenon. That CPU still runs fine after seven years or so. But it's there were the doubts about Ryzen starts.

I look at the project for the charity as if it was production for a business. This is not in the realm of a home server. The staff and other people must be able to fully depend on it. We help our clients to get recognized certifications, so our administration and course support are very important. Luckily my project is well on its way. One of our sponsors will pay for professional Amazon cloud back-up service for the coming five years. Another will provide me (us) with all the hard drives I need. I will even get spares. And this include drives for the local backup system. Wow, I can't begin to tell you how much relief that is for our budget.

OK, enough of this before we get to much off-topic.
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Well we have to be specific here. The results Mastakilla reported on are not about ECC support on AMD but very specifically ECC support with Ryzen on the X470D4U (and variants). Ryzen is priced and marketed as consumer CPU. The stance from AMD is here that they don't outright deactivate ECC within the CPU but they leave the implementation to the motherboard manufacturers. The X470D4U happens to be one of the first AM4 boards to get an ECC stamp from ASRock and as such is a very interesting offering.

Although my testing is indeed 100% specific to my exact hardware (mainly the CPU / Mobo combination), the claims from Asrock Rack about what AMD told them, are not. They were very clearly about all AM4. Now I don't know the exact words that AMD told Asrock Rack and how Asrock Rack might have "slightly changed" the message. Neither I know how capable / reliable this person at AMD was to tell these kinds of things. But if it is true... I hope not... Currently I'm waiting for a person @ Level1Techs to try MemTest86 Error Injection with his Ryzen 2600. He saw MCA Memory errors in ESXi with a broken DIMM, so I have slight hope that also injecting memory errors will give some ECC error reporting in MemTest86 (he no longer has the broken DIMM for testing). If that happens I'll confront Asrock Rack, explaining them the AMD statement is false and asking them to fix it in both the IPMI Event Log and for the Ryzen 3000 series in general...
Also Wendell from Level1Techs is sure he once saw memory error reportings on an AM4 system. But I'm hoping to get some actual proof before actually going back to Asrock Rack to confront them...

We come to the conclusion that the error correcting side of ECC seems to be correctly implemented by ASRock however the reporting side is not available, which ASRock explains with the lack of support in the AM4 socket.
Unfortunately I'm not entirely sure that the error correcting side is correctly implemented, as without the reporting, there is no way to check this :( All we can do is hope that it is working
 

edge-case

Dabbler
Joined
Nov 2, 2019
Messages
28
Man this is a well researched thread with some very valuable information.
Maybe I can share some first hand experience. It turns out the best way to test the ECC capabilities of the X470D4U (I don't have the 10G variant but I recon this doesn't matter here) is if you have some faulty memory at hand.
I was 'lucky' enough to get sent 2 Kingston KSM26ED8/16ME modules which caused me massive headaches until I discovered that they are indeed both bad using memtest86. The headaches were frequent restarts of at least once every other day.
Once I got around to run memtest86 it turned out that each of the modules was generating 100s of correctable errors within one pass (round about 1,5 hours duration).
I then RMA'd the RAM modules and swapped them against modules that passed memtest86 and since then my stability issues are resolved.

Hope this helps someone. It cost me a lot of hair pulling and frustration so I thought I should share.

Ouch. I can't believe I've missed this thread until now. :eek:
It looks like I have exactly the same situation as @poldi; same MB [X470D4U], same Kingston ECC RAM, same MCE logs, same memtest86 errors, same system instability [random system freezes every 24 to 48 hours], same frustration and hair pulling!...

Only difference is the CPU - a Ryzen 7 2700 - that I actually RMAd to Newegg [guessing/hoping it was the problem], only to get the replacement CPU back last week and have all the same problems....
I even started a thread about it because I couldn't find any useful info - guess I suck at using the forum search... :oops: ]
My memtest86 images and MCE logs are in that thread if it helps anyone.


I've less than a week left before I'm unable to return the MBoard and RAM [for refunds], and I was on the verge of giving up with the system, sending it all back for refunds and just switching to an ASRock B450 Pro and [non-ECC] RAM..... now based on your post, I'll be returning the Kingston RAM sticks, buying Crucial or Samsung ECC instead, and hoping the replacements are error free and lead to a stable system..

Many thanks to @Mastakilla for starting the thread and all the hard work & testing, and @poldi for this and your other posts... I'll add more info once the new RAM arrives.
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Ouch. I can't believe I've missed this thread until now. :eek:
It looks like I have exactly the same situation as @poldi; same MB [X470D4U], same Kingston ECC RAM, same MCE logs, same memtest86 errors, same system instability [random system freezes every 24 to 48 hours], same frustration and hair pulling!...

Only difference is the CPU - a Ryzen 7 2700 - that I actually RMAd to Newegg [guessing/hoping it was the problem], only to get the replacement CPU back last week and have all the same problems....
I even started a thread about it because I couldn't find any useful info - guess I suck at using the forum search... :oops: ]
My memtest86 images and MCE logs are in that thread if it helps anyone.


I've less than a week left before I'm unable to return the MBoard and RAM [for refunds], and I was on the verge of giving up with the system, sending it all back for refunds and just switching to an ASRock B450 Pro and [non-ECC] RAM..... now based on your post, I'll be returning the Kingston RAM sticks, buying Crucial or Samsung ECC instead, and hoping the replacements are error free and lead to a stable system..

Many thanks to @Mastakilla for starting the thread and all the hard work & testing, and @poldi for this and your other posts... I'll add more info once the new RAM arrives.
Hi Marshy,

As you still have the faulty DIMMs, can you please make screenshots of Memtest86 reporting Correct ECC Memory Errors? And copy / post the (parts of the) log files with the MCA memory errors?

I have reported back to Asrock Rack that I've heard from people that ECC Memory Errors are being reported for Zen+ cores (like your Ryzen 2700), but it would be nice to have some more proof to fund this claim (I was actually just looking into buying a 2nd hand Ryzen 2600 myself, but don't see any interesting buys atm).

Anyway... We'll probably have to wait some time for a response, as it is Chinese New Year atm...
 

edge-case

Dabbler
Joined
Nov 2, 2019
Messages
28
Hi Marshy,

As you still have the faulty DIMMs, can you please make screenshots of Memtest86 reporting Correct ECC Memory Errors? And copy / post the (parts of the) log files with the MCA memory errors?

I have reported back to Asrock Rack that I've heard from people that ECC Memory Errors are being reported for Zen+ cores (like your Ryzen 2700), but it would be nice to have some more proof to fund this claim (I was actually just looking into buying a 2nd hand Ryzen 2600 myself, but don't see any interesting buys atm).

Anyway... We'll probably have to wait some time for a response, as it is Chinese New Year atm...

I can add the logs tonight / this weekend. There's a memtest86 image and FreeNAS mcelog output attached to the thread I linked earlier, and I just added another image to this post [from a test I ran last night].
 

Attachments

  • 4E3ED8E7-6274-408F-8EF1-5BA9D827D6A3.jpeg
    4E3ED8E7-6274-408F-8EF1-5BA9D827D6A3.jpeg
    96.3 KB · Views: 512
Last edited:

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Thanks a lot!

Did you ever see "corrected errors" in Memtest86? (I think Memtest86 can also tell if a Memory Error was corrected, I think your screenshot is showing "uncorrected errors" (could be multi-bit-errors)...
I know... lots of "I think", as I'm not sure of much yet ;)

The MCA errors also seem to indicate CPU errors again (like with Poldi), which could ofcourse be caused by defective memory.
Inverness @ Level1Techs has also seen "real memory errors" by MCA in ESXi.

Also strange that you and Poldi both had such issues with this memory. I have to run mine (the same memory) !33%! faster before I start getting stability issues...
 

edge-case

Dabbler
Joined
Nov 2, 2019
Messages
28
Thanks a lot!

Did you ever see "corrected errors" in Memtest86? (I think Memtest86 can also tell if a Memory Error was corrected, I think your screenshot is showing "uncorrected errors" (could be multi-bit-errors)...
I know... lots of "I think", as I'm not sure of much yet ;)
I believe they were all corrected. The end of test report [for the test associated with the above screenshot] showed:
ECC Correctable errors: 1525
ECC Uncorrectable Errors: 0

I'll run another test tonight, and post images of the errors, final summary and log.

I've previously run both sticks on their own, and from memory I think both sticks had errors, but I'll also try that again over the weekend and share the results.

The MCA errors also seem to indicate CPU errors again (like with Poldi), which could ofcourse be caused by defective memory.
Inverness @ Level1Techs has also seen "real memory errors" by MCA in ESXi.

Also strange that you and Poldi both had such issues with this memory. I have to run mine (the same memory) !33%! faster before I start getting stability issues...
Yes, strange. Hopefully the RAM is the problem in my case. [This is meant to be my main server, so I really want it stable].

I wonder did Poldi and I both get bad RAM sticks, or is it an incompatibility between this particular Kingston RAM and the motherboard [even though it's on ASRock's QVL list]? I wasn't able to tell from his posts if @poldi received the same Kingston RAM as replacement or switched to a different type...
In order to mix things up, I've ordered a different brand of RAM as a replacement [from Nemix, as the Samsung and Crucial were both out of stock at Amazon and Newegg, and I want it fairly soon - it's supposedly same as the Samsung 16GB ECC modules that are on the QVL list].

Thanks for your prompt replies.
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
I have exactly the same RAM and no issues at all, BUT as RAM compatibility on Ryzen is mainly handled by the CPU (and just a little bit the mobo), it could be that this RAM works great with Zen2 (like my Ryzen 3600) and works crap with Zen+ (like your Ryzen 2700)

But the QVL list was most likely created with a Zen or Zen+ CPU, as this motherboard is from before Zen2 was released... So it should work...
 

edge-case

Dabbler
Joined
Nov 2, 2019
Messages
28
[Mostly for my own benefit,] Here's summary of testing so far. [I'm collating it all in a matrix and will upload at the end of the process].

Note: for the Kingston memory on the X470D4U, I had reset the CMOS recently, and memory was running at 2400 MHz for all the test listed below.
I've now switched it back to 2666 MHz.
I'm running BIOS v3.3.0 and BMC 1.90.0

X5470D4U : 2 Kingston ECC DIMMs
- Test #1: ECC Correctable: 1525 ECC Uncorrectable: Zero
- Test #2: ECC Correctable: 1564 ECC Uncorrectable: Zero
- Test #3:
X5470D4U : Single Kingston ECC DIMM - DIMM#1
- Test #1: ECC Correctable: 74 ECC Uncorrectable: Zero [Incomplete; I stopped test 40% through]
X5470D4U : Single Kingston ECC DIMM - DIMM#2
- Test #1: ECC Correctable: 32 ECC Uncorrectable: Zero
- Test #2: NO Errors

With a different mboard, CPU [Ryzen 5 1600] and PSU - "ECC" enabled in BIOS. I had the memory set to 2666 MHz in BIOS.
ASRock B450 Pro : Single Kingston EEC DIMM - DIMM#1
- Test #1: NO Errors
ASRock B450 Pro : Single Kingston EEC DIMM - DIMM#2
- Test #1: NO Errors

Observations:
- An anomaly with the X470D4U and stick #2 having an error free run, but I'll strike that to test to test variation.
- No errors of any type on the B450 Pro / Ryzen 5 1600 combo, but I have no confirmation that ECC support really works on this combination, so I can't really make any conclusions [but it was this in previous testing that led me to RMA my Ryzen 2700 CPU in the hope it was creating the errors].


I really need[ed] a second set of ECC RAM to use for comparative testing purposes; in hindsight I should have bought that sooner [instead of RMA'ing the CPU]. [ I'll try the B450 Pro / Ryzen 1600 with memtest86 with the Nemix ECC RAM when it arrives...]

Next steps:
I now have both Kingston sticks back in the X470D4U, and it's currently running memtest86, and it's spitting out lots of the ECC correctable errors again...
Once that's complete, I'll transfer both Kingston DIMMs to the B450 Pro motherboard and see if that reports any errors...

And then, once the new ECC RAM arrives on Tuesday evening, I'll obviously try that in the ASRock X470D4U [and have everything crossed hoping it's an issue with the Kingston RAM and not a problem with my motherboard].

--------------------------------------------------------------------------------------------------------------------------------
Update: Jan 28th

Things are never straightforward....
With the original Kingston ECC RAM, I now have run for 48 hours straight with no freezing, and no ECC errors reported in FreeNAS.
In those 48 hours, I've completed over 2 TBs of file transfers - from multiple computers, and been testing a Plex Server plugin [running in a jail] heavily, and have had zero issues... [ Maybe the Kingston memory is feeling threatened as it knows a replacement is heading in the mail... ? ]

[I broke the cardinal rule of changing multiple variables at the same time, so not sure what contributed to the newfound error-free stability.]

Things changed:
- set the memory to 2666 Mhz [instead of 2400]
Followed the tips in this Level1Techs forum thread for better Ryzen system stability:
- disabled global C states
- set Power Supply Idle Control to Typical

I'll shut down the FreeNAS server tonight and run memtest86 again [at 2666, 2400 and 2993 MHz] to see what I get...
 
Last edited:

edge-case

Dabbler
Joined
Nov 2, 2019
Messages
28
Update to previous post:

@2666 MHz:
- FreeNAS server ran for over 72 hours with zero glitches, zero MCE errors, and without any other issues.
- I ran a full memtest86 overnight, and test completed all 4 passess [about 7 hours in total] with ZERO errors.

@2400 MHz:
Early this morning, I set it to back to 2400 MHz [no other changes] and initiated another memtest86 run...
- it started spitting out ECC errors almost immediately...

I'll see the full test results this evening, and then start a run @2933 MHz...

So, maybe I've been a dufus all along - I missed the fact that at some point the BIOS memory speed inadvertently reverted to @2400 - and did all the other things without addressing the memory speed, and perhaps it was the entire problem.
But, I never imagined underclocking memory would result in ECC errors...

New ECC RAM arrived in the mail last night and remains unopened (so I can return it if I decide to); I'm not sure what to do now...
 
Last edited:

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
I recently got an official response from AMD stating that ecc correction and reporting is supported on the Ryzen 9 3950x and threadripper 3990x.

Setups tried
CPU: Ryzen Threadripper 3990X
Mobo: Gigabyte TRX40 Aurus pro wifi (with latest bios and ecc settings enabled including ecc error injection)
Memory: 1 x 16Gb Cruicial CT16G4WFD8266 ECC UDIMM

CPU: Ryzen 9 3950X
Mobo: Biostar X570GTA (with latest bios and ecc settings enabled)
and
Mobo: ASrock Rack X470D4U (with latest bios and ecc settings enabled including ecc error injection)
and
Mobo: ASrock x570 creator (with latest bios and ecc settings enabled)
Memory: 1 x 16Gb Cruicial CT16G4WFD8266 ECC UDIMM

With all setups (except, have not tested on x570 creator) Memtest86 pro v 8.4 rc2 does not report corrected errors after injecting most likely because the injecitng of errors is not supported by either the bios, chipset or cpu but Passmark is pointing at 'ryzen' and am still waiting what they mean by that.

given this article:
I believe ecc correcting works on all the setups I tried but that alone is simply not enough for a safe and stabile system. We need to be informed of errors so that we can replace memory modules if they become faulty.

I am pressing ASrock Rack at the moment to come clean about their official support

@Mastakilla Below is one of the conversations with AMD tech support email account in reverse (tech*dot?*amdsupport*at?*customercare)
I have asked and gotten their permission to publish the conversation online.

I really hope to get to the bottom at this.

------------------------------
Kyle (AMD)

6 Mar, 13:20 CET

Thank you for contacting AMD.

I'll be happy to answer to clear your doubts,

We can confirm both ECC error reporting are supported by the Ryzen platform and the EDAC command should work just fine on your setup:

{some text omitted}

****************
My Question in between
Can I interpret your answer as follows:?

That the AMD AM4 socket supports both ECC and ECC error reporting.
And that the EDAC command should work just fine on my setup:
Mobo: ASrock x570 creator which has an AMD AM4 socket
CPU: AMD Ryzen 9 3950X
Memory: 4 x 16 GB ECC ((Kingston KSM26ED8/16ME) from the ASrock x570 creator QVL)
*****************

Isabelle (AMD)

6 Mar, 10:46 CET

{some text omitted}

Correctable and un-correctable ECC errors are both reported in the Windows Event Log under the WHEA category or in Linux through the EDAC command. So indeed, both are supported

-----------------------

@ixsystems Is there still no user friendly way to inject error so that testing becomes actually doable?
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Hi Diversity,

Thanks for investigating this from your side as well!

Some updates from my side:
  • Thanks to the below post from MasterPhi on Level1Techs I learned that someone did get error reporting to work on Zen2 cores (Ryzen 3900x) using an Asus x570 motherboard
https://forum.level1techs.com/t/asr...-server-boards-x470d4u-x470d4u2-2t/139490/998
  • He had worked with AMD, Linux kernel and EDAC engineers on this, but in the end it probably was Asus who have (although not involved in the discussion) fixed it using a BIOS update
  • I did find those knowledgeable contacts from the discussion very interesting and contacted them, asking for support with our issues as well. The AMD engineer brought someone from AMD US Support in contact with Asrock Rack Taiwan support to work on this issue. I did not hear back yet regarding any progress though...
  • Asrock Rack was in already in contact a couple of times with AMD Taiwan about this earlier, but AMD Taiwan kept on claiming that it was not possible to make ECC reporting work (even though there had been proof of this being false).
Regarding ECC injection:
  • The Asrock Rack BIOS does have all the settings required to have ECC injection working. But currently it indeed does not seem to work, as the BIOS is not catching ECC error reports at all (not to the IPMI, not to Linux, not to windows and not to Memtest86 when injecting). One additional setting that I overlooked earlier is "Platform First Error Handling", which I think is best disabled if you want Memtest86, EDAC or WHEA to catch the errors. Also remember to set "Disable Memory Error Injection" to false.
    1585781636853.png
    1585781649755.png
  • You may need to try and find similar settings on your other motherboards to properly test memory injection.
  • As I personally actually never got any error injection working on any motherboard in my life, I may actually still overlook other important settings that I don't know of. Although I'm trying to get it working, I am far from an expert on any of this ;)
 
Last edited:

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
Thx Mastakilla

I had the correct BIOS settings. Platform first = disabled. Memory Injection = Enabled. And Obviosly the ECC settings aswell
I am breaking my head on this for too long now to miss these pitfalls.

Although not sure, since we still can't test on a user friendly level, my gut feeling tells me ECC correction works all across the Ryzen CPU's
However, that buys us NOTHING if we can't assess for our self that it actually does.



Anyway, can we all please agree on a very simple premise?

If one can't humanly friendly (without going full mental on hardware) inject mem errors then there is no way to know if the ECC capability of one's setup actually works.

PLease share your thoughts.
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
@ixsystems Is there still no user friendly way to inject error so that testing becomes actually doable?
Fundamentally, there is little interest because it's CPU-specific and doesn't work unless the stars align - and by that I mean the system firmware. Positive results would be useful, but negative results might cause confusion.

Disclaimer: not iX, but I've looked into ECC error injection in the past and have since experience getting trying to get buy-in for features.
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
Thx Ericloewe,

Can you (and others preverably also ixsystems) please tell me your stance on if ECC is only useful if we can easily test if it works?

I have made a new thread about this hopefully attracting more attention.

Also can you please elaborate what you meant by 'trying to get buy-in for features'?
 
Last edited:

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
Fundamentally, there is little interest because it's CPU-specific

Understood!. How about we make a list of know good configurations then? I mean I have read countless articles about hardware suggestions but I do not remember ever reading a specific setup that actually works. I respect that use cases are widly different to eachother. But dous that mean there is no means of having a list of know working configurations list?

for the WD red it seems to work out

Kind regards
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
@Ericloewe Please see my other threads and give your stance, as you are clearly a 'higher up'. I'd like to break open the discussion on the importance of ECC when we even can't know its even for real.

I really expect to be corrected and be put in my place and I will thank one for doing so. But perhaps that does not happen and then what does that mean for the whole ECC aspect?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
ECC does work. And some very relevant platforms are known to work correctly, apart from error injection.
My Haswell system had a bunch of correctable ECC errors once and they showed up in the IPMI log. Probably showed up in the OS, but a lot happened simultaneously and I never paid attention.

That's part of the importance of common hardware: someone out there can probably confirm that it's working correctly.

If you're in a destructive mood, I bet wiggling a DIMM during memory accesses can induce failures and provide some interesting data. Not something I'd recommend, though.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
Top