FreeNAS Build with 10GBe and Ryzen

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
There hardware purpose built for this. But if you have to ask... You can't afford it.

https://teledynelecroy.com/ddr/
What Lecroy provides is at the hardware level characterization, mostly looking at eye diagram.
What injection testing/stressing does is create localized signal/data degradation that would induce data discrepancy. Nothing Lecroy will not be able to do, as far as I am aware.

In a nutshell, Lecroy is validating your DDR links are in spec, nothing else.
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
OK that's too bad. Not that I can afford such a specialist tool but I intended to ask them what it would cost just out of curiosity. But I guess I wont be bothering then :(
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
Understood!. How about we make a list of know good configurations then? I mean I have read countless articles about hardware suggestions but I do not remember ever reading a specific setup that actually works. I respect that use cases are widly different to eachother. But dous that mean there is no means of having a list of know working configurations list?

for the WD red it seems to work out

Kind regards

I actually meant known working configurations with any CPU and Mobo. Not AMD Ryzen particularly. Hopefully this will spark interest.
My apologies for not being clear at first.
 

cl0wnf!sh

Cadet
Joined
Apr 13, 2020
Messages
2
Maybe an update from my side, I registered only for this topic :)

So far I had quite some... adventures with the X470D4U. I experience freezes every 48h, IPMI still working but not taking SSH/Keyboard input in KVM or directly from USB.

My setups so far were:

  • X470D4U, Ryzen 7 1700, 4x Kingston KSM26ED8/16ME
  • X470D4U, Ryzen 7 1700, 2x HyperX FURY schwarz 8GB DDR4-2666 DIMM CL15-17-17-26
I may or may not have been a bit frustrated by the server and this whole corona epidemic, so I bought a bit more RAM (actually 4x Samsung 32GB M391A4G43MB1-CTD) - so my next test setups - as soon as this arrives - will be:
  • X470D4U, Ryzen 7 1700, 4x Samsung 32GB M391A4G43MB1-CTD
  • X470D4U, Ryzen 5 3600, 4x Samsung 32GB M391A4G43MB1-CTD
  • X470D4U, Ryzen 5 3600, 4x Kingston KSM26ED8/16ME
  • X470D4U, Ryzen 5 3600, 2x HyperX FURY schwarz 8GB DDR4-2666 DIMM CL15-17-17-26
The Ryzen 5 3600 is from my gaming rig (as well as the HyperX RAM), but I'm currently out of Thermal Compound and have to wait until it arrives. I'm currently going for two wild guesses: Either the board hates Kingston RAM, or the board hates 1st/2nd Gen Ryzen CPUs. Both CPUs worked flawless before on my gaming rig - a MSI B350 Mortar - and both Kingston RAM setups were not reporting any errors in memtest.

I'll keep you updated on this whole mess. Somebody knows if the X470D4U2-2T or X570D4I-2T have similar issues? ASRock has so far not replied with any mails. What I already love: They removed 1st Gen Ryzen from the CPU QVL list, but still have the Kingston RAM in the Memory QVL. I'm 100% sure 1st Gen Ryzen was part of the CPU QVL list before (together with 2nd Gen), especially as during launch the 3rd was not even available. What kind of bullshit is ASRock pulling on us here?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
X470D4U2-2T
It's been a bit of a mixed bag on the forum. Reports range from "all okay here" to "oddly unstable if not exactly correctly configured".
 

cl0wnf!sh

Cadet
Joined
Apr 13, 2020
Messages
2
As I can't edit my previous post:
X470D4U, Ryzen 1700 memtest86 of Kingston RAM - ECC errors, corrected
B350 Mortar, Ryzen 3600 memtest86 of Kingston RAM - no errors at all
X470D4U, Ryzen 1700 memtest86 of HyperX Fury RAM - not tested yet

It's been a bit of a mixed bag on the forum. Reports range from "all okay here" to "oddly unstable if not exactly correctly configured".
Looks like ASRock has quite some issues with the Rack Grade Ryzens :/
 

edge-case

Dabbler
Joined
Nov 2, 2019
Messages
28
Looks like ASRock has quite some issues with the Rack Grade Ryzens :/

I had instability issues at first [which I'll put down user error in accidentally under clocking the Kingston ECC memory after a CMOS reset], but after getting the memory timing correct I'm now at 79 days continuous uptime, so I'm very happy with it.
That's on a system that serves files to 5 computers, including a full, multi-TB Steam Library, and also runs a Plex [jailed] media server [~7 TB] 24/7 that supplies video to numerous laptops, AppleTVs and iPads....
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
I am happy to report (EDIT: single bit) ECC reporting functional on at least 2 mobo's. (EDIT: Still not found a way to trigger multi bit errors)
Setups tested so far
* AMD Ryzen 9 3950x & ASUS Prime X570-P
* AMD Ryzen 9 3950x & Asrock rack X470D4U
For me the X470D4U was the most important so I don't think I will continue checking the other boards.

The method I used was to use the inner wires of some electrical cable and stick it in the memory bank with 8GB ECC UDIMM in it. I understood that FreeNAS does not support ECC reporting if I understood correctly so I tried
using proxmox
root@pve:~# pveversion
pve-manager/6.1-8/806edfe1 (running kernel: 5.4.24-1-pve)
I saw the report of corrected errors. From the 2 boards only the X470D4U has IPMI as far as I could tell but the IPMI log is not showing any ecc errors :( which is a mayor oversight from the manufacturer if you ask me. Let's hope they can fix that in an update.
 
Last edited:

b3081a

Cadet
Joined
Apr 15, 2020
Messages
5
I am happy to report ECC reporting functional on at least 2 mobo's.
Setups tested so far
* AMD Ryzen 9 3950x & ASUS Prime X570-P
* AMD Ryzen 9 3950x & Asrock rack X470D4U
For me the X470D4U was the most important so I don't think I will continue checking the other boards.

The method I used was to use the inner wires of some electrical cable and stick it in the memory bank with 8GB ECC UDIMM in it. I understood that FreeNAS does not support ECC reporting if I understood correctly so I tried
using proxmox
root@pve:~# pveversion
pve-manager/6.1-8/806edfe1 (running kernel: 5.4.24-1-pve)
I saw the report of corrected errors. From the 2 boards only the X470D4U has IPMI as far as I could tell but the IPMI log is not showing any ecc errors :( which is a mayor oversight from the manufacturer if you ask me. Let's hope they can fix that in an update.

So it's memtest86 not injecting memory errors correctly, or AMD has broken memory error injection implementation? That's really good news.
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
I still have not received a response from Passmark. A few weeks ago I have send them a few debug logs for memtest pro version 8.4 rc2 build 1001. Earlier logs I had send them got responded to within 2 business days.
I am also still waiting for an answer from them on known working configurations even though reminding them a few times also explaining I am willing to buy different (also NON AMD) components). I am not sure why they would not want to share a known working configuration to see ecc error injection work.

So I am afraid I have no additional clues that can indicate if it's AMD or memtest pro
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
So it's memtest86 not injecting memory errors correctly, or AMD has broken memory error injection implementation? That's really good news.
I suspect this is because of the CPU. For memory error injection to fully work, you require memory, motherboard (BIOS) and CPU to all support this. I've never seen it work on Ryzen, although the motherboard (BIOS) and memory should be ok. Also I don't think you should see this as a "broken feature". It probably was left out intentionally by AMD, as it could potentially cause security issues if I remember well and it isn't required for the normal functioning of the computer.

I am happy to report ECC reporting functional on at least 2 mobo's.
Setups tested so far
* AMD Ryzen 9 3950x & ASUS Prime X570-P
* AMD Ryzen 9 3950x & Asrock rack X470D4U
For me the X470D4U was the most important so I don't think I will continue checking the other boards.

The method I used was to use the inner wires of some electrical cable and stick it in the memory bank with 8GB ECC UDIMM in it. I understood that FreeNAS does not support ECC reporting if I understood correctly so I tried
using proxmox
root@pve:~# pveversion
pve-manager/6.1-8/806edfe1 (running kernel: 5.4.24-1-pve)
I saw the report of corrected errors. From the 2 boards only the X470D4U has IPMI as far as I could tell but the IPMI log is not showing any ecc errors :( which is a mayor oversight from the manufacturer if you ask me. Let's hope they can fix that in an update.
Although I'm very happy to hear that Diversity managed to see reporting of ECC Errors on the Asrock Rack X470D4U when using a Ryzen 3950 (Zen 2), I'm also confused and puzzled to what this means for my failing experiences to achieve the same...

Causing memory errors by shorting pins of memory modules

Firstly I did some research on this "Triggering memory errors using 'needles' or wires"-approach, which Diversity has used. I didn't look into this method yet, as it seemed too risky and I didn't want to damage or degrade any of my hardware.

I found a good (but complicated) description in the following paper:
1587668839576.png

Error injection with a shunt probe.To reduce noise and cross-talk between high-speed signals, data pins of the DDR DIMM (DQx) are physically placed next to a ground (VSS) signal. As the ground plane (VSS) has a very low impedance compared to the data signal and because the signal driver is (pseudo) open drain, short-circuiting the VSS and DQx signals will pull DQx from its high voltage level to “0”. Depending on the encoding of the high voltage, this short-circuiting results in a 1-to-0 or 0-to-1 bit flip on a given DQx line.Figure 1 displays the locations of the important signals andshows that a DQx signal is always adjacent to a VSS signal.There fore, to inject a single correctable bit error, while the system exercises the memory by writing and reading all ones,we have to short-circuit a DQx signal with VSS. We can achieve the short-circuiting effect with the help of a custom-built shunt probe using syringe needles (Figure 2a). We insert the probe in the holes of the DIMM socket as shown in Figure 2b. For clarity, we omit the memory module from the picture. We then use tweezers to control when the error is injected by shorts-circuiting the two needles and thus the targeted DQx and nearby VSS signal. This method, while simple (and cheap), is effective in the case of a memory controller that computes ECCs in a single memory transaction(ECC word size is 64 bits) and can be used instead of expensive ad-hoc equipment [30], [31]. On some systems (e.g., configuration AMD-1) data is retrieved in two memory transactions and then interleaved. Because of the low temporal accuracy of the shunt probe method, an error inserted on memory line DQk (0≤k <64) that appears on data bit 2*k will also “reflect” on data bit 2*k+1 inside the 128 bit ECC word. In this case the syndrome corresponds to two bit errors and contradicts Proposition 1. To ensure single bit errors, once the interleaved mechanism is understood, the exercising data can be constructed such that the reflected positions contain only bits that are encoded tolow voltage, essentially masking the reflections.
So in short, you're connecting a data-pin with a ground-pin, so that the current on the data-pin "flows away" into the ground-pin and this "flips a bit". When using the correct pins and not accidently shorting anything wrong, this should actually be "reasonably safe" to do I think... (please correct me if I'm wrong :) )
Why is this causing single-bit errors and not multi-bit errors? I THINK because every "clock tick" data is pulled from each data-pin of the memory module. So if you change only the "result" of 1 pin, you get a maximum 1 bit flipped per "clock tick", which ECC can then correct. (not sure though)

This paper is already a bit older and was using DDR3. The 'AMD-1' configuration they are talking about (where interleaving is making things complicated) is an 'AMD Opteron 6376 (Bulldozer (15h))'.
As far as I understand, the extra complexity of "interleaving", that happens on the Opteron system, is only applicable on Ryzen, when using Dual Rank memory. As Diversity was using a single 8GB module, I suppose he only has single rank memory, so wasn't confronted with the "interleaving-complexity". However, if I would try this with my 16GB modules, I would be confronted with this extra complexity, because ?all? 16GB modules are dual rank...
I concluded this from this article (but I could misinterpreting things!):
https://www.reddit.com/r/Amd/comments/6nzjeb/optimising_ryzen_cpu_memory_clocks_for_better/
RankInterleaving: auto (left on default, untested; should only be toggled with Dual Rank Memory* )

As the paper was for DDR3, you of course need find the pin layout of DDR4, before you can apply it on Ryzen. Following datasheet has pretty clear pin layout of unregistered ECC DDR4 module:
1587668890229.png

On page 6, 7 and 8 you can see the description per pin and on page 17 you can see a picture of where those pin numbers are on the memory module. I suppose all VSS-pins are ground pins and DQ+number pins are data pins. So if we follow the example of the paper and short DQ0 with a VSS, the corresponds to shorting pin-4 (VSS) with pin-5 (DQ0). But I guess shorting pin-2 (VSS) with pin-3 (DQ4) could work equally fine.

This should help us "understand" a bit better what Diversity has done and how to "safely" reproduce it.

The results from Diversity
I was in contact with Diversity and have some more details on his testing (all pictures in this "chapter" are from Diversity himself). Diversity used the following video: https://www.youtube.com/watch?v=_npcmCxH2Ig. Instead of needles and tweezers, he used a thin wire, as in the picture below.
1587668919834.png

He was able to trigger "ECC errors" in Memtest86 and "Corrected errors" in Linux (Proxmox).
1587669016991.png

1587669096196.png

1587669178101.png


My testing / experiences
As you know, I also tried triggering reporting of corrected memory errors. I tried this by overclocking / undervolting the memory to the point where it is on the edge of stability. This edge is very "thin", can be hard to reach and can result in the following scenarios in my understanding:
1) Not unstable enough, so no errors at all
2) Only just unstable enough, so that single-bit error occurs only sometimes when stressing the memory enough. These will then be corrected by ECC and will not cause faults or crashes.
3) A little more unstable, so that single-bit errors occur a bit more often and less stress is required on the memory to achieve this. But also (uncorrected) multi-bit errors can occur sometimes, which could cause faults / crashes.
4) Even a little bit more unstable, so that mostly multi-bit errors occur when stressing the memory and single bit errors might be rare. This also makes the system more prone to faults and crashes.
5) Even more unstable, so the multi-bit errors occur even when hardly stressing the memory at all. This makes the system very unstable and probably will not be able to boot into OS all the time.
6) Too unstable, so that it doesn't boot at all.
Both scenario 2) and 3) are "good enough" for testing reporting of corrected memory errors. Perhaps even scenario 4), if you're lucky...

During all my testing I tried 100+ possible memory settings, using all kinds of frequencies, timings and voltages, of which 10-30 were potentially in scenario 2) or 3). I "PARTLY" kept track of all testing in the below (incomplete) Excel:
1587669263627.png

This convinced me that I should have at least once been in scenario 2) or 3), where I should have seen corrected errors (but didn't). That is why I concluded that it didn't work and I contacted Asrock Rack and AMD to work on this.

Conclusions / Questions / Concerns
Now what does all of this means? Does this mean that I never reached scenario 2) or 3)? Does it mean scenario 2) and 3) are almost impossible to reach using the methods I tried? Or does it mean that Diversity perhaps triggered a different kind of memory error? I'm not sure and I hope someone can clarify...

I know there is error correction and reporting happening on many layers in a modern computer. As far as I know, there are these:
1) Inside the memory modules itself (only when you have ECC modules). The memory module then has an extra chip on the module to store checksums. I think this works similar to RAID5 for HDDs. So that a memory error is detected and corrected in the module itself, even before it exits the memory module.
2) On the way from the memory module to the memory controller on the CPU (databus). Error detecting / correcting / reporting for these kinds of errors are handled by the memory controller in the CPU, so ECC memory isn't even required to make this work.
3) Inside the CPU data is also transfered between L1/L2/L3 caches and the CPU. Also there Error detecting / correcting / reporting is possible I think.

All of these might look confusingly similar when reported to the OS, but I do think they are often reported in a slightly different manner. I've seen reports where the CPU cache (L1/L2/L3) was clearly mentioned when reporting a corrected error for example, but I'm not sure what the exact difference between reports of 1) and 2) would be.
In Proxmox screenshots I do read things like "Unified Memory Controller..." and "Cache level: L3/GEN...", but again, I'm not entirely sure if these mean that the errors are in 2) or 3) instead of 1)...

Diversity is draining the current from a data-pin "outside" of the memory module, but I still see 2 possibilities of what's happening:
1) The drain on the data-pin is pulling the current away also inside the memory module itself, where the memory module detects the error, corrects and reports it.
2) The drain on the data-pin only happens after the memory module has done its checks and the error is only detected on the databus and corrected / reported there (so not by the ECC functionality of the memory module).

To know which scenario is happening, we could:
  • try to find someone with knowledge on exactly how each type of error is reported and who can exactly identify what is being reported
  • perhaps using a non-ECC single-rank module also reports the same kind of ECC errors, which would proof it is happening on the databus.
  • perhaps someone with more (I don't have any actually) electrical engineering knowledge can also say something more meaningful than myself?
Diversity did mention that his IPMI log was still empty after getting those errors. So there the motherboard / IPMI is certainly missing some required functionality.
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Why is this causing single-bit errors and not multi-bit errors?
DRAM in PCs is by convention effectively a 64/72-bit bus per memory channel (the extra 8 bits are for ECC). If you short out a single data line, you can only ever affect a single bit. Unless you start destroying things, of course.
 

b3081a

Cadet
Joined
Apr 15, 2020
Messages
5
One quick question for X470D4U, is it possible to split the bottom x8 PCIe slot into 2 M.2 NVMe x4 slots using a bifurcation card without PLX, while keeping the top PCIe running at x8? My plan is to build a system with the following PCIe devices:

M2_1 and M2_2 - Small and slow NVMe drive for OS (mirror)
x8 - HBA connecting to SAS HDDs
x4 - dual port 10GbE Intel NIC
x8 - 2 small & fast NVMe drives for SLOG (mirrored)

PLX cards are quite expensive, and the PLX itself could become a single point of failure, so I'm trying to avoid that.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Have you checked the manual? If it doesn't help, I'd try asking ASRock directly.

Let us know if you hear back.
 

j4ys0n

Cadet
Joined
Apr 25, 2020
Messages
1
@Mastakilla what'd you end up deciding on for the slog memory?

i'm doing a similar build. i ended up going with 2x 500gb wd sn750 nvme drives. the plan is to partition each and set up a mirrored 20gb slog and a 400gb l2arc (the rest is left for the drive to manage its blocks).

the motherboard has 2x m.2 nvme slots and the wd drives specs say that they can handle the iops for read and write cache.

i also thought about 2x 16gb optane m.2 drives in a mirror for the slog, and 2x 1tb ssds in a mirror for the l2arc.

if you're interested, the rest of my specs are:
asrock x570m pro4 motherboard
ryzen 3200g
32gb non-ecc memory
lsi 9211-8i hba (flashed to it mode)
8x 5tb seagate 2.5" hdds
10g nic

on the nics, i'm deciding between an intel x550 and a trendnet teg-10gectx which is really a tehuti tn40xx. i have both, the tn40xx needed a module to be added to freenas and then a command on startup (can provide more info if anyone wants). i haven't tested the x550, but i'm thinking it probably just works.
 

snake

Cadet
Joined
Aug 13, 2017
Messages
3
One quick question for X470D4U, is it possible to split the bottom x8 PCIe slot into 2 M.2 NVMe x4 slots using a bifurcation card without PLX, while keeping the top PCIe running at x8?

From the PCIe point of view, looking at the manual, yes. When both the physical 16x slots are in use, they both work at 8x speed. Otherwise the top works at 16x and the bottom one should be unoccupied.
No idea about bifurcation cards.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
8x 5tb seagate 2.5" hdds

I am betting those are DM-SMR, the only 2.5" 5TB I know of are the ST5000LM000.

So you got a drive that is slow with random write, a single vdev limiting you to the IOPS of one drive, and a 10Gbit/s NIC. I am -- feeling some design tension. Not to mention the split SLOG/L2ARC design with a mere 32GiB of RAM - what are you solving for? What's doing sync writes? How big is your read dataset, and have you determined that the reduction in ARC is worth it with the addition of the L2ARC?


can handle the iops for read and write cache

SLOG isn't a write cache. It's there to speed up sync writes to where they can catch back up to non-sync writes. sync writes are a thing with VMWare over NFS / iSCSI, for example. Also with MacOS over SMB, but arguably that use case doesn't need sync and it's often better to just turn off sync on the SMB side.

I am asking a bunch of pointed question to help you out with the design, and to provide valuable build information to others who will read your answers. Imagine me looking friendly while I ask these. Saying that because pointed questions can come across as a little aggressive in text.
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
@Mastakilla what'd you end up deciding on for the slog memory?

i'm doing a similar build. i ended up going with 2x 500gb wd sn750 nvme drives. the plan is to partition each and set up a mirrored 20gb slog and a 400gb l2arc (the rest is left for the drive to manage its blocks).

the motherboard has 2x m.2 nvme slots and the wd drives specs say that they can handle the iops for read and write cache.

i also thought about 2x 16gb optane m.2 drives in a mirror for the slog, and 2x 1tb ssds in a mirror for the l2arc.

if you're interested, the rest of my specs are:
asrock x570m pro4 motherboard
ryzen 3200g
32gb non-ecc memory
lsi 9211-8i hba (flashed to it mode)
8x 5tb seagate 2.5" hdds
10g nic

on the nics, i'm deciding between an intel x550 and a trendnet teg-10gectx which is really a tehuti tn40xx. i have both, the tn40xx needed a module to be added to freenas and then a command on startup (can provide more info if anyone wants). i haven't tested the x550, but i'm thinking it probably just works.
I currently am skipping the slog / l2arc. I was quite happy with my initial performance testing (without slog / l2arc) so far (500-800MB/sec over LAN). I may still change my mind later ;)
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
I currently am skipping the slog / l2arc.

Wise. With the exception of some very particular use cases - FreeNAS as storage for VMWare, for example - SLOG gains you nothing.

L2ARC can be useful if the read dataset doesn't fit into ARC, but particularly with these home builds with 32GiB of RAM to start out with, I'd bet crullers to donuts that in 90-some percent of cases, boosting RAM to 64GiB will be more than sufficient to fit the dataset, and do more than an L2ARC - and if you still need an L2ARC, at least you'll have the ARC to burn for it.

Side thought: For those that absopositively HAVE TO HAVE (I am feeling you, sib) a speedup option in their pool: Consider a special allocation mirror vdev, stick metadata and small files on there, for some value of "small" - could be 16kB to 64kB, depending on size of SSD and expected number of files. Just keep in mind you can't change your mind on that without blowing away the pool, a special allocation vdev is just as permanent as any other data holding vdev.
 
Last edited:

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
I'm happy to report that, after disabling "Platform First Error Handling (PFEH)" in the BIOS, (corrected) single-bit are properly reported to the OS, also when overclocking / undervolting! So I'm now getting the same results as Diversity with his memory pin shorting method...

The reason I was failing to detect this earlier was:
  • Memtest86 v8.2 reported "unknown" for ECC support. Memtest86 v8.3 reported "enabled" for ECC support. So I assumed, if it was working, Memtest86 v8.3 should be able to detect them. However, a couple of days ago I figured out that Memtest86 v8.4 beta had Zen2 ECC support in its changelog. So after testing I figured out that Memtest86 v8.3 does NOT support Zen2 ECC, but Memtest86 v8.4 beta DOES support Zen2 ECC.
  • I only discovered the BIOS option "Platform First Error Handling (PFEH)" very recently. During all my previous testing, except for the very last couple short tests, it was set to the default "enabled". I probably did too little testing with Linux / Windows after disabling it.
So in short:
  1. (corrected) single-bit memory errors -> motherboard (BIOS) -> OS ==> works 100%
  2. (corrected) single-bit memory errors -> motherboard (BIOS) -> OS -> IPMI ==> not sure if the OS properly forwards the error to the IPMI
  3. (corrected) single-bit memory errors -> motherboard (BIOS) -> IPMI ==> 100% broken
  4. (corrected) single-bit memory errors -> motherboard (BIOS) -> IPMI -> OS ==> 100% broken
  5. (uncorrected) multi-bit memory errors -> * ==> I'm not sure if it is broken (or perhaps not even possible on Zen2) or if we just haven't been able trigger them yet. I've ran Memtest86 v8.4 with unstable memory for many hours now. In doing so, I've triggered about 3000 "ECC Correctable Errors" (=single-bit) and about 100 of CPU errors, but 0 "ECC Correctable Errors" (=multi-bit). Also using the shorting-method, we haven't achieved any "ECC Correctable Errors" (=multi-bit) yet. We are currently in contact with the persons who wrote the paper (see link above for details) that explained the shorting-method, to see how to trigger multi-bit errors reliably.
So if I understand it correctly
  1. = ok
  2. I think we can only validate this once 3) is fixed
  3. Is actually a bug and should be fixed by Asrock Rack (with help of AMD and perhaps the IPMI-chip manufacturer Aspeed)
  4. Can only work / be fixed once 3) gets fixed
  5. Suggestions are welcome. Perhaps AMD can confirm if Zen2 properly supports this? But not like AMD TW claimed that “reporting is not supported”, which we now clearly proved to be false
I've send this information to Asrock Rack + AMD. Asrock Rack, on the same day, confirmed that they, together with AMD, had come to the exact same conclusion (using error injection in Linux) and that they asked AMD for assistance to report these errors in the IPMI as well. So hopefully we'll someday get this important feature on this motherboard!!

In meantime, me and (especially) Diversity, are still trying to trigger (uncorrected) multi-bit errors as well. We're in contact with the interesting folks of ECCploit for this, who have a very profound knowledge on this matter... (Check out Lucians talk on OffensiveCon19 https://www.youtube.com/watch?v=R2aPo_wwmZw).

Finally some real progress on this matter! Thanks to Diversity for getting my hopes up again, cause I almost gave up on this...
 
Last edited:
Top