Scale Hardware Error / Memory failure dmesg, unscheduled restarts

radomirpolach

Explorer
Joined
Feb 13, 2022
Messages
71
I am trying to use Scale on my NAS.

I am using AsRock B450M-F and Ryzen 3600 and 4x 32GB Mustang 2666Mhz ECC UDIMM memory.

I am getting random restarts, I was able to reproduce them immediately with:
Code:
stress-ng -c 18 -i 8 -m 8 --vm-bytes 1G -t 3600s

Nothing in logs.

When I lowered the memory speed to 2400Mhz, restarts disappeared with stress-ng, but I am getting this in log:
Code:
[   82.052153] mce: Uncorrected hardware memory error in user-access at 151137d40
[   82.052157] mce: Uncorrected hardware memory error in user-access at 41ef01040
[   82.052586] Memory failure: 0x151137: Sending SIGBUS to stress-ng:10254 due to hardware memory corruption
[   82.059802] mce: [Hardware Error]: Machine check events logged
[   82.067151] Memory failure: 0x151137: recovery action for dirty LRU page: Recovered
[   82.090998] [Hardware Error]: Uncorrected, software restartable error.
[   82.097795] [Hardware Error]: CPU:9 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
[   82.108047] Memory failure: 0x41ef01: Sending SIGBUS to stress-ng:10249 due to hardware memory corruption
[   82.108780] [Hardware Error]: Error Addr: 0x000000041ef01040
[   82.118511] Memory failure: 0x41ef01: recovery action for dirty LRU page: Recovered
[   82.124371] [Hardware Error]: IPID: 0x000000b000000000
[   82.124373] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[   82.124374] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[   82.124939] mce: [Hardware Error]: Machine check events logged
[   82.160714] [Hardware Error]: Uncorrected, software restartable error.
[   82.167478] [Hardware Error]: CPU:11 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
[   82.178510] [Hardware Error]: Error Addr: 0x0000000151137d40
[   82.184445] [Hardware Error]: IPID: 0x000000b000000000
[   82.189866] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[   82.200315] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[   82.207173] [Hardware Error]: Deferred error, no action required.
[   82.213506] [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|SyndV|UECC|Deferred|-|-]: 0xdc2030000000011b
[   82.225004] [Hardware Error]: Error Addr: 0x00000001ef780840
[   82.230958] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xe0b1f5aa0b800000
[   82.238979] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[   82.247996] EDAC MC0: 1 UE on mc#0csrow#0channel#1 (csrow:0 channel:1 page:0x7fde02 offset:0x140 grain:64)
[   82.258033] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[   83.547317] mce: Uncorrected hardware memory error in user-access at 40b12a980
[   83.547326] [Hardware Error]: Uncorrected, software restartable error.
[   83.561611] [Hardware Error]: CPU:10 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
[   83.572672] [Hardware Error]: Error Addr: 0x000000040b12a980
[   83.578592] [Hardware Error]: IPID: 0x000000b000000000
[   83.583972] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[   83.594384] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[   83.597563] mce: Uncorrected hardware memory error in user-access at 2f0ea8e80
[   83.599476] Memory failure: 0x40b12a: Sending SIGBUS to stress-ng:10251 due to hardware memory corruption
[   83.599481] Memory failure: 0x40b12a: recovery action for dirty LRU page: Recovered
[   83.601388] [Hardware Error]: Deferred error, no action required.
[   83.612297] Memory failure: 0x2f0ea8: Sending SIGBUS to stress-ng:25155 due to hardware memory corruption
[   83.618318] [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[-|-|MiscV|AddrV|-
[   83.626218] Memory failure: 0x2f0ea8: recovery action for dirty LRU page: Recovered
[   83.657525] |SyndV|UECC|Deferred|-|-]: 0x9c2030000000011b
[   83.663214] [Hardware Error]: Error Addr: 0x00000001e58954a0
[   83.669143] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x8b99f5aa0b800000
[   83.677180] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[   83.686548] EDAC MC0: 1 UE on mc#0csrow#0channel#1 (csrow:0 channel:1 page:0x7d6255 offset:0x1a0 grain:64)
[   83.696546] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[   83.703826] [Hardware Error]: Uncorrected, software restartable error.
[   83.710647] [Hardware Error]: CPU:8 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
[   83.721471] [Hardware Error]: Error Addr: 0x00000002f0ea8e80
[   83.727396] [Hardware Error]: IPID: 0x000000b000000000
[   83.732811] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[   83.743253] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[   87.023582] IPv6: ADDRCONF(NETDEV_CHANGE): veth6aaff4b8: link becomes ready
[   87.031830] kube-bridge: port 12(veth6aaff4b8) entered blocking state
[   87.038601] kube-bridge: port 12(veth6aaff4b8) entered disabled state
[   87.050522] device veth6aaff4b8 entered promiscuous mode
[   87.058045] kube-bridge: port 12(veth6aaff4b8) entered blocking state
[   87.066342] kube-bridge: port 12(veth6aaff4b8) entered forwarding state
[  101.856294] kube-bridge: port 2(veth4c74b712) entered disabled state
[  101.874914] device veth4c74b712 left promiscuous mode
[  101.881788] kube-bridge: port 2(veth4c74b712) entered disabled state
[  102.976424] kube-bridge: port 1(veth3cf7860f) entered disabled state
[  103.007006] device veth3cf7860f left promiscuous mode
[  103.013891] kube-bridge: port 1(veth3cf7860f) entered disabled state
[  165.861289] kube-bridge: port 5(vethef0f9ce0) entered disabled state
[  165.874955] device vethef0f9ce0 left promiscuous mode
[  165.882089] kube-bridge: port 5(vethef0f9ce0) entered disabled state
[  242.571209] mce: Uncorrected hardware memory error in user-access at 15f9c7e40
[  242.571211] mce: Uncorrected hardware memory error in user-access at 1ea993940
[  242.571221] mce_notify_irq: 4 callbacks suppressed
[  242.571221] mce: [Hardware Error]: Machine check events logged
[  242.598398] Memory failure: 0x1ea993: Sending SIGBUS to stress-ng:25967 due to hardware memory corruption
[  242.598399] Memory failure: 0x15f9c7: Sending SIGBUS to stress-ng:10257 due to hardware memory corruption
[  242.598404] Memory failure: 0x15f9c7: recovery action for dirty LRU page: Recovered
[  242.604219] [Hardware Error]: Uncorrected, software restartable error.
[  242.615694] Memory failure: 0x1ea993: recovery action for dirty LRU page: Recovered
[  242.654703] [Hardware Error]: CPU:2 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
[  242.667325] [Hardware Error]: Error Addr: 0x00000001ea993940
[  242.675049] [Hardware Error]: IPID: 0x000000b000000000
[  242.682147] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[  242.694297] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[  242.702693] mce: [Hardware Error]: Machine check events logged
[  242.710564] [Hardware Error]: Uncorrected, software restartable error.
[  242.719131] [Hardware Error]: CPU:1 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
[  242.731776] [Hardware Error]: Error Addr: 0x000000015f9c7e40
[  242.739497] [Hardware Error]: IPID: 0x000000b000000000
[  242.746665] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[  242.758860] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[  242.767367] [Hardware Error]: Corrected error, no action required.
[  242.775689] [Hardware Error]: CPU:0 (17:71:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
[  242.788463] [Hardware Error]: Error Addr: 0x0000000171463380
[  242.796176] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0xad6105060a800800
[  242.806065] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[  242.816551] EDAC MC0: 1 CE on mc#0csrow#0channel#0 (csrow:0 channel:0 page:0x60518c offset:0xc80 grain:64 syndrome:0x506)
[  242.829732] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[  242.838320] [Hardware Error]: Deferred error, no action required.
[  242.846446] [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|SyndV|UECC|Deferred|-|-]: 0xdc2030000000011b
[  242.859792] [Hardware Error]: Error Addr: 0x000000008fce3f40
[  242.867520] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x5063f5aa0b800000
[  242.877326] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[  242.888975] EDAC MC0: 1 UE on mc#0csrow#0channel#1 (csrow:0 channel:1 page:0x27f38f offset:0xd40 grain:64)
[  242.900771] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[  247.764236] mce: Uncorrected hardware memory error in user-access at 293075c40
[  247.764248] [Hardware Error]: Uncorrected, software restartable error.
[  247.781845] [Hardware Error]: CPU:2 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
[  247.794203] [Hardware Error]: Error Addr: 0x0000000293075c40
[  247.801828] [Hardware Error]: IPID: 0x000000b000000000
[  247.808792] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[  247.820753] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[  247.829108] Memory failure: 0x293075: Sending SIGBUS to stress-ng:32947 due to hardware memory corruption
[  247.840508] Memory failure: 0x293075: recovery action for dirty LRU page: Recovered
[  249.584335] mce: Uncorrected hardware memory error in user-access at 398a6e6c0
[  249.584345] [Hardware Error]: Uncorrected, software restartable error.
[  249.601945] [Hardware Error]: CPU:2 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
[  249.614617] [Hardware Error]: Error Addr: 0x0000000398a6e6c0
[  249.622085] [Hardware Error]: IPID: 0x000000b000000000
[  249.629118] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[  249.641310] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[  249.641584] Memory failure: 0x398a6e: Sending SIGBUS to stress-ng:32865 due to hardware memory corruption
[  249.660976] Memory failure: 0x398a6e: recovery action for dirty LRU page: Recovered
[  518.655854] mce: Uncorrected hardware memory error in user-access at 29f23f880
[  518.655857] mce: Uncorrected hardware memory error in user-access at 322ca7640
[  518.655859] mce: Uncorrected hardware memory error in user-access at 63014e600
[  518.655867] mce_notify_irq: 4 callbacks suppressed
[  518.655868] mce: [Hardware Error]: Machine check events logged
[  518.697362] [Hardware Error]: Uncorrected, software restartable error.
[  518.705720] [Hardware Error]: CPU:6 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
[  518.718320] [Hardware Error]: Error Addr: 0x000000063014e600
[  518.725866] [Hardware Error]: IPID: 0x000000b000000000
[  518.732812] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[  518.744826] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[  518.753227] mce: [Hardware Error]: Machine check events logged
[  518.760910] [Hardware Error]: Uncorrected, software restartable error.
[  518.769344] [Hardware Error]: CPU:2 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
[  518.781859] [Hardware Error]: Error Addr: 0x000000029f23f880
[  518.789402] [Hardware Error]: IPID: 0x000000b000000000
[  518.796363] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[  518.808418] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[  518.816867] [Hardware Error]: Uncorrected, software restartable error.
[  518.825259] [Hardware Error]: CPU:5 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
[  518.837852] [Hardware Error]: Error Addr: 0x0000000322ca7640
[  518.845415] [Hardware Error]: IPID: 0x000000b000000000
[  518.852418] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[  518.864457] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[  518.872839] [Hardware Error]: Deferred error, no action required.
[  518.880760] [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|SyndV|UECC|Deferred|-|-]: 0xdc2030000000011b
[  518.893964] [Hardware Error]: Error Addr: 0x0000000171653b40
[  518.901529] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xfdd9f5750b800000
[  518.911160] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[  518.921748] EDAC MC0: 1 UE on mc#0csrow#0channel#1 (csrow:0 channel:1 page:0x60594e offset:0xd40 grain:64)
[  518.933451] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[  518.933811] Memory failure: 0x322ca7: Sending SIGBUS to stress-ng:10258 due to hardware memory corruption
[  518.942081] Memory failure: 0x29f23f: Sending SIGBUS to stress-ng:10253 due to hardware memory corruption
[  518.942227] Memory failure: 0x63014e: Sending SIGBUS to stress-ng:32972 due to hardware memory corruption
[  518.942233] Memory failure: 0x63014e: recovery action for dirty LRU page: Recovered
[  518.953644] Memory failure: 0x322ca7: recovery action for dirty LRU page: Recovered
[  518.997232] Memory failure: 0x29f23f: recovery action for dirty LRU page: Recovered
[  524.044278] mce: Uncorrected hardware memory error in user-access at 1e366bc00
[  524.044289] [Hardware Error]: Uncorrected, software restartable error.
[  524.061777] [Hardware Error]: CPU:3 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
[  524.074187] [Hardware Error]: Error Addr: 0x00000001e366bc00
[  524.081692] [Hardware Error]: IPID: 0x000000b000000000
[  524.088702] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[  524.100659] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[  524.109863] Memory failure: 0x1e366b: Sending SIGBUS to stress-ng:36521 due to hardware memory corruption
[  524.121603] Memory failure: 0x1e366b: recovery action for dirty LRU page: Recovered



I would blame memory, but I just RAN memtest for 2 hours and not a single issue occurred on 2666Mhz even though I can make it restart in a few seconds with stress-ng. And no ECC issues occured.

What is going on?

PSU is quality Seasonic and I haven't had any issues with it in the previous build.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Are you overclocking? TrueNAS does not respond well to overclocking. Your experience suggests there's an automatic 10% overclock in effect, and backing off to 2400MHz actually achieves 2666MHz.
 

radomirpolach

Explorer
Joined
Feb 13, 2022
Messages
71
Are you overclocking? TrueNAS does not respond well to overclocking. Your experience suggests there's an automatic 10% overclock in effect, and backing off to 2400MHz actually achieves 2666MHz.
No. No overclocking is turned on. Also, the restarts stop on 2400Mhz, but I get errors in dmesg instead. Should I underclock the CPU?
 

radomirpolach

Explorer
Joined
Feb 13, 2022
Messages
71
Code:
$ edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 5 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 1 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 1 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#1: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I would blame memory, but I just RAN memtest for 2 hours and not a single issue occurred on 2666Mhz even though I can make it restart in a few seconds with stress-ng. And no ECC issues occured.

It's strongly recommended that you run at least a partial burn-in anytime you change hardware or experience hardware problems.


Note that two hours is *nothing* in the grand scheme of things, especially with 128GB RAM. Two days or even two weeks is more indicative.

Alternatively, you can try isolating it one DIMM at a time, seeing if the removal of any particular DIMM results in smoother sailing. Note that the DIMM itself may not be faulty; CPU pins and DIMM socket pins routinely go bad, and may require reseating of the DIMM. You can identify whether it is the DIMM or elsewhere by swapping DIMM's and watching to see if the problem moves with the DIMM.
 

radomirpolach

Explorer
Joined
Feb 13, 2022
Messages
71
Yes, I can imagine. But it is very strange that in the TrueNAS I can reproduce errors within seconds, but in memtest, nothing.

I just swapped two DIMMs and I will rerun test.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Do we know that whichever memtest variant you've chosen to use supports whatever error detection is in use on this Ryzen platform? Memory error detection is not a standardized feature on the PC platform, and it looks like the error you're getting is coming over the Machine Check Architecture subsystem. I'm not sure if either of the memtest suites picks up on that, I'd have to go read the docs.
 

radomirpolach

Explorer
Joined
Feb 13, 2022
Messages
71
I tested ECC with memtest and it worked fine. I was able to create errors by overclocking memory to see if they got corrected and they were.
 

radomirpolach

Explorer
Joined
Feb 13, 2022
Messages
71
I solved all memory issues, but it doesn't seem reboots stopped. So this was probably unrelated. Now I have a random reboot like every 10 minutes for some reason.
 

radomirpolach

Explorer
Joined
Feb 13, 2022
Messages
71
I replaced CPU to Ryzen 4650 PRO, reduced memory to 2x32GB and replaced motherboard with Another AsRock B450M-F. I still have random reboots.

I ran memtest for 2 hours, I run stress-ng for an hour from live Ubuntu. No issues.

Can this be a software issue in TrueNAS?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
It's not impossible, but it's unlikely. Lots of people use TrueNAS.
 

radomirpolach

Explorer
Joined
Feb 13, 2022
Messages
71
Any suggestions? I just got two restarts in a row. Almost at idle.
I will try replacing a power source again. I would blame PSU or RAM, but RAM held memtest no issues for 2 hours. And I already tried changing PSU once.
 
Top