radomirpolach
Explorer
- Joined
- Feb 13, 2022
- Messages
- 71
I am trying to use Scale on my NAS.
I am using AsRock B450M-F and Ryzen 3600 and 4x 32GB Mustang 2666Mhz ECC UDIMM memory.
I am getting random restarts, I was able to reproduce them immediately with:
Nothing in logs.
When I lowered the memory speed to 2400Mhz, restarts disappeared with stress-ng, but I am getting this in log:
I would blame memory, but I just RAN memtest for 2 hours and not a single issue occurred on 2666Mhz even though I can make it restart in a few seconds with stress-ng. And no ECC issues occured.
What is going on?
PSU is quality Seasonic and I haven't had any issues with it in the previous build.
I am using AsRock B450M-F and Ryzen 3600 and 4x 32GB Mustang 2666Mhz ECC UDIMM memory.
I am getting random restarts, I was able to reproduce them immediately with:
Code:
stress-ng -c 18 -i 8 -m 8 --vm-bytes 1G -t 3600s
Nothing in logs.
When I lowered the memory speed to 2400Mhz, restarts disappeared with stress-ng, but I am getting this in log:
Code:
[ 82.052153] mce: Uncorrected hardware memory error in user-access at 151137d40 [ 82.052157] mce: Uncorrected hardware memory error in user-access at 41ef01040 [ 82.052586] Memory failure: 0x151137: Sending SIGBUS to stress-ng:10254 due to hardware memory corruption [ 82.059802] mce: [Hardware Error]: Machine check events logged [ 82.067151] Memory failure: 0x151137: recovery action for dirty LRU page: Recovered [ 82.090998] [Hardware Error]: Uncorrected, software restartable error. [ 82.097795] [Hardware Error]: CPU:9 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135 [ 82.108047] Memory failure: 0x41ef01: Sending SIGBUS to stress-ng:10249 due to hardware memory corruption [ 82.108780] [Hardware Error]: Error Addr: 0x000000041ef01040 [ 82.118511] Memory failure: 0x41ef01: recovery action for dirty LRU page: Recovered [ 82.124371] [Hardware Error]: IPID: 0x000000b000000000 [ 82.124373] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption. [ 82.124374] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD [ 82.124939] mce: [Hardware Error]: Machine check events logged [ 82.160714] [Hardware Error]: Uncorrected, software restartable error. [ 82.167478] [Hardware Error]: CPU:11 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135 [ 82.178510] [Hardware Error]: Error Addr: 0x0000000151137d40 [ 82.184445] [Hardware Error]: IPID: 0x000000b000000000 [ 82.189866] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption. [ 82.200315] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD [ 82.207173] [Hardware Error]: Deferred error, no action required. [ 82.213506] [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|SyndV|UECC|Deferred|-|-]: 0xdc2030000000011b [ 82.225004] [Hardware Error]: Error Addr: 0x00000001ef780840 [ 82.230958] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xe0b1f5aa0b800000 [ 82.238979] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. [ 82.247996] EDAC MC0: 1 UE on mc#0csrow#0channel#1 (csrow:0 channel:1 page:0x7fde02 offset:0x140 grain:64) [ 82.258033] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD [ 83.547317] mce: Uncorrected hardware memory error in user-access at 40b12a980 [ 83.547326] [Hardware Error]: Uncorrected, software restartable error. [ 83.561611] [Hardware Error]: CPU:10 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135 [ 83.572672] [Hardware Error]: Error Addr: 0x000000040b12a980 [ 83.578592] [Hardware Error]: IPID: 0x000000b000000000 [ 83.583972] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption. [ 83.594384] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD [ 83.597563] mce: Uncorrected hardware memory error in user-access at 2f0ea8e80 [ 83.599476] Memory failure: 0x40b12a: Sending SIGBUS to stress-ng:10251 due to hardware memory corruption [ 83.599481] Memory failure: 0x40b12a: recovery action for dirty LRU page: Recovered [ 83.601388] [Hardware Error]: Deferred error, no action required. [ 83.612297] Memory failure: 0x2f0ea8: Sending SIGBUS to stress-ng:25155 due to hardware memory corruption [ 83.618318] [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[-|-|MiscV|AddrV|- [ 83.626218] Memory failure: 0x2f0ea8: recovery action for dirty LRU page: Recovered [ 83.657525] |SyndV|UECC|Deferred|-|-]: 0x9c2030000000011b [ 83.663214] [Hardware Error]: Error Addr: 0x00000001e58954a0 [ 83.669143] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x8b99f5aa0b800000 [ 83.677180] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. [ 83.686548] EDAC MC0: 1 UE on mc#0csrow#0channel#1 (csrow:0 channel:1 page:0x7d6255 offset:0x1a0 grain:64) [ 83.696546] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD [ 83.703826] [Hardware Error]: Uncorrected, software restartable error. [ 83.710647] [Hardware Error]: CPU:8 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135 [ 83.721471] [Hardware Error]: Error Addr: 0x00000002f0ea8e80 [ 83.727396] [Hardware Error]: IPID: 0x000000b000000000 [ 83.732811] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption. [ 83.743253] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD [ 87.023582] IPv6: ADDRCONF(NETDEV_CHANGE): veth6aaff4b8: link becomes ready [ 87.031830] kube-bridge: port 12(veth6aaff4b8) entered blocking state [ 87.038601] kube-bridge: port 12(veth6aaff4b8) entered disabled state [ 87.050522] device veth6aaff4b8 entered promiscuous mode [ 87.058045] kube-bridge: port 12(veth6aaff4b8) entered blocking state [ 87.066342] kube-bridge: port 12(veth6aaff4b8) entered forwarding state [ 101.856294] kube-bridge: port 2(veth4c74b712) entered disabled state [ 101.874914] device veth4c74b712 left promiscuous mode [ 101.881788] kube-bridge: port 2(veth4c74b712) entered disabled state [ 102.976424] kube-bridge: port 1(veth3cf7860f) entered disabled state [ 103.007006] device veth3cf7860f left promiscuous mode [ 103.013891] kube-bridge: port 1(veth3cf7860f) entered disabled state [ 165.861289] kube-bridge: port 5(vethef0f9ce0) entered disabled state [ 165.874955] device vethef0f9ce0 left promiscuous mode [ 165.882089] kube-bridge: port 5(vethef0f9ce0) entered disabled state [ 242.571209] mce: Uncorrected hardware memory error in user-access at 15f9c7e40 [ 242.571211] mce: Uncorrected hardware memory error in user-access at 1ea993940 [ 242.571221] mce_notify_irq: 4 callbacks suppressed [ 242.571221] mce: [Hardware Error]: Machine check events logged [ 242.598398] Memory failure: 0x1ea993: Sending SIGBUS to stress-ng:25967 due to hardware memory corruption [ 242.598399] Memory failure: 0x15f9c7: Sending SIGBUS to stress-ng:10257 due to hardware memory corruption [ 242.598404] Memory failure: 0x15f9c7: recovery action for dirty LRU page: Recovered [ 242.604219] [Hardware Error]: Uncorrected, software restartable error. [ 242.615694] Memory failure: 0x1ea993: recovery action for dirty LRU page: Recovered [ 242.654703] [Hardware Error]: CPU:2 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135 [ 242.667325] [Hardware Error]: Error Addr: 0x00000001ea993940 [ 242.675049] [Hardware Error]: IPID: 0x000000b000000000 [ 242.682147] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption. [ 242.694297] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD [ 242.702693] mce: [Hardware Error]: Machine check events logged [ 242.710564] [Hardware Error]: Uncorrected, software restartable error. [ 242.719131] [Hardware Error]: CPU:1 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135 [ 242.731776] [Hardware Error]: Error Addr: 0x000000015f9c7e40 [ 242.739497] [Hardware Error]: IPID: 0x000000b000000000 [ 242.746665] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption. [ 242.758860] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD [ 242.767367] [Hardware Error]: Corrected error, no action required. [ 242.775689] [Hardware Error]: CPU:0 (17:71:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b [ 242.788463] [Hardware Error]: Error Addr: 0x0000000171463380 [ 242.796176] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0xad6105060a800800 [ 242.806065] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. [ 242.816551] EDAC MC0: 1 CE on mc#0csrow#0channel#0 (csrow:0 channel:0 page:0x60518c offset:0xc80 grain:64 syndrome:0x506) [ 242.829732] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD [ 242.838320] [Hardware Error]: Deferred error, no action required. [ 242.846446] [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|SyndV|UECC|Deferred|-|-]: 0xdc2030000000011b [ 242.859792] [Hardware Error]: Error Addr: 0x000000008fce3f40 [ 242.867520] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x5063f5aa0b800000 [ 242.877326] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. [ 242.888975] EDAC MC0: 1 UE on mc#0csrow#0channel#1 (csrow:0 channel:1 page:0x27f38f offset:0xd40 grain:64) [ 242.900771] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD [ 247.764236] mce: Uncorrected hardware memory error in user-access at 293075c40 [ 247.764248] [Hardware Error]: Uncorrected, software restartable error. [ 247.781845] [Hardware Error]: CPU:2 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135 [ 247.794203] [Hardware Error]: Error Addr: 0x0000000293075c40 [ 247.801828] [Hardware Error]: IPID: 0x000000b000000000 [ 247.808792] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption. [ 247.820753] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD [ 247.829108] Memory failure: 0x293075: Sending SIGBUS to stress-ng:32947 due to hardware memory corruption [ 247.840508] Memory failure: 0x293075: recovery action for dirty LRU page: Recovered [ 249.584335] mce: Uncorrected hardware memory error in user-access at 398a6e6c0 [ 249.584345] [Hardware Error]: Uncorrected, software restartable error. [ 249.601945] [Hardware Error]: CPU:2 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135 [ 249.614617] [Hardware Error]: Error Addr: 0x0000000398a6e6c0 [ 249.622085] [Hardware Error]: IPID: 0x000000b000000000 [ 249.629118] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption. [ 249.641310] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD [ 249.641584] Memory failure: 0x398a6e: Sending SIGBUS to stress-ng:32865 due to hardware memory corruption [ 249.660976] Memory failure: 0x398a6e: recovery action for dirty LRU page: Recovered [ 518.655854] mce: Uncorrected hardware memory error in user-access at 29f23f880 [ 518.655857] mce: Uncorrected hardware memory error in user-access at 322ca7640 [ 518.655859] mce: Uncorrected hardware memory error in user-access at 63014e600 [ 518.655867] mce_notify_irq: 4 callbacks suppressed [ 518.655868] mce: [Hardware Error]: Machine check events logged [ 518.697362] [Hardware Error]: Uncorrected, software restartable error. [ 518.705720] [Hardware Error]: CPU:6 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135 [ 518.718320] [Hardware Error]: Error Addr: 0x000000063014e600 [ 518.725866] [Hardware Error]: IPID: 0x000000b000000000 [ 518.732812] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption. [ 518.744826] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD [ 518.753227] mce: [Hardware Error]: Machine check events logged [ 518.760910] [Hardware Error]: Uncorrected, software restartable error. [ 518.769344] [Hardware Error]: CPU:2 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135 [ 518.781859] [Hardware Error]: Error Addr: 0x000000029f23f880 [ 518.789402] [Hardware Error]: IPID: 0x000000b000000000 [ 518.796363] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption. [ 518.808418] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD [ 518.816867] [Hardware Error]: Uncorrected, software restartable error. [ 518.825259] [Hardware Error]: CPU:5 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135 [ 518.837852] [Hardware Error]: Error Addr: 0x0000000322ca7640 [ 518.845415] [Hardware Error]: IPID: 0x000000b000000000 [ 518.852418] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption. [ 518.864457] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD [ 518.872839] [Hardware Error]: Deferred error, no action required. [ 518.880760] [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|SyndV|UECC|Deferred|-|-]: 0xdc2030000000011b [ 518.893964] [Hardware Error]: Error Addr: 0x0000000171653b40 [ 518.901529] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xfdd9f5750b800000 [ 518.911160] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. [ 518.921748] EDAC MC0: 1 UE on mc#0csrow#0channel#1 (csrow:0 channel:1 page:0x60594e offset:0xd40 grain:64) [ 518.933451] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD [ 518.933811] Memory failure: 0x322ca7: Sending SIGBUS to stress-ng:10258 due to hardware memory corruption [ 518.942081] Memory failure: 0x29f23f: Sending SIGBUS to stress-ng:10253 due to hardware memory corruption [ 518.942227] Memory failure: 0x63014e: Sending SIGBUS to stress-ng:32972 due to hardware memory corruption [ 518.942233] Memory failure: 0x63014e: recovery action for dirty LRU page: Recovered [ 518.953644] Memory failure: 0x322ca7: recovery action for dirty LRU page: Recovered [ 518.997232] Memory failure: 0x29f23f: recovery action for dirty LRU page: Recovered [ 524.044278] mce: Uncorrected hardware memory error in user-access at 1e366bc00 [ 524.044289] [Hardware Error]: Uncorrected, software restartable error. [ 524.061777] [Hardware Error]: CPU:3 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135 [ 524.074187] [Hardware Error]: Error Addr: 0x00000001e366bc00 [ 524.081692] [Hardware Error]: IPID: 0x000000b000000000 [ 524.088702] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption. [ 524.100659] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD [ 524.109863] Memory failure: 0x1e366b: Sending SIGBUS to stress-ng:36521 due to hardware memory corruption [ 524.121603] Memory failure: 0x1e366b: recovery action for dirty LRU page: Recovered
I would blame memory, but I just RAN memtest for 2 hours and not a single issue occurred on 2666Mhz even though I can make it restart in a few seconds with stress-ng. And no ECC issues occured.
What is going on?
PSU is quality Seasonic and I haven't had any issues with it in the previous build.