Multiple failures since update

Edexiel

Cadet
Joined
Jan 11, 2023
Messages
4
Hi,

My system :
CPU : Intel Pentium G4560
Motherboard : P10S-M WS
RAM : 4x8Go Crucial DDR4 2400Mhz ECC
HDD : 1xRaidZ1 5x4To
4x WDC_WD40EZRZ
1x ST4000VN008

Since the update, I have random shutdown/reboots followed by an alert from MDAM indicating a failure


Code:
[   60.778788] md/raid1:md127: not clean -- starting background reconstruction
[   60.785911] md/raid1:md127: active with 2 out of 2 mirrors
[   60.791524] md127: detected capacity change from 0 to 4188160
[   60.797437] md: resync of RAID array md127
[   61.330877] md/raid1:md126: not clean -- starting background reconstruction
[   61.337927] md/raid1:md126: active with 2 out of 2 mirrors
[   61.343583] md126: detected capacity change from 0 to 4188160
[   61.349555] md: resync of RAID array md126
[   62.154979] Adding 2094076k swap on /dev/mapper/md127.  Priority:-3 extents:1 across:2094076k FS
[   63.350202] Adding 2094076k swap on /dev/mapper/md126.  Priority:-4 extents:1 across:2094076k FS
[   63.535512] md/raid1:md126: Disk failure on sdd1, disabling device.
               md/raid1:md126: Operation continuing on 1 devices.
[   63.535526] md: md126: resync interrupted.
[   63.659522] md: resync of RAID array md126
[   63.663915] md: md126: resync done.
[   63.839877] md126: detected capacity change from 4188160 to 0
[   63.845749] md: md126 stopped.
[   65.607439] md/raid1:md127: Disk failure on sdb1, disabling device.
               md/raid1:md127: Operation continuing on 1 devices.
[   65.607477] md: md127: resync interrupted.
[   65.707521] md: resync of RAID array md127
[   65.711910] md: md127: resync done.
[   65.861259] md127: detected capacity change from 4188160 to 0

Code:
This is an automatically generated mail message from mdadm
running on truenas

A Fail event had been detected on md device /dev/md127.

It could be related to component device /dev/sdb1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md127 : active raid1 sda1[1] sdb1[0](F)
2094080 blocks super 1.2 [2/1] [_U]
[======>..............] resync = 32.8% (688960/2094080) finish=0.1min speed=172240K/sec

unused devices: <none>

Code:
This is an automatically generated mail message from mdadm
running on truenas

A Fail event had been detected on md device /dev/md126.

It could be related to component device /dev/sdd1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md126 : active raid1 sdd1[1](F) sdc1[0]
2094080 blocks super 1.2 [2/1] [U_]
[===>.................] resync = 15.1% (317184/2094080) finish=0.1min speed=158592K/sec

md127 : active raid1 sda1[1] sdb1[0]
2094080 blocks super 1.2 [2/2] [UU]
[===>.................] resync = 17.7% (371968/2094080) finish=0.1min speed=185984K/sec

unused devices: <none>


Smart offline test on incriminated disks do not show any errors.


Dmesg is spammed with ECC Hardware errors :

Code:
[10877.749344] {81}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[10877.757969] {81}[Hardware Error]: It has been corrected by h/w and requires no further action
[10877.766720] {81}[Hardware Error]: event severity: corrected
[10877.772429] {81}[Hardware Error]:  Error 0, type: corrected
[10877.778157] {81}[Hardware Error]:  fru_text: CorrectedErr
[10877.783705] {81}[Hardware Error]:   section_type: memory error
[10877.789694] {81}[Hardware Error]:   node: 1 device: 1

Is this an indication of a failing DIMM ?

If you have any idea of what is happening, it would be greatly appreciated !
Thank you !
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
memory errors mean you should run memtest86. memory errors can cause all kinds of problems.

on SCALE, mdadm is used for your swap partitions. swap is integrated with RAM.

also, raidz1 with >2TB HDDs is highly not recommended.
 

Edexiel

Cadet
Joined
Jan 11, 2023
Messages
4
Thank you for your reply, I will test a memtest86 as soon as I can.
Won't the ECC make a memtest test difficult since, as I read in dmesg most errors are corrected ?

Oh I see, I was a little bit confused and thought I made a configuration error.

I'm a little bit budget constrained at the moment and I do not store essential data, but it is in my upgrade path.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
as I read in dmesg most errors are corrected
memtest86 is a bit is a bit messy. some versions properly support ECC. I believe that is the UEFI version, which is the 7+ version released by passmark. this would be the best version, if your system can boot uefi.
ECC, however, is most useful for correcting the random soft errors that occur. consistent errors indicate deeper problems. it is those deeper problems we are looking for, which even the regular memtest should have a chance of finding.
 

Edexiel

Cadet
Joined
Jan 11, 2023
Messages
4
Ran memtest86 for a while and generated 200+ corrected ECC errors, so the test "passed" anyway.
It would be nice that truenas can monitor ECC errors, like smart, to prevent this kind of situation from happening.
I pulled the ram, and I'll see if I see any other reboots.
But It's weird, it happened directly after the bluefin update.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
It would be nice that truenas can monitor ECC errors
It should, aren't your logs full of ECC correctable error warnings?
 
Top