Multiple failures since update

Edexiel · Jan 11, 2023

Hi,

My system :
CPU : Intel Pentium G4560
Motherboard : P10S-M WS
RAM : 4x8Go Crucial DDR4 2400Mhz ECC
HDD : 1xRaidZ1 5x4To
4x WDC_WD40EZRZ
1x ST4000VN008

Since the update, I have random shutdown/reboots followed by an alert from MDAM indicating a failure

Code:

[   60.778788] md/raid1:md127: not clean -- starting background reconstruction
[   60.785911] md/raid1:md127: active with 2 out of 2 mirrors
[   60.791524] md127: detected capacity change from 0 to 4188160
[   60.797437] md: resync of RAID array md127
[   61.330877] md/raid1:md126: not clean -- starting background reconstruction
[   61.337927] md/raid1:md126: active with 2 out of 2 mirrors
[   61.343583] md126: detected capacity change from 0 to 4188160
[   61.349555] md: resync of RAID array md126
[   62.154979] Adding 2094076k swap on /dev/mapper/md127.  Priority:-3 extents:1 across:2094076k FS
[   63.350202] Adding 2094076k swap on /dev/mapper/md126.  Priority:-4 extents:1 across:2094076k FS
[   63.535512] md/raid1:md126: Disk failure on sdd1, disabling device.
               md/raid1:md126: Operation continuing on 1 devices.
[   63.535526] md: md126: resync interrupted.
[   63.659522] md: resync of RAID array md126
[   63.663915] md: md126: resync done.
[   63.839877] md126: detected capacity change from 4188160 to 0
[   63.845749] md: md126 stopped.
[   65.607439] md/raid1:md127: Disk failure on sdb1, disabling device.
               md/raid1:md127: Operation continuing on 1 devices.
[   65.607477] md: md127: resync interrupted.
[   65.707521] md: resync of RAID array md127
[   65.711910] md: md127: resync done.
[   65.861259] md127: detected capacity change from 4188160 to 0

Code:

This is an automatically generated mail message from mdadm
running on truenas

A Fail event had been detected on md device /dev/md127.

It could be related to component device /dev/sdb1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md127 : active raid1 sda1[1] sdb1[0](F)
2094080 blocks super 1.2 [2/1] [_U]
[======>..............] resync = 32.8% (688960/2094080) finish=0.1min speed=172240K/sec

unused devices: <none>

Code:

This is an automatically generated mail message from mdadm
running on truenas

A Fail event had been detected on md device /dev/md126.

It could be related to component device /dev/sdd1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md126 : active raid1 sdd1[1](F) sdc1[0]
2094080 blocks super 1.2 [2/1] [U_]
[===>.................] resync = 15.1% (317184/2094080) finish=0.1min speed=158592K/sec

md127 : active raid1 sda1[1] sdb1[0]
2094080 blocks super 1.2 [2/2] [UU]
[===>.................] resync = 17.7% (371968/2094080) finish=0.1min speed=185984K/sec

unused devices: <none>

Smart offline test on incriminated disks do not show any errors.

Dmesg is spammed with ECC Hardware errors :

Code:

[10877.749344] {81}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[10877.757969] {81}[Hardware Error]: It has been corrected by h/w and requires no further action
[10877.766720] {81}[Hardware Error]: event severity: corrected
[10877.772429] {81}[Hardware Error]:  Error 0, type: corrected
[10877.778157] {81}[Hardware Error]:  fru_text: CorrectedErr
[10877.783705] {81}[Hardware Error]:   section_type: memory error
[10877.789694] {81}[Hardware Error]:   node: 1 device: 1

Is this an indication of a failing DIMM ?

If you have any idea of what is happening, it would be greatly appreciated !
Thank you !

artlessknave · Jan 11, 2023

memory errors mean you should run memtest86. memory errors can cause all kinds of problems.

on SCALE, mdadm is used for your swap partitions. swap is integrated with RAM.

also, raidz1 with >2TB HDDs is highly not recommended.

Edexiel · Jan 12, 2023

Thank you for your reply, I will test a memtest86 as soon as I can.
Won't the ECC make a memtest test difficult since, as I read in dmesg most errors are corrected ?

Oh I see, I was a little bit confused and thought I made a configuration error.

I'm a little bit budget constrained at the moment and I do not store essential data, but it is in my upgrade path.

artlessknave · Jan 12, 2023

Edexiel said:
as I read in dmesg most errors are corrected

memtest86 is a bit is a bit messy. some versions properly support ECC. I believe that is the UEFI version, which is the 7+ version released by passmark. this would be the best version, if your system can boot uefi.
ECC, however, is most useful for correcting the random soft errors that occur. consistent errors indicate deeper problems. it is those deeper problems we are looking for, which even the regular memtest should have a chance of finding.

Edexiel · Jan 14, 2023

Ran memtest86 for a while and generated 200+ corrected ECC errors, so the test "passed" anyway.
It would be nice that truenas can monitor ECC errors, like smart, to prevent this kind of situation from happening.
I pulled the ram, and I'll see if I see any other reboots.
But It's weird, it happened directly after the bluefin update.

Ericloewe · Jan 14, 2023

Edexiel said:
It would be nice that truenas can monitor ECC errors

It should, aren't your logs full of ECC correctable error warnings?

Edexiel · Jan 14, 2023

My dmesg is full of errors but no alerts from Truenas GUI

Important Announcement for the TrueNAS Community.

Multiple failures since update

Edexiel

Cadet

artlessknave

Wizard

Edexiel

Cadet

artlessknave

Wizard

Edexiel

Cadet

Ericloewe

Server Wrangler

Edexiel

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Multiple failures since update

Edexiel

Cadet

artlessknave

Wizard

Edexiel

Cadet

artlessknave

Wizard

Edexiel

Cadet

Ericloewe

Server Wrangler

Edexiel

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Multiple failures since update"

Similar threads