Hi,
My system :
CPU : Intel Pentium G4560
Motherboard : P10S-M WS
RAM : 4x8Go Crucial DDR4 2400Mhz ECC
HDD : 1xRaidZ1 5x4To
4x WDC_WD40EZRZ
1x ST4000VN008
Since the update, I have random shutdown/reboots followed by an alert from MDAM indicating a failure
	
	
		
			
		
	
	
	
		
			
		
	
	
	
		
			
		
	
Smart offline test on incriminated disks do not show any errors.
Dmesg is spammed with ECC Hardware errors :
	
	
		
			
		
	
Is this an indication of a failing DIMM ?
If you have any idea of what is happening, it would be greatly appreciated !
Thank you !
	
		
			
		
		
	
			
			My system :
CPU : Intel Pentium G4560
Motherboard : P10S-M WS
RAM : 4x8Go Crucial DDR4 2400Mhz ECC
HDD : 1xRaidZ1 5x4To
4x WDC_WD40EZRZ
1x ST4000VN008
Since the update, I have random shutdown/reboots followed by an alert from MDAM indicating a failure
Code:
[   60.778788] md/raid1:md127: not clean -- starting background reconstruction
[   60.785911] md/raid1:md127: active with 2 out of 2 mirrors
[   60.791524] md127: detected capacity change from 0 to 4188160
[   60.797437] md: resync of RAID array md127
[   61.330877] md/raid1:md126: not clean -- starting background reconstruction
[   61.337927] md/raid1:md126: active with 2 out of 2 mirrors
[   61.343583] md126: detected capacity change from 0 to 4188160
[   61.349555] md: resync of RAID array md126
[   62.154979] Adding 2094076k swap on /dev/mapper/md127.  Priority:-3 extents:1 across:2094076k FS
[   63.350202] Adding 2094076k swap on /dev/mapper/md126.  Priority:-4 extents:1 across:2094076k FS
[   63.535512] md/raid1:md126: Disk failure on sdd1, disabling device.
               md/raid1:md126: Operation continuing on 1 devices.
[   63.535526] md: md126: resync interrupted.
[   63.659522] md: resync of RAID array md126
[   63.663915] md: md126: resync done.
[   63.839877] md126: detected capacity change from 4188160 to 0
[   63.845749] md: md126 stopped.
[   65.607439] md/raid1:md127: Disk failure on sdb1, disabling device.
               md/raid1:md127: Operation continuing on 1 devices.
[   65.607477] md: md127: resync interrupted.
[   65.707521] md: resync of RAID array md127
[   65.711910] md: md127: resync done.
[   65.861259] md127: detected capacity change from 4188160 to 0
Code:
This is an automatically generated mail message from mdadm running on truenas A Fail event had been detected on md device /dev/md127. It could be related to component device /dev/sdb1. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md127 : active raid1 sda1[1] sdb1[0](F) 2094080 blocks super 1.2 [2/1] [_U] [======>..............] resync = 32.8% (688960/2094080) finish=0.1min speed=172240K/sec unused devices: <none>
Code:
This is an automatically generated mail message from mdadm running on truenas A Fail event had been detected on md device /dev/md126. It could be related to component device /dev/sdd1. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md126 : active raid1 sdd1[1](F) sdc1[0] 2094080 blocks super 1.2 [2/1] [U_] [===>.................] resync = 15.1% (317184/2094080) finish=0.1min speed=158592K/sec md127 : active raid1 sda1[1] sdb1[0] 2094080 blocks super 1.2 [2/2] [UU] [===>.................] resync = 17.7% (371968/2094080) finish=0.1min speed=185984K/sec unused devices: <none>
Smart offline test on incriminated disks do not show any errors.
Dmesg is spammed with ECC Hardware errors :
Code:
[10877.749344] {81}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[10877.757969] {81}[Hardware Error]: It has been corrected by h/w and requires no further action
[10877.766720] {81}[Hardware Error]: event severity: corrected
[10877.772429] {81}[Hardware Error]:  Error 0, type: corrected
[10877.778157] {81}[Hardware Error]:  fru_text: CorrectedErr
[10877.783705] {81}[Hardware Error]:   section_type: memory error
[10877.789694] {81}[Hardware Error]:   node: 1 device: 1
Is this an indication of a failing DIMM ?
If you have any idea of what is happening, it would be greatly appreciated !
Thank you !