Hi all,
Firstly, my server specifications:
CPU: Intel Pentium G4560
Motherboard: Supermicro X11SSL-F
Memory: 2x 8GB Crucial DDR4 Server Memory, PC4-17000 (2133); 2x Kingston Server Premier 8GB (1x 8GB) 2400MHz DDR4
Boot drives: 1x 512GB Sandisk Ultra II SSD
Data configuration: 1 pool consisting of two vdevs: vdev1 w/ 3x WD Red 4TB in RAIDZ1 configuration; vdev2 w/ 3x WD Red 6TB in RAIDZ1 configuration
PSU: 350W
About a month ago, I was having issues with my boot partition, which was hosted on USB keys. Following advice from this forum, I changed my boot drive to an SSD, which resolved the issue.
Following this, the first full scrub I performed on the data pool flagged a disk as FAULTED, with 12 write errors reported, and a second disk as DEGRADED with approx 50 write errors, both in vdev2. I was a little surprised, as the disks in vdev1 are several years older than those in vdev2, so I would have anticipated those to fail first. Nevertheless, I took this result at its word and ordered a replacement WD Red 6TB to replace the faulted drive, with an expectation to replace the degraded drive in the near future.
Once the replacement disk arrived, I offlined the faulted disk, powered down the system, replaced the drive, and then powered back up and used the UI to replace the removed disk with the new disk, at which point resilvering began. This process took approx 36 hours, and, on completion, the new drive immediately went into FAULTED state, with the TrueNAS UI indicating that the replacement was not complete. As an aside, `zfs status -v` implied that the resilver had completed without error. The new disk was reporting 12 write errors, so, perhaps naively, I presumed, given that it was unlikely that a brand new drive should fail immediately, it had just somehow inherited the errors from the disk being replaced. I therefore applied `zfs clear` against the new drive, which triggered another resilver operation, again taking approx 36 hours.
This second resilver completed this morning, with the same result - the new disk faulted, with 12 write errors reported, but no errors reported in the resilvering process by `zfs status -v`. Additionally, the degraded disk had now reported more write errors, and a warning was raised that a file had become corrupted and unrecoverable (the specific file was a member of a snapshot, so not too concerning).
I powered down the system and attempted to replace the SATA cable connecting the new drive to the motherboard, in case this was somehow responsible for the fault. However, attempting to power up the server again following this, the boot sequence failed in GRUB, reporting that "alloc magic is broken". Some research indicated that this could be due to the boot partition of the data drives having issues, rather than necessarily an issue with the boot drive itself. After some experimentation, I found that the system would boot if the two unreplaced data disks for vdev2 were unplugged - though, predictably, this then rendered the data pool unavailable.
At this stage, I'm unsure what to do. While it's not impossible that four disks, one of which was brand new, should fail simultaneously, it seems to be a terrible stroke of luck if there is not some underlying cause. It is also surprising that none of this has apparently impacted the disks in vdev1, which as noted are three years or so older than those of vdev2.
With regards to the previous fault linked above, when installing the SSD as a boot drive, I had insufficient SATA ports to the motherboard, or SATA power supplies from the PSU. I purchased an adapter from PCIe to SATA to connect the SSD, as well as an adapter from 4-pin molex to SATA power to supply power to the SSD. As the disks in vdev2 did not begin to fail until this SSD was installed, I do not want to rule out that it may be related. The two physical disks that now require disconnection for the system to boot are themselves powered by 4-pin molex to SATA adapters. Is it possible that the installation of the SSD has overloaded the 350W power supply, and these errors are the manifestation of the disks receiving insufficient power?
At this point, the server is powered down, as I don't want to risk causing any further damage to the data integrity in case there is any hope of bringing the pool back online and migrating away from any faulty disks through standard resilvers than having to resort to full disaster recovery.
Any suggestions or insights would be appreciated.
Firstly, my server specifications:
CPU: Intel Pentium G4560
Motherboard: Supermicro X11SSL-F
Memory: 2x 8GB Crucial DDR4 Server Memory, PC4-17000 (2133); 2x Kingston Server Premier 8GB (1x 8GB) 2400MHz DDR4
Boot drives: 1x 512GB Sandisk Ultra II SSD
Data configuration: 1 pool consisting of two vdevs: vdev1 w/ 3x WD Red 4TB in RAIDZ1 configuration; vdev2 w/ 3x WD Red 6TB in RAIDZ1 configuration
PSU: 350W
About a month ago, I was having issues with my boot partition, which was hosted on USB keys. Following advice from this forum, I changed my boot drive to an SSD, which resolved the issue.
Following this, the first full scrub I performed on the data pool flagged a disk as FAULTED, with 12 write errors reported, and a second disk as DEGRADED with approx 50 write errors, both in vdev2. I was a little surprised, as the disks in vdev1 are several years older than those in vdev2, so I would have anticipated those to fail first. Nevertheless, I took this result at its word and ordered a replacement WD Red 6TB to replace the faulted drive, with an expectation to replace the degraded drive in the near future.
Once the replacement disk arrived, I offlined the faulted disk, powered down the system, replaced the drive, and then powered back up and used the UI to replace the removed disk with the new disk, at which point resilvering began. This process took approx 36 hours, and, on completion, the new drive immediately went into FAULTED state, with the TrueNAS UI indicating that the replacement was not complete. As an aside, `zfs status -v` implied that the resilver had completed without error. The new disk was reporting 12 write errors, so, perhaps naively, I presumed, given that it was unlikely that a brand new drive should fail immediately, it had just somehow inherited the errors from the disk being replaced. I therefore applied `zfs clear` against the new drive, which triggered another resilver operation, again taking approx 36 hours.
This second resilver completed this morning, with the same result - the new disk faulted, with 12 write errors reported, but no errors reported in the resilvering process by `zfs status -v`. Additionally, the degraded disk had now reported more write errors, and a warning was raised that a file had become corrupted and unrecoverable (the specific file was a member of a snapshot, so not too concerning).
I powered down the system and attempted to replace the SATA cable connecting the new drive to the motherboard, in case this was somehow responsible for the fault. However, attempting to power up the server again following this, the boot sequence failed in GRUB, reporting that "alloc magic is broken". Some research indicated that this could be due to the boot partition of the data drives having issues, rather than necessarily an issue with the boot drive itself. After some experimentation, I found that the system would boot if the two unreplaced data disks for vdev2 were unplugged - though, predictably, this then rendered the data pool unavailable.
At this stage, I'm unsure what to do. While it's not impossible that four disks, one of which was brand new, should fail simultaneously, it seems to be a terrible stroke of luck if there is not some underlying cause. It is also surprising that none of this has apparently impacted the disks in vdev1, which as noted are three years or so older than those of vdev2.
With regards to the previous fault linked above, when installing the SSD as a boot drive, I had insufficient SATA ports to the motherboard, or SATA power supplies from the PSU. I purchased an adapter from PCIe to SATA to connect the SSD, as well as an adapter from 4-pin molex to SATA power to supply power to the SSD. As the disks in vdev2 did not begin to fail until this SSD was installed, I do not want to rule out that it may be related. The two physical disks that now require disconnection for the system to boot are themselves powered by 4-pin molex to SATA adapters. Is it possible that the installation of the SSD has overloaded the 350W power supply, and these errors are the manifestation of the disks receiving insufficient power?
At this point, the server is powered down, as I don't want to risk causing any further damage to the data integrity in case there is any hope of bringing the pool back online and migrating away from any faulty disks through standard resilvers than having to resort to full disaster recovery.
Any suggestions or insights would be appreciated.