Multiple disks having issues, possibly related to change in boot drive?

Xeonian · Jun 9, 2023

Hi all,

Firstly, my server specifications:

CPU: Intel Pentium G4560
Motherboard: Supermicro X11SSL-F
Memory: 2x 8GB Crucial DDR4 Server Memory, PC4-17000 (2133); 2x Kingston Server Premier 8GB (1x 8GB) 2400MHz DDR4
Boot drives: 1x 512GB Sandisk Ultra II SSD
Data configuration: 1 pool consisting of two vdevs: vdev1 w/ 3x WD Red 4TB in RAIDZ1 configuration; vdev2 w/ 3x WD Red 6TB in RAIDZ1 configuration
PSU: 350W

About a month ago, I was having issues with my boot partition, which was hosted on USB keys. Following advice from this forum, I changed my boot drive to an SSD, which resolved the issue.

Following this, the first full scrub I performed on the data pool flagged a disk as FAULTED, with 12 write errors reported, and a second disk as DEGRADED with approx 50 write errors, both in vdev2. I was a little surprised, as the disks in vdev1 are several years older than those in vdev2, so I would have anticipated those to fail first. Nevertheless, I took this result at its word and ordered a replacement WD Red 6TB to replace the faulted drive, with an expectation to replace the degraded drive in the near future.

Once the replacement disk arrived, I offlined the faulted disk, powered down the system, replaced the drive, and then powered back up and used the UI to replace the removed disk with the new disk, at which point resilvering began. This process took approx 36 hours, and, on completion, the new drive immediately went into FAULTED state, with the TrueNAS UI indicating that the replacement was not complete. As an aside, `zfs status -v` implied that the resilver had completed without error. The new disk was reporting 12 write errors, so, perhaps naively, I presumed, given that it was unlikely that a brand new drive should fail immediately, it had just somehow inherited the errors from the disk being replaced. I therefore applied `zfs clear` against the new drive, which triggered another resilver operation, again taking approx 36 hours.

This second resilver completed this morning, with the same result - the new disk faulted, with 12 write errors reported, but no errors reported in the resilvering process by `zfs status -v`. Additionally, the degraded disk had now reported more write errors, and a warning was raised that a file had become corrupted and unrecoverable (the specific file was a member of a snapshot, so not too concerning).

I powered down the system and attempted to replace the SATA cable connecting the new drive to the motherboard, in case this was somehow responsible for the fault. However, attempting to power up the server again following this, the boot sequence failed in GRUB, reporting that "alloc magic is broken". Some research indicated that this could be due to the boot partition of the data drives having issues, rather than necessarily an issue with the boot drive itself. After some experimentation, I found that the system would boot if the two unreplaced data disks for vdev2 were unplugged - though, predictably, this then rendered the data pool unavailable.

At this stage, I'm unsure what to do. While it's not impossible that four disks, one of which was brand new, should fail simultaneously, it seems to be a terrible stroke of luck if there is not some underlying cause. It is also surprising that none of this has apparently impacted the disks in vdev1, which as noted are three years or so older than those of vdev2.

With regards to the previous fault linked above, when installing the SSD as a boot drive, I had insufficient SATA ports to the motherboard, or SATA power supplies from the PSU. I purchased an adapter from PCIe to SATA to connect the SSD, as well as an adapter from 4-pin molex to SATA power to supply power to the SSD. As the disks in vdev2 did not begin to fail until this SSD was installed, I do not want to rule out that it may be related. The two physical disks that now require disconnection for the system to boot are themselves powered by 4-pin molex to SATA adapters. Is it possible that the installation of the SSD has overloaded the 350W power supply, and these errors are the manifestation of the disks receiving insufficient power?

At this point, the server is powered down, as I don't want to risk causing any further damage to the data integrity in case there is any hope of bringing the pool back online and migrating away from any faulty disks through standard resilvers than having to resort to full disaster recovery.

Any suggestions or insights would be appreciated.

sretalla · Jun 9, 2023

I don't think your problems are related to the boot pool SSD nor the SATA controller for it.

I think it's much more likely that your WD Reds are not the right type and you have SMR disks, which would certainly be a potential cause for all disks seeing issues at the same time during a resilver, large copy or scrub operation.

What model exactly are your WD Reds?

List of known SMR drives

Hard drives that write data in overlapping, "shingled" tracks, have greater areal density than ones that do not. For cost and capacity reasons, manufacturers are increasingly moving to SMR, Shingled Magnetic Recording. SMR is a form of PMR...

www.truenas.com

Xeonian · Jun 9, 2023

Just checked. The drives in vdev1 are EFRX (Non -SMR), while both the original and new vdev2 drives are EFAX (SMR). Guess that explains the odd pattern of failures.

While it's clearly in my best interest to replace the vdev2 drives with non-SMR models, is there anything I could try to bring the pool back online in the interim, or am I hosed? Again, would prefer to migrate the pool by parts via resilvering if possible, rather than having to recreate the pool from scratch with a new array of drives.

sretalla · Jun 9, 2023

You can try exporting it and importing as read-only to get data off it, you're pretty safe if you're just reading (and it can import).

Xeonian · Jun 9, 2023

Presumably while in read-only state, it won't perform any resilvering, so I presume once I've got my copy I'd destroy the existing pool, swap out the problematic drives, create a new pool, then copy the data back?

I still haven't quite figured out about the boot sequence problems, however. Could it be related to the OS attempting to automatically mount the drive with its current settings? Could it be bypassed by booting with the problematic disks disconnected, then reconnecting the disks after boot and then importing from there? Would the OS be happy with disks connected at runtime?

I also still have the original drive that I was attempting to replace. Would it be of any value to swap that back in temporarily while producing my copy? There have been writes since I attempted to start the replacement, but would the system still recognise it as a partial source of redundancy, or would it just be considered an unrelated drive now that the replacement process was started?

sretalla · Jun 9, 2023

Xeonian said:
Presumably while in read-only state, it won't perform any resilvering, so I presume once I've got my copy I'd destroy the existing pool, swap out the problematic drives, create a new pool, then copy the data back?

With no writes going on, resilvering shouldn't be triggered... not sure if there will still be some "owed" to some disks when you come back online, but I think it may still do some if that's the case.

That plan of action certainly sounds like the right one... get a copy, then rework things to the right shape and put it back.

Xeonian said:
Could it be related to the OS attempting to automatically mount the drive with its current settings? Could it be bypassed by booting with the problematic disks disconnected, then reconnecting the disks after boot and then importing from there? Would the OS be happy with disks connected at runtime?

I'm not sure what exactly is going on there, but having a pool set to import on boot that's not healthy won't be helping things... exporting it before a reboot would probably help with that.

Xeonian said:
I also still have the original drive that I was attempting to replace. Would it be of any value to swap that back in temporarily while producing my copy?

Once resilvering has started with that drive offline, it's not great to have it come back online (although possibly not a total disaster). I would leave it out.

Arwen · Jun 9, 2023

In theory, replacing a SMR drive with CMR drive, in a pool with other SMR drives, should not be too bad. It will likely still take longer than a pool full of CMR disks, (due to the extra SMR head seeks on the source disks). So this is an option.

As more and more SMR drives are replaced with CMR drives, the disk replacements will become faster and more normal.

Of course, if you can make a backup first, that would be very helpful for your backup plan to destroy and re-work the pool configuration with all CMR drives.

I think Western Digital should replace all your SMR drives for free, in warranty or not. This excrement storm they CREATED has bitten far too many people. The other vendors are not blameless, but they never put SMR drives in NAS targeted disks, (as far as I know). Nor did they have a firmware bug for IDNF, (sector not found??? Really, can't find a sector on a brand new disk???)

Xeonian · Jun 9, 2023

Thanks for all your help so far. I've managed to get the system back online with all disks attached - I think my boot priority was messed up and it was attempting to boot from one of the data drives - and I'm making sure my backups are up to date (I do have cloud sync tasks targeting S3 but I had the schedule a bit too infrequent and had been missing some datasets; correcting that now!) before moving on. If I encounter anything else unexpected I'll update the thread, but hopefully it all goes well and I can resolve it.

I genuinely hadn't encountered the difference between SMR and CMR drives prior to today. I did most of my research back in late 2017/early 2018 when I first put the server together, and I'm not sure if it just wasn't a problem then but it didn't cross my attention. I then just defaulted to buying more WD Reds without paying much mind to the model number when it came time to upgrade. Guess this is a learning experience for me.

When you say WD should replace the drives, is there precedent for that, or just something that would be so in a just world? If there's a chance they would then I'll get in contact. The retailer who sold me the recent replacement drive returns policy is not to accept once the drive has been installed/written to, so unless I get into an argument with their customer support I'm already out £125 for a drive I can't use, let alone the outlay for eventually replacing the entire vdev.

Arwen · Jun 9, 2023

Sorry, for the old WD Red SMRs, they had a 3 year warranty, so probably long gone. You can still try... Check any newer WD Red SMRs, and if under the 3 year warranty, try and get WD Red Plus drives in place directly from WD. Mention ZFS & TrueNAS, they are almost certainly going to recognize those.

Note, WD Red SMR drives have an initial firmware bug, that whence the drive has been used enough, generally should not re-occur. But, timeouts and other sillyness caused by WD Red SMR drives can occur during any point during the life of the drive. The more SMR drives in a ZFS pool, the more chances for problems to show up.

Important Announcement for the TrueNAS Community.

Multiple disks having issues, possibly related to change in boot drive?

Xeonian

Dabbler

sretalla

Powered by Neutrality

List of known SMR drives

Xeonian

Dabbler

sretalla

Powered by Neutrality

Xeonian

Dabbler

sretalla

Powered by Neutrality

Arwen

MVP

Xeonian

Dabbler

Arwen

MVP

Similar threads

Important Announcement for the TrueNAS Community.

Multiple disks having issues, possibly related to change in boot drive?

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

MVP

Dabbler

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Multiple disks having issues, possibly related to change in boot drive?"

Similar threads