All drives alerted "Device is causing slow I/O"

chiknbake · Dec 8, 2020

Note: I'm not physically able to access this hardware right now due to covid so my ability to try different things is limited, just trying to confirm my suspicion that this is the worst case physical hardware degradation

I've been recently trying to diagnose really poor dd and smb performance on a remote server I have. DD zero writes have been atrocious at 25-30MBytes/s and same in samba. While diagnosing, I decide to try scrubbing my drives and near the middle of the scrub I got a chain of emails within a 12 min span alerting me that all 4 drives were causing slow I/O on the pool:

New alerts:
* Device /dev/gptid/90518e3a-f397-11e8-9303-0cc47ac2d7cc.eli is causing slow I/O on pool vol1.

Current alerts:
* Device /dev/gptid/926767f0-f397-11e8-9303-0cc47ac2d7cc.eli is causing slow I/O on pool vol1.
* Device /dev/gptid/8e3610d3-f397-11e8-9303-0cc47ac2d7cc.eli is causing slow I/O on pool vol1.
* Device /dev/gptid/90518e3a-f397-11e8-9303-0cc47ac2d7cc.eli is causing slow I/O on pool vol1.

I've seen this error in the past, usually only on one drive every 2 months or so and I originally thought it was due to the drives I used being the recently discovered SMR drives. I did not suspect this error was causing the write performance issues as past performance in the past 2 years has been very good at +110Mbytes/s SMB performance. Now I'm very suspicious that this was the root cause all along but I can't find that many posts about this error and it's typical root cause.

All my long SMART tests shows healthy and my pool also has no issues now and in the past. I fear this may be the backplane going out on my device which would be the worst case since I can't do anything about that while I'm away from this server. I just wanted to get the help of the community to see if anyone has any other thoughts and things I can try remotely before I just give up until I can travel back to fix the physical issue.

Setup:
U-NAS NSC-400 Enclosure
X10SDV-6C+-TLN4F-O (Xeon D-1528)
64GB Registered ECC RAM
4x Seagate Baracuda 8TB SMR Drives (ST8000DM004) in RaidZ2

HoneyBadger · Dec 8, 2020

Unless you can figure out the commands to physically eject an SMR drive and install a CMR one instead - no, there's nothing that can be done remotely here.

RAIDZ2 is going to exacerbate the situation (they'd hobble along slightly faster as mirrors) but unless you have the ability to offload all of the data and rebuild the pool there's no way around that.

chiknbake · Dec 8, 2020

HoneyBadger said:
Unless you can figure out the commands to physically eject an SMR drive and install a CMR one instead - no, there's nothing that can be done remotely here.

RAIDZ2 is going to exacerbate the situation (they'd hobble along slightly faster as mirrors) but unless you have the ability to offload all of the data and rebuild the pool there's no way around that.

So "device is causing slow I/O" is indeed SMR drives root cause and not a faulty backplane? A bit confused by that since these drives were originally performing well and all of a sudden all drives issue an alert within a 12min time span instead of the typical alert every 2 months.

HoneyBadger · Dec 8, 2020

chiknbake said:
So "device is causing slow I/O" is indeed SMR drives root cause and not a faulty backplane? A bit confused by that since these drives were originally performing well and all of a sudden all drives issue an alert within a 12min time span instead of the typical alert every 2 months.

SMR drives do just fine when writing to initially blank space; it's the read-modify-write cycle of overwriting sectors in place that causes them to hurl. I imagine for the first couple years of life, the drives were brand new and able to happily write into SMR zones that were considered "empty" so no shingling was necessary. Now the time has finally come for ZFS's copy-on-write to strike, and that first zone now needs to be updated with some fresh data.

You're running a scrub so it's pushing a lot more I/O to the disks, which will cause that much more thrashing between the drive's internal CMR "cache ring" and the SMR zones.

Backplane issues would normally manifest as CCB/CAM errors rather than slow I/O.

chiknbake · Dec 8, 2020

HoneyBadger said:
SMR drives do just fine when writing to initially blank space; it's the read-modify-write cycle of overwriting sectors in place that causes them to hurl. I imagine for the first couple years of life, the drives were brand new and able to happily write into SMR zones that were considered "empty" so no shingling was necessary. Now the time has finally come for ZFS's copy-on-write to strike, and that first zone now needs to be updated with some fresh data.

You're running a scrub so it's pushing a lot more I/O to the disks, which will cause that much more thrashing between the drive's internal CMR "cache ring" and the SMR zones.

Backplane issues would normally manifest as CCB/CAM errors rather than slow I/O.

Interesting, would this SMR behavior also explain why pool performance in dd & SMB has tanked as well? One other question I have is this detectable in iostat? I've screen shot (sorry remote limitations on copy paste right now) my results below:

HoneyBadger · Dec 9, 2020

chiknbake said:
Interesting, would this SMR behavior also explain why pool performance in dd & SMB has tanked as well? One other question I have is this detectable in iostat? I've screen shot (sorry remote limitations on copy paste right now) my results below:

Yes. The nature of how SMR disks work makes them a very poor match for RAID in general and especially bad for copy-on-write filesystems like ZFS.

Regarding your iostat pastes yes, the disk_wait column is where you'll want to look and you're seeing operations in the multiple seconds of delay there.

chiknbake · Dec 9, 2020

HoneyBadger said:
Yes. The nature of how SMR disks work makes them a very poor match for RAID in general and especially bad for copy-on-write filesystems like ZFS.

Regarding your iostat pastes yes, the disk_wait column is where you'll want to look and you're seeing operations in the multiple seconds of delay there.

Really appreciate the help, I can stop pulling my hair out now on the performance issues.

All the initial news about SMRs were so focused on rebuild hang and nothing about healthy pool function that I assumed these drives could hold for just a bit longer with my manual backups. Guess EOL came much earlier that I would have liked.

HoneyBadger · Dec 9, 2020

SMR drives do have a "lifespan" in that sense - once they run out of empty space to write to, performance nosedives if you ever exceed their internal caching capabilities.

They're great for non-RAID filesystems that can produce redundancy at a higher level (object storage) and may well be viable later in their HA/HM (Host Aware/Host Managed) flavors once the necessary thousands of hours have been invested into ZFS coding.

But for now, and for any of the DM (Drive Managed) versions, they're unfortunately best to be avoided. Single drives in non-RAID external USB enclosures, to hold sequential files (your movie collection, or big archive tarballs/files? sure!) is the only place where I'd use them.

Important Announcement for the TrueNAS Community.

All drives alerted "Device is causing slow I/O"

chiknbake

Cadet

HoneyBadger

actually does care

chiknbake

Cadet

HoneyBadger

actually does care

chiknbake

Cadet

HoneyBadger

actually does care

chiknbake

Cadet

HoneyBadger

actually does care

Similar threads