Make kernel constants such as BLK_MIN_SG_TIMEOUT configurable for slow disks?

Hoozemans · Aug 9, 2023

Good morning!

I have a bunch of SMR disks lying around, and I figured I'd try and do something with them. Since SMR is becoming more prevalent, we'll run into enterprise solutions employing the technology for storage sooner or later, so NAS software should be figuring out ways to deal with it as well.

Predictably, I keep running into errors like this:

Code:

Aug  9 17:00:10 nas kernel: sd 8:0:0:0: attempting task abort!scmd(0x00000000c4e14ac5), outstanding for 1420 ms & timeout 1000 ms
Aug  9 17:00:10 nas kernel: sd 8:0:0:0: tag#35 CDB: ATA command pass through(16) 85 08 0e 00 d5 00 01 00 e0 00 4f 00 c2 00 b0 00
Aug  9 17:00:10 nas kernel: scsi target8:0:0: handle(0x000a), sas_address(0x5001e677b7d18fe8), phy(8)
Aug  9 17:00:10 nas kernel: scsi target8:0:0: enclosure logical id(0x5001e677b7d18fff), slot(8)
Aug  9 17:00:10 nas kernel: sd 8:0:0:0: device_block, handle(0x000a)
Aug  9 17:00:10 nas kernel: sd 8:0:0:0: task abort: SUCCESS scmd(0x00000000c4e14ac5)
Aug  9 17:00:10 nas kernel: sd 8:0:0:0: device_unblock and setting to running, handle(0x000a)
Aug  9 17:00:11 nas kernel: sd 8:0:0:0: Power-on or device reset occurred

Surprisingly, there's nothing much that the base configuration of TrueNAS Scale allows me to do about this. The timeout in question appears to be hardcoded in a kernel driver somewhere. I suspect it's related to

Code:

/usr/src/linux-headers-5.10.142+truenas/include/linux/blkdev.h:BLK_MIN_SG_TIMEOUT

Now I'm no wizkid, and recompiling the kernel to see if increasing this value might make SMR disks useable is something I'd rather not try - also because I'm not sure what'll happen next time I update the TrueNAS Scale box I'm performing this experiment on.

So I have a couple of questions:

Is there way to change this timeout other than recompile the kernel?
Might future releases have such a feature? Is that workable, should I post a feature request (https://ixsystems.atlassian.net/jira/software/c/projects/NAS/issues)?
What else could I try to make SMR disks play well with ZFS?

Thanks!

sretalla · Aug 9, 2023

You're not going to get a lot of love with this topic...

Those timeouts are set with performance in mind and messing with them will have flow-on effects even if they do help to not get SMR disks kicked out of your pools.

I recommend using the SMR disks with a different filesystem connected to a different OS.

Hoozemans · Aug 9, 2023

sretalla said:
You're not going to get a lot of love with this topic...

It's okay; I have girlfriend: I do her laundry, she laughs at my jokes.

sretalla said:
Those timeouts are set with performance in mind and messing with them will have flow-on effects even if they do help to not get SMR disks kicked out of your pools.

I should think that different hardware requires different parameters for maximum performance. It can't be true that a single set of numbers is the best possible solution for all conceivable combinations of hardware and firmware out there.

sretalla said:
I recommend using the SMR disks with a different filesystem connected to a different OS.

I assure you, I wasn't going to recompile your box, or anyone else's -- just my own ;)

Hoozemans · Aug 9, 2023

sretalla said:
Those timeouts are set with performance in mind and messing with them will have flow-on effects even if they do help to not get SMR disks kicked out of your pools.

Also (sorry, it appears editing is not an option here), tweaking the TrueNAS install to keep disks from being kicked is easy enough -- but the constant resets are killing performance. I'd think that you could get better performance by increasing the timeout to prevent disks from being reset.

Arwen · Aug 9, 2023

Perhaps changing the timeouts would help for SMR disks. Ideally a user would be able to set it on a disk by disk basis so the new, longer timeout can be applied only to SMR disks.

However, that is not the only problem with SMR disks & TrueNAS. The TrueNAS software only uses ZFS for data pools. This means that some of the ZFS methodology can impact SMR disks in addition to the time outs. Like Scrubs or Re-silvers, both of which can take many times longer than CMR disks. SMR disks are basically not suited for RAID configurations, and really not suited for ZFS.

Even non-SMR disks can have problems with ZFS. For example;

Desktop disks with long, (aka >1 minute), time limit for error recovery, (TLER)
Aggressive head parking

Both those can cause problems. First causing disks to be considered failed, when they may only have 1 bad block. The second can cause time outs, and excessive wear on the head parking mechanism. In essence, desktop disks are not generally suitable for NAS or ZFS.

Hoozemans · Aug 12, 2023

Arwen said:
SMR disks are basically not suited for RAID configurations, and really not suited for ZFS.

True. However, I would suggest that the demand for high density storage is on the rise, and rising faster than the price for SSD GiB/€ is dropping. I am predicting there will be more attempts to make NASses (using RAID/ZFS) work with SMR disks, and I'm curious how that will pan out, and what modifications will be made by both software engineers coding storage solutions and disk manufacturers.

Thanks!

Important Announcement for the TrueNAS Community.

Make kernel constants such as BLK_MIN_SG_TIMEOUT configurable for slow disks?

Hoozemans

Cadet

sretalla

Powered by Neutrality

Hoozemans

Cadet

Hoozemans

Cadet

Arwen

MVP

Hoozemans

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Make kernel constants such as BLK_MIN_SG_TIMEOUT configurable for slow disks?

Hoozemans

Cadet

sretalla

Powered by Neutrality

Hoozemans

Cadet

Hoozemans

Cadet

Arwen

MVP

Hoozemans

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Make kernel constants such as BLK_MIN_SG_TIMEOUT configurable for slow disks?"

Similar threads