NVME ashift and performance.

Jadan1213

Dabbler
Joined
Apr 18, 2018
Messages
17
I've been searching for a while for some definitive information, but either no one has really addressed it, or my google-fu isn't very good.

I have two NVME drives. They are NOT mirrored, they are separate pools.

Drive A is 512GB and has a 512b sector size with a 16KB Page size. Currently ashift=12.
Drive B is 2TB and has a 4096k (set with nvme format) sector size with a 16KB Page size. Currently ashift=12.

Both drives have a page size of 16K. Does it make sense or have benefits to match the ashift to the page size? In this case, an ashift of 14 would match the page size. I have ZVOLS for vms on drive A and may Do an iscsi ZVOL on drive B. Is it best to match the 16KB size there too (e.g., 16KB Ashift=14, ZVOL block size of 16K, Filesystem cluster size of 16K)?
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
The general rule of thumb is that ashift that is too big will have less of a performance impact than one that is too small.
I have seen much ado about this topic over the years, and generally people don't talk much about ashift anymore and leave it at the default 12.
That being said, some old wisdom was to use ashift=13 with SSDs. In your case, the logic to try ashift=14 makes sense.

I don't have any resources or information to point you to beyond that, and I would be interested to see your test results.
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
Also you might find that boot-pool is irritatingly unbootable using grub with ashift>13.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
no. just use the default, which is 12. it's the default for a reason.
ZFS is 512(e)/4096k aware. it will write to the disks appropriately in mixed pools afaik. basically, it will write 4k in 4k bits and 512e in 4k bits (since 512e has to read 4k, change 512, write 4k anyway, just a single write is better)
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
Defaults change. Used to be 9. 12 is a welcome upgrade but was slow on coming.
grub will support ashift=13, but not larger.
if the disk device has a natural blocksize of n then I prefer to use n.
If I have a ssd with a native page/block size of 8k or 16k then i want to use that by preference.
I will put up with 4k for a boot pool to allow grub boot.
Mixed pools with vdevs of different ashift will get written to differently, but if the vdev is ashift=12 then the IO will be in ashift=12 units irrespective of the disk reported block size.
There are plenty of disks that lie about block size too, so zfs can't write to disks appropriately when they lie about the block size and say 512, and it also saves up problem for later when upgrading to a 4k blocksize disk, and you have a vdev with ashift=9 and want to replace a disk with one with ashift=12. This is a case where the updated default of 12 is a very good thing, but doesn't help with older pools.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
defaults change "when needed"; if the default doesn't need to change, why change it?
12 isn't just "A welcome upgrade" 9 literally doesn't work with 4k drives. you get like month long write times or something completely silly for simple adding a single 4k drive to such a pool.
to my knowledge, mixed pools of disk block types will work just fine as long as ashift is 12. it's specifically ashift9 thats a problem.

this is, admitedly, not a field i know really well, however, I know someone who very likely does; IX systems, who have most definitely looked at this, as they have to be planning ahead for such drives; probably have been since 4k was introduced and have made no changes to the default.

trying to set this every drive will suck, and everytime they increase the blocksize you would be rebuilding the pool, and ashift sizes would be a constantly moving goalpost nobody can catch.

additionally SSD are so many orders of magnitude faster than HDDs, and NVME SSDs being orders of magnitude faster than SSDs, that Im pretty sure the whole thing is utterly irrelevant. the 4k problems were from basically before SSDs were a thing. reading, modifying, writing on an HDD is potentially seconds of delays; even cheap SSDs i doubt you could even tell.
 
Last edited:

Jadan1213

Dabbler
Joined
Apr 18, 2018
Messages
17
Since an SSD's page size is the minimum write size, in this case 16K, wouldn't it make sense for the ashift to be 14 so that the writes are in 16K chunks? especially given write amplification???

I'm sure at some point, available namespace block sizes will increase from 4k to 8k or 16k just as current drives can do either 512b or 4k.
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
You are forgetting about write amplification which is a problem on ssds as well as hdds, and ssd often have larger than 4k pagesize, and often even more poorly documented than hdd blocksize. Don't want to be burning thru flash write cycles just because you have a poorly tuned ashift.

I've been using ashift=12 on all vdevs I setup for years without waiting for iX to become converted. I've also been using 1MB partition alignment for years before iX decided 20k was a bad idea. 20k was a very bad idea for 8k pagesize ssds, like common Samsungs, permanent write amplification.
 

Jadan1213

Dabbler
Joined
Apr 18, 2018
Messages
17
You are forgetting about write amplification which is a problem on ssds as well as hdds, and ssd often have larger than 4k pagesize, and often even more poorly documented than hdd blocksize. Don't want to be burning thru flash write cycles just because you have a poorly tuned ashift.

I've been using ashift=12 on all vdevs I setup for years without waiting for iX to become converted. I've also been using 1MB partition alignment for years before iX decided 20k was a bad idea. 20k was a very bad idea for 8k pagesize ssds, like common Samsungs, permanent write amplification.
This is precisely why I'm looking into this avenue. With NAND having limited p/e cycles, it makes sense to me that the write amplification be as little to non existent as possible. Given that 16K page size is the smallest write for the drives, it just makes sense in my head that it should be matched with ashift, block size, cluster size, etc.
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
Since an SSD's page size is the minimum write size, in this case 16K, wouldn't it make sense for the ashift to be 14 so that the writes are in 16K chunks? I'm sure at some point, available namespace block sizes will increase from 4k to 8k or 16k just as current drives can do either 512b or 4k.
Exactly what I try to setup if I can get the data. However grub won't boot a pool with ashift >= 14. I have tried and failed with ashift=14, and succeeded with ashift=13. So for boot-pool, you are limited to ashift=12 or 13. Non-boot pools you could use whatever ashift seems reasonable, although convincing webui when adding a vdev might be interesting. That is an experiment I haven't run. Certainly possible with command line.

If the ssh has a front end SLC cache that might ameliorate the issue somewhat, not sure, but it will always be a trade off vs small file sizes, although some of these can be packed directly into the pool metadata IIRC, not sure if this requires dnodesize=auto.
 
Last edited:

Jadan1213

Dabbler
Joined
Apr 18, 2018
Messages
17
Exactly what I try to setup if I can get the data. However grub won't boot a pool with ashift >= 14. I have tried and failed with ashift=14, and succeeded with ashift=13. So for boot-pool, you are limited to ashift=12 or 13. Non-boot pools you could use whatever ashift seems reasonable, although convincing webui when adding a vdev might be interesting. That is an experiment I haven't run. Certainly possible with command line.

If the ssh has a front end SLC cache that might ameliorate the issue somewhat, not sure, but it will always be a trade off vs small file sizes, although some of these can be packed directly into the pool metadata IIRC, not sure if this requires dnodesize=auto.
My TrueNAS boot pool is on a pair of mirrored 128GB ssds, with ashift=12, and those drives are dirt cheap so i don't need to worry about the boot issue, I'll just replace one if it dies.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
well. then put in a bug/feature request for the devs to review. that's the best way.
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
Exactly what I try to setup if I can get the data. However grub won't boot a pool with ashift >= 14. I have tried and failed with ashift=14, and succeeded with ashift=13. So for boot-pool, you are limited to ashift=12 or 13. Non-boot pools you could use whatever ashift seems reasonable, although convincing webui when adding a vdev might be interesting. That is an experiment I haven't run. Certainly possible with command line.

If the ssh has a front end SLC cache that might ameliorate the issue somewhat, not sure, but it will always be a trade off vs small file sizes, although some of these can be packed directly into the pool metadata IIRC, not sure if this requires dnodesize=auto.
Why are you worried about the boot drive's ashifft? I think we are splitting hairs here when servers take up to 10 minutes just to finish POST.
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
TN isn't the only game in town. I agree that for TN there is no point worrying about ashift 12/13, but still better to do things properly so have minimal impact to the medium which I may well redeploy later, so partition alignment can matter, and I prefer to reduce write amplification where reasonably possible. Current defaults are ashift=12/alignment=1M, much better then previous ashift=9/alignment=20k. I can always hack the installer around line 410 to get ashift=13 etc if I want.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
I can always hack the installer around line 410 to get ashift=13 etc if I want.
why would you hack it? if you must have a different ashift for a pool, just make it via CLI first. export it, then import it in the webui. it will work like any other...
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
wait. you are obsessing over the *boot pool* ashift value? dear internet gods, why?!?
it barely writes to the boot pool; ssds will basically last forever and are DIRT CHEAP for the sizes you should be buying. this is even less of a concern than I originally thought....that being none whatsoever.
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
I know how to deal with the situation if I want to. If system data set is pushed back to boot pool then there is regular IO that I prefer not to be on HDDs, and I prefer to minimize write amplification. You don't have to do any of that, and it sounds like you don't which is also perfectly fine.
 

Jadan1213

Dabbler
Joined
Apr 18, 2018
Messages
17
this is even less of a concern than I originally thought....that being none whatsoever.
Not for me... I'm doing this on a data pool. Not really sure why the focus has been on the boot pool in this thread. I agree, defaults are fine for a mostly read only boot pool. I care about write amplification and performance in my data pool...
 
Last edited:

NickF

Guru
Joined
Jun 12, 2014
Messages
763
Not for me... I'm doing this on a data pool. Not really sure why the focus has been on the boot pool in this thread. I agree, defaults are fine for a mostly read only boot pool. I care about write application and performance in my data pool...
Thank you for seeing reason here. I'm curious to know if you have any A-B comparisons between ashift values.
 
Top