Misaligned Pools and Lost Space...

Bidule0hm · Jan 1, 2016

Index

So, it took me the reading of more than a dozen of web pages, a very informative conversation with @jgreco, 3 posts* from @SirMaster and months (well, a couple of dozens of hours actually, but as the posts were posted months apart...) to understand how all of this works (ZFS isn't that simple sometimes...) so I wanted to share it in the most clear and understandable way I can ;)

Many thanks to @jgreco and @SirMaster for the help :)

The following only applies to RAID-Zx vdevs.

The problem
- The ashift value
- The recordsize value
- The problem
The solution
- Reducing the ashift value
- Increasing the recordsize value
Miscellaneous
- What is the block auto-sizing
- Why ZFS wants the number of sectors in each allocation to be a multiple of the RAID-Z level + 1
- Why the 2^n + p rule is broken
Real case example
- Reserved.
- Reserved.
- Reserved.

* the 3 posts I talk about are this one, this one and this one.

Bidule0hm · Jan 1, 2016

Reserved.

Bidule0hm · Jan 1, 2016

The problem

The ashift value

The ashift is a pool property who refers to the size of a drive sector.

The old drives use 512 bytes sectors whereas recent drives use 4k sectors.

Usually we use ashift=9 or ashift=12. The first one stand for 2^9 or 512 bytes and the second one for 2^12 or 4096 (4k) bytes.

NB: you can use an ashift value that doesn't match the real sector size of the drives but it's not something you want because you will have performance issues.

The recordsize value

The recordsize is a dataset property who refers to the maximum size of a ZFS block.

It can be any power of two between the size of a sector (usually 512 or 4k) as determined by the ashift value, and 1M. The default recordsize is set to 128k.

The problem

Ok, let's go for some maths :)

ZFS needs that the number of sectors used by a data block + his parity to be an integer and a multiple of the RAID-Z level (1, 2 or 3) plus one.

If the number of sectors isn't a multiple then it is increased until it follow this rule.

For example for a RAID-Z2 of 8 drives with ashift=12 and recordsize=128k we have 128k / 4k = 32 sectors for data + 10.667 sectors for parity (32 sectors * 2 parity drives / 6 data drives) so the total is 42.667 sectors.

This number isn't a multiple of 3 (RAID-Z level 2 + 1), actually it's not even an integer, so ZFS will use the next multiple of 3, which is 45 in this example.

The problem is that this padding is instant lost space. In the example we lose (45 - 42.667) / 42.667 = 5.47 %.

Bidule0hm · Jan 1, 2016

Reserved.

Bidule0hm · Jan 1, 2016

The solution

Reducing the ashift value

By reducing the ashift value there's more chances that the number of sectors will be an integer and a multiple because for the same block size there will be more sectors per block, and even if it's not a multiple there will be less padding sectors relatively to the total number of sectors.

If we take the same example as above but with ashift=9 then we have 256 + 85.333 = 341.333 sectors and the next multiple of 3 is 342 so the lost space is only 0.20 % instead of 5.47 % :)

There's just one small problem: all the recent drives have 4k sectors and using ashift=9 with 4k sectors isn't exactly a good idea for performance (each time ZFS will have to write a sector of 512 bytes it'll in fact have to rewrite the whole 4k of the real sector...).

Increasing the recordsize value

By increasing the recordsize value we do essentially the same as when we reduce the ashift value because we then have more sectors per block.

If we increase the recordsize value to 1M instead of 128k and we take again the same example then we have 256 + 85.333 = 341.333 sectors and the next multiple of 3 is 342 so the lost space is only 0.20 % instead of 5.47 % (yes it's the same result as with ashift=9 but it's just a coincidence).

There's basically no downsides (if we ignore VMs and DBs, and we can as we are talking about RAID-Zx vdevs only) to increasing the recordsize value so we have our solution :)

Bidule0hm · Jan 1, 2016

Reserved.

Bidule0hm · Jan 1, 2016

Miscellaneous

What is the block auto-sizing

Remember when I talked about the recordsize value? well the block size can be any power of two between the size of a sector (usually 512 or 4k) as determined by the ashift value, and the recordsize value.

For example if you have a 7 KB file then ZFS will chose a 8 KB block size. Hopefully almost all of the files we usually have are bigger than 1 MB so this means that increasing the recordsize value to 1M do work to reduce the misalignment overhead.

Why ZFS wants the number of sectors in each allocation to be a multiple of the RAID-Z level + 1

Let's take the most simple example with a RAID-Z1 pool. The smallest allocation you can do is 1 sector for data + 1 sector for parity, or 2 sectors total. Now let's say you need 2 data sectors + 1 parity sector, no problem, ZFS can allocate that. Next you delete the data so there's now 3 free sectors. ZFS then reallocate 2 of those 3 sectors because you've decided to put some new data; well now there's a single free sector who can't be used until some adjacent sectors are freed and that's not a good thing.

By following the RAID-Z level + 1 rule you avoid any cases like that where you have single free sectors. If we take the same example but follow the rule we now have 2 data sectors + 1 parity sector + 1 padding sector. If we free them and reallocate them there's only 3 possibilities:

3 data sectors + 1 parity sector
2 data sectors + 1 parity sector + 1 padding sector
2x (1 data sector + 1 parity sector)

There's no single free sector in any of these possibilities.

Why the 2^n + p rule is broken

If we take the example of a 10 drives RAID-Z2, who follows the 2^n + p rule, we have a 5 % overhead (32 + 8 = 40 sectors --> 40 / 3 = 13.333 so we need a 2 sectors padding) so the rule is obviously broken. Now we can't do anything about that, this rule is just a (over) simplification so it's not exact.

Bidule0hm · Jan 1, 2016

Reserved.

Bidule0hm · Jan 1, 2016

Reserved.

snicke · Jan 16, 2016

Great article @Bidule0hm! Thank you!

How and when do you set the ashift and recordsize values? Searched for ashift in the FreeNAS 9.3.a documentation but couldn't find anything. Do you have any RTFM-links maybe? ;)

What is the default ashift value?

Bidule0hm · Jan 16, 2016

You're welcome ;)

The default ashift value is 12 (4k sectors) and you don't want to change it because all current drives have 4k sectors.

The recordsize value is settable (only in Advanced Mode) when you create a dataset, look at the options table here: http://doc.freenas.org/9.3/freenas_storage.html#create-dataset ;)

Dice · Jan 16, 2016

Bidule0hm said:
Hopefully almost all of the files we usually have are bigger than 1 MB so this means that increasing the recordsize value to 1M do work to reduce the misalignment overhead.

Question is - how much of the data needs to be bigger than 1mb?

Mlovelace · Jan 16, 2016

Bidule0hm said:
Why the 2^n + p rule is broken

WIP...

Here is a link as to why the rule 2^n + p no longer applies from one of the zfs devs.

http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/

Nice write up, thanks for the post.

Bidule0hm · Jan 17, 2016

@Dice I haven't had the time to put the real case example but in short even on my user documents (no big media files, mostly sources files, images, pdf, txt, ...) I reclaimed nearly all of the lost past by changing from 128k to 1M ;)

You can always do a test: create a dataset with a 128k recordsize, another with 1M recordsize and copy the exact same data to both ;)

@Mlovelace Yep, a bit hard to understand but as I haven't had the time to write the part about that it's a good reading in the meantime.

Dice · Jan 17, 2016

Bidule0hm said:
You can always do a test: create a dataset with a 128k recordsize, another with 1M recordsize and copy the exact same data to both

Ahh.. Once I get my stuff running I will do that and hopefully find this thread to report back.

nightshade00013 · Jan 17, 2016

Ok, I need just a little help processing and figuring for something. Sorry that I have to ask but the percocet is turning my brain to mush right now.

If I have seven 4TB drives (4k native sectors) in a raidZ3 with mostly larger files but a smattering of small ones as well would I be better off allowing the system to manage the ashift value or would it be better to do that manually and what would the target be. I will hopefully be picking up the last two drives to build out my pool in the coming weeks. I can also see about testing a couple different setups since I will be copying everything off of a single 3TB drive so a couple ones to try out would be good.

Bidule0hm · Jan 18, 2016

You don't want to change the ashift value, only the recordsize value: https://forums.freenas.org/index.php?threads/misaligned-pools-and-lost-space.40288/#post-256839 ;)

jgreco · Jan 19, 2016

Bidule0hm said:
The solution

There's basically no downside to increasing the recordsize value so we have our solution :)

Of course there are downsides to increasing the recordsize value. Among them are:

- a tendency to consume more RAM to accomplish the same things (or a reduction in the number of things the ARC can manage),
- the fact that lots of data fits into blocks well below 1MB, meaning that you haven't actually found a magic bullet, just an optimization that works for large files,
- that slinging around 1MB blocks increases latency, especially considering that the blocks may be compressed, meaning a larger amount of data must be inflated,
- that you are less likely to find a contiguous 1MB allocation on a fragmented pool than you are a contiguous 128K allocation,
- that some data you would never want to use a large recordsize for (VM virtual disks on NFS),
etc.

Whether any of these downsides are applicable to any given scenario is, of course, a different matter entirely.

anodos · Jan 19, 2016

jgreco said:
Of course there are downsides to increasing the recordsize value. Among them are:

- a tendency to consume more RAM to accomplish the same things (or a reduction in the number of things the ARC can manage),
- the fact that lots of data fits into blocks well below 1MB, meaning that you haven't actually found a magic bullet, just an optimization that works for large files,
- that slinging around 1MB blocks increases latency, especially considering that the blocks may be compressed, meaning a larger amount of data must be inflated,
- that you are less likely to find a contiguous 1MB allocation on a fragmented pool than you are a contiguous 128K allocation,
- that some data you would never want to use a large recordsize for (VM virtual disks on NFS),
etc.

Whether any of these downsides are applicable to any given scenario is, of course, a different matter entirely.

The sign of a good sysadmin is the ability to find the downside to everything. :p

jgreco · Jan 19, 2016

anodos said:
The sign of a good sysadmin is the ability to find the downside to everything. :p

I come from the old days of the ISP business, where you had to be (at least semi-) expert in lots of things. I'm not so much a "good sysadmin" as I am "someone who's worked with it all" or maybe "been crapped on by it all."

Think of it more as I've gotten good at understanding the downsides of things since I don't like getting crapped on. ;-)

Important Announcement for the TrueNAS Community.

Misaligned Pools and Lost Space...

Server Electronics Sorcerer

Server Electronics Sorcerer

Server Electronics Sorcerer

Server Electronics Sorcerer

Server Electronics Sorcerer

Server Electronics Sorcerer

Server Electronics Sorcerer

Server Electronics Sorcerer

Server Electronics Sorcerer

Explorer

Server Electronics Sorcerer

Wizard

Guru

Server Electronics Sorcerer

Wizard

Wizard

Server Electronics Sorcerer

Resident Grinch

Sambassador

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Misaligned Pools and Lost Space..."

Similar threads