Misaligned Pools and Lost Space...

Status
Not open for further replies.

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Index

So, it took me the reading of more than a dozen of web pages, a very informative conversation with @jgreco, 3 posts* from @SirMaster and months (well, a couple of dozens of hours actually, but as the posts were posted months apart...) to understand how all of this works (ZFS isn't that simple sometimes...) so I wanted to share it in the most clear and understandable way I can ;)

Many thanks to @jgreco and @SirMaster for the help :)

The following only applies to RAID-Zx vdevs.

  • The problem
    • The ashift value
    • The recordsize value
    • The problem
  • The solution
    • Reducing the ashift value
    • Increasing the recordsize value
  • Miscellaneous
    • What is the block auto-sizing
    • Why ZFS wants the number of sectors in each allocation to be a multiple of the RAID-Z level + 1
    • Why the 2^n + p rule is broken
  • Real case example
    • Reserved.
    • Reserved.
    • Reserved.

* the 3 posts I talk about are this one, this one and this one.
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Reserved.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
The problem


The ashift value

The ashift is a pool property who refers to the size of a drive sector.

The old drives use 512 bytes sectors whereas recent drives use 4k sectors.

Usually we use ashift=9 or ashift=12. The first one stand for 2^9 or 512 bytes and the second one for 2^12 or 4096 (4k) bytes.

NB: you can use an ashift value that doesn't match the real sector size of the drives but it's not something you want because you will have performance issues.


The recordsize value

The recordsize is a dataset property who refers to the maximum size of a ZFS block.

It can be any power of two between the size of a sector (usually 512 or 4k) as determined by the ashift value, and 1M. The default recordsize is set to 128k.


The problem

Ok, let's go for some maths :)

ZFS needs that the number of sectors used by a data block + his parity to be an integer and a multiple of the RAID-Z level (1, 2 or 3) plus one.

If the number of sectors isn't a multiple then it is increased until it follow this rule.

For example for a RAID-Z2 of 8 drives with ashift=12 and recordsize=128k we have 128k / 4k = 32 sectors for data + 10.667 sectors for parity (32 sectors * 2 parity drives / 6 data drives) so the total is 42.667 sectors.

This number isn't a multiple of 3 (RAID-Z level 2 + 1), actually it's not even an integer, so ZFS will use the next multiple of 3, which is 45 in this example.

The problem is that this padding is instant lost space. In the example we lose (45 - 42.667) / 42.667 = 5.47 %.
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Reserved.
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
The solution


Reducing the ashift value

By reducing the ashift value there's more chances that the number of sectors will be an integer and a multiple because for the same block size there will be more sectors per block, and even if it's not a multiple there will be less padding sectors relatively to the total number of sectors.

If we take the same example as above but with ashift=9 then we have 256 + 85.333 = 341.333 sectors and the next multiple of 3 is 342 so the lost space is only 0.20 % instead of 5.47 % :)

There's just one small problem: all the recent drives have 4k sectors and using ashift=9 with 4k sectors isn't exactly a good idea for performance (each time ZFS will have to write a sector of 512 bytes it'll in fact have to rewrite the whole 4k of the real sector...).


Increasing the recordsize value

By increasing the recordsize value we do essentially the same as when we reduce the ashift value because we then have more sectors per block.

If we increase the recordsize value to 1M instead of 128k and we take again the same example then we have 256 + 85.333 = 341.333 sectors and the next multiple of 3 is 342 so the lost space is only 0.20 % instead of 5.47 % (yes it's the same result as with ashift=9 but it's just a coincidence).

There's basically no downsides (if we ignore VMs and DBs, and we can as we are talking about RAID-Zx vdevs only) to increasing the recordsize value so we have our solution :)
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Reserved.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Miscellaneous


What is the block auto-sizing

Remember when I talked about the recordsize value? well the block size can be any power of two between the size of a sector (usually 512 or 4k) as determined by the ashift value, and the recordsize value.

For example if you have a 7 KB file then ZFS will chose a 8 KB block size. Hopefully almost all of the files we usually have are bigger than 1 MB so this means that increasing the recordsize value to 1M do work to reduce the misalignment overhead.


Why ZFS wants the number of sectors in each allocation to be a multiple of the RAID-Z level + 1

Let's take the most simple example with a RAID-Z1 pool. The smallest allocation you can do is 1 sector for data + 1 sector for parity, or 2 sectors total. Now let's say you need 2 data sectors + 1 parity sector, no problem, ZFS can allocate that. Next you delete the data so there's now 3 free sectors. ZFS then reallocate 2 of those 3 sectors because you've decided to put some new data; well now there's a single free sector who can't be used until some adjacent sectors are freed and that's not a good thing.

By following the RAID-Z level + 1 rule you avoid any cases like that where you have single free sectors. If we take the same example but follow the rule we now have 2 data sectors + 1 parity sector + 1 padding sector. If we free them and reallocate them there's only 3 possibilities:
  • 3 data sectors + 1 parity sector
  • 2 data sectors + 1 parity sector + 1 padding sector
  • 2x (1 data sector + 1 parity sector)
There's no single free sector in any of these possibilities.


Why the 2^n + p rule is broken

If we take the example of a 10 drives RAID-Z2, who follows the 2^n + p rule, we have a 5 % overhead (32 + 8 = 40 sectors --> 40 / 3 = 13.333 so we need a 2 sectors padding) so the rule is obviously broken. Now we can't do anything about that, this rule is just a (over) simplification so it's not exact.
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Reserved.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Reserved.
 

snicke

Explorer
Joined
May 5, 2015
Messages
74
Great article @Bidule0hm! Thank you!

How and when do you set the ashift and recordsize values? Searched for ashift in the FreeNAS 9.3.a documentation but couldn't find anything. Do you have any RTFM-links maybe? ;)

What is the default ashift value?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
You're welcome ;)

The default ashift value is 12 (4k sectors) and you don't want to change it because all current drives have 4k sectors.

The recordsize value is settable (only in Advanced Mode) when you create a dataset, look at the options table here: http://doc.freenas.org/9.3/freenas_storage.html#create-dataset ;)
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Hopefully almost all of the files we usually have are bigger than 1 MB so this means that increasing the recordsize value to 1M do work to reduce the misalignment overhead.

Question is - how much of the data needs to be bigger than 1mb?
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
@Dice I haven't had the time to put the real case example but in short even on my user documents (no big media files, mostly sources files, images, pdf, txt, ...) I reclaimed nearly all of the lost past by changing from 128k to 1M ;)

You can always do a test: create a dataset with a 128k recordsize, another with 1M recordsize and copy the exact same data to both ;)

@Mlovelace Yep, a bit hard to understand but as I haven't had the time to write the part about that it's a good reading in the meantime.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
You can always do a test: create a dataset with a 128k recordsize, another with 1M recordsize and copy the exact same data to both
Ahh.. Once I get my stuff running I will do that and hopefully find this thread to report back.
 
Joined
Apr 9, 2015
Messages
1,258
Ok, I need just a little help processing and figuring for something. Sorry that I have to ask but the percocet is turning my brain to mush right now.

If I have seven 4TB drives (4k native sectors) in a raidZ3 with mostly larger files but a smattering of small ones as well would I be better off allowing the system to manage the ashift value or would it be better to do that manually and what would the target be. I will hopefully be picking up the last two drives to build out my pool in the coming weeks. I can also see about testing a couple different setups since I will be copying everything off of a single 3TB drive so a couple ones to try out would be good.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The solution


There's basically no downside to increasing the recordsize value so we have our solution :)

Of course there are downsides to increasing the recordsize value. Among them are:

- a tendency to consume more RAM to accomplish the same things (or a reduction in the number of things the ARC can manage),
- the fact that lots of data fits into blocks well below 1MB, meaning that you haven't actually found a magic bullet, just an optimization that works for large files,
- that slinging around 1MB blocks increases latency, especially considering that the blocks may be compressed, meaning a larger amount of data must be inflated,
- that you are less likely to find a contiguous 1MB allocation on a fragmented pool than you are a contiguous 128K allocation,
- that some data you would never want to use a large recordsize for (VM virtual disks on NFS),
etc.

Whether any of these downsides are applicable to any given scenario is, of course, a different matter entirely.
 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,554
Of course there are downsides to increasing the recordsize value. Among them are:

- a tendency to consume more RAM to accomplish the same things (or a reduction in the number of things the ARC can manage),
- the fact that lots of data fits into blocks well below 1MB, meaning that you haven't actually found a magic bullet, just an optimization that works for large files,
- that slinging around 1MB blocks increases latency, especially considering that the blocks may be compressed, meaning a larger amount of data must be inflated,
- that you are less likely to find a contiguous 1MB allocation on a fragmented pool than you are a contiguous 128K allocation,
- that some data you would never want to use a large recordsize for (VM virtual disks on NFS),
etc.

Whether any of these downsides are applicable to any given scenario is, of course, a different matter entirely.
The sign of a good sysadmin is the ability to find the downside to everything. :p
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The sign of a good sysadmin is the ability to find the downside to everything. :p

I come from the old days of the ISP business, where you had to be (at least semi-) expert in lots of things. I'm not so much a "good sysadmin" as I am "someone who's worked with it all" or maybe "been crapped on by it all."

Think of it more as I've gotten good at understanding the downsides of things since I don't like getting crapped on. ;-)
 
Status
Not open for further replies.
Top