Total usable space

Status
Not open for further replies.

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
I don't think it's a supported thing and I don't think I'd do it myself (and I'm not afraid to do this kind of things usually...), but I can be wrong.

And if you can change it on the fly it means that it doesn't work on already written data, plus if you have a lot of small files you'll lose some space because a file will be smaller than a block.
 

SirMaster

Patron
Joined
Mar 19, 2014
Messages
241
I don't think it's a supported thing and I don't think I'd do it myself (and I'm not afraid to do this kind of things usually...), but I can be wrong.

And if you can change it on the fly it means that it doesn't work on already written data, plus if you have a lot of small files you'll lose some space because the file will be smaller then a block.

It's absolutely supported. Why wouldn't it be supported? FreeNAS has an option right in the GUI to select the recordsize. It's the last option.

FreeNAS-Create-ZFS-Dataset-Dialog.png


Correct, it doesn't work on already written data. But you could simply make a new dataset and then just copy all your data from the old 128KiB dataset to the new 1M dataset. Your data will take up less space then on the new dataset.

You can lose some space if you have a LOT of small files and store them on the big recordsize dataset. But don't forget that recordsize is a dataset property, so just store your big files on the 1MiB dataset(s) and keep your small files on 128KiB or whatever you like recordsize dataset(s).

However I find that many people are using ZFS for their home media storage and have generally large files as the vast majority of their data amount, so they could stand to gain a lot of space creating 1M recordsize datasets and copying all their large files to them.

Even if you stored all your data on 1MiB record size, unless you have a very, very large amount of really, really small files then the space gained back from the padding issue in most configurations will far outweigh the space lost from small files.

Large recordsizes can also help performance for those interested. For instance my scrubs on a 12x4TB WD Red RAIDZ2 pool are 1.4GB/s with 1M recordsize and it takes only 5 hours and 41 minutes to scrub all 24.8TiB of data which seems pretty fast to me. Also with large recordsizes it means that operations that interact with the 1M recordsize datasets take up less IOPS from your disks which can leave more for other simultaneous work. Also fragmentation and the performance impact associated with it is much less of a problem when most of your blocks are all 1M in size.
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Ok, my bad, definitely interesting then. I think I'll try it just because I hate wasted resources and I have a lot of big files and not that many small files :)

Why we didn't talk about the recordsize parameter before? very useful parameter and easy to use (for once...) :D
 

SirMaster

Patron
Joined
Mar 19, 2014
Messages
241
I don't know why people don't talk about it. I've been using it for awhile now (although on Linux) and I've mentioned it a couple times on this forum before, but I don't post a lot.

I can definitely confirm from lots of testing that it does increase usable space.

However, note that it wont change the total capacity that you see in "zfs list" or "zpool list". Those commands always use an assumed 128KiB recordsize to do free-space and capacity calculations.

The different presents itself like this:

Say you have a 12x4TB RAIDZ2 which has close to 10% overhead with ashift=12 vs. ashift=9. If you make a 128KiB dataset and write a 1 GB file to the dataset, the USED property of the pool will increase by 1 GB as you would expect.

Now if you make a 1M recordsize dataset on this same high-overhead pool and write the 1 GB file to that dataset, when you check the USED property of the dataset, it will only say about 900 M is used. So the files you write will "appear" smaller than they actually are by the amount of overhead your ashift=12 RAIDZ suffered from (note that this has nothing to do with compression, it works on completely incompressible data too). This effectively means you can store that much more data on the dataset and thus the pool and thus negates the overhead issue.

A final note is that you *do* need to have some form of compression enabled for the dataset in order for this all to work. Even if you are storing largely incompressible files. LZ4 works great for this and you should not notice any performance impact.
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Yep, in short you'll not see an increase in the free space but a decrease in the used space :)

Yes, just keep LZ4 compression enabled on all the datasets even with already compressed data, it's very light on the CPU ;)
 

solarisguy

Guru
Joined
Apr 4, 2014
Messages
1,125
[...] Now if you make a 1M recordsize dataset on this same high-overhead pool and write the 1 GB file to that dataset, when you check the USED property of the dataset, it will only say about 900 M is used. So the files you write will "appear" smaller than they actually are by the amount of overhead your ashift=12 RAIDZ suffered from (note that this has nothing to do with compression, it works on completely incompressible data too). This effectively means you can store that much more data on the dataset and thus the pool and thus negates the overhead issue.

A final note is that you *do* need to have some form of compression enabled for the dataset in order for this all to work. Even if you are storing largely incompressible files. LZ4 works great for this and you should not notice any performance impact.
So why the compression has to be turned on? I.e. what changes with incompressible files and LZ4 ?

Thank you in advance!
 

SirMaster

Patron
Joined
Mar 19, 2014
Messages
241
So why the compression has to be turned on? I.e. what changes with incompressible files and LZ4 ?

Thank you in advance!

It's able to completely compress the space otherwise used for padding the alignment of the RAIDZ2 after splitting the blocks between the number of disks.

Also with 1M blocks if you have compression off, then you potentially waste half a MB per file (on average) from the padding of the last block since your files will pretty much never be exact increments of 1MB in size. With compression on this padding can also be completely eliminated.

In all my testing, when writing the files to the 1M dataset with compression off it took just as much space as with 128K blocks. Enabling LZ4 compression and then writing the file resulted in a 10% smaller file (on my 12x4TB RAIDZ2) for my test files (H.264 HD video). I tried putting the file inside a maximum compression solid archive 7zip (a significantly better and slower compression algorithm than ZFS uses) and it only compressed it a few KB so the data is "incompressible" otherwise, yet it shows up as 10% smaller in the USED property.

I also tested completely filling smaller 1M recordsize test pools (again with 12 disk RAIDZ2) with compression on and off and confirmed that it let me fit about 10% more incompressible data on the pool.
 

solarisguy

Guru
Joined
Apr 4, 2014
Messages
1,125
@SirMaster, thank you!

On FreeNAS, I also have copies of some programs with data, that have tens of thousands of small (1kB or less) files. I might need to follow your suggestion of having two datasets with different recordsizes.

And still, possibly moving from 8 to 9 disks first...

Anybody with 9 disks 6 TB each willing to test ? :D
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Are the small files in use? or is this just backup? because if it's backup you can archive the files in a tar file so ZFS will only see one big file ;)
 

AMiGAmann

Contributor
Joined
Jun 4, 2015
Messages
106
@SirMaster: Thanks for the information about the record size!

According to the manual, ZFS changes the record size dynamically if not configured to a fixed value like 1M, if I understand it correctly:
while ZFS automatically adapts the record size dynamically to adapt to data, if the data has a fixed size (e.g. a database), matching that size may result in better performance

And a different question from me starting with a quotation of SirMaster
However, note that it wont change the total capacity that you see in "zfs list" or "zpool list". Those commands always use an assumed 128KiB recordsize to do free-space and capacity calculations.
...
So the files you write will "appear" smaller than they actually are by the amount of overhead your ashift=12 RAIDZ suffered from
This is kind of irritating. I understand that it may be hard to calculate the free space, but why can the ocupied space not be calculated and displayed exactly?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
It can I guess, it's just not implemented.
 

SirMaster

Patron
Joined
Mar 19, 2014
Messages
241
According to the manual, ZFS changes the record size dynamically if not configured to a fixed value like 1M, if I understand it correctly:

ZFS does use a dynamic recordsize, but it can only be smaller than the recordsize property you have set. So by default it cant go higher than 128KiB unless you raise the recordsize property. FWIW it can also only be set to powers of 2.

This is kind of irritating. I understand that it may be hard to calculate the free space, but why can the ocupied space not be calculated and displayed exactly?

It can I guess, it's just not implemented.

Well is the occupied space really meaningful if the capacity and free space can't be calculated?

The capacity and free space are not just difficult to calculate, they are impossible to calculate since the amount of capacity you have depends on how much data you end up writing to the different record sizes. You can't predict ahead of time how much data you are going to write to each recordsize dataset.

So remember, when I write a 10GiB file for instance to my 12-wide RAIDZ2, zfs used shows 9GiB "USED" for that file.
Code:
du
(disk usage) also shows 9GiB.

But remember, our total capacity of the pool is too low because it subtracts the overhead because it assumes 128KiB records. So if ZFS showed how much disk space was actually being used by the file(s), then we would get to the point where zfs says we are using up more disk space than the capacity of the pool and that's just confusing.

ZFS also can't really assume 1MiB records to calculate the capacity because it can't predict that you might fill it with 128KiB records and if you do that then the pool will become completely full even though it says you have hundreds of GiB or even more than 1TiB free still. It's better to underestimate than to overestimate.

So to try to reiterate. You can see the actual size of the file with
Code:
du -A
You can also see the amount of disk space the filesystem thinks it's using, but that number needs to relate to the total capacity of the filesystem or else everything gets screwy. So since the pool capacity is smaller than it really really is, the 1MiB recordsize files need to show up smaller than they really are and the 128KiB recordsize files need to show up exactly how large they are.

Say you have a 10TiB "capacity" pool. Best case is you write 10TiB of 128KiB recordsize files and you fit exactly 10TiB of files on the pool. Worse case is you write 10TiB of 1MiB recordsize files to the pool and you are left wth 1TiB free. Nobody is gonna complain about that extra free space.

The other option is you instead state that the pool "capacity" is 11TiB. The best case is you write 11TiB of 1MiB recordsize files as you expect and it fills up the pool. Worst case is you queue up 10.5TiB of 128KiB recordsize files to this 11TiB capacity pool, but it fails after writing only 10TiB as the pool ends up full because the 128KiB records can't be efficiently stored on that number of disks in the RAIDZ2 and it had to add padding.

This is obviously problematic as the admin needs to know that the data he plans to write will fit ahead of time without having to do extra math himself.
 
Last edited:
Status
Not open for further replies.
Top