Fragmentation

Bidule0hm · Dec 19, 2015

So, to change the recordsize on my datasets I copied the dataset content to another temporary dataset so I can destroy the existing on, recreate it with the correct recordsize and move the data back.

My pool is a 8x 3 TB RAID-Z3 so about 12 TiB data space. It was used to 3.0 TiB (25 %). I copied a dataset of 1.8 TiB (16 %) and the fragmentation rise from 4 % to 7 %.

The question is: why did it rise?

Jailer · Dec 19, 2015

Because you created a "hole" in the pool where the data used to reside when you moved it. Unless you move it to a new pool in one long continuous write operation it will continue to do the same thing.

I'm sure someone will be along shortly though to explain it in a much more technical and accurate manner. :p

ETA: I'd be happy with 7% mine is currently sitting at 65% capacity and 19% fragmentation.

Bidule0hm · Dec 20, 2015

No, I copied it, and in one go with no other write to other datasets in the meantime :)

It's not that I'm not happy with the number (but a little, though...), it's curiosity because from what I understand it shouldn't rise when you do a plain copy like that with plenty of space to put the copied data without fragmentation.

SirMaster · Dec 22, 2015

Bidule0hm said:
No, I copied it, and in one go with no other write to other datasets in the meantime :)

=from what I understand it shouldn't rise when you do a plain copy like that with plenty of space to put the copied data without fragmentation.

You misunderstand the situation. Doing a plain single copy can certainly cause the FRAG property to rise.

In ZFS the FRAG property is not a measurement of the fragmentation of your data. It's a measurement of the fragmentation of the free space. The goal of the FRAG property is to measure what percentage of free blocks (holes) in the metaslabs are smaller than a certain size. The reason it measures this is because so long as there are lots of very large free holes (small percentage of small holes, aka "FRAG"), then ZFS can allocate writes quickly. Performance only becomes slow once FRAG is high (meaning that most of your free space is scattered into many small holes and it takes ZFS longer to fit the data into all those smaller holes.)

Some suggestions:

Use larger recordsizes (you are now and fragmentation shouldn't ever be a problem, especially with 1MiB recordsize)
Improve how the ZIL works for these kinds of writes by setting "logbias" for your dataset to "throughput".
Extend your transaction group interval based upon how much data you're willing to store in an in-RAM TXG and on-disk mirror of the TXG that we call the "ZIL". On a typical home system with say 16GB RAM and throughput of perhaps only a few hundred megabytes per second, maybe you can make it 10 or 15 seconds instead of 5 to reduce the sync intervals and enhance linear aggregation.
Realize that fragmentation is not a particularly big deal in ZFS. With very large files like you are using, your metaslabs are going to be fairly teeny and can load and unload dynamically, while with FAT, NTFS, and other legacy filesystems a big part of dealing with a fragmented filesystem was gigantic growth of your space maps that had to be kept in-memory; this resulted in profound impact not only to seek times but to overall system resource utilization. That cost doesn't become profound on ZFS until your space maps accommodate billions++ of files...

Bidule0hm · Dec 22, 2015

SirMaster said:
In ZFS the FRAG property is not a measurement of the fragmentation of your data. It's a measurement of the fragmentation of the free space.

Ahhh ok, now it's clear :D thanks ;)

The frag value is now at 5 % so less than after the copy but still a bit higher than before (I'll make a thread on why and how to change the record size, the overhead from misaligned pool, etc... and will use my case as an example with all the numbers so if you want more data you'll have it :)). It's not be a big deal at all, I was just curious to know why it rises on a plain copy.

Now the question is: how to know the fragmentation of the data since frag is the fragmentation of the free space?

jgreco · Dec 22, 2015

Bidule0hm said:
Now the question is: how to know the fragmentation of the data since frag is the fragmentation of the free space?

Reboot your filer. Time how long it takes to read the data sequentially. The longer it takes, the more seeking is going on.

Bidule0hm · Dec 22, 2015

Yeah but it's totally subjective :) I want numbers :p

But in theory data frag should be more or less a bit lower or equal to free space frag because one day or another the free space will be used, no?

SirMaster · Dec 22, 2015

Bidule0hm said:
Now the question is: how to know the fragmentation of the data since frag is the fragmentation of the free space?

This one is not really possible to know. But ZFS is designed such that it shouldn't make a difference or be important what the data fragmentation is which is why there is no way to report on it.

The FRAG property was added not too much past a year ago. It's purpose was to create a map of where the free space in ZFS is all located so that the ZFS slab allocator can more quickly and efficiently pick the best metaslab to load to perform the write.

Previously ZFS would always choose the most empty metaslab and would always swiss-cheese it up nearly right away when it wrote new blocks to it. But now with the FRAG property, all the metaslabs keep a record of how fragmented their individual free spaces are.

So now the ZFS slab allocator can instantly know what the write performance of a given slab will be like. If ZFS sees a very heavy write load coming into the system it can choose a high performance metaslab that has very un-fragmented free space. But late at night for instance when the system is under a much lighter write load it can spend more time searching through metaslabs with more fragmented free space to find better fits for the incoming writes. Thus so ZFS can preserve more high performance metaslabs for future heavy write loads.

Basically why "waste" a pristine, high performance metaslab and swiss-cheese it all up when the incoming write load is so small and the system has plenty of time to do more searching? Lets find better fits for the writes when the write load is small and save the pristine metaslabs for when the system is very busy and the system doesn't have time to find better fits.

Overall this improves the write performance in ZFS as the filesystem fills up.

This is the real place where ZFS performance has suffered, as it fills up. But it is not nearly as bad as it used to be.

As far as fragmented data for reads, it hasn't proven to be a real problem in terms of normal data reading performance. Fragmentation is inevitable in a CoW filesystem, but all the technologies built into ZFS (transaction groups, ZIL, ARC) allow it to maintain acceptable read performance throughout. Fragmented data does have an impact on scrub and resilver performance. However this is currently being solved in OpenZFS as someone at Delphix is working on adding sequential scrub and resilver to OpenZFS so that scrubs and resilvers are performed sequentially on the disks no matter the fragmentation of the data at the user level. Oracle already has this feature in their codebase, but we will be getting it ourselves soon.

jgreco · Dec 22, 2015

Bidule0hm said:
Yeah but it's totally subjective :) I want numbers :p

But in theory data frag should be more or less a bit lower or equal to free space frag because one day or another the free space will be used, no?

No, because if you fill 10% of the pool with a contiguous file, then proceed to fill the remaining 90% with random writes to large files, and then remove the first file, you have a single large contiguous region of free space but loads of fragmentation of the data on the pool.

Bidule0hm · Dec 22, 2015

Ok, I see, thanks for the long and precise answer ;)

Edit: @jgreco yeah... was just a very fast educated guess... I guess it's not a good idea to not think more than 30 sec on subjects like this one... :D

jgreco · Dec 22, 2015

SirMaster said:
As far as fragmented data for reads, it hasn't proven to be a real problem in terms of normal data reading performance.

For some unique definition of "hasn't proven to be a real problem", I guess I'd have to agree. Maybe if you're using it for little files or something. Those of us using it for block storage know the gory truth to be otherwise.

Fragmentation is inevitable in a CoW filesystem, but all the technologies built into ZFS (transaction groups, ZIL, ARC) allow it to maintain acceptable read performance throughout. Fragmented data does have an impact on scrub and resilver performance.

And a hell of a hit on VM virtual disk sequential read performance, such as when backing them up. Throwing lots of free space at the pool helps a lot (by reducing fragmentation), as does having lots of ARC and L2ARC... but most people are not provisioning their filers with gobs of RAM and 1/8th the used data size in L2ARC.

jgreco · Dec 22, 2015

Bidule0hm said:
Edit: @jgreco yeah... was just a very fast educated guess... I guess it's not a good idea to not think more than 30 sec on subjects like this one... :D

No biggie, much of my professional life has been spent suffering the edge cases, and mitigating them too. I came up with that answer in like 5 seconds ;-) ;-)

SirMaster · Dec 22, 2015

Well the niche where people see performance problems are in smaller home servers who are using ZFS for their VMs and things that are highly sensitive to fragmentation.

Enterprises using this Enterprise filesystem are indeed using gobs and gobs of RAM and this helps their performance immensely.

I more meant, for the people who ZFS was really designed to be used for, for the people who hold a stake in ZFS's development of new features and improvements, they are not seeing as many performance issues caused by fragmentation.

If they were, they would be paying the developers like Delphix, Nexenta, Joyent, etc to implement features to combat the problems. Which is exactly what they did as for adding the FRAG property and slab allocator enhancements as well and the forthcoming sequential scrub patch.

Data fragmentation on a CoW filesystem is a very difficult problem and is essentially unavoidable. Look at what BTRFS devs tell you to do for VM disk images stored on BTRFS, They tell you to disable CoW for the disk image entirely! This isn't even possible in ZFS nor should it ever be as it defeats many of the purposes and features of the filesystem itself.

Defragmentation is not really an option because it would require block pointer rewrite which is not something anyone is willing to pay to have implemented. Minimizing data fragmentation through TXG, and ZIL and caching frequent reads in ARC was deemed the better solution to maintaining ZFS's performance. The FRAG property and the slab allocator changes were the answer to improving the performance of pools as they become more full. All these things were not super difficult, dangerous, or error prone like BPR would be and all could be done with minimal detrimental effects to the existing system.

Design and development choices like this are not taken lightly and are exactly why ZFS is still so reliable and high performance.

jgreco · Dec 22, 2015

I'd have to say that there's a huge component to the "are not seeing as many performance issues" in that the game has totally changed in the last five years or so... even just a year or two ago I was seeing enterprise deployments where pushing for more than 32GB of RAM was met with some significant resistance. Today, though, with 128GB of 32GB DDR4 2133 modules going for sub-$1000, and SSD prices in freefall, what I'm seeing (including in our shop) is that it is a LOT easier to justify throwing massive resources at it. An E5-1650 and mainboard, $850, 128GB RAM, $950, 2 x 512GB Samsung 950 PRO for L2ARC, $600, that works out to less than $2500. Unimaginable a few years back. I suspect that people actually throwing resources at the issue to make it better also plays at least as large a role in the reduction of apparent "performance issues."

joeschmuck · Dec 22, 2015

Holy cow, I just learned more about ZFS and fragmentation. I really like this forum.

depasseg · Dec 22, 2015

Bidule0hm said:
Yeah but it's totally subjective :) I want numbers :p

Hook up an oscilloscope.

jgreco · Dec 22, 2015

depasseg said:
Hook up an oscilloscope.

Idly wonders what kind of probe to use for that...

depasseg · Dec 22, 2015

I'm sure he's got one that will do the trick. (j/k) After all those drive spin up measurements, I'm sure he'll come up with something.

Bidule0hm · Dec 23, 2015

Oscilloscope isn't the answer to everything, but almost... :D

Important Announcement for the TrueNAS Community.

Fragmentation

Server Electronics Sorcerer

Not strong, but bad

Server Electronics Sorcerer

Patron

Server Electronics Sorcerer

Resident Grinch

Server Electronics Sorcerer

Patron

Resident Grinch

Server Electronics Sorcerer

Resident Grinch

Resident Grinch

Patron

Resident Grinch

Old Man

FreeNAS Replicant

Resident Grinch

FreeNAS Replicant

Server Electronics Sorcerer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Fragmentation"

Similar threads