Ahrens: ZFS performance is fine above 80%

Yorick · Mar 28, 2021

Quoth they:

I've seen warnings that you should always stay below 50% and definitely below 80% usage

Click to expand...

I don't agree with those warnings. Lots of people successfully use ZFS all the way up to the 97% utilization limit (which is being raised so that you can get even closer to 100% utilization, see #11023). At work we have some of the most demanding workloads in terms of fragmentation (relational databases with recordsize=8k, compression=on, ashift=9, lots of sync (ZIL) writes with no log device) and we see reasonable performance up to 85-95% capacity.

Interesting.

jgreco · Mar 28, 2021

Not really. It depends on what's acceptable to you. For example, if you can get your working set into ARC+L2ARC, your fragmentation won't affect reads unless you are getting something that isn't in cache. Writes may also seem fast as long as they're not high in volume. I can totally believe that there are places where you could be at 90%+ capacity and feel performance is "reasonable."

But this hinges on the definition of "reasonable." I once had a discussion with one of the iX developers who wanted to equate FreeNAS HDD performance with being "on par with SSD." You can definitely get something that strongly resembles that when you have low occupancy rates and a working set that remains in ARC+L2ARC, and will remain that way regardless of workload. Unfortunately it gets slower once you start nibbling away at those supporting factors.

Ahrens is the same fellow who said "There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem."

That is strictly true but leave out important context. In the ECC comment, Ahrens failed to mention that ZFS doesn't have automatic filesystem repair tools, that the design of the system assumes bad data will not get into a pool, or that having a petabyte ZFS filesystem fail is much more catastrophic than having a UFS, btrfs, NTFS or EXT3 filesystem (none of which can practically handle such large amounts of data) fail. Your recourse for a corrupted ZFS pool is to dump it to a holding area and then restore it (or restore it from backup), but that's onerous when you have a petabyte or more. By way of comparison, you just run fsck on UFS (etc). But of course you're not holding a petabyte of data there.

You can absolutely fill a fresh ZFS pool to 97% if it is being used as an archival (single-time-write) pool and it will have lovely performance for both writes and reads. This is neither shocking nor even news -- your pool will read and write at peak speeds. It's what happens once you start removing files that can be problematic, because once you start to rewrite, you are now needing to search for open space on the pool (instead of it being in a big glob) and fragmentation effects begin.

This is basic computer science stuff. ZFS trades resource riches to gain performance. This is fairly common in compsci.

So this all hinges on the definition of "reasonable." I don't think any one definition is suitable across the board. If you want SSD-like performance that iX has talked about, listening to Ahrens is going to lead you to disappointment, because a ZFS pool that's living in the 90% occupancy zone and doing lots of rewrites gets slower and slower as time goes on. Many who have stored VM data on ZFS have discovered this the hard way.

ornias · Mar 28, 2021

Lowering Ahrens of all people to a "fellow" doesn't do him, his work or his statement justice.

That being said: Fragmentation is also somewhat related to the total storage capacity.
When you have 10% left of a 1TB array, you have 100GB of fragmented free-space.
When you have 10% left of a 100TB array, you have 10TB of fragmented free-space.

The chance of writes fitting nicely into available free space, is a lot more likely in the second scenario.

winnielinnie · Mar 28, 2021

Would these concerns about fragmentation and remaining free space be moot points on a ZFS pool that consists solely of solid-state drives? (No spinning platters, no head seek, etc)

I never understood the warnings about "over-filling" a ZFS pool that makes it unique to ZFS, as opposed to other file-systems. I can understand the impact on performance when using spinning HDDs, but this also applies to NTFS, UFS, ext4, XFS, and anything that performs better when there is more fragment-free available space.

Does ZFS do some under-the-hood defragmentation if it senses the pool is made of spinning HDDs, and thus relocates the physical locations of records so that free space is laid out in large continuous chunks?

ADDENDUM: I'm not advocating to "fill up as much as you can", since there are also other benefits of having substantial unused space. For example, you can copy large amounts of data back-and-forth to force existing data to be re-written with a newly applied compression method. A new, temporary "holding" dataset can be used for this, which is only possible with enough free space. Once you are done, you destroy the temporary dataset.

jgreco · Mar 28, 2021

ornias said:
Lowering Ahrens of all people to a "fellow" doesn't do him, his work or his statement justice.

So, you've got him up on a pedestal. I've seen lots of people who have been put up on pedestals. Generally, they've simply been in a unique position to do something interesting or unusual, but it does not warrant granting them the status of Emperor of Computing or anything like that.

RMS has been shown to be a real ... winner. Hans Reiser was the author of a very promising Linux filesystem that would have replaced EXT. Linus (or Linux fame) is famously quite the arse, although I'm pretty sure Theo de Raadt outdoes him. Don't get me started on McAfee. I could go on for a dozen more entries. Being smart and having done good things does not have any relationship to being right or being a decent human being. (Yes I deliberately picked a bag of jerks.)

By way of comparison, Ahrens is apparently a good guy, but I'm choosing to consider some of the things he says to be slanted, in the way much stuff is slanted by our own biases. I am sure that he's gotten tired of hearing about ECC or 80% over the years, and feels it necessary to defend ZFS. That doesn't make him right, it doesn't make him wrong, but I feel fine with correcting what I view to be a potentially misleading assertion. I explained why, and it turns out that I'm demonstrably correct, "the best kind of correct." ;-) He's a smart fellow but he may be too close to his project to be objective about it. That's fine.

That being said: Fragmentation is also somewhat related to the total storage capacity.
When you have 10% left of a 1TB array, you have 100GB of fragmented free-space.
When you have 10% left of a 100TB array, you have 10TB of fragmented free-space.

The chance of writes fitting nicely into available free space, is a lot more likely in the second scenario.

Not particularly true. It's actually related to the size of blocks that are freed. Whether you have 100GB or 10TB of fragmented free space, if the free space has been created 128KB at a time, and that's randomly spread around the pool, it isn't magically easier to find that free space on the 100TB pool. In fact, it may be more expensive to find the space because if you are trying to allocate a large amount of space, you may have less of the pool's overall metadata cached in ARC (think of it like a "per capita" issue) and you may have to work harder to find the space because you need to inspect more metadata to find free space.

This is sort-of related to the seek sustainability issue that has plagued hard drives as they've gotten larger. It used to be viable to fill a 100MByte HDD seeking randomly to fill it one sector at a time (takes about 3 hrs at 20 IOPS), while it now takes something more like 20000 days to do that same thing on an 18TB drive, which is why hard drives are largely relegated to storage duties with a significant sequential component.

It's possible to contrive scenarios that favor one thing or another, but your described pools suffer from the fixed maximum size of ZFS blocks, so it only works out like you describe in some cases.

Yorick · Mar 28, 2021

winnielinnie said:
I never understood the warnings about "over-filling" a ZFS pool that makes it unique to ZFS, as opposed to other file-systems.

Little shaky on this and I think it’s because of CoW. Writes copy the data, and that means more seeks as the pool gets fragmented.

SSD do better at it, but not perfect. Optane does surprisingly well but doesn’t really do capacity. There’s a post here on the forums about performance curves on different types of SSDs.

Yorick · Mar 28, 2021

jgreco said:
In the ECC comment, Ahrens failed to mention that ZFS doesn't have automatic filesystem repair tools, that the design of the system assumes bad data will not get into a pool, or that having a petabyte ZFS filesystem fail

The context I see there is that anyone who would not run ECC is by definition running a mini system with just a few TB of data. Quite possibly under 50 TB. Now that’s still annoying to restore and I agree ECC is quite desirable even for hobby / home setups.

Someone storing PBs better be using server class hardware, lest their career be cut short. And servers use ECC.

The fervor with which ECC was advocated on these here forums a few years ago did reach a fever pitch, from what I am told by old timers ... though that was one particular person and they’ve moved on.

ornias · Mar 28, 2021

jgreco said:
So, you've got him up on a pedestal.

No, I actually disagree with him on a fair number of subjects. I just don't think it's fair to demote someone who is one of they key filesystem developers of the era to "a fellow". As a mater of fact: he is the specialist and you're, for all intends and purposes, just the "fellow on a forum".

jgreco said:
and it turns out that I'm demonstrably correct

I had a good laugh and hope everyone reading this is going to skim your posts here looking for your bias... wait... that doesn't really require much looking-for actually :')

I'll leave it at that.

winnielinnie · Mar 28, 2021

Yorick said:
Little shaky on this and I think it’s because of CoW. Writes copy the data, and that means more seeks as the pool gets fragmented.

Good point. I was implying new data being written, rather than the context of modifying existing data. I guess it depends on the main usage of your NAS server: if you're using it primarily to make backups and archives, then the overwhelming majority of writes will be new files (rather than modifying existing files.)

Jailer · Mar 28, 2021

Patrick M. Hausen · Mar 28, 2021

Jailer said:
View attachment 46201

Didn't you get your fill with the pfSense discussion, already?

jgreco · Mar 28, 2021

ornias said:
As a mater of fact: he is the specialist and you're, for all intends and purposes, just the "fellow on a forum".

Well, I know we live in a fact-free era, so, I'll just say, "that's nice" and file your opinion where it belongs.

ornias · Mar 28, 2021

Patrick M. Hausen said:
Didn't you get your fill with the pfSense discussion, already?

I think it's more fun to disagree with jgreco though... At least there is something to discuss when you disagree with him. In stark contrast with PFSense, as every sane individual understands the folks at PFSense are absolute asshats.

Newfoundland.Republic · Mar 28, 2021

@joeschmuck - Maybe you should loan @jgreco your avatar to keep @ornias safe?

Spearfoot · Mar 28, 2021

I'll make an analogy here. From studying physics I understand that matter is essentially empty space with a few particles here and there. I can grasp this intellectually. Nevertheless, here in real life I won't be dropping a rock on my toe any time soon. Rocks may indeed be mostly empty space, but it's still going to hurt if I drop one on my toe.

Similarly, though Matt Ahren's statement that ZFS pool utilization can safely exceed 50% is doubtlessly true, or at least true in certain circumstances, nevertheless I will try not to exceed 50% utilization on my own ZFS-based systems, for the same reason that I don't drop rocks on my toes.

jgreco · Mar 28, 2021

ornias said:
I think it's more fun to disagree with jgreco though... At least there is something to discuss when you disagree with him.

Why, thank you, that's appreciated. I like to surround myself with competent and interesting people, and learn as much as I can from them. I also enjoy trying to pass knowledge on, which makes the forums a naturally attractive place to be. I try pretty hard not to get too fossilized in my thinking, because the world does actually change, so the day when there's an interesting forum debate is often the best time to learn something new. You might notice that there are some things I simply don't touch with a ten foot pole, like Windows permissions or AD crap, and that's because I know less of them than many posters here. I find that keeping your mouth shut on topics you're ignorant of is a winning strategy

Plus it probably makes you look smarter than you actually are. I do like to think I have some idea of what I'm talking about when I talk though.

In stark contrast with PFSense, as every sane individual understands the folks at PFSense are absolute asshats.

And the thing about that is that it really ticks me off that I keep forgetting that.

I don't generally make use of pfSense because it's not hard to create packet infrastructure directly out of FreeBSD, but, then again, that's a quarter of a century experience doing that, since one of the reasons I took up 386BSD and FreeBSD was to make cheap high speed SLIP and PPP servers out of 16550's (SUN ALM sucked!!) But the downside is that it has been around so long I keep forgetting about OPNsense.

ornias · Mar 28, 2021

@jgreco That is at least some things we do agree on indeed... brr... Windows AD and group policy hell :O

danb35 · Mar 28, 2021

ornias said:
every sane individual understands the folks at PFSense are absolute asshats

That does explain a lot...

ChrisRJ · Mar 28, 2021

jgreco said:
[..] make cheap high speed SLIP and PPP servers out of 16550's [..]

I'll never forget how my first 14k4 modem failed on my multi I/O card, because the latter only had a 16450. Had to remove and replace it with a 16550, although I was scared enough to solder in a socket only and not the chip. I was still in high school then and these things were rather expensive for me. But god was I proud when it finally worked

jgreco · Mar 29, 2021

ChrisRJ said:
I'll never forget how my first 14k4 modem failed on my multi I/O card, because the latter only had a 16450. Had to remove and replace it with a 16550, although I was scared enough to solder in a socket only and not the chip. I was still in high school then and these things were rather expensive for me. But god was I proud when it finally worked

It's all a lot of fun to look back on, because in those days, a 25-33 MHz machine was considered fairly decent. We had two of these Sun 3/260's, but there was just no good way to attach a small bank of high speed (9600-19200 baud) modems to them. There wasn't a whole lot of buy-in to the very expensive Telebit Trailblazers around here, so when AT&T Teledyne offered their DataPort 14.4's for $222 to "sysops" (instead of $555 list), I ended up adding like half a dozen to the mix to replace slower USR Courier 2400's, but this ended up doing very poorly because serial performance tanked when several UUCP peers would connect. I've still got stuff like one of the BocaBoard 2016's we had from that era floating around here, sixteen 16550's in an external can connected to the host via a cable that carried ISA bus signals...

Important Announcement for the TrueNAS Community.

Ahrens: ZFS performance is fine above 80%

Wizard

Resident Grinch

Wizard

MVP

Resident Grinch

Wizard

Wizard

Wizard

MVP

Not strong, but bad

Hall of Famer

Resident Grinch

Wizard

Guru

He of the long foot

Resident Grinch

Wizard

Hall of Famer

Wizard

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Ahrens: ZFS performance is fine above 80%"

Similar threads