ZFS "ARC" doesn't seem that smart...

winnielinnie · Apr 10, 2022

Without yet resorting to adding an L2ARC, is there a "tuneable" that I can test which instructs the ARC to prioritize metadata?

I upgraded from 16GB to 32GB ECC RAM.

Yet there is zero change in this behavior.

This keeps happening:

justified-l2arc-or-try-something-else-png.53200

I run regular rsync tasks from a few local clients, which is very metadata heavy (not so much actual data involved). I assumed that over time ZFS would "intelligently" adjust the ARC to prevent the metadata in question from being evicted every day. Yes, it's a lot of metadata (many files/folders on the dataset), but it's accessed every day, and I would argue that the metadata of the filesystem gets more "hits" than some random data itself.

I've even been through a couple weeks of barely touching any files on the NAS server, and thus the only real usage of the NAS is to list the directory tree with rsync.

Is there some tweak or tuneable to instruct the ARC to prioritize metadata? Honestly, the NAS server would perform better (in my usage) if it evicted large caches of data blocks to make room for metadata instead.

Patrick M. Hausen · Apr 10, 2022

winnielinnie said:
Is there some tweak or tuneable to instruct the ARC to prioritize metadata?

zfs set primarycache=metadata <dataset>

indy · Apr 10, 2022

One thing to keep in mind is that the "most frequently used" denomination is a bit of a misnomer.
The ARC has 2 queues: "hit once" and "hit more than once".
So depending on your other workloads even your daily accessed (meta-)data might simply get evicted.

You could try changing the following tunables and see if it helps:
zfs-arc-meta-limit-percent (default is already at 75% though)
zfs_arc_dnode_limit_percent (default is 10%)

What is the behavior when you run rsync with a cold arc? Does all the metadata fit?

winnielinnie · Apr 10, 2022

Patrick M. Hausen said:
zfs set primarycache=metadata <dataset>

Well color me stoked! I'm going to try this right now. I hadn't even realized it was a per-dataset property.

EDIT: Might be tentative on second thought; but I'll still test this out. This is an "all or nothing" approach. Meaning that if I set it to "metadata", it will not cache any user data in the ARC, at all.

winnielinnie · Apr 10, 2022

indy said:
What is the behavior when you run rsync with a cold arc? Does all the metadata fit?

That would require a reboot, no? I can test this when I get the chance.

indy said:
You could try changing the following tunables and see if it helps:
zfs-arc-meta-limit-percent (default is already at 75% though)
zfs_arc_dnode_limit_percent (default is 10%)

I'll look into those! Is the assumption that increasing the "arc dnode limit" will yield enough "breathing room" to comfortably fit all the metadata into the ARC without exceeding this (low) default limit?

winnielinnie · Apr 10, 2022

Appears to be a perennial issue with ZFS, even back in 2018 on Solaris:

ARC metadata is evicting too early on Solaris ZFS

I have a Solaris 11.2 server running ZFS with the following configuration; 6x 4TB HDDs in raidz2 (approx 14TB usable) 16GB RAM (ECC) E5-2670 (16 cores) No ARC or L2ARC No zfs settings tweaks ...

stackoverflow.com

One of the comments practically describes the very same thing I'm witnessing; the only difference is they're using "find" and I'm using "rsync". However, "find", "rsync", directory crawls, etc, are all metadata intensive, even if no user data is touched at all.

sleepycal said:
Okay so if I run the find ./ command twice, the first one takes >5 minutes, but the second one executes in a few seconds. But if I wait a few hours, it takes >5 minutes again. I can also see misses incrementing on the >5 minute run. So my best guess is that metadata is being evicted from the ARC. Is there any way to force metadata to take priority over all other cache data?

Seems like ZFS is eager to evict metadata from the ARC (by default), which I find odd.

I'll play around with some of the mentioned tunables.

I'd prefer not to have to rely on a persistent L2ARC, and as it was mentioned in the other thread: Having an L2ARC requires RAM to keep track of what is in the L2ARC (i.e, an "index").

This discussion on the OpenZFS GitHub confirms the issue:

Data throughput causing apparent (directory) metadata eviction with metadata_size << arc_meta_min · Issue #10508 · openzfs/zfs

System information Type Version/Name Distribution Name Proxmox VE (some configs below were repeated on Ubuntu 20.04 with the same result) Distribution Version 6.2 and 6.4-1 Linux Kernel (tested acr...

github.com

devZer0 said:
it's frustrating to see that this problem exists for so long and has no priority for getting fixed.

storing millions of files on zfs and using rsync or other tools (which repeatedly walk down on the whole file tree) is not too exotic use case

UPDATE: The reason this seems to be a ZFS-exclusive issue is because no such behavior occurs on a laptop with less RAM (16GB) running desktop Linux. Many hours can pass between the same rsync dry-run, and each time it only takes 7 seconds to complete. There is no aggressive eviction of metadata from RAM. All it takes is one crawl of the entire filesystem (even a cheap portable external HDD), and it appears to remain cache'd in RAM for very long periods of time; which makes everything snappier and more efficient.

With ZFS, however, not even 30 minutes will pass and it already evicts the same metadata, slowing everything down.

In regards to rsync, directory listings, and metadata operations:
On a Linux laptop with only 16GB of RAM, with a cheap portable WD passport external HDD (5400 RPM) connected via USB, it yields snappier performance than TrueNAS with a mirrored vdev of Red Plus drives and 32GB of RAM.

Think about that...

HoneyBadger · Apr 10, 2022

Try cranking up arc_meta_min to a reasonable (and then maybe an unreasonable?) amount to tell ZFS not to eject metadata unless it's greater than that value.

winnielinnie · Apr 10, 2022

HoneyBadger said:
Try cranking up arc_meta_min to a reasonable (and then maybe an unreasonable?) amount to tell ZFS not to eject metadata unless it's greater than that value.

As a sysctl tunable, do I enter it as arc_meta_min or zfs_arc_meta_min

EDIT: It needs to be entered exactly as this: vfs.zfs.arc.meta_min

winnielinnie · Apr 11, 2022

Well, well, well, this seems to be the magic trick.

I added a sysctl tuneable with the exact variable name of vfs.zfs.arc.meta_min with a value of 4294967296 bytes (which equals 4 GiB).

From three different clients, I ran rsync dry-runs (no user data involved, pure metadata of many files/folders.)

After the first time, each subsequent pass took only seconds, and my ARC hit ratio doesn't suddenly drop. I waited over night, and same results. The speed at which the rsync dry-run completes is within seconds, alluding to the metadata remaining in ARC and not being evicted.

I could try lowering the value from 4 GiB to 3 GiB, and do more tests, but 4 GiB seems like a comfortable amount with enough breathing room for future uses.

HoneyBadger · Apr 11, 2022

Glad that worked. Sorry about the ambiguous tunable name, they differ now between CORE and SCALE.

If you're interested in knowing how much actual metadata is in your ARC live feel free to query the sysctl kstat.zfs.misc.arcstats.metadata_size

winnielinnie · Apr 11, 2022

HoneyBadger said:
If you're interested in knowing how much actual metadata is in your ARC live feel free to query the sysctl kstat.zfs.misc.arcstats.metadata_size

Between the three clients, and while the rsync dry-runs still finish within mere seconds: 2.1 GiB

I'll leave the setting to 4 GiB, since technically it's not like this value "hard reserves" ARC for metadata only, but rather uses it as a threshold of when to start evicting from the ARC, correct?

So of my 32 GiB of RAM, this entire time it only needed just above 2 GiB to hold all of my metadata in ARC, yet it was super eager to evict the metadata for whatever reason. Perusing other forums and mailing lists, it appears I'm not the only one who suffers from this; and others are equally confused why ZFS still behaves in this way.

HoneyBadger said:
Glad that worked. Sorry about the ambiguous tunable name, they differ now between CORE and SCALE.

Hey, but it worked!

So big thanks!

I'm still not satisfied overall: I assumed that ZFS / ARC was more "intelligent" and would literally "adapt" over time to make the best use of RAM and cache automatically. The fact that I, as an end-user, have to override its behavior with a tuneable to force it do to something sane is disappointing. It appears this issue is specific to "aggressive metadata eviction", and the ZFS developers haven't yet fixed it (nor see it as a real problem), and thus it remains this way for years and years.

winnielinnie · Apr 11, 2022

As an added bonus rant:

You know what else rubs me the wrong way? All it took was a single tuneable (vfs.zfs.arc.meta_min), and now I have no need for an L2ARC vdev.

Had I not been able to resolve this performance issue, I would have gone a more convoluted route of adding an L2ARC to be used only for metadata.

This entire time, a Linux laptop with 16 GiB or RAM (and another with 12 GiB), keep their filesystem metadata cache'd for long periods of time, and were never the bottleneck: it was always TrueNAS/ZFS as the bottleneck.

Kris Moore · Apr 11, 2022

FYI - This was a pretty informative thread. I've passed it on to our performance team internally, we will be doing some investigation to see if we should make some adjustments to the out of box metadata caching defaults out of box.

winnielinnie · Apr 11, 2022

Kris Moore said:
FYI - This was a pretty informative thread. I've passed it on to our performance team internally, we will be doing some investigation to see if we should make some adjustments to the out of box metadata caching defaults out of box.

I appreciate that, @Kris Moore!

The one caveat is that this is an upstream issue (ZFS / OpenZFS) which has existed for several years, at least.

What I did (based on @HoneyBadger's suggestion) was to manually set a "high enough" threshold of an absolute value of 4 GiB. It works for my system (32 GiB of RAM, with mostly metadata intensive operations). Now I get consistently good performance, and not just with rsync, but also other tools like "find" and directory crawls/listings, etc.

The problem is that my manual intervention of using an "absolute" value in a sense bypasses the innate "intelligence" of the ARC. I basically told the ZFS ARC: "You're not that smart. I've been using the NAS in the same way, even for months between reboots, and you still haven't figured out that you keep missing hits with the same metadata of the same datasets? Why do you think it's a good idea to evict this same metadata, over and over, when only 30 minutes have passed? Fine then, I'll just use a tuneable to stop you from doing this! Unless my metadata exceeds 4 GiB in the ARC, you must leave it alone and stop with your aggressive eviction!"

So to go back to your comment, Kris: The workaround I used was not really an intelligent "priority" tuneable (which is what I initially searched for, and hope to eventually find if it's possible), but rather a "best guess" minimum reservation.

Ideally, the best tuneable would be one that works with the ARC in a more automatic / intelligent fashion. Something like a "priority" scale, in which you can give higher priority to metadata so that it is less often (and perhaps never) outright evicted in its entirety after only 30 minutes and/or after reading large amounts of user data.

In my case, 4 GiB (and perhaps even 3 GiB) is a sweet spot and works smoothly. However, it might differ based on other users' workloads, patterns, NAS-usage, and total RAM.

winnielinnie · Apr 11, 2022

Another small update: Browsing the SMB shares with a file manager in Linux and Windows (of the same datasets on TrueNAS) is snappier and more consistent. Even a directory with over 14,000 files loads in an instant; every single time. (No delay, no "populating the view", etc.) Prior to adding this tuneable, there would be a noticeable delay in displaying the contents of the directory if I hadn't browsed/navigated the SMB share in the past hour or so.

HoneyBadger · Apr 11, 2022

winnielinnie said:
I'll leave the setting to 4 GiB, since technically it's not like this value "hard reserves" ARC for metadata only, but rather uses it as a threshold of when to start evicting from the ARC, correct?

Correct. If you don't current have more than arc_meta_min bytes of data, any request to evict metadata gets denied, but it won't prevent ARC from using that space if it's free. It'll just have the effect of squeezing the "data" portion of ARC over time if you keep adding metadata.

winnielinnie said:
I'm still not satisfied overall: I assumed that ZFS / ARC was more "intelligent" and would literally "adapt" over time to make the best use of RAM and cache automatically. The fact that I, as an end-user, have to override its behavior with a tuneable to force it do to something sane is disappointing. It appears this issue is specific to "aggressive metadata eviction", and the ZFS developers haven't yet fixed it (nor see it as a real problem), and thus it remains this way for years and years.

The out-of-the-box tunables on ZFS/OpenZFS are often set very conservatively to avoid breaking things. Getting under the hood can yield some very big improvements, but it's crucial that they aren't applied blindly as the necessary tuning varying wildly depending on workload/hardware.

Kris Moore said:
FYI - This was a pretty informative thread. I've passed it on to our performance team internally, we will be doing some investigation to see if we should make some adjustments to the out of box metadata caching defaults out of box.

The shift to OpenZFS 2.0 brought a good amount of changed defaults. It's worth reviewing the sysctls to see what else might need to be overridden with whatever iX thinks is best. ~~For the record my vfs.zfs.arc_meta_min on a FreeNAS 11.x box is non-zero with no tunables or autotune, so perhaps there is an inherent adjustment in the BSD ZFS that didn't get ported over.~~ Disregard that, I was looking at the kstat, there doesn't appear to be a minimum floor in BSD ZFS.

Kris Moore · Apr 11, 2022

Thats great to hear its performing that much better for you. I suspect we'll be taking a look at both potential types of tunings / improvements, either some hard coded thresholds of tunables, or perhaps adjusting the internal mechanisms as well.

indy · Apr 11, 2022

winnielinnie said:
I'm still not satisfied overall: I assumed that ZFS / ARC was more "intelligent" and would literally "adapt" over time to make the best use of RAM and cache automatically. The fact that I, as an end-user, have to override its behavior with a tuneable to force it do to something sane is disappointing. It appears this issue is specific to "aggressive metadata eviction", and the ZFS developers haven't yet fixed it (nor see it as a real problem), and thus it remains this way for years and years.

This is probably just a case where the ARC algorithm does not work favorably for your use case, but without an actual defect.
If it was tuned differently: "The stupid ARC keeps the midnight rsync task cached forever and drops data that users frequently request"

winnielinnie · Apr 11, 2022

indy said:
This is probably just a case where the ARC algorithm does not work favorably for your use case, but without an actual defect.
If it was tuned differently: "The stupid ARC keeps the midnight rsync task cached forever and drops data that users frequently request"

That's hard to say for certain, since if metadata vs user data was treated the same (in regards to retention/eviction in the ARC), then it makes very little sense based on normal usage. (What I mean is that if "data is data", then it matters not what kind of data it is. The purpose of the ARC is to reduce and correct repeated "misses" and increase "hits", which means better overall performance, especially if the pool is comprised of spinning HDDs. Whether these misses/hits are metadata or user data, the reads/usage should dictate what is retained in the ARC.)

As I mentioned earlier, there could be long periods where only metadata was ever read, and it will still be evicted in less than an hour to make room for what? What else desperately needs that 2 GiB (or even less) in the ARC? Especially considering that ARC is not meant for the "one-and-done" rarely accessed random data, but that which is repeatedly read. If one would assume that after so many "misses" of the same user data will "promote" that very user data in the ARC (which become "hits" now), why not the same algorithm for repeated "misses" of the same metadata?

You can see from the other discussions I linked to, it was demonstrated that different rsync tasks (using different metadata) would be enough to evict the former metadata from the ARC.

It's as if the ZFS developers in their design of how the ARC adapts gives very little preference for metadata: a very small floor.

Just a 2 GiB floor is enough to house the metadata for 3 different datasets, of which many, many files are involved, with still plenty of ARC remaining for literally everything else. Because I made the floor 4 GiB, that leaves 22-24 GiB of ARC to do as it pleases, without evicting any metadata at all.

What makes it even more peculiar is that by doubling my RAM, it made no difference whatsoever. No one would argue that you wouldn't see any performance improvements in regards to the ARC and your user data if you doubled your system memory. Yet when it comes to metadata, it still is trigger-happy to evict it as soon as possible? Even 30 or 60 minutes later?

So going from 16GB to 32GB, the extra room for the ARC to use is still very aggressive at evicting metadata. My ARC itself is now larger than the total capacity of physical memory I had before upgrading the hardware, but not an iota of difference in terms of allowing metadata to stay in the ARC for more than an hour's time (even after "miss" after "miss" after "miss").

What finally solved this was to use a tuneable that essentially gives the metadata in the ARC a reasonable floor before it starts being evicted when the pressure for user data increases.

Dice · May 25, 2022

winnielinnie said:
What finally solved this was to use a tuneable that essentially gives the metadata in the ARC a reasonable floor before it starts being evicted when the pressure for user data increases.

What was your final 'reasonable floor'? 4GB? or +16GB?

My data structure is similar to yours, and I'm interested in finding ways to tweak responsiveness and effectiveness of TN too.
I went along and tried your tunable;
I added the

Code:

vfs.zfs.arc.meta_min

to 4GB.

Verifying the tunable is in place:

Code:

sysctl -a | grep vfs.zfs.arc.meta_min
vfs.zfs.arc.meta_min: 4294967298

I traversed my directory tree and watched the size of arc metadata.
Yet I don't seem to find it growing particularly much. I felt the NFS share directory traversal to become snappier, but expected to find arcstats.metadata_size to have at least approached the new floor of 4GB. But it does not, it stays at 1.35GB.

Is that to be expected..?

Code:

sysctl -a | grep kstat.zfs.misc.arcstats.metadata_size
kstat.zfs.misc.arcstats.metadata_size: 1354146816

From here I went on to check the deep waters.
The default dnode limit of 10% of ARC may have part in the eviction "willingness" on systems with many man files.

Module Parameters — OpenZFS documentation

openzfs.github.io

I'd have a gamble and try to increase this value, to see if there is a measurable difference.

Additionally, I stumbled upon this, which might be of interest to your testing too:

Module Parameters — OpenZFS documentation

openzfs.github.io

Code:

zfs_arc_dnode_limit_percent

from default 10 to maybe 20?

Cheers, Dice

Important Announcement for the TrueNAS Community.

ZFS "ARC" doesn't seem that smart...

MVP

Hall of Famer

Patron

MVP

MVP

MVP

actually does care

MVP

MVP

actually does care

MVP

MVP

SVP of Engineering

MVP

MVP

actually does care

SVP of Engineering

Patron

MVP

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZFS "ARC" doesn't seem that smart..."

Similar threads