ZFS "ARC" doesn't seem that smart...

Joined
Oct 22, 2019
Messages
3,641
Without yet resorting to adding an L2ARC, is there a "tuneable" that I can test which instructs the ARC to prioritize metadata?

I upgraded from 16GB to 32GB ECC RAM.

Yet there is zero change in this behavior.

This keeps happening:
justified-l2arc-or-try-something-else-png.53200


I run regular rsync tasks from a few local clients, which is very metadata heavy (not so much actual data involved). I assumed that over time ZFS would "intelligently" adjust the ARC to prevent the metadata in question from being evicted every day. Yes, it's a lot of metadata (many files/folders on the dataset), but it's accessed every day, and I would argue that the metadata of the filesystem gets more "hits" than some random data itself.

I've even been through a couple weeks of barely touching any files on the NAS server, and thus the only real usage of the NAS is to list the directory tree with rsync.

Is there some tweak or tuneable to instruct the ARC to prioritize metadata?
Honestly, the NAS server would perform better (in my usage) if it evicted large caches of data blocks to make room for metadata instead.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776

indy

Patron
Joined
Dec 28, 2013
Messages
287
One thing to keep in mind is that the "most frequently used" denomination is a bit of a misnomer.
The ARC has 2 queues: "hit once" and "hit more than once".
So depending on your other workloads even your daily accessed (meta-)data might simply get evicted.

You could try changing the following tunables and see if it helps:
zfs-arc-meta-limit-percent (default is already at 75% though)
zfs_arc_dnode_limit_percent (default is 10%)

What is the behavior when you run rsync with a cold arc? Does all the metadata fit?
 
Joined
Oct 22, 2019
Messages
3,641
zfs set primarycache=metadata <dataset>

Well color me stoked! I'm going to try this right now. I hadn't even realized it was a per-dataset property.

EDIT: Might be tentative on second thought; but I'll still test this out. This is an "all or nothing" approach. Meaning that if I set it to "metadata", it will not cache any user data in the ARC, at all.
 
Last edited:
Joined
Oct 22, 2019
Messages
3,641
What is the behavior when you run rsync with a cold arc? Does all the metadata fit?
That would require a reboot, no? I can test this when I get the chance.

You could try changing the following tunables and see if it helps:
zfs-arc-meta-limit-percent (default is already at 75% though)
zfs_arc_dnode_limit_percent (default is 10%)
I'll look into those! Is the assumption that increasing the "arc dnode limit" will yield enough "breathing room" to comfortably fit all the metadata into the ARC without exceeding this (low) default limit?
 
Joined
Oct 22, 2019
Messages
3,641
Appears to be a perennial issue with ZFS, even back in 2018 on Solaris:


One of the comments practically describes the very same thing I'm witnessing; the only difference is they're using "find" and I'm using "rsync". However, "find", "rsync", directory crawls, etc, are all metadata intensive, even if no user data is touched at all.
sleepycal said:
Okay so if I run the find ./ command twice, the first one takes >5 minutes, but the second one executes in a few seconds. But if I wait a few hours, it takes >5 minutes again. I can also see misses incrementing on the >5 minute run. So my best guess is that metadata is being evicted from the ARC. Is there any way to force metadata to take priority over all other cache data?

Seems like ZFS is eager to evict metadata from the ARC (by default), which I find odd. :confused:

I'll play around with some of the mentioned tunables.

I'd prefer not to have to rely on a persistent L2ARC, and as it was mentioned in the other thread: Having an L2ARC requires RAM to keep track of what is in the L2ARC (i.e, an "index").

This discussion on the OpenZFS GitHub confirms the issue:


devZer0 said:
it's frustrating to see that this problem exists for so long and has no priority for getting fixed.

storing millions of files on zfs and using rsync or other tools (which repeatedly walk down on the whole file tree) is not too exotic use case


UPDATE: The reason this seems to be a ZFS-exclusive issue is because no such behavior occurs on a laptop with less RAM (16GB) running desktop Linux. Many hours can pass between the same rsync dry-run, and each time it only takes 7 seconds to complete. There is no aggressive eviction of metadata from RAM. All it takes is one crawl of the entire filesystem (even a cheap portable external HDD), and it appears to remain cache'd in RAM for very long periods of time; which makes everything snappier and more efficient.

With ZFS, however, not even 30 minutes will pass and it already evicts the same metadata, slowing everything down.

In regards to rsync, directory listings, and metadata operations:
On a Linux laptop with only 16GB of RAM, with a cheap portable WD passport external HDD (5400 RPM) connected via USB, it yields snappier performance than TrueNAS with a mirrored vdev of Red Plus drives and 32GB of RAM.


Think about that...
 
Last edited:
Joined
Oct 22, 2019
Messages
3,641
Try cranking up arc_meta_min to a reasonable (and then maybe an unreasonable?) amount to tell ZFS not to eject metadata unless it's greater than that value.
As a sysctl tunable, do I enter it as arc_meta_min or zfs_arc_meta_min

EDIT: It needs to be entered exactly as this: vfs.zfs.arc.meta_min
 
Last edited:
Joined
Oct 22, 2019
Messages
3,641
Well, well, well, this seems to be the magic trick.

vfs.zfs.arc.meta_min.png


I added a sysctl tuneable with the exact variable name of vfs.zfs.arc.meta_min with a value of 4294967296 bytes (which equals 4 GiB).

From three different clients, I ran rsync dry-runs (no user data involved, pure metadata of many files/folders.)

After the first time, each subsequent pass took only seconds, and my ARC hit ratio doesn't suddenly drop. I waited over night, and same results. The speed at which the rsync dry-run completes is within seconds, alluding to the metadata remaining in ARC and not being evicted.

I could try lowering the value from 4 GiB to 3 GiB, and do more tests, but 4 GiB seems like a comfortable amount with enough breathing room for future uses.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Glad that worked. Sorry about the ambiguous tunable name, they differ now between CORE and SCALE.

If you're interested in knowing how much actual metadata is in your ARC live feel free to query the sysctl kstat.zfs.misc.arcstats.metadata_size
 
Joined
Oct 22, 2019
Messages
3,641
If you're interested in knowing how much actual metadata is in your ARC live feel free to query the sysctl kstat.zfs.misc.arcstats.metadata_size
Between the three clients, and while the rsync dry-runs still finish within mere seconds: 2.1 GiB

I'll leave the setting to 4 GiB, since technically it's not like this value "hard reserves" ARC for metadata only, but rather uses it as a threshold of when to start evicting from the ARC, correct?

So of my 32 GiB of RAM, this entire time it only needed just above 2 GiB to hold all of my metadata in ARC, yet it was super eager to evict the metadata for whatever reason. Perusing other forums and mailing lists, it appears I'm not the only one who suffers from this; and others are equally confused why ZFS still behaves in this way.

Glad that worked. Sorry about the ambiguous tunable name, they differ now between CORE and SCALE.
Hey, but it worked! :smile: So big thanks!

I'm still not satisfied overall: I assumed that ZFS / ARC was more "intelligent" and would literally "adapt" over time to make the best use of RAM and cache automatically. The fact that I, as an end-user, have to override its behavior with a tuneable to force it do to something sane is disappointing. It appears this issue is specific to "aggressive metadata eviction", and the ZFS developers haven't yet fixed it (nor see it as a real problem), and thus it remains this way for years and years. :frown:
 
Joined
Oct 22, 2019
Messages
3,641
As an added bonus rant:

You know what else rubs me the wrong way? All it took was a single tuneable (vfs.zfs.arc.meta_min), and now I have no need for an L2ARC vdev.

Had I not been able to resolve this performance issue, I would have gone a more convoluted route of adding an L2ARC to be used only for metadata.

This entire time, a Linux laptop with 16 GiB or RAM (and another with 12 GiB), keep their filesystem metadata cache'd for long periods of time, and were never the bottleneck: it was always TrueNAS/ZFS as the bottleneck.
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
FYI - This was a pretty informative thread. I've passed it on to our performance team internally, we will be doing some investigation to see if we should make some adjustments to the out of box metadata caching defaults out of box.
 
Joined
Oct 22, 2019
Messages
3,641
FYI - This was a pretty informative thread. I've passed it on to our performance team internally, we will be doing some investigation to see if we should make some adjustments to the out of box metadata caching defaults out of box.
I appreciate that, @Kris Moore!

The one caveat is that this is an upstream issue (ZFS / OpenZFS) which has existed for several years, at least.

What I did (based on @HoneyBadger's suggestion) was to manually set a "high enough" threshold of an absolute value of 4 GiB. It works for my system (32 GiB of RAM, with mostly metadata intensive operations). Now I get consistently good performance, and not just with rsync, but also other tools like "find" and directory crawls/listings, etc.

The problem is that my manual intervention of using an "absolute" value in a sense bypasses the innate "intelligence" of the ARC. I basically told the ZFS ARC: "You're not that smart. I've been using the NAS in the same way, even for months between reboots, and you still haven't figured out that you keep missing hits with the same metadata of the same datasets? Why do you think it's a good idea to evict this same metadata, over and over, when only 30 minutes have passed? Fine then, I'll just use a tuneable to stop you from doing this! Unless my metadata exceeds 4 GiB in the ARC, you must leave it alone and stop with your aggressive eviction!"

So to go back to your comment, Kris: The workaround I used was not really an intelligent "priority" tuneable (which is what I initially searched for, and hope to eventually find if it's possible), but rather a "best guess" minimum reservation.

Ideally, the best tuneable would be one that works with the ARC in a more automatic / intelligent fashion. Something like a "priority" scale, in which you can give higher priority to metadata so that it is less often (and perhaps never) outright evicted in its entirety after only 30 minutes and/or after reading large amounts of user data.

In my case, 4 GiB (and perhaps even 3 GiB) is a sweet spot and works smoothly. However, it might differ based on other users' workloads, patterns, NAS-usage, and total RAM.
 
Joined
Oct 22, 2019
Messages
3,641
Another small update: Browsing the SMB shares with a file manager in Linux and Windows (of the same datasets on TrueNAS) is snappier and more consistent. Even a directory with over 14,000 files loads in an instant; every single time. (No delay, no "populating the view", etc.) Prior to adding this tuneable, there would be a noticeable delay in displaying the contents of the directory if I hadn't browsed/navigated the SMB share in the past hour or so.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I'll leave the setting to 4 GiB, since technically it's not like this value "hard reserves" ARC for metadata only, but rather uses it as a threshold of when to start evicting from the ARC, correct?
Correct. If you don't current have more than arc_meta_min bytes of data, any request to evict metadata gets denied, but it won't prevent ARC from using that space if it's free. It'll just have the effect of squeezing the "data" portion of ARC over time if you keep adding metadata.

I'm still not satisfied overall: I assumed that ZFS / ARC was more "intelligent" and would literally "adapt" over time to make the best use of RAM and cache automatically. The fact that I, as an end-user, have to override its behavior with a tuneable to force it do to something sane is disappointing. It appears this issue is specific to "aggressive metadata eviction", and the ZFS developers haven't yet fixed it (nor see it as a real problem), and thus it remains this way for years and years. :frown:
The out-of-the-box tunables on ZFS/OpenZFS are often set very conservatively to avoid breaking things. Getting under the hood can yield some very big improvements, but it's crucial that they aren't applied blindly as the necessary tuning varying wildly depending on workload/hardware.

FYI - This was a pretty informative thread. I've passed it on to our performance team internally, we will be doing some investigation to see if we should make some adjustments to the out of box metadata caching defaults out of box.
The shift to OpenZFS 2.0 brought a good amount of changed defaults. It's worth reviewing the sysctls to see what else might need to be overridden with whatever iX thinks is best. For the record my vfs.zfs.arc_meta_min on a FreeNAS 11.x box is non-zero with no tunables or autotune, so perhaps there is an inherent adjustment in the BSD ZFS that didn't get ported over. Disregard that, I was looking at the kstat, there doesn't appear to be a minimum floor in BSD ZFS.
 
Last edited:

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
Thats great to hear its performing that much better for you. I suspect we'll be taking a look at both potential types of tunings / improvements, either some hard coded thresholds of tunables, or perhaps adjusting the internal mechanisms as well.
 

indy

Patron
Joined
Dec 28, 2013
Messages
287
I'm still not satisfied overall: I assumed that ZFS / ARC was more "intelligent" and would literally "adapt" over time to make the best use of RAM and cache automatically. The fact that I, as an end-user, have to override its behavior with a tuneable to force it do to something sane is disappointing. It appears this issue is specific to "aggressive metadata eviction", and the ZFS developers haven't yet fixed it (nor see it as a real problem), and thus it remains this way for years and years.

This is probably just a case where the ARC algorithm does not work favorably for your use case, but without an actual defect.
If it was tuned differently: "The stupid ARC keeps the midnight rsync task cached forever and drops data that users frequently request"
 
Joined
Oct 22, 2019
Messages
3,641
This is probably just a case where the ARC algorithm does not work favorably for your use case, but without an actual defect.
If it was tuned differently: "The stupid ARC keeps the midnight rsync task cached forever and drops data that users frequently request"
That's hard to say for certain, since if metadata vs user data was treated the same (in regards to retention/eviction in the ARC), then it makes very little sense based on normal usage. (What I mean is that if "data is data", then it matters not what kind of data it is. The purpose of the ARC is to reduce and correct repeated "misses" and increase "hits", which means better overall performance, especially if the pool is comprised of spinning HDDs. Whether these misses/hits are metadata or user data, the reads/usage should dictate what is retained in the ARC.)

As I mentioned earlier, there could be long periods where only metadata was ever read, and it will still be evicted in less than an hour to make room for what? What else desperately needs that 2 GiB (or even less) in the ARC? Especially considering that ARC is not meant for the "one-and-done" rarely accessed random data, but that which is repeatedly read. If one would assume that after so many "misses" of the same user data will "promote" that very user data in the ARC (which become "hits" now), why not the same algorithm for repeated "misses" of the same metadata?

You can see from the other discussions I linked to, it was demonstrated that different rsync tasks (using different metadata) would be enough to evict the former metadata from the ARC.

It's as if the ZFS developers in their design of how the ARC adapts gives very little preference for metadata: a very small floor.

Just a 2 GiB floor is enough to house the metadata for 3 different datasets, of which many, many files are involved, with still plenty of ARC remaining for literally everything else. Because I made the floor 4 GiB, that leaves 22-24 GiB of ARC to do as it pleases, without evicting any metadata at all.

What makes it even more peculiar is that by doubling my RAM, it made no difference whatsoever. No one would argue that you wouldn't see any performance improvements in regards to the ARC and your user data if you doubled your system memory. Yet when it comes to metadata, it still is trigger-happy to evict it as soon as possible? Even 30 or 60 minutes later?

So going from 16GB to 32GB, the extra room for the ARC to use is still very aggressive at evicting metadata. My ARC itself is now larger than the total capacity of physical memory I had before upgrading the hardware, but not an iota of difference in terms of allowing metadata to stay in the ARC for more than an hour's time (even after "miss" after "miss" after "miss").

What finally solved this was to use a tuneable that essentially gives the metadata in the ARC a reasonable floor before it starts being evicted when the pressure for user data increases.
 
Last edited:

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
What finally solved this was to use a tuneable that essentially gives the metadata in the ARC a reasonable floor before it starts being evicted when the pressure for user data increases.
What was your final 'reasonable floor'? 4GB? or +16GB?

My data structure is similar to yours, and I'm interested in finding ways to tweak responsiveness and effectiveness of TN too.
I went along and tried your tunable;
I added the
Code:
vfs.zfs.arc.meta_min
to 4GB.

Verifying the tunable is in place:
Code:
sysctl -a | grep vfs.zfs.arc.meta_min
vfs.zfs.arc.meta_min: 4294967298


I traversed my directory tree and watched the size of arc metadata.
Yet I don't seem to find it growing particularly much. I felt the NFS share directory traversal to become snappier, but expected to find arcstats.metadata_size to have at least approached the new floor of 4GB. But it does not, it stays at 1.35GB.

Is that to be expected..?

Code:
sysctl -a | grep kstat.zfs.misc.arcstats.metadata_size
kstat.zfs.misc.arcstats.metadata_size: 1354146816


From here I went on to check the deep waters.
The default dnode limit of 10% of ARC may have part in the eviction "willingness" on systems with many man files.
I'd have a gamble and try to increase this value, to see if there is a measurable difference.

Additionally, I stumbled upon this, which might be of interest to your testing too:
Code:
zfs_arc_dnode_limit_percent
from default 10 to maybe 20?

Cheers, Dice
 
Top