ZFS "ARC" doesn't seem that smart...

winnielinnie · May 27, 2022

Dice said:
I find that the number did surpass 4GiB, but nowhere close to 8GiB.

I wonder if it reports a different value with...

arc_summary | grep "Metadata cache size (current)"

...during your tests?

If I strictly use full rsync crawls (directory tree crawls for the entire datasets) for three different datasets (used by three different clients), and navigation via the SMB shares, then I appear to never exceed the following values:

Code:

kstat.zfs.misc.arcstats.metadata_size: 2296452608 (2.14 GiB)

Metadata cache size (current): 3.2 GiB

At no point do I witness/experience any aggressive metadata eviction, and my tests continue to be snappy (rsync crawls, directory tree listings, SMB browsing). All the while my arc.meta_min is still set to 4294967296 (4.0 GiB).

It's "working as advertised", in that until metadata exceeds 4.0 GiB in the ZFS ARC, there is no aggressive metadata eviction by userdata.

This is why I haven't yet tried increasing the threshold, since I haven't needed to.

Technically, I could just go ahead and set it to 8 GiB, under the assumption it will never be reached, while always giving metadata higher priority over userdata in the ARC.

(Your recent test is interesting! That's why I'm wondering what would happen if you retry it, but this time monitor the other value in the command shared.)

arc_summary | grep "Metadata cache size (current)"

Dice said:
ran the same "check files numbers/size" over NFS.

I'm seriously wondering if doing this (whether over SMB/NFS) requires extra metadata to be read into memory, above and beyond what is required for rsync tasks, directory crawls, and browsing.

Dice · May 28, 2022

Sure,
I've tested the 'filechecking' until I found the cache size started to shrink.
Also did a

Code:

time rsync -r --info=progress2 --dry-run <path> <path>

to find out they did not improve from 67sec. Maybe because the server has not been restarted since a while - and these tests have actually primed the system far better than I see from this point.

Here's some stats for you to compare:

Code:

 while true; do echo "===== `date`" ;  sysctl -a | grep kstat.zfs.misc.arcstats.metadata_size; (arc_summary | grep "Metadata cache size (
current)") |tr -s ' '    && sleep 10 ;  done

Code:

A somewhat edited timeline; for the first 20% scanned.
===== Sat May 28 13:05:12 CEST 2022                                                                                                                        
kstat.zfs.misc.arcstats.metadata_size: 4141617664                                                                                                          
 Metadata cache size (current): 21.9 % 5.1 GiB                                                
===== Sat May 28 13:05:53 CEST 2022                                                                                                                        
kstat.zfs.misc.arcstats.metadata_size: 4249685504                                                                                                          
 Metadata cache size (current): 22.5 % 5.2 GiB                                                                                                             
===== Sat May 28 13:06:04 CEST 2022                                                                                                                        
kstat.zfs.misc.arcstats.metadata_size: 4300497408                                                                                                          
 Metadata cache size (current): 22.7 % 5.3 GiB                                                             
===== Sat May 28 13:06:24 CEST 2022

kstat.zfs.misc.arcstats.metadata_size: 4404800000                                                                                                          
 Metadata cache size (current): 23.2 % 5.4 GiB                                                                                                             
===== Sat May 28 13:06:34 CEST 2022                                                                                                                        
kstat.zfs.misc.arcstats.metadata_size: 4491397632                                                                                                          
 Metadata cache size (current): 23.7 % 5.5 GiB                                                                                                             
===== Sat May 28 13:06:45 CEST 2022                                                                                                                        
kstat.zfs.misc.arcstats.metadata_size: 4557889024                                                                                                          
 Metadata cache size (current): 24.4 % 5.7 GiB                                                                                                             
===== Sat May 28 13:06:55 CEST 2022                                                                                                                        
kstat.zfs.misc.arcstats.metadata_size: 4621641728                                                                                                          
 Metadata cache size (current): 25.2 % 5.8 GiB                                                                                                             
===== Sat May 28 13:07:05 CEST 2022                                                                                                                        
kstat.zfs.misc.arcstats.metadata_size: 4653271040                                                                                                          
 Metadata cache size (current): 25.5 % 5.9 GiB

Here's the next part, where I increased the interval to 30sec, and let the scan continue for a while longer.
My hopes here were to see something closer to 8GiB, or at the least, not see it shrink before being completed.
Clearly there is some cache thrashing going on, as the second run from the beginning is far slower than the initial run. An estimation would say it'll take the entire day to complete, if not more at this pace.

The statistics remain approx the same;

Code:

===== Sat May 28 13:24:58 CEST 2022
kstat.zfs.misc.arcstats.metadata_size: 4899728384
 Metadata cache size (current): 26.8 % 6.2 GiB
===== Sat May 28 13:28:30 CEST 2022
kstat.zfs.misc.arcstats.metadata_size: 4842187264
 Metadata cache size (current): 27.0 % 6.3 GiB
===== Sat May 28 13:29:00 CEST 2022
kstat.zfs.misc.arcstats.metadata_size: 4844483072
 Metadata cache size (current): 27.1 % 6.3 GiB

I decided to quit waiting and stop thrashing on the drives.
Next thing I'll do is to play around with a L2ARC and see what other statistics I can unlock and see what's going on. Either with normalmode on L2ARC, or metadata=only.

It'll have to wait a few weeks, parts for my new build are still due.
Plus, I'm getting convinced to play with L2ARC.

edit:
arc_summary:

ARC status: HEALTHY
Memory throttle count: 0

ARC size (current): 74.6 % 23.1 GiB
Target size (adaptive): 77.4 % 23.9 GiB
Min size (hard limit): 3.2 % 1021.6 MiB
Max size (high water): 30:1 30.9 GiB
Most Frequently Used (MFU) cache size: 34.9 % 7.4 GiB
Most Recently Used (MRU) cache size: 65.1 % 13.8 GiB
Metadata cache size (hard limit): 75.0 % 23.2 GiB
Metadata cache size (current): 27.5 % 6.4 GiB
Dnode cache size (hard limit): 34.5 % 8.0 GiB
Dnode cache size (current): 10.5 % 862.9 MiB

ARC hash breakdown:
Elements max: 1.7M
Elements current: 89.7 % 1.6M
Collisions: 14.4M
Chain max: 7
Chains: 228.0k

ARC misc:
Deleted: 36.4M
Mutex misses: 380
Eviction skips: 1.7M

ARC total accesses (hits + misses): 5.6G
Cache hit ratio: 99.7 % 5.6G
Cache miss ratio: 0.3 % 19.7M
Actual hit ratio (MFU + MRU hits): 99.6 % 5.6G
Data demand efficiency: 93.7 % 22.9M
Data prefetch efficiency: 4.7 % 10.6M

Cache hits by cache type:
Most frequently used (MFU): 99.0 % 5.6G
Most recently used (MRU): 0.9 % 52.9M
Most frequently used (MFU) ghost: < 0.1 % 2.4M
Most recently used (MRU) ghost: < 0.1 % 813.4k

Cache hits by data type:
Demand data: 0.4 % 21.4M
Demand prefetch data: < 0.1 % 495.6k
Demand metadata: 98.3 % 5.5G
Demand prefetch metadata: 1.3 % 72.3M

Cache misses by data type:
Demand data: 7.4 % 1.4M
Demand prefetch data: 51.3 % 10.1M
Demand metadata: 17.5 % 3.5M
Demand prefetch metadata: 23.8 % 4.7M

DMU prefetch efficiency: 12.8M
Hit ratio: 77.2 % 9.9M
Miss ratio: 22.8 % 2.9M

L2ARC not detected, skipping section

winnielinnie · May 28, 2022

Dice said:
My hopes here were to see something closer to 8GiB, or at the least, not see it shrink before being completed.
Clearly there is some cache thrashing going on, as the second run from the beginning is far slower than the initial run. An estimation would say it'll take the entire day to complete, if not more at this pace.

Something about the "filechecking" must be causing such thrashing and strange results.

I'm curious if you start fresh, but this time only run repeated rsync dry-runs on entire filesystem trees (which will crank up metadata in ARC, as expected), do you notice the following:

The initial runs are slow/delayed (expected, since there's nothing in RAM yet)
The subsequent runs are faster (reading metadata purely from ARC in RAM) [1]
The "Metadata cache size" caps off at a certain amount (approximately), no matter how many dry-runs complete

As to why metadata appears to be aggressively evicted before reaching 8 GiB, I'm not sure.

I'm not able to reproduce your results over an SMB share.

What command are you using against your NFS share for the "filechecking"? Is it via a GUI browser or a command-line?

[1] This depends on your client as well. If your client doesn't keep all the metadata cached in RAM, then the client is the bottleneck, not the TrueNAS server. (Depends on client's OS, RAM, and configuration.)

Dice · May 28, 2022

winnielinnie said:
What command are you using against your NFS share for the "filechecking"? Is it via a GUI browser or a command-line?

GUI, mark all folders of the 'major dataset', right click, select properties to see Size & file/folder numbers start getting calculated.

I'll see when I get around to make a 'cold test'.

Dice · May 28, 2022

The results are in for your 'standard test'. Freshly rebooted server.

Code:

time rsync -r --info=progress2 --dry-run <path_big_dataset> <path_somewhere>

Run #	Time
1	93m34s
2	1m17s
3	1m16s
4 (10h cool down period)	1m17s

I forgot about logging during the run. Here's what they look like the day after:

Code:

===== Sun May 29 07:57:03 CEST 2022
kstat.zfs.misc.arcstats.metadata_size: 3046209536
 Metadata cache size (current): 15.8 % 3.7 GiB

This clearly works wonders, with rsync and the tunable set to 8GB. Judging from the size, I'd probably would be just as fine at 4GB.

winnielinnie · May 29, 2022

Dice said:
This clearly works wonders, with rsync and the tunable set to 8GB. Judging from the size, I'd probably would be just as fine at 4GB.

Same amazing results I'm seeing!

Same immensely improved behavior.

It's not just the benefit from Run #1 to Run #2 (in your table above.) The real consistent benefit is evidenced in Run #4. (Ten-hour cooldown period, yet still in effect!)

This is why I think 4 GiB is a nice sweet spot.

But upon thinking about it, maybe I should change my opinion to:

"Setting the threshold to 4 GiB should be the minimum starting point if you aren't already using the tuneable. Otherwise, based on your RAM and preferences, you can bump it upwards in 2 GiB increments until you settle on what you believe is your own sweet spot."

The truth is, you could technically set the value to 20 GiB, and it will still behave the same as if you set it to 4 GiB, since in neither scenario will the totality of metadata in the ARC exceed the threshold [1], and thus no aggressive eviction will take place; while also the "unused" portion of the threshold can still be used for userdata (normal ARC behavior.)

[1] Unless of course, you slam your system by checking the properties of an entire humongous NFS share over a Linux client's file browser tool.

awasb · May 29, 2022

Just out of curiosity, do you have an L2ARC? (Used additionally ...)

winnielinnie · May 29, 2022

awasb said:
Just out of curiosity, do you have an L2ARC? (Used additionally ...)

Me? Nope. No L2ARC whatsoever.

But if you read through this thread and the other one (referenced in the first post), you’ll see that an L2ARC (even only for metadata) is not likely to resolve this, and might introduce more issues. Especially for systems with under 64 GB of RAM.

For the longest time, and still to this day, the default behavior for ZFS is to aggressively and swiftly evict metadata from the ARC, regardless of usage or system RAM.

The tuneable used in our tests is the magic.

Like releasing the cork of a wine bottle!

I believe from my own tests and others like @Dice, that 4 GiB is a good starting point, and from there it can be adjusted until you find a good threshold (if needed).

The idea of resorting to using a dedicated L2ARC was a "last-ditch effort" to at least load metadata from a faster device, rather than from the spinning HDDs (even if only 30 minutes have passed since it was last used, yet still evicted.) I figured that if it's just going to keep "missing", I might as well have it "miss" but then load from an NVMe rather than spinning HDD. (I figured, if the ARC isn't that smart, I'll just have to come to grips with it and unfortunately resort to adding an L2ARC.)

But this all become moot, because the tuneable [1] basically makes ZFS behave to not aggressively and swiftly evict metadata.

[1] The name of the tuneable is a sysctl variable named vfs.zfs.arc.meta_min for TrueNAS Core. (Still not sure what it's named under SCALE.)

MrTP7 · May 29, 2022

neofusion said:
I would like to set the zfs.arc.meta_min variable in SCALE, what is the correct name there? A post in this thread implied it differered between CORE and SCALE.

Is there a reference available where you can look up tunables for Scale?

System Settings -> Advanced -> Init/Shutdown Scripts

Type: Command
Command: echo 4294967296 > /sys/module/zfs/parameters/zfs_arc_meta_min
When: Pre Init

neofusion · May 30, 2022

MrTP7 said:
System Settings -> Advanced -> Init/Shutdown Scripts

Type: Command
Command: echo 4294967296 > /sys/module/zfs/parameters/zfs_arc_meta_min
When: Pre Init

Thanks for the suggestion, that appears to be the way to do it for now. It's essentially what I posted earlier but automatically reapplied after a reboot.
The UI implies it should be possible to set this in the normal tunable window. I guess maybe it can be seen as a bug? A bug either in the implementation or the help text.

winnielinnie said:
Oops! I forgot the -w

sysctl -w

Maybe it’s an issue with SCALE’s GUI then? I don’t have SCALE to test this out.

I also see I missed replying to this:

Code:

# sysctl -w zfs.arc.meta_min=4294967296
sysctl: cannot stat /proc/sys/zfs/arc/meta_min: No such file or directory

When trying the official tunable window you get no feedback in the UI but when examining middlewared.log I found this:

Code:

[2022/05/28 22:35:48] (ERROR) middlewared.sysctl_configuration():12 - Failed to get default value of 'zfs.arc.meta_min' : sysctl: cannot stat /proc/sys/zfs/arc/meta_min: No such file or directory
[2022/05/28 22:35:48] (ERROR) middlewared.sysctl_configuration():22 - Failed to set sysctl 'zfs.arc.meta_min' -> '4294967296' : sysctl: cannot stat /proc/sys/zfs/arc/meta_min: No such file or directory

So the same problem.
The SCALE tunables appear unable to set this type of value, or maybe an undocumented prefix is needed.

StevenD · Sep 9, 2022

In my case I am using the system as a smb share. Would it be possible to run a script on boot to traverse the shared directories to get the meta data into arc automatically? If it matters, I am using scale and not core at the moment.

I just want to make sure directory traversal/listing and file searches from windows desktops is as fast as possible.

Patrick M. Hausen · Sep 9, 2022

StevenD said:
I just want to make sure directory traversal/listing and file searches from windows desktops is as fast as possible.

A metadata special vdev works wonders for this use case. Mirror of two or three SSDs ...

NugentS · Sep 9, 2022

or L2ARC, metadata only. Its also "safer" (non pool critical) and you don't need mirrors. If the disk fails then ZFS sort of shrugs its shoulders and goes back to using the metadata on the data pool

awasb · Sep 9, 2022

@StevenD: If You want to do it "in software" ... add a cron-job with the following command:

Code:

/bin/ls -lahR /mnt/ > /dev/null &

Execute hourly. Should suffice. (Over here it does with L2ARC.)

Edit: You could even run it every 5min, since execution time will be within secs.

HoneyBadger · Sep 9, 2022

NugentS said:
or L2ARC, metadata only. Its also "safer" (non pool critical) and you don't need mirrors. If the disk fails then ZFS sort of shrugs its shoulders and goes back to using the metadata on the data pool

This would be my suggested approach over special/metadata vdevs for the reasons of redundancy and easy reversal. If you attach a special device to a pool with a raidz vdev, it's there for good due to restrictions on device removal. Special vdevs are good for when you need to write metadata quickly, but have some limitations - most of this thread is about ensuring the metadata is read quickly, and that can be done with less permanent impact.

awasb said:
@StevenD: If You want to do it "in software" ... add a cron-job with the following command:

Code:
/bin/ls -lahR /mnt/ > /dev/null &

Execute hourly. Should suffice. (Over here it does with L2ARC.)

Edit: You could even run it every 5min, since execution time will be within secs.

I don't know that it needs to be as frequent as that, with the introduction of the metadata minimum tunable - once it's been warmed into cache, it shouldn't be ejected from it. I'd start with it as an on-boot script, and then look to see if the metadata amount is shrinking.

NateroniPizza · Jan 26, 2023

NugentS said:
or L2ARC, metadata only. Its also "safer" (non pool critical) and you don't need mirrors. If the disk fails then ZFS sort of shrugs its shoulders and goes back to using the metadata on the data pool

HoneyBadger said:
This would be my suggested approach over special/metadata vdevs for the reasons of redundancy and easy reversal. If you attach a special device to a pool with a raidz vdev, it's there for good due to restrictions on device removal. Special vdevs are good for when you need to write metadata quickly, but have some limitations - most of this thread is about ensuring the metadata is read quickly, and that can be done with less permanent impact.

I don't know that it needs to be as frequent as that, with the introduction of the metadata minimum tunable - once it's been warmed into cache, it shouldn't be ejected from it. I'd start with it as an on-boot script, and then look to see if the metadata amount is shrinking.

I keep seeing metadata-only L2ARC being offhandedly thrown around as a safe replacement for metadata VDEVs, but I've never seen anyone actually post comparisons about their performance or behavior. I've tried both over the last week or so (switching back and forth a few times, rebuilding the pool as necessary), and I've not been able to get L2ARC to work as an actual replacement for a metadata special VDEV. Note that I've got a pair of Optane P1600X drives I had purchased specifically for the purpose of a metadata special VDEV mirror, but given the advantages I've more recently read (safer + ability to remove it), I'd really like to get it working as metadata-only L2ARC if at all possible.

When I do a "/bin/ls -lahR /mnt/", I get a sub-2 minute list time when I have a metadata special VDEV set up. However, regardless how I configure it, I cannot get the persistent L2ARC to get under 18 minutes on a fresh boot (subsequent times running it before the next reboot is extremely fast, of course).

I've tried both secondarycache=all and secondarycache=metadata.

I have both l2arc_noprefetch=0 and l2arc_headroom=0. Also, as I am running TrueNAS Scale, l2arc_rebuild_enabled is enabled by default.

I've verified after a reboot that these settings took, using arc_summary and zfs get secondarycache {poolname}.

How does one get this so-often-recommended metadata special VDEV-alternative to actually work as a metadata special VDEV alternative? It's very possible I'm doing something wrong, but given how often this is mentioned without any caveats, I would have thought it would work without any additional configuration.

ChrisRJ · Jan 26, 2023

NateroniPizza said:
When I do a "/bin/ls -lahR /mnt/", I get a sub-2 minute list time when I have a metadata special VDEV set up. However, regardless how I configure it, I cannot get the persistent L2ARC to get under 18 minutes on a fresh boot (subsequent times running it before the next reboot is extremely fast, of course).

I just stumbled over this thread and have therefore only read the posting from the last page. So please ignore this, if it has already been addressed ...

Is it possible that somehow the persistent L2ARC is not that persistent? At least that is what your description indicates to me.

jgreco · Jan 26, 2023

NateroniPizza said:
I keep seeing metadata-only L2ARC being offhandedly thrown around as a safe replacement for metadata VDEVs, but I've never seen anyone actually post comparisons about their performance or behavior. I've tried both over the last week or so (switching back and forth a few times, rebuilding the pool as necessary), and I've not been able to get L2ARC to work as an actual replacement for a metadata special VDEV.

I would note that a metadata-only L2ARC is still first and foremost an L2ARC device, and metadata-only is just a tweak.

One of the common mistakes made with L2ARC is a failure to tune the L2ARC eviction rates. By default, these are a very conservative 8MBytes/period (which may be as often as 5x/second), but you don't really want to be thinking "ok my SSD can write at 560MBytes/sec so I'll jigger it for that!" In practice, high quality L2ARC evictions don't happen at a torrential rate unless you have massive ARC, a massive pool, and insane network. This is going to be even MORE true for metadata-only L2ARC; what you want to do is to be certain that you are tuning such that this stuff does a good job of landing in ARC, and then spool THAT out to L2ARC when ARC is too full of it.

Wait. I've written this L2ARC tuning stuff before.. let me dig it up. Oh! Here we go. This is from my conversation with Cyberjock years ago, when he was new to ZFS. It's also nearly a perfect ten years ago.

Trimmed down to relevant part.

https://www.truenas.com/community/threads/zfs-and-ssd-cache-size-log-zil-and-l2arc.6345/post-49449

So you are limited to certain write speeds to the L2ARC(we're talking a few MB/sec if I remember correctly) ZFS doesn't do much 'read ahead' so even if you start watching a streaming movie don't expect the movie to be dumped into the L2ARC and then the drives to spin down from being idle.

This is controlled by several variables.

vfs.zfs.l2arc_norw: 1
vfs.zfs.l2arc_feed_again: 1
vfs.zfs.l2arc_noprefetch: 1
vfs.zfs.l2arc_feed_min_ms: 200
vfs.zfs.l2arc_feed_secs: 1
vfs.zfs.l2arc_headroom: 2
vfs.zfs.l2arc_write_boost: 8388608
vfs.zfs.l2arc_write_max: 8388608

norw: if this is set to 1, it suppresses reads from the L2ARC device if it is being written to.

noprefetch: if this is set to 1, it suppresses L2ARC caching of prefetch buffers

headroom: the number of buffers worth of headroom the L2ARC tries to maintain. If the ARC is under pressure and there's insufficient headroom, the L2ARC may not get some stuff that it would have been good to get.

The rest of this is complicated and works together.

write_max is the maximum size of an L2ARC write. Typically this happens every feed_secs seconds. Do NOT set write_max to a very large number without understanding all of the rest of this.

When the L2ARC is cold and no reads have yet happened, write_max is augmented by write_boost. The theory is that if nothing's being read, it's not disruptive to write at a higher rate.

If feed_again is set to 1, ZFS may actually write to L2ARC as frequently as feed_min_ms; for the default value of 200, that means 5x per second.

So now, as an administrator, you have to use your head and figure this all out. So here's the thing. The 8MB write_max is very conservative. But you can't just say "oh yeah my SSD can write at 475MB/sec! I'll set it to THAT!" An L2ARC is only useful if it's offloading a lot of read activities from your main pool. So an easy call is that it would make no sense to be using more than half its bandwidth for writing. But further, ZFS already allows for automatic bumping up of write speed when the L2ARC is cold through the write_boost mechanism. Also, the feed_again mechanism works to allow multiple feeds per second if there is sufficient demand, so with 200ms, you only need one fifth. So you can safely set this to 1/2 of 1/5 of what your SSD can write at and still have it all work very well; so for a 475MB/sec SSD, you can go for 47.5MB/sec. Probably best to pick a power of two, though, so pick 32MB or 64MB. More does NOT make sense.

jgreco · Jan 27, 2023

ChrisRJ said:
the persistent L2ARC is not that persistent?

Persistent L2ARC is not the same thing as a guarantee that a given object will be in the ARC. A special VDEV for metadata, on the other hand, is a GUARANTEED it-is-going-to-be-here-on-this-vdev sort of thing.

So if you set up a persistent L2ARC but have crap for eviction policy, see my immediately preceding message about l2arc_write_max and friends, you may not get the sort of behaviour out of ARC/L2ARC that you are expecting. Let's say you have a million files on your pool, and you're hoping to use ARC/L2ARC to "speed it up". There's several different cases this devolves down to, and that is also impacted by ARC metadata limits.

Doing this from memory, so excellent opportunities to catch me making a blunder:

kstat.zfs.misc.arcstats.arc_meta_min: 16777216
kstat.zfs.misc.arcstats.arc_meta_max: 19411601120
kstat.zfs.misc.arcstats.arc_meta_limit: 23106580896
kstat.zfs.misc.arcstats.arc_meta_used: 1088806008

So this looks happy-ish, don't mess with meta_min. meta_max is the maximum value that meta_used has been observed to be, and meta_limit is the advisory/soft limit that the ARC system will try not to exceed. So this system isn't bumping up against the limit, and if you watch it for awhile, you can also see the meta_used yo-yo up and down as it periodically runs rsync operations.

kstat.zfs.misc.arcstats.mfu_evictable_metadata: 144566784
kstat.zfs.misc.arcstats.mru_evictable_metadata: 241664
kstat.zfs.misc.arcstats.anon_evictable_metadata: 0

So the evictable stuff is stuff that could qualify to be evicted from the ARC as additional ARC (or memory) is needed, and ideally represents what you want the system to be eyeballing for eviction TO the L2ARC, rather than just freeing the memory and going on its merry way.

kstat.zfs.misc.arcstats.metadata_size: 430560768
kstat.zfs.misc.arcstats.prefetch_metadata_misses: 312012036
kstat.zfs.misc.arcstats.prefetch_metadata_hits: 1254975955
kstat.zfs.misc.arcstats.demand_metadata_misses: 354913690
kstat.zfs.misc.arcstats.demand_metadata_hits: 63093503323

You also want a good ratio of demand hits to misses. Prefetch is a bit hit-or-miss. (boo hiss bad joke).

vfs.zfs.l2arc.rebuild_enabled: 1

Anyways, with this setup, this is able to scan ~7500 directories containing about 181,000 files in about a second or two.

Code:

# /usr/bin/time find ????? -type d -print | wc -l
        0.48 real         0.07 user         0.40 sys
    7466
# /usr/bin/time find ????? -type f -print | wc -l
        1.05 real         0.08 user         0.94 sys
  181215
# /usr/bin/time find ????? -type f -ls | wc -l
        1.66 real         0.34 user         1.28 sys
  181215
#

Admittedly this is a well-sized, well-tuned and also warmed-up setup. But persistent L2ARC rocks when done like this.

NateroniPizza · Jan 27, 2023

ChrisRJ said:
I just stumbled over this thread and have therefore only read the posting from the last page. So please ignore this, if it has already been addressed ...

Is it possible that somehow the persistent L2ARC is not that persistent? At least that is what your description indicates to me.

When I boot the system back up, L2ARC's size is the same, if not larger, than before the shutdown. So it at least appears to be remaining in there. I suspect it's just not actually caching what I want.

jgreco said:
I would note that a metadata-only L2ARC is still first and foremost an L2ARC device, and metadata-only is just a tweak.

One of the common mistakes made with L2ARC is a failure to tune the L2ARC eviction rates. By default, these are a very conservative 8MBytes/period (which may be as often as 5x/second), but you don't really want to be thinking "ok my SSD can write at 560MBytes/sec so I'll jigger it for that!" In practice, high quality L2ARC evictions don't happen at a torrential rate unless you have massive ARC, a massive pool, and insane network. This is going to be even MORE true for metadata-only L2ARC; what you want to do is to be certain that you are tuning such that this stuff does a good job of landing in ARC, and then spool THAT out to L2ARC when ARC is too full of it.

Wait. I've written this L2ARC tuning stuff before.. let me dig it up. Oh! Here we go. This is from my conversation with Cyberjock years ago, when he was new to ZFS. It's also nearly a perfect ten years ago. Trimmed down to relevant part.

https://www.truenas.com/community/threads/zfs-and-ssd-cache-size-log-zil-and-l2arc.6345/post-49449

This is controlled by several variables.

vfs.zfs.l2arc_norw: 1
vfs.zfs.l2arc_feed_again: 1
vfs.zfs.l2arc_noprefetch: 1
vfs.zfs.l2arc_feed_min_ms: 200
vfs.zfs.l2arc_feed_secs: 1
vfs.zfs.l2arc_headroom: 2
vfs.zfs.l2arc_write_boost: 8388608
vfs.zfs.l2arc_write_max: 8388608

norw: if this is set to 1, it suppresses reads from the L2ARC device if it is being written to.

noprefetch: if this is set to 1, it suppresses L2ARC caching of prefetch buffers

headroom: the number of buffers worth of headroom the L2ARC tries to maintain. If the ARC is under pressure and there's insufficient headroom, the L2ARC may not get some stuff that it would have been good to get.

The rest of this is complicated and works together.

write_max is the maximum size of an L2ARC write. Typically this happens every feed_secs seconds. Do NOT set write_max to a very large number without understanding all of the rest of this.

When the L2ARC is cold and no reads have yet happened, write_max is augmented by write_boost. The theory is that if nothing's being read, it's not disruptive to write at a higher rate.

If feed_again is set to 1, ZFS may actually write to L2ARC as frequently as feed_min_ms; for the default value of 200, that means 5x per second.

So now, as an administrator, you have to use your head and figure this all out. So here's the thing. The 8MB write_max is very conservative. But you can't just say "oh yeah my SSD can write at 475MB/sec! I'll set it to THAT!" An L2ARC is only useful if it's offloading a lot of read activities from your main pool. So an easy call is that it would make no sense to be using more than half its bandwidth for writing. But further, ZFS already allows for automatic bumping up of write speed when the L2ARC is cold through the write_boost mechanism. Also, the feed_again mechanism works to allow multiple feeds per second if there is sufficient demand, so with 200ms, you only need one fifth. So you can safely set this to 1/2 of 1/5 of what your SSD can write at and still have it all work very well; so for a 475MB/sec SSD, you can go for 47.5MB/sec. Probably best to pick a power of two, though, so pick 32MB or 64MB. More does NOT make sense.

I've read through this a couple of times times today, and I've not been able to figure out how those will help in this situation (other than the ones I'd already applied). This is currently an entirely unused system (aside from the the tests I've been running), so anything relating to caching speed/performance wouldn't come into play here, unless I am mistaken. For example, it's not even being used enough to populate more than 1-2GB of ARC between reboots (even though the last reboot, up until one I'd just done to test during writing this reply had been 18 hours prior). There should be more than ample time for all the data in the ARC to transfer over to L2ARC many times over.

The following are all related to tuning how quickly L2ARC is populated on a busy server, correct?
vfs.zfs.l2arc_norw: 1
vfs.zfs.l2arc_feed_again: 1
vfs.zfs.l2arc_feed_min_ms: 200
vfs.zfs.l2arc_feed_secs: 1
vfs.zfs.l2arc_write_boost: 8388608
vfs.zfs.l2arc_write_max: 8388608

The two properties you mentioned that affect what data is eligible to go into L2ARC are what I've changed. Here are those two items, and what I understand them to do (do correct me if my understanding of them is wrong):
l2arc_headroom=0
From what I've read on this, this setting expands the portion of ARC eligible for L2ARC to copy information from, to the entirety of the ARC (rather than a small portion at the "tail" of ARC). It should be able to start caching data the instant it shows up in ARC with the value set to 0.

l2arc_noprefetch=0
This setting will make all data in ARC potentially eligible for caching to L2ARC, rather than only marking ARC data that gets used as eligible.

Even with secondarycache=all (and also when it was set to secondarycache=metadata), and the L2ARC having grown larger than ARC is able to grow with the limited usage it has been getting during testing, it is still obviously not caching metadata - upon reboot, a "/bin/ls -lahR /mnt/" takes just as long as if there was no cache whatsoever.

EDIT: Realized this thread is for ARC - starting my own thread to discuss this: https://www.truenas.com/community/t...e-metadata-special-device-replacement.107327/

Important Announcement for the TrueNAS Community.

ZFS "ARC" doesn't seem that smart...

MVP

Wizard

MVP

Wizard

Wizard

MVP

Patron

MVP

Cadet

Contributor

Cadet

Hall of Famer

MVP

Patron

actually does care

Dabbler

Wizard

Resident Grinch

Resident Grinch

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZFS "ARC" doesn't seem that smart..."

Similar threads