Impact of sVDEV on rsync vs. L2ARC

Constantin · May 24, 2021

Good evening,

A few years past, I looked into the impact of L2ARC on rsync performance with FreeNAS. Back then (i.e. FreeNAS 11.x and earlier), the L2ARC was not persistent and for my use case (metadata only) it usually took about three passes before the L2ARC got "hot" (as of TrueNAS 12, the L2ARC can be made persistent - see here).

A metadata-only L2ARC is very beneficial for directory traversals and like operations: completion times associated with rsync of unchanged directories improved by a factor of up to 12x from the first to the fourth run. For example:

iTunes Folder (1.36TB of music files, iOS backups) 4th Run: 5 minutes, 4 seconds
Pictures Folder (1.74TB of family pics) 4th Run: 42 minutes, 37 seconds
Time Machine Folder (1.16TB of sparse bundles) 4th Run: 3 minutes, 29 seconds

As of TrueNAS 12.x, sVDEVs allow metadata to be stored by default on SSDs, for which I use three 1.6TB S3610 in a mirrored pool. L2ARC is now no longer reserved for metadata only but likely has little to no impact on day to day operations since few files of the files here are needed persistently. The below numbers suggest that sVDEVs have a significant benefit for my use case. For full disclosure, the below results used SMB vs. AFP, but since both SMB and AFP have fairly similar performance, I doubt that was a factor.

iTunes Folder (now 1.1TB of music files since I removed iOS backups): 1 minute, 42 seconds
Pictures Folder (now 2.3TB of pictures): 34 minutes, 34 seconds

So, based on my testing, using a sVDEV vs. L2ARC is advantageous, even for largely dormant data. For data that is more dynamic, the sVDEV benefit would likely be even more pronounced on a day-to-day basis since the "hit" rate for metadata in the sVDEV is 100% whereas for L2ARC it has to be missed once or more before the default tuneables commit it to the L2ARC. Yes, the rate at which the L2ARC is filled can be filled etc. can be hand-tuned, etc. but that is some work. Plus, there always has to be a L2ARC "miss" before there can be "hit".

Whether or not sVDEV is the right answer for your particular use case is a different question, however. Remember, the L2ARC is not essential to the operation of the NAS. It can fail and then the NAS will use pool data instead. For performance and stability, stick to the recommendations here re: having at least 32GB of RAM before adding a L2ARC to avoid pressuring the ARC RAM. Especially if you're running jails or VMs.

Also: If your sVDEV fails, the pool will go with it. So use a muliti-drive mirror to host the sVDEV. Ideally, use high-grade SSDs from the likes of Intel rated for multiple wipes per day... especially if your NAS is hosting databases with many small files that are written and read constantly.

Lastly, regardless of how many resources you throw at the NAS, offsite backups are essential if you value the data.

winnielinnie · Jan 24, 2022

I've read your older thread (from 2019) as well as this updated one.

I'm very interested in using an existing m.2 NVMe drive as an L2ARC for metadata only (to speed up directory listings and subsequent rsync operations that are heavy on reading metadata before/during transfer.)

But I'm confused on some points. Why would adding an L2ARC vdev increase the system's RAM requirements? I assumed it would be agnostic. Rather than reading from the spinning HDD's for metadata not already cached in RAM, the L2ARC would provide a faster alternative, in which it would be loaded into RAM. Either way, shouldn't RAM usage remain approximately the same, regardless?

Just to reaffirm, the L2ARC for metadata doesn't need to be redundant, and in fact any failure of the vdev (or corruption detected on the vdev) should not put the pool's data at risk, even in a live system? (I understand redundant is better, but just want to make sure.)

Also to be clear, I'm in no way referring whatsoever to sVDEV (even though you compare it in your post.) I'm specifically referring to metadata-only L2ARC, which is not tethered to the pool's integrity and safety.

UPDATE: I do not run any VMs, but I do run a few jails, Plex being the most active.

UPDATE 2: I'm interested in L2ARC metadata-only; so I assume its requirements are vastly different to a standard L2ARC for all reads (user data + metadata.)

Constantin · Jan 24, 2022

The L2ARC index is kept in RAM, thus the more L2ARC in the system, the more RAM is used by the index, potentially reducing RAM available for ARC.

Yes, L2ARC is fully redundant.

How L2ARC RAM requirements for indexing needs differ from metadata-only to regular L2ARC is a good question. I do not know the answer but suspect that metadata-only likely needs more RAM for indexing than ”standard” L2ARC.

winnielinnie · Jan 24, 2022

Constantin said:
How L2ARC RAM requirements for indexing needs differ from metadata-only to regular L2ARC is a good question. I do not know the answer but suspect that metadata-only likely needs more RAM for indexing than ”standard” L2ARC.

So theoretically, if I'm understanding this correctly, if I use an L2ARC vdev, and disable it for all datasets, but then enable secondarycache=metadata ("metadata-only") for a specific dataset (e.g, bigtank/rsyncbackups), the greatest additional RAM it will require is limited to the dataset's metadata (directory tree and file listings, attributes, timestamps, etc), which surely can't be that large, right?

Constantin · Jan 24, 2022

Not sure whether it can be done that granularly. But if so, I’d like to think you’re right by virtue of the metadata-only L2ARC size being limited by your dataset fencing.

How much RAM does your machine have now?

winnielinnie · Jan 24, 2022

16 GB ECC (would increase it, but I've been very happy thus far.)

Constantin said:
Not sure whether it can be done that granularly.

I've read elsewhere of other users setting "secondarycache=" (via ZFS commands) at a per-dataset level, whether to disable it or specify metadata-only, etc, for a particular dataset within the pool.

Constantin said:
I’d like to think you’re right by virtue of the metadata-only L2ARC size being limited by your dataset fencing.

This is what I'm hoping too. Not expecting anything in the way of pure data reads from cache, but rather snappier directory/file listings (especially for regular rsync tasks.) In fact, the bulk of time consumed by rsync is when it figures out what to send. The actual data transfer itself is minimal.

jgreco · Jan 24, 2022

winnielinnie said:
But I'm confused on some points. Why would adding an L2ARC vdev increase the system's RAM requirements? I assumed it would be agnostic. Rather than reading from the spinning HDD's for metadata not already cached in RAM, the L2ARC would provide a faster alternative, in which it would be loaded into RAM. Either way, shouldn't RAM usage remain approximately the same, regardless?

I found @Constantin's explanation to be possibly a bit deficient, though technically correct.

The point is that L2ARC is only useful because the system has to know what is out there in the L2ARC. To do this, SOMETHING has to be in the ARC that lets the system rapidly tell that the L2ARC holds the answer to a query.

If you need to read block #12345670, and it is in the ARC, that's trivially fulfilled, and it doesn't really matter even if you had some vaguely inefficient data structure such as a hash table plus linked list that required you to traverse several records to establish that fact, because that'd happen at the speed of RAM, which is quite fast.

However, if you need to read that block #12345670 and you had to grovel around in the L2ARC to find it, that would be a problem, because L2ARC is only really useful if it is fast. If you had to trawl around and do half a dozen I/O's just to tell if the L2ARC contained a block, then your entire system's read capacity would be limited to some fraction of the L2ARC's I/O speed.

Instead, system memory holds a record of what is available in the L2ARC. That way, if a block is held in L2ARC, system memory tells where it is, and only one I/O to the flash is required to retrieve it; if not, no I/O's to the flash, and instead, I/O straight from the pool.

Unfortunately, maintaining such an index in memory means you need to dedicate some memory to that. This means less space for ARC blocks. It should also make it clear why you can't just have a 16GB RAM system and add 2TB of L2ARC to it to "make it zippy".

sretalla · Jan 25, 2022

OK, but if we're setting L2ARC as metadata only for all datasets in the pool assigned, the only thing that can go into L2ARC is metadata (or am I not understanding it?).

That would mean... run big rsync task (or something else that does deep tree enumeration)... all metadata now in ARC.

Run some other tasks for a day (like normal use of the NAS for file shares)... overwrite ARC many times over due to the volume of different data going through being much greater than ARC, hence metadata kicked out into L2ARC (but nothing else). Actually it may already have happened during the rsync if enough data were to be handled by the task itself to kick the metadata (or most of it) out to L2ARC.

Re-run your rsync and have it go faster than needing to re-read the metadata from pool disks because metadata is still in L2ARC.

Did I miss something?

Would that also be true of a system with relatively little RAM?

jgreco · Jan 25, 2022

sretalla said:
Would that also be true of a system with relatively little RAM?

I suspect that it would take several runs for L2ARC to be populated, and of course you're also robbing the system of some ARC. It's tradeoffs.

sretalla · Jan 25, 2022

jgreco said:
I suspect that it would take several runs for L2ARC to be populated

What about setting the persistent L2ARC flag so that's not an issue?

jgreco said:
of course you're also robbing the system of some ARC. It's tradeoffs.

Agreed. I guess it's just another way of getting a similar result to a metadata VDEV, but without the downside of pool loss if your VDEV fails.

Also, I guess you could consider a really small (maybe 30GB?) L2ARC if all it needs to handle is metadata, which would presumably limit the stealing of ARC to a minimum too. Maybe even partitioning a good sized NVME or SSD and handing out little bits to each of your pools would work pretty well (for the use case discussed in the thread above).

jgreco · Jan 25, 2022

sretalla said:
What about setting the persistent L2ARC flag so that's not an issue?

Well, the words I said were "to be populated". Persistent L2ARC doesn't solve the issue of getting the data populated in the first place. Either way, my gut feeling is that you still want to have more memory than 16GB. Part of this will depend on just how much metadata we're talking about. My guess is that on the first passes thru, we're going to see lots of reading of both metadata and block data (for the rsync use case) so there will be competition in the ARC for metadata storage space. I believe ARC is capped by default to 75% metadata, so if you are ALSO reading lots of block data and caching that temporarily while working with it, that means that the larger amount of non-metadata is applying more pressure on the ARC during the initial runs, making MFU metadata a bit harder to characterize.

It may be that inaccurate tail evictions are not a serious problem for metadata in most cases though. For normal ZFS blocks, this is quite important as you want to ideally evict stuff that has been accessed more than once, but just barely more than once. This is how you get to that point of L2ARC'ing "useful" blocks in the working set. Really that's sorta true for metadata too, if you had a total crapton of it, but you probably don't, so it's probably fine to be flushing it out to L2ARC with less opportunity to characterize its value.

sretalla · Jan 25, 2022

If I take a theoretical (even a business) use case of a Media company with Photographers/Videographers and editors, they might have a scenario where both content creators and editors need different things from a system (which may have a lot of RAM).

Millions of Photos (presumably relatively small, but lots of metadata) and thousands of videos (presumably very large, but less metadata as there are fewer files).

If the Photographers want to upload, they may run routines that are similar to rsync, so metadata heavy with some copy on top.

If the Videographers want to upload, they may run routines that are similar to rsync, but not as metadata heavy with a lot of copy on top.

The editors will need to lay down new large files and access existing ones quickly, but when they edit photos, they may need a lot of metadata to browse through the large collection.

In that scenario (which many of us can probably identify our own "home" setups in), the metadata will constantly be ejected from ARC due to the re-reading of large files and if there's nothing to keep it in L2ARC either, all the metadata heavy operations will be slow.

Assuming that ARC is big enough to hold a day worth of working set for the editors (or somewhere close to it)... maybe 1TB, then you may be able to get away with L2ARC holding only metadata to help with the workflows that use that and the other workflows can then fight over ARC based on what's in use.

I have no idea if that makes sense to anyone else, but it seems logical to me.

Once you have all of your library enumeration run through enough times to get it in L2ARC, then persistence would keep it there across reboots (with the first few minutes after a reboot to re-warm) and presumably changes would be added over time.

winnielinnie · Jan 25, 2022

jgreco said:
Persistent L2ARC doesn't solve the issue of getting the data populated in the first place. Either way, my gut feeling is that you still want to have more memory than 16GB. Part of this will depend on just how much metadata we're talking about. My guess is that on the first passes thru, we're going to see lots of reading of both metadata and block data (for the rsync use case) so there will be competition in the ARC for metadata storage space.

jgreco said:
Really that's sorta true for metadata too, if you had a total crapton of it, but you probably don't, so it's probably fine to be flushing it out to L2ARC with less opportunity to characterize its value.

sretalla said:
Millions of Photos (presumably relatively small, but lots of metadata) and thousands of videos (presumably very large, but less metadata as there are fewer files).

sretalla said:
In that scenario (which many of us can probably identify our own "home" setups in), the metadata will constantly be ejected from ARC due to the re-reading of large files and if there's nothing to keep it in L2ARC either, all the metadata heavy operations will be slow.

I feel that for many NAS users (including myself) whose NASes are primarily used for one-way operations (rsyncs and backups from client PCs), it's overwhelmingly metadata over actual file/user data being read.

To bring back your scenario, it would be favoring the many operations of small files (not data; just comparing sizes and timestamps), rather than the few large files being read for their actual data (such as videos.)

If a metadata-only L2ARC for a particular dataset[1] requires additional RAM (for indexing) at the expense of ARC being used for other purposes (i.e, large files being cached), then it's a "win" in my case, since I'd rather speed up regular rsync operations that spend way too long on crawling the directory/file tree looking at timestamps and filesizes, but ends up only transferring a small amount of real data each time since there are only a few differences between source and destination filesystems.[2]

I supposed it might "compete" with Plex streaming in the ARC, but Plex isn't used nearly as often as rsync. (UPDATE: But worst-case scenario, unless I'm misunderstanding, is that if all this metadata is pushed out of the ARC, yet my "metadata-only" L2ARC is persistent, it can much more quickly be read into RAM from the m.2 NVMe L2ARC, instead of the spinning HDDs, when rsync or other metadata-heavy tasks are summoned.)

Outside of that, very little in the way of large files are read over the network; and any browsing done over the share is metadata-heavy more than anything else.

Basically, I'd fit more in the camp of "I don't mind slow reads of large files that need to be pulled from the pool, with even less room in the ARC for such things, so long as metadata operations (e.g, rsync, listing, etc) can be sped up without having to rely on the storage pool itself."

[1] Referring to the fact that using "zfs" commands, you can specify per-dataset options, in which the L2ARC ("secondarycache=flag") can be set to "none" for all datasets in the pool, except for the specific dataset(s) that would be set to "metadata-only".

[2] As it stands now, if I do a "dry-run" of rsync for the first time in a while, it takes a notable amount of time to complete, even if to tell me "nothing has been changed". However, if I immediately do another dry-run, it finishes the same crawl within seconds. I use the Rsync daemon / module service, so it's not over SSH.

UPDATE: Maybe the "million-dollar question" is, "Will a metadata-only L2ARC still require a massive amount of memory if the filesystem contains a 'crapton' number of files and folders?" (In order to index what's in the L2ARC if such metadata gets pushed out of the ARC.)

Maybe I'm misunderstanding how ZFS treats filesystem metadata. Does each and every file's attributes (timestamp, size, permissions, etc) get its own "record"? If that's the case, then a dataset with many, many small files may in fact suck up a good portion of RAM just to index what lives in the L2ARC.

I could add two additional sticks. These same sticks are in my system currently, and would double it to 32GB ECC total (4 x 8GB):

Timetec Hynix IC 16GB KIT (2x8GB) DDR4 2400MHz PC4-19200 Unbuffered ECC

The payoff doesn't seem as appealing the more I think about it.

Constantin · Jan 25, 2022

Consider putting all those crapton files into disk images so that contiguous blocks can be written. I did for two reasons: 1) it limits the metadata the system has to keep track of and 2) rsyncs are faster by virtue of not having to cycle through as many files. This is especially relevant should you run something like carbon copy cloner (CCC) with the health check option on, where every file is read in its entirety and compared to the backup source.

It's a lot faster to do so with something like a disk image where all the bazillions of little files were crammed into larger bands (if using a Apple sparsebundle, for example) than going through a system file tree on a mac with 300k+ files. Of course, there are tradeoffs associated with disk images vs. having raw files but for many general purpose backups, they work great on a WORM basis. For larger files (phot images or video), disk images make no sense since the files are large, generally uncompressible, and usually not nearly as plentiful as all the localization files in a OS.

I significantly reduced my metadata requirements before I nuked my pool and rebuilt it completely with the sVDEVs. I hope that I set my small file thresholds sufficiently low since sVDEVs are currently woefully unsupported re: diagnostics in the GUI. There are zero clues in the GUI re: how full the sVDEVs are, what the ratio of meta data to small files is, nor can the sVDEV be easily dedicated solely to metadata.

winnielinnie · Feb 16, 2022

As an update to this, I added more ECC RAM (same brand/vendor) to bump it up from 16GB to 32GB.

The pattern of "delays" for an rsync task dealing with only pure metadata ("dry run", no files being transferred) corresponds to a drop in the ARC hit ratio. It doesn't matter if the entire directory/file listing is cached in RAM on the client PC, since it appears it drops from the ARC on the TrueNAS server after some time has passed. This makes me wonder if an L2ARC for metadata-only might reduce some of the delay, since theoretically the metadata-only L2ARC will live on a fast NVMe (rather than the current setup where it pulls it from spinning HDDs.)

The same behaviour happened on 16GB RAM and now with 32GB RAM. If I initiate the same "dry run" immediately after the first one completes, then the ARC hit ratio remains above 99.9% and the rsync task completes within 15 seconds.

View in full size if the text is hard to read.

Constantin · Feb 16, 2022

... suggests that all your metadata currently may fit into the ARC. How quickly that will get displaced or not is a different question.

I don't run multiple rsync operations in a row, so I don't expect to see a big benefit from the ARC retaining at least for some time the metadata from the last run. Pool size will also influence this type of testing - bigger pools are unlikely to fit their metadata in its entirety into the ARC.

Whereas if you have a persistent, metadata-only L2ARC, then the performance will likely remain consistently fast between reboots, pool scrubs, and so on. Maybe not 15s fast, but fast enough for real-world use, especially if the data is largely dormant.

winnielinnie · Apr 11, 2022

Constantin said:
... suggests that all your metadata currently may fit into the ARC. How quickly that will get displaced or not is a different question.

Practically every 30 minutes or so, and sometimes if I touch any sensible amount of user data, apparently.

For any lurkers bumping into this thread from a web search, this topic branches off into a related discussion (with a solution that works for me). I recommend reading through the comments and the links provided in some of the posts (to read older discussions of the same issue that affects even non-TrueNAS users of ZFS).

ZFS "ARC" doesn't seem that smart...

Without yet resorting to adding an L2ARC, is there a "tuneable" that I can test which instructs the ARC to prioritize metadata? I upgraded from 16GB to 32GB ECC RAM. Yet there is zero change in this behavior. This keeps happening: I run regular rsync tasks from a few local clients, which is...

www.truenas.com

sretalla · Apr 12, 2022

An update from the trenches...

After finally deciding to test out some of my hypotheticals, I added L2ARC to my largest pool (about 70TB of data, just over 50% full). That server has 128GB of RAM, but I assign some of it to a RAMdisk, so in reality, about 96GB, which it seems to want to assign less than 30GB of to ARC, often leaving 30GB or more as Free... I have a tunable in place to keep it below 58GB... not sure why it stays much lower than that... anyway, on with the story)

That pool gets new media daily and takes a weekly backup from about 1TB of a Windows 11 PC and data both new and old is accessed daily from the pool.

First thing I would expect after more than a week would be that we are close to having pushed 3 or 4 TB of data in and out, so potentially should have put pressure on ARC to grow... which it didn't.

That in turn tells me that ARC shouldn't really be evicting anything as it has the chance to grow, but isn't doing that (maybe I'm missing a bit where ultimately content times out rather than being pushed out for other content), but I do see some metadata moving to L2ARC.

That amount of metadata in L2ARC seems pitifully small compared to the over 50% metadata hit ratio... it's all very confusing compared to what I was expecting to see.

Code:

Cache hits by data type:
        Demand data:                                   48.2 %      26.6G
        Demand prefetch data:                         < 0.1 %       2.5M
        Demand metadata:                               51.7 %      28.6G
        Demand prefetch metadata:                       0.1 %      71.7M

Cache misses by data type:
        Demand data:                                   39.0 %      31.7M
        Demand prefetch data:                          28.3 %      23.1M
        Demand metadata:                               16.0 %      13.0M
        Demand prefetch metadata:                      16.7 %      13.6M

L2ARC size (adaptive):                                          10.0 GiB
        Compressed:                                    19.2 %    1.9 GiB
        Header size:                                    0.1 %   14.9 MiB

L2ARC breakdown:                                                   11.0M
        Hit ratio:                                      6.0 %     660.5k
        Miss ratio:                                    94.0 %      10.3M
        Feeds:                                                    659.7k

L2ARC writes:
        Writes sent:                                    100 %      18.0k

Maybe somebody who knows better can explain it.

Constantin · Apr 13, 2022

I'd suggest setting the L2ARC up as meta-data only, then adjust some of the tunables, followed by full directory traversals with something like rsync. My testing showed that it took three rsync misses before the metadata was fully committed to the L2ARC.

sretalla · Apr 14, 2022

Constantin said:
I'd suggest setting the L2ARC up as meta-data only

Already the case.

I think I was actually reading the arc_summary incorrectly...

Code:

L2ARC breakdown:                                                   12.0M
        Hit ratio:                                      6.3 %     754.3k
        Miss ratio:                                    93.7 %      11.2M
        Feeds:                                                    765.4k

Those are counts, not bytes... since it seems when it's talking in bytes, it looks like this:

Code:

L2ARC size (adaptive):                                          10.2 GiB
        Compressed:                                    19.1 %    1.9 GiB
        Header size:                                    0.1 %   14.7 MiB

So if I'm following it right this time, that means I've had 3/4 of a million hits on L2ARC, saving me reads from the pool... actually pretty good (even if I was hoping for better and still don't quite understand how it can happen if ARC is under its maximum size).

All at the cost of 15MB of ARC and just over 10GB of L2ARC SSD space.

Important Announcement for the TrueNAS Community.

Impact of sVDEV on rsync vs. L2ARC

Vampire Pig

MVP

Vampire Pig

MVP

Vampire Pig

MVP

Resident Grinch

Powered by Neutrality

Resident Grinch

Powered by Neutrality

Resident Grinch

Powered by Neutrality

MVP

Vampire Pig

MVP

Vampire Pig

MVP

Powered by Neutrality

Vampire Pig

Powered by Neutrality

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Impact of sVDEV on rsync vs. L2ARC"

Similar threads