How large of a cache drive

aufalien · Oct 7, 2021

Constantin said:
Maximizing the performance of the pool by disabling sync also opens you up to corruption during data transfers / writes should the power fail, etc while it’s happening.

I prefer allowing the OS to decide that which is sync required, and what is not sync required. But it all comes down to use case. My pool is primarily for dormant data, not scratch, etc. Hence the Z3, and so on.

Ah yes I see what you mean.

So back in the FreeNAS 9.x days I had implemented async and would on occasion yank power to various JBODs (I've between 4 to 8 on any given server), server head units etc... Keep in mind I've ~300 NFS clients which is very modest but I wanted to see what would happen.

I then power cycle everything cleanly and the worst that happened was a corrupt file that was in transit. The file system itself and data on it were fine.

I attribute this to ZFS being copy on write which is marvelous.

I also have no ZIL, L2ARC.

The thing I realized about ZFS which I don't think can ever be really remedied are that when Solaris developed it, there goals for a modern file system were;

Robust
Secure
Scalable

No were was speed/performance in the top 3.

Like you said, it's use case so if one has large writes in transits or hosting a database, then sync all the way as async during hardware failure would be a PITA.

But SATA sux. I would prefer SAS but they are not high density enough and using SSD for frequent writes in my env is not good either.

NugentS · Oct 7, 2021

async=always does allow loss of data if the system crashes / power goes out and there is data in the ZIL that has not been written to the Pool.

TrueNAS caches writes in memory for about 5 seconds (up to 5 seconds) and then writes them all to the Pool in one go. Async=always tells the client OS that the data has been written when in fact it may not have been yet. You get faster response (as fast as the TruieNAS can deliver) in exchange for the small (?) risk of loss of data.

This is what a SLOG is for (where relevant, mostly NFS or iSCSI) - you set sync=always - which forces the system to acknowledge writes only after they have been written. You then use something like an optane (very low latency, and fast with no loss of data if power goes out). ZFS writes the data to the Optane first, tells the client the data has been written and then writes the data to the Pool later. If power goes out then ZFS reads the SLOG for any missed writes. This however is not as fast as async

Hope I got that right

aufalien · Oct 7, 2021

Yep NugentS, perfect.

jgreco · Oct 7, 2021

Constantin said:
I found a metadata only L2ARC beneficial even if just for browsing more quickly through directories. With 48GB of RAM, you have more than enough to entertain a 512GB L2ARC. The usual minimum is 32GB of RAM.

No, the usual minimum is 64GB, but it is workload-dependent. The basic problem is that you really need a lot of ARC to identify meaningfully-cacheable stuff on your typical workload -- but the trick is this, it has to be meaningfully-cacheable but not used THAT often. This is really hard to identify with small amounts of RAM.

aufalien · Oct 7, 2021

jgreco said:
No, the usual minimum is 64GB, but it is workload-dependent. The basic problem is that you really need a lot of ARC to identify meaningfully-cacheable stuff on your typical workload -- but the trick is this, it has to be meaningfully-cacheable but not used THAT often. This is really hard to identify with small amounts of RAM.

Yes, this is why I focus on maximizing number of vdevs and amounts of RAM.

As I understand it, ZFS is only as fast as a vdev rather than how many drives are in it.

I'll never use L2ARC but may use a SLOG device and set things to sync if running a database or my NFS use case requires me to.

The larger drives also pose a challenge during rebuilds but they have larger and faster internal caches.

They also support 4K alignments were as the older smaller drives have been hard for me to deal with in this regard.

Etorix · Oct 7, 2021

beowu!f said:
Thank you for the reply. I can't go above 48GB.. honestly out of money to spend, but the m/b has 8 slots and that is what I have. I could increase to 128.. but man.. if 48TBs and 48GB of RAM is not enough.. then I am way over my head.

Threadripper and 48 GB RAM is certainly overkill to serve files, but ZFS was designed as an enterprise filesystem and its requirements fit the expected resources of a corporate customer of SUN, Inc. rather than the wallet of a home user.
Just don't feel compelled to implement any cool-looking feature you may find in the documentation. (There are too many anyway

) Definitely no SLOG for your use case, and no general-purpose L2ARC. You may consider a metadata-only persistent L2ARC to speed up browsing, but keep it small.

jgreco said:
No, the usual minimum is 64GB, but it is workload-dependent. The basic problem is that you really need a lot of ARC to identify meaningfully-cacheable stuff on your typical workload -- but the trick is this, it has to be meaningfully-cacheable but not used THAT often. This is really hard to identify with small amounts of RAM.

That makes sense for meaningful-but-not-so-frequent data. Does this consideration apply also to a metadata-only L2ARC? One would expect that, in this case, ZFS just caches any metadata it comes across until the L2ARC is full, but ZFS rarely behaves as one would expect.

jgreco · Oct 7, 2021

Etorix said:
Does this consideration apply also to a metadata-only L2ARC?

No, because metadata-only is a specialized workload.

If you want "normal" L2ARC to work correctly, ZFS needs to be able to count the number of accesses to ARC entries and have that be meaningful. For example, your root directory metadata is in ARC and has tons of accesses to it, but the file you just accessed the one time is in ARC but has only been accessed once.

When ARC fills, it needs to have contents removed to free up space. So the ARC pruning code inspects the ARC to look for the blocks that have been least used, and my recollection is that it uses some decision-making to send the best candidates out to L2ARC. If all your entries are "this has been used once", you are effectively sending stuff to L2ARC that is closer to being random chance as to whether or not it will be read again in the near future, but if you need to evict a block that has been accessed several times while in ARC, past activity is a reasonable predictor of future likely accesses. Therefore if you want efficient L2ARC evictions, you need sufficient ARC that you are not just crushing stuff in-and-out-of-ARC with an access count of 1 for almost all of it. There has to be enough ARC so that when a workload spike happens, some of the stuff that needs to be evicted can be usefully differentiated from the stuff that has only been accessed once and wouldn't be meaningful to cache, and that useful stuff is sent out to L2ARC with priority, so that it is readily on-hand to be brought back in.

Constantin · Oct 7, 2021

FWIW, on my measly Mini XL (32GB RAM, 90TB Pool, 512GB L2ARC) I never experienced issues with ARC, etc. but the addition of the metadata L2ARC made a huge difference for rsync backups. However, this comes down to use case, pool layout, etc.

Back then, the L2ARC was not persistent (FreeNAS 11 doesn’t have that option). Persistent metadata-only L2ARCs are arguably a better solution than sVDEVs since L2ARC is 100% expendable AND sVDEVs still do not enjoy visibility via the GUI re fill. So we still do not know if the sVDEV is spilling into the main pool per the GUI. A small L2ARC is easy to upgrade, sVDEVs not so much.

that’s not to say that sVDEVs do not have their use case, they certainly do, but for a small Soho setup w/o databases or VMs a metadata L2ARc may be preferable.

HoneyBadger · Oct 7, 2021

jgreco said:
the ARC pruning code inspects the ARC to look for the blocks that have been least used, and my recollection is that it uses some decision-making to send the best candidates out to L2ARC

ARC pruning is technically independent of the L2ARC feed - when ARC needs to free RAM due to outside pressure (eg: kernel) it needs to do it "right now" and as such will just drop the lowest-value (from MRU/MFU perspective) records, and doesn't care where they wind up.

The L2ARC feed thread tries to stay ahead of that, by proactively scanning records that are "candidates for eviction" and deciding which ones are the most valuable (MRU/MFU again weighing in here) and then pulls them into L2ARC. There's a whole raft of tunables you can adjust here:

how deep into the "end of ARC" to scan
how frequently you want to scan the "end of ARC"
whether you want to exclude MRU data entirely in an attempt to prevent churn
whether you want to allow "prefetched but never accessed" data to hit L2ARC
how much capacity in writes per scan to write to L2ARC

But as you mentioned, there still needs to be enough primary ARC to find meaning from the madness. Most users won't benefit from a data L2ARC but as @Constantin points out a metadata L2ARC can make a massive improvement for certain workflows.

Etorix · Oct 7, 2021

The general case for worry-free persistent metadata-only L2ARC over data-critical sVDEV is clear. I suppose that the use case for sVDEV is strictly for workloads where metadata is frequently updated.
But this is drifting away from the OP's concerns.

jgreco said:
No, because metadata-only is a specialized workload.

The pruning mechanism may explain a guideline to size a general purpose L2ARC as "N times RAM"—N=5 being then the default recommendation.
Now, for the "specialised workload" of a metadata-only L2ARC:

Do we get another recommendation as "M times RAM" (with M > N) ?
Or is it safe to assume that ZFS just hoards metadata with very limited RAM cost and that, contrary to a "normal" L2ARC, a metadata-only L2ARC can be of any size without harmful side-effects?

Either way, the answer could be of interest to the OP… and to many other users.

jgreco · Oct 7, 2021

Etorix said:
The pruning mechanism may explain a guideline to size a general purpose L2ARC as "N times RAM"—N=5 being then the default recommendation.

That's basically old advice, and it was predicated in part on the size of L2ARC pointer records of older ZFS versions.

Now, for the "specialised workload" of a metadata-only L2ARC:

Do we get another recommendation as "M times RAM" (with M > N) ?

Or is it safe to assume that ZFS just hoards metadata with very limited RAM cost and that, contrary to a "normal" L2ARC, a metadata-only L2ARC can be of any size without harmful side-effects?

Either way, the answer could be of interest to the OP… and to many other users.

No, L2ARC can never be of just "any size" without harmful side effects. You can still end up starving the ARC, but there are more variables to the whole issue today than there were ten years ago, so it is certainly possible to do some rational things today that were unthinkable in the past. If you want to be trying out L2ARC on systems of less than 32GB RAM for things like metadata-only, I'd advise cautiously watching your ARC statistics and trying intensive workloads to see if it is actually doing what you want without also hurting normal ARC needs. The upside to L2ARC being, of course, that you can disconnect it and try different variations, or tune it differently.

Etorix · Oct 8, 2021

Fine…
So the old rule of thumb "L2ARC≤5*RAM" only applies to data and is obsolete, but there is no update.
A (strictly) metadata-only L2ARC, to speed up directory browsing, could be possible with relatively low RAM (32 GB or less) but this is to be investigated (including how beneficial this is), and there is no guidance yet on how to size it.

(I'd like to conclude more positevely, I hesitate between Socrates—"ἕν οἶδα ὅτι οὐδὲν οἶδα"—and the Monty Pythons—"and now… THE L(2)ARC(H)!".)

jgreco · Oct 8, 2021

This quickly becomes a discussion of system tuning, for which the correct answer is generally that you need to become familiar with the subsystem, characterize your workload, and then watch the statistics as you torture-test the edge cases.

Metadata, generally being small data blocks, has the potential to create a crapton of L2ARC pointers, because your average modern flash device is really huge (256GB+) compared to the size of system RAM we are discussing (16GB? 24GB?). You rapidly start trading one thing for another here, and that doesn't lend itself too well to generalizations.

beowu!f · Oct 10, 2021

Hmm.. I think I may have confused myself and this post.. now that I have the initial TrueNAS installed and running. I think I was thinking using 2nd SSD for plugins.. like installation for plex, etc as a 2nd pool. Is THAT good to do?
If so.. can I add that later and move (or reinstall if need be)? Right now I am selecting the single pool to install things to.

Etorix · Oct 10, 2021

You can always add further pools later. A single SSD for the plugin jail is fine if you don't care about the jails and are ready to recreate them anew in case of failure (no redundancy); otherwise, setup two SSD as a mirror pool for jails.

beowu!f · Oct 11, 2021

Etorix said:
You can always add further pools later. A single SSD for the plugin jail is fine if you don't care about the jails and are ready to recreate them anew in case of failure (no redundancy); otherwise, setup two SSD as a mirror pool for jails.

That's a good point. If I can find two same size 500GB NVMe SSDs I may just add those and do as you said.

Important Announcement for the TrueNAS Community.

How large of a cache drive

aufalien

Patron

NugentS

MVP

aufalien

Patron

jgreco

Resident Grinch

aufalien

Patron

Etorix

Wizard

jgreco

Resident Grinch

Constantin

Vampire Pig

HoneyBadger

actually does care

Etorix

Wizard

jgreco

Resident Grinch

Etorix

Wizard

jgreco

Resident Grinch

beowu!f

Dabbler

Etorix

Wizard

beowu!f

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

How large of a cache drive

Patron

MVP

Patron

Resident Grinch

Patron

Wizard

Resident Grinch

Vampire Pig

actually does care

Wizard

Resident Grinch

Wizard

Resident Grinch

Dabbler

Wizard

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "How large of a cache drive"

Similar threads