Budget hardware recommendation with 10 GBit/s

rungekutta · Aug 15, 2023

@HoneyBadger thanks for the nuance. I like nuance. I find that’s often where “truth” lies rather than in absolutes, particularly if they don’t move with the times and evolving circumstances.

I saw the SNIA presentation, thanks. The gist of it seems to be that with modern NVMe drives, L2ARC is more broadly useful including for streaming/sequential workloads (previously at risk of instead becoming a bottleneck vs arrays of spinning disks). Useful tuning advise in there also.

Look, here are the major points I’m trying to get across:

SSD prices per byte have come down much quicker than RAM. Compared to 10 years ago RAM prices are roughly 1/4th and very volatile, SSD prices about 1/20th and much more stable (https://jcmit.net/memoryprice.htm, https://jcmit.net/flashprice.htm)
Meanwhile SSDs are (or at least can be) faster by factor 10-20x through the same time period, thanks to combination of improved chip and controller design with protocol changes (SATA -> NVMe).
ZFS has evolved too, with less memory overhead from L2ARC everything else equal (@jgreco I know you don’t agree but I think you’re wrong, see below [1]) and some smart design choices too (as alluded to by @HoneyBadger ).

All this combined strongly suggests that the efficient frontier in terms of “best bang for the buck”, given a finite budget, also needs to move with the times.

For what it’s worth, my own experience:

Memory pressure from L2ARC significantly less than one would expect from other threads similar to this one. It suggests that many on this forum have a preconceived and inaccurate views on the trade-offs in terms of benefits of L2ARC vs associated negative effect on ARC.
Some of the L2ARC parameters perhaps most notably vfs.zfs.l2arc_write_max default to ridiculous values in 2023 (8MB) and need to be adjusted upwards for L2ARC to be effective.
HOWEVER in doing so, more pressure will be put on the L2ARC SSD(s) – make sure your L2ARC doesn’t actually become the bottleneck. My starting point was that “any NVMe must be faster than my spinning RAIDZ2”, but found that to be wrong and the SSD actually pegged at 100% and rather slowed things down under certain workloads. So upgraded from cheap generic NVMe to Samsung EVO and had much better experience all round. (Have since moved on to Optanes and a much beefier setup overall.)

In my view, recommended advise should be

Check and know your stats – ARC and L2ARC hit rates
Keep an eye on the RAM usage for L2ARC headers (l2_hdr_size). Manage this trade-off vs ARC for your workload (based on hit rates). If you know what you’re doing you can adjust the ARC fraction which is allowed to be used for this (/sys/module/zfs/parameters/zfs_arc_meta_limit_percent).
Tune vfs.zfs.l2arc_write_max, vfs.zfs.l2arc_write_boost and possibly l2arc_headroom (how far through the ARC lists to search for L2ARC cacheable content, expressed as a multiplier of l2arc_write_max, default 2, higher value means smarter evictions from ARC to L2ARC but at the cost of CPU).
In line with tuning up (3), be careful so your L2ARC doesn’t become the bottleneck, check drive utilization and realise that NVMe is just a protocol and suppliers can still build crappy drives on top of it. 32GB NVMe Optane sticks are cheap and quite fast, Samsung EVO generally seems a good compromise too.

[1] @jgreco I’m not going to go as far as going through ZFS release notes and look for the changes myself, but there are compelling internet sources that suggest you are wrong, including Jim Salter who maintains the OpenZFS Development Roadmap (so he should know), and also writes on e.g. Arstechnica – including “[...] The issue of indexing L2ARC consuming too much system RAM was largely mitigated several years ago, when the L2ARC header (the part for each cached record that must be stored in RAM) was reduced from 180 bytes to 70 bytes.” (https://arstechnica.com/gadgets/202...get-a-persistent-ssd-read-cache-feature-soon/). But you seem convinced that this has never changed - what are your sources…?

Edit: typo

HoneyBadger · Aug 15, 2023

Not to put words into the Grinch's mouth, but I don't think it's so much "the L2ARC header requirements have never changed, ever" as much as the inference that the footprint hasn't been reduced to the point where it's trivial, the variable block size of ZFS makes it challenging to predict with exact precision (unless your data is very well understood) and throwing a large L2ARC device in without consideration of the impacts on ARC will still end poorly for your overall performance.

rungekutta said:
In my view, recommended advise should be

Check and know your stats – ARC and L2ARC hit rates

Keep an eye on the RAM usage for L2ARC headers (l2_hdr_size). Manage this trade-off vs ARC for your workload (based on hit rates). If you know what you’re doing you can adjust the ARC fraction which is allowed to be used for this (/sys/module/zfs/parameters/zfs_arc_meta_limit_percent).

Tune vfs.zfs.l2arc_write_max, vfs.zfs.l2arc_write_boost and possibly l2arc_headroom (how far through the ARC lists to search for L2ARC cacheable content, expressed as a multiplier of l2arc_write_max, default 2, higher value means smarter evictions from ARC to L2ARC but at the cost of CPU).

In line with tuning up (3), be careful so your L2ARC doesn’t become the bottleneck, check drive utilization and realise that NVMe is just a protocol and suppliers can still build crappy drives on top of it. 32GB NVMe Optane sticks are cheap and quite fast, Samsung EVO generally seems a good compromise too.

#1 I would suggest is probably the most critical measure; not just knowing your hitrates, but also knowing the size of your active dataset. If you're serving your hottest data almost exclusively from 32G of ARC with a 98% hitrate, adding a 1T L2ARC device and filling it to capacity in order to cover some portion of that missing 2% seems a bit silly. Similarly, if you're accessing data randomly across a 16T pool, you may just end up churning through your L2ARC SSD's write endurance as it tries in vain to hold something relevant, but ultimately fails at doing any better than a 1/16 chance.

Ultimately, general recommendations tend to remain vague, and sized on the conservative end of the scale. Users with more experience and ability can push the boundaries out, but recalling the slide from the L2ARC presentation, the design intent was to "do no harm" so our advice tries to align with that.

jgreco · Aug 15, 2023

rungekutta said:
I’m not going to go as far as going through ZFS release notes and look for the changes myself, but there are compelling internet sources that suggest you are wrong

Such as?

rungekutta said:
including Jim Salter who maintains the OpenZFS Development Roadmap (so he should know)

What makes this "so he should know"? I maintain some dozens of ZFS filers and have helped hundreds, maybe thousands, of forum users over the years, and have written on this extensively. How is Jim Salter more of an "expert" than me or other ZFS admins who are familiar with the topic?

rungekutta said:
The issue of indexing L2ARC consuming too much system RAM was largely mitigated several years ago, when the L2ARC header (the part for each cached record that must be stored in RAM) was reduced from 180 bytes to 70 bytes.” (https://arstechnica.com/gadgets/202...get-a-persistent-ssd-read-cache-feature-soon/).

Yes, we all know that. I even said 70 bytes, even though Salter's count is apparently wrong. It was at one time about twice as large as it is now, and it probably doesn't matter whether that's 70 or 96 bytes. HoneyBadger doesn't usually get facts wrong while I am more interested in the big picture.

rungekutta said:
But you seem convinced that this has never changed - what are your sources…?

I don't seem convinced of that. I just don't think it's a huge deal. I think there's a lot of value in worrying about the exact number, since we're not talking an order of magnitude or anything like that.

The behaviour we often see is that users come in with something like 16GB or maybe 32GB of RAM and then they want to throw in a terabyte or two of L2ARC, not understanding the underlying mechanism and the importance of generating meaningful MFU/MRU statistics to allow the eviction routines to pick solidly useful candidates for the L2ARC. The situation I'm outlining here is relatively common and results in L2ARC thrashing as the quality of eviction choices made when ARC is insufficient is a problem. It's clear to those of us who have actually been interactively helping large numbers of people that this is capital-D Difficult for newbies to ZFS to get their head around how the ARC and L2ARC interact with each other; MY observation along with a number of other senior forum members here is that less than 64GB of ARC is generally not going to result in good L2ARC evictions. Is that always true? Of course not, you can certainly devise a situation where it isn't true, but it's a useful guideline anyways.

Source: A variety of people, myself included, who have been providing free ZFS help to thousands of users for more than a decade. Can you cite any better?

rungekutta · Aug 15, 2023

jgreco said:
What makes this "so he should know"? I maintain some dozens of ZFS filers and have helped hundreds, maybe thousands, of forum users over the years, and have written on this extensively. How is Jim Salter more of an "expert" than me or other ZFS admins who are familiar with the topic?

[...]

jgreco said:
Yes, we all know that. I even said 70 bytes, even though Salter's count is apparently wrong. It was at one time about twice as large as it is now, and it probably doesn't matter whether that's 70 or 96 bytes. HoneyBadger doesn't usually get facts wrong while I am more interested in the big picture.

Well, if you really want to nitpick, @HoneyBadger actually said "Current OpenZFS needs 96 bytes of RAM per record in L2ARC" but suggests it may have been something else prior, Jim's article says the "L2ARC header" is now 70 bytes long from prior 180. These are not necessarily contradictory, as the L2ARC header is a subset of the total amount of RAM needed per record. If you really want to dig into that level of detail, that structure is documented in the actual source code here https://github.com/openzfs/zfs/blob/master/module/zfs/arc.c (see l2arc_buf_hdr_t) for everyone to see.

In the meanwhile, your exact comment was "ZFS has never 'required much more RAM overhead in ZFS' for L2ARC" which is in direct contradiction (if you accept half/double as "much more") to Jim's rather specific point on this matter and it's still not clear to me whether you agree with him or or not?

BUT IN ANY CASE, all this is already above for anyone to read and whether the total overhead is 70 or 80 or 96 bytes in the ZFS version compiled into TrueNAS-13.0-U5.3 is also not the core of the matter here. In all those cases, that still yields a RAM overhead of <2% in a theoretical worst case scenario. And we can just settle on that there is conflicting information out there on how this compares vs 10 years ago (again feel free to explain if you actually know).

jgreco said:
The behaviour we often see is that users come in with something like 16GB or maybe 32GB of RAM and then they want to throw in a terabyte or two of L2ARC, not understanding the underlying mechanism and the importance of generating meaningful MFU/MRU statistics to allow the eviction routines to pick solidly useful candidates for the L2ARC.

Yes and this would also go outside of iX's recommendations.

jgreco said:
The situation I'm outlining here is relatively common and results in L2ARC thrashing as the quality of eviction choices made when ARC is insufficient is a problem.

Would be interesting to understand how you conclude "quality of eviction choices" being the "problem" and at what point they start and stop becoming one. In those typical scenarios, what is the actual L2ARC memory overhead, that leads to this conclusion?

jgreco · Aug 15, 2023

rungekutta said:
Well, if you really want to nitpick, @HoneyBadger actually said "Current OpenZFS needs 96 bytes of RAM per record in L2ARC" but suggests it may have been something else prior,

Fine, let's nitpick. I clearly know it was something else earlier. I was aware that it was substantially larger and that it had been "fixed". But I'm way too lazy to bother knowing the exact number; just like errnos or syscalls, the place for this information is in system headers. So 96 or 70 or 80 or whatever. It's shrunk, but not really by a hell of a lot.

rungekutta said:
In the meanwhile, your exact comment was "ZFS has never 'required much more RAM overhead in ZFS' for L2ARC" which is in direct contradiction (if you accept half/double as "much more") to Jim's rather specific point on this matter and it's still not clear to me whether you agree with him or or not?

I do not consider double to be "much more". But it kinda depends on your conceptualization of "much". Some of us started computing working on systems with 256 bytes of program memory but now think of 256GB as "small". Jim is wrong that cutting memory consumption down by approx. half has somehow "fixed" the L2ARC index issue, because if you look at this from a usability perspective, the problem has actually gotten worse. L2ARC is not a random access cache pool, which is arguably a more useful design, and so you have to mitigate by keeping the ARC much larger than general computer folks would expect to need. What you want is for the "less valuable" blocks to get pushed out to SSD but what actually happens with too little RAM is that random blocks get pushed out to the SSD.

rungekutta said:
If you really want to dig into that level of detail, that structure is documented in the actual source code here https://github.com/openzfs/zfs/blob/master/module/zfs/arc.c (see l2arc_buf_hdr_t) for everyone to see.

I don't have time to dig into the code tonight and I also really don't care, but I believe that structure may not include the ARC structures necessary to maintain it.

rungekutta said:
Yes and this would also go outside of iX's recommendations.

But users do not read the documentation and do not see the recommendations. They come right to the forums, often having already built the badness, and then we have to give them a bit of a rude awakening. This is unfortunate but it happens on a nearly daily basis.

rungekutta said:
Would be interesting to understand how you conclude "quality of eviction choices" being the "problem" and at what point they start and stop becoming one. In those typical scenarios, what is the actual L2ARC memory overhead, that leads to this conclusion?

Quality of eviction choices has almost nothing to do with L2ARC memory overhead. The factors are the type of data on the pool, the access patterns, and whether or not the ARC is sufficiently large to gather statistics on the access patterns. Having excessive L2ARC will rob the system of ARC, meaning you end up with poorer eviction choices. It almost sounds to me like you're overly focused on the caching aspect rather than understanding the issue of making good choices for what to cache. You don't just want to cache everything blindly. You want to make intelligent choices. This allows you to read back in efficiently. Consider this on one of my servers (which isn't even really particularly good)

kstat.zfs.misc.arcstats.l2_write_bytes: 426708296192
kstat.zfs.misc.arcstats.l2_read_bytes: 3832722060800

We're talking about almost a 10:1 read:write ratio there and that's not anything too terrible. Could be better though. But my point is consider this fellow from a random Internet post

kstat.zfs.misc.arcstats.l2_read_bytes: 1152794672128
kstat.zfs.misc.arcstats.l2_write_bytes: 27096209368064

This guy has a more than 1:20 read:write ratio which means that there's just a flood of useless writes being sent out to SSD. This post is from 2014 so that's likely MLC SSD and it's just being burned through writing useless data out there. This is an example of "quality of eviction choices" being extremely poor. There are a few other weird and worrying things in his stats but this one should be obvious to almost anyone I hope.

NickF · Aug 15, 2023

FWIW Here...without jumping too much into the tit-for-tat.
I have ~150TB of space, and I have an Optane pool for my VMs ~1TB and I am using ~70TB of space on my HDD pool.
Most of this is movies/tv shows/pictures/isos/personal documents...etc. Typical homelab stuff.

I have several scripts running, moving files around, adding metadata to mp4 files, rencoding movies. Alot of mixed sequential read and write traffic like 3 or 4 media files at a time. Plus internet archive torrents, storj (10tb) and a bunch other random workloads.

I am explaining all of this for the perspective. I think I am much more of a "power user" when it comes to "home lab use". I have 256GB of RAM, but I am using ~100GiB of it for VMs and the like (KVM on SCALE)...So I thought adding an L2ARC with some Optane P1600Xs would help me. I bought two of those 118GB deals that were going for like $70 a while back.
My L2ARC hit rates are between 1%-20% at best with some random spikes ~40% for very short periods of time. Quite frankly, unless you can prove with some data otherwise, L2ARC is kinda a waste of time and money for most workloads people here in the community forums will need.

jgreco · Aug 15, 2023

NickF said:
unless you can prove with some data otherwise, L2ARC is kinda a waste of time and money for most workloads people here in the community forums will need.

That's the short way of saying it.

If you understand the mechanism (most users do not) and have workloads that would benefit as demonstrated by stats or realistic theory, then that's great. A lot of the time, though, when you look at ARC stats, you either have too much or too little ARC. If you're getting, for ex., 99.5% hit rates in ARC, you probably have enough ARC, and whatever dribbles out to the L2ARC is probably going to result in a low hit rate. Sure this might do something for you, but for most users it would be more efficient just to re-fetch the miss content from the pool. If you're truly really busy and need that last little bit, then maybe L2ARC wins there.

I've said it often enough, L2ARC is really a thing where you need to solve some performance problem. If you have a pool that is flatlining 100% busy and is performing poorly, that could well be a winning case for a properly sized ARC coupled with L2ARC. Most forum users do not have this, or any of several other similar resource starvation scenarios that would benefit from L2ARC. But everyone's situation is different. Trust the statistics to help you know.

rungekutta · Aug 20, 2023

@jgreco well if after all that wrangling we’re left with whether “double” constitutes “much more” then frankly life is too short. ;-) But I’m glad we seem to agree on the numbers, and that the overhead was larger in previous versions of ZFS. And to stay with the example you chose (from another forum, from 2014 (?)), you would have noticed as well that the L2ARC RAM overhead in that example is 56 megabytes. So even from that example it would be tempting to conclude that L2ARC memory overhead is unlikely to be a real-world factor vs all the other parameters in play – certainly when within the iX systems recommendation of L2ARC max 5-10x RAM (ARC?).

In other words I think would make sense to steer focus away from the seemingly anecdotal “you must max out your RAM / have at least 64GB before considering L2ARC” to a more data driven conversation around working sets vs how to read and work with both metrics and parameterization. I mentioned a couple in post #21, and I have myself seen those having a dramatic effect on L2ARC performance. I don’t see anyone talking about them here though. And as mentioned, the default values are unlikely to deliver full potential, including l2arc_headroom which very much should influence the “quality” of evicted blocks into L2ARC, as you guys are on to here.

With respect to usage patterns it also makes sense to consider the secondarycache setting per data set. As an example, it may make sense to lean towards metadata (only) for datasets more prone to sequential read (e.g. media content) – this works well and you can kind of achieve a “special vdev light” this way (read but not write) which can speed up certain operations when sparsely accessed metadata otherwise would have been ejected from ARC and fetched from the spinning array, orders of magnitude slower.

There are many other usage patterns as well of course as we all know.

Edit: typo

jgreco · Aug 20, 2023

rungekutta said:
In other words I think would make sense to steer focus away from the seemingly anecdotal “you must max out your RAM / have at least 64GB before considering L2ARC” to a more data driven conversation around working sets vs how to read and work with both metrics and parameterization.

It's not "anecdotal". It's a "practical guideline" for new users who do not have an understanding of how the subsystem works.

Once you know the subsystem well enough to have a "data driven conversation around working sets" then you are set up to figure out your local requirements based on actual ground truth yourself. Since this is different for virtually every TrueNAS instance, it is very difficult to drag new users through a class on a highly complex topic to teach them what they need to know. If you are volunteering to be the person that is going to do that, then by all means, stop babbling about this in this thread and instead write some detailed documentation that is accessible to beginners but also sufficiently detailed to drag them towards the finish line, or actually invest time and energy in appropriate threads to help individual users get up to speed. What you're writing suggests that you feel that the guidance we provide isn't "good enough". The way you fix that on a forum is to do it better. No one here is interested in performing to the standards that you define as appropriate. We all participate because we want to. When someone hires participants here to provide support, then they can define where the focus should be and what sort of "data driven conversations" should look like. Until then, the heck with that.

rungekutta · Aug 20, 2023

"Not good enough" is very loaded and entirely subjective. Who am I to judge? I am suggesting some changes to the "practical guidelines" and have laid out the reasoning around it and some concrete suggestions too. But it seems difficult to discuss the technology without slipping back to "who knows best" and prickly questions of authority. In any case I think the case has been made, and outside of that I am by no means questioning your or anyone else's commitment (18,474 messages speaks to that...). Have a nice day all.

jgreco · Aug 20, 2023

rungekutta said:
I am suggesting some changes to the "practical guidelines" and have laid out the reasoning around it and some concrete suggestions too.

I haven't seen anything practical from you. The first problem is generally disabusing new users from their varying misconceptions about how they think a cache should work. Because most of them expect that they can make up for a lack of ARC cache by adding SSD cache. And lots of stuff works like that, but ZFS doesn't. If you have practical suggestions to get users to rapidly understand how L2ARC works with and interacts with ARC, that's great.

NickF · Aug 20, 2023

In defense of @jgreco here's some real evidence in favor of his argument. My HomeLab environment. Sample size, 2 days. Relatively light workload. I'll kick it up to 11 for a few days and report back.

Both my ARC and L2ARC are populated nicely..

My ARC hits are fantastic..

My L2ARC hits? Abysmal.

rungekutta · Aug 20, 2023

Thanks. Which argument would that be? Your ARC is way larger than 64GB so if anything this shows that is no silver bullet but rather L2ARC performance depends on many other factors. Your hit rate is indeed intriguingly low… what type of data goes into it? What are your settings for l2arc_write_max and l2arc_headroom?

Etorix · Aug 21, 2023

The argument I see is that L2ARC is not a magical extension and that you can't improve perfection. It is dubious that tweaking settings would make the L2ARC useful if the ARC already takes all the work.

rungekutta · Aug 21, 2023

True that.

NickF · Aug 21, 2023

rungekutta said:
Thanks. Which argument would that be? Your ARC is way larger than 64GB so if anything this shows that is no silver bullet but rather L2ARC performance depends on many other factors. Your hit rate is indeed intriguingly low… what type of data goes into it? What are your settings for l2arc_write_max and l2arc_headroom?

The argument that the argument is pointless for homelab users because L2ARC hit rates are generally going to be low and not going to help performance...

I didn't want to hear and accept that argument either. Thats why I have an L2ARC.

rungekutta · Aug 21, 2023

It would seem logical that with enough memory and massive ARC och very high hit rate, then l2arc can’t add much. So almost conversely, the main benefit may be when you don’t have enough ARC to go around. You probably still need to bump l2arc_write_max and l2arc_write_boost to something more meaningful (at least factor 10 from default, depending on the speed of your SSD, be careful so that the SSD doesn’t become the new bottleneck) and to maximise chances of l2arc being fed with relevant stuff, l2arc_headroom=0. This enables the entire ARC to be scanned for the most relevant data to evict to l2arc, but eats CPU so keep an eye on that. Default value is 2 which means the tail of l2arc_write_max*2 of ARC is scanned at every eviction scenario.

NickF · Aug 21, 2023

This is not a criticism or meant to be snarky, just an observation. The irony here is that you are arguing against generalization here regarding the general advise on ARC to L2ARC sizing.
Now you are citing general wisdom about l2arc tuning and when and why you should turn knobs and tunable.

The worst generalization you cite is that A high arc hit rate will yield a low l2arc.

It’s actually not even quite that simple. You can have a low ARC hit rate in the 70s or something, and yes in that situation an l2arc will probably help.

But you can still also have a 99% hit rate and still see the benefits of the l2arc. There are situations where you won’t as you’ve said, but that is not a universal. That’s sorta the crux of my counter point here.

When your system is busy enough having a 99% arc hit rate can still see high l2arc hit rates. This is generally achieved through multiple parallel clients accessing the system. But that’s also not a universal.

Home labs are not that.

rungekutta · Aug 21, 2023

Mostly agree, but main point I’ve been trying to get across is that it isn’t at simple as “with less than 64GB ARC, L2 will be pointless” and trying to add some nuance to this along with some practical input as to what knobs to turn / look at, as a starting point, depending on the scenario (along with what happens if you turn the knob too far). Don’t see how one or several clients has anything to do with it, surely it depends on usage patterns and one client could behave like the aggregate of 20 - particularly if that one client is a hypervisor with many running VMs (not unusual in homelabs).

Anyway, I think we may have beaten this topic to death by now…

Etorix · Aug 21, 2023

rungekutta said:
it isn’t at simple as “with less than 64GB ARC, L2 will be pointless”

This is NOT the general advice which is typically passed around. Said general advice is that "with less than 64 GB RAM, L2ARC could be harmful" (by evicting ARC). The same applies to oversized L2ARC in proportion to ARC.

Tweaking tunables to "make it work" in some specific circumstances which fall out of the general recommendations is a very nice idea… if you have a good grasp of your workload and of what these tunables do. Otherwise it could be a dangerous idea.

The general advice "L2ARC = 5 to 10 * ARC (RAM), with a minimum of 64 GB RAM" is at least considered safe for users to follow, without tweaking, though possibly unhelpful if the ARC is already large enough.

Important Announcement for the TrueNAS Community.

Budget hardware recommendation with 10 GBit/s

Contributor

actually does care

Resident Grinch

Contributor

Resident Grinch

Guru

Resident Grinch

Contributor

Resident Grinch

Contributor

Resident Grinch

Guru

Attachments

Contributor

Wizard

Contributor

Guru

Contributor

Guru

Contributor

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Budget hardware recommendation with 10 GBit/s"

Similar threads