vfs.zfs.arc_shrink_shift in 8.3

datnus · Apr 28, 2014

Hi all,
Is there vfs.zfs.arc_shrink_shift in 8.3?
I'm unable to see it with sysctl -a.

Tks so much.

jgreco · Apr 28, 2014

No, sorry.

datnus · Apr 28, 2014

Then, is there any equivalent sysctl to control how much arc to "reclaim" everytime?
Thanks

jgreco · Apr 28, 2014

No, I don't think so. That's why they added arc_shrink_shift ....

datnus · Apr 28, 2014

And freenas doesnt have the arc shrink shift?
Any workaround using other sysctl?

jgreco · Apr 28, 2014

No, I think you pretty much need to use arc_shrink_shift, but it isn't on 8.3. I don't recall whether it is exposed on 9 but I'm pretty sure it is in there somewhere. If you don't see it exposed, feel free to submit a bug report since it probably ought to be.

datnus · Apr 28, 2014

I set manually "zfs set zfs:zfs_arc_shrink_shift=10 Data123" but how could I know if it has effects?

cyberjock · Apr 28, 2014

Well, zfs get all will list all of the zfs properties available to a pool. Using the latest FreeNAS version that property doesn't exist, so either it isn't in FreeNAS or it's not a zfs property. I will say that I'm not aware of any zfs property having colons in them, so it's probably a sysctl. And jgreco covered that earlier when he said the sysctl doesn't exist.

datnus · Apr 28, 2014

Strange, it used to be in 8.0.1.

cyberjock · Apr 28, 2014

datnus said:
Strange, it used to be in 8.0.1.

I'm 99% sure it didn't exist as a ZFS property just because they were never formatted like what you listed. As for a sysctl, it could be something that was deprecated for various reasons. There's *tons* of sysctls that ZFS used to use back in all of the previous version and stuff that no longer exist because they've been deprecated by other functions.

The good example of that is the old vdev caching. You used to specify caching on a per-disk basis. But, when the advanced pre-fetch caching system came about the old vdev caching system was disabled entirely. I even tried enabling it on my server to see if it matters today. I could never find a situation where it mattered because prefetch is just so awesome. However, there are situations where you might not want to use a prefetch, and so there is a real demand for disabling prefetch and using the old vdev caching. I'm really wondering if there was no use whatsoever for the sysctl you are mentioning due to some behavioral change to ZFS so it's just not around at all.

One of the biggest problems with tuning ZFS is there's so much outdated, inaccurate, or flat out inappropriate information on how to tune ZFS that there's no "guide" anywhere that is a "go-to" for tuning. There are also tons of tunables that only work on one particular "fork" of ZFS. There's stuff for Nexenta that will never be in FreeBSD because of technological differences between the OS. I've spent so many hours researching functions that I found out later were disabled or superseded by some new function. When I found the vdev caching I thought I hit a potential performance boost if you had lots of RAM. Then, after hours of reading I found one small website where it talked about how vdev caching went from being the "big thing" to being almost useless for just short of 100% of situations.

Just a short google of "zfs_arc_shrink_shift" pulls up almost exclusively Nexenta stuff, with a little bit of illumos.

The first result I got was http://www.cupfighter.net/index.php/2013/03/default-nexenta-zfs-settings-you-want-to-change-part-2/ and that website is a part to for "default nexenta / zfs settings you want to change" and he discussing settings that "need to be changed as soon as you install or receive your system". Well, the reality of it is that FreeBSD's ZFS is designed to be extremely dynamic and should require only minimal tuning(if any) on all but the most purpose-built installation of ZFS. So I tend to think that Nexenta doesn't provide a good "all-around" dynamic situation with their ZFS implementation while FreeBSD does. But don't quote me because I don't try to learn anything about ZFS aside from the FreeBSD and Linux ZFS implementation. I do end up touching on Oracle/Sun and Nexenta, but I don't do heavy research on them.

cyberjock · Apr 28, 2014

Just for sh*ts and giggles I rebooted my Mini to wipe the ARC. I then loaded up 3 ssh screens. One had top, another was zpool iostat 1, and the third I did dd if=/dev/zero of=/mnt/tank/testfile bs=4m count=5000.

The box has 16GB of RAM.. just for reference

When this started the ARC was just 2400MB. It grew at a rate that was 1:1 with the amount of data written to the pool. This was expected as the ARC grew as it had new data to cache. That is, the ARC could grow, had no reason not to grow, and why not cache this newly written data since we can do it for basically no performance penalty.

Then I deleted the file. If the ARC frees the memory in the method that the website I linked above explains it, the ARC would be freed. Immediately, the ARC total shrunk from 10G to 244M. So clearly there is a function that is releasing the deleted data. So that was a roughly 99% decrease in ARC size in literally 1 second. Now, back to that website

So far so good... kind of. According to that webpage it claims it will evict up to 1/32nd per second. Well, that 99% kind of proves that FreeBSD doesn't follow any kind of limitation like that. But that's okay.

Now I changed the blocksize from default (128KB) to 2KB. The whole reason for the change is that more blocks will be stored in RAM and more blocks having to be freed means that the penalty for releasing that much ARC in 1 shot should be borderline debilitating.

I noticed 2 things during the write. The performance of the dd write MB/sec decreased significantly. Where the first write test took 92 seconds(226MB/sec), the second write test took 295 seconds(70MB/sec) . Pure suck, but expected. CPU usage also skyrocketed because it has to do checksums for a much larger number of blocks.

Next I deleted the file again. If this ARC freeing "penalty" is as serious as that webpage makes it out to be I'd expect a pretty nasty penalty. Well, it took almost 40 seconds to delete and the ARC slowly crept down as the delete process proceeded. So clearly FreeBSD is potentially susceptible to the problem. I'm just not sure if this is something that FreeBSDs implementation has deemed to be not a problem since the default is 128KB and shouldn't be changed unless you understand all of these nuances. Notworthy is that with a 2KB block size resulted in the same 20GB file actually consiming almost 119GB of disk space. Haha.

Interesting test but not sure I expected anything less. Proved that there *may* be problems in store for someone making a single bad zfs property change. But of course I keep telling people not to touch shit they don't understand, so if you want to be dumb and play with a knob you don't understand you'll really hate the consequences.

As for the OP, I think you'll find the test results interesting, but they are somewhat meaningless since that sysctl doesn't exist anyway. :P I just wanted to see if the behavior that the webpage I linked to in Nexenta is actually a potential problem with FreeBSD's ZFS implementation. Turns out.. it is.

datnus · Apr 29, 2014

Then, can I conclude that ARC really affects on write performance and arc_shrink_shift should help but not there?

I wonder how the version 9 deals with this problem?

cyberjock · Apr 29, 2014

arc_shrink_shift could help for some situations. But those situations are the result of self-inflicted wounds. I'll explain...

I don't think you can argue either way with the arc_shrink_shift article and/or the data I provided. The real limitation appears to be related to the ability to delete data from pool itself. With a properly configured blocksize I don't see it being a problem. The only time I *think* it would be useful would be if you were completely off your rocker and had extremely large files in a pool with an inappropriately small blocksize. Since someone tuning ZFS is supposed to be taking everything into consideration(which is a feat in and of itself as a ZFS expert) I think it's a fools errand to think the setting is useful in any way. I'd say it might be useful if you're being irresponsible with tuning. But if you are already being irresponsible with tuning settings the ability of the ARC to vacate deleted data is going to be the least of your worries(as I demonstrated with the first test).

It's also very possible that the feature was removed because CPU processing power has grown so much since ZFS v1 that at some point in the past they removed it because the processing penalty with forcing a limit on ARC to free per second was not only unreasonable but hurt ARC performance overall.

Looking at his first post he seems to focus on making numbers look good and not on real empirical evidence that concludes that performance is gained.

Example1:

He recommends the zfs:l2arc_write_boost setting be changed from 8388608 to something 10x that size. The reason being that it helps fill the L2ARC faster. Well, sure, filling the ARC faster is great... for numbers. It makes you *look* like things are better. But that's not what the setting is useful for.

When you bootup the system the L2ARC is empty. The "boost" setting let's to do an initial fill of the L2ARC at a higher rate than when it's actually full. When data is being vacated from the ARC ZFS makes a choice. Either it decides to keep the data in the L2ARC because it's met those set of requirements and tunables or it vacates it from the ARC without pushing it to the L2ARC. On bootup the usage of the server is almost always unique. Think VMs spinning up one by one. That load is nothing closely resembling *real* daily workload and the data that will end up in the L2ARC will almost certainly not be the same data you *want* to have in the L2ARC. So you've potentially begun to fill the L2ARC with data that you won't use again.

Now things get tricky.

Once the L2ARC is full, the "boost" setting no longer serves any function. Now you rely on the standard L2ARC tunables. Now that you are at a point where actual regularly requested data is in RAM and potentially being vacated from the ARC a more difficult choice has to be made. ZFS will decide to vacate the data from the ARC without going to the L2ARC or will decide it needs to go into the L2ARC. BUT.. once it has decided it wants to store data in the L2ARC there's a whole set of tunables for determining at what threshold you want to actually *remove* data from the L2ARC to make room for this "allegedly" more important data. That threshold is set somewhat high and it's for 3 reasons. 1) You don't want the L2ARC to be in a position of constantly changing over data as any write to the L2ARC has a performance penalty and hurts the L2ARC's ability to do its job and 2) you don't want to wear out the L2ARC prematurely with constant writes and 3) If you've done your job and you are the ZFS professional you are supposed to be your L2ARC should be an appropriate size to store enough data to keep your file server cruising along at full speed. So now, thanks to being aggressive with filling the L2ARC you've actually hurt your server's performance. ZFS must now weigh the performance penalty of evicting data from the L2ARC to add the new data over simply discarding the data completely, and in some situations you'll end up with a server that starts off fast, ramps down in speed then has to wait for the L2ARC to fill with *actual* useful information. A full L2ARC of unimportant data is actually more costly to the performance of your pool than not having an L2ARC at all. If he had left the defaults alone he would have potentially done 3 things. 1) Not filled his L2ARC with useless data, 2) Not had to deal with the performance penalty that will likely take hours(or days) to resolve with #1, and didn't add a bunch of writes to his L2ARC that didn't need to take place in the first place.

This is a simple fool's errand because if you watch the L2ARC fill faster you pat yourself on the back for a job well done because you've got these cool L2ARC fill% numbers that make you look like a hero. But secretly you've hurt your servers overall performance. But don't tell him that.

Example2:

He discusses setting resilver performance to be more aggressive than actual server load. Well, in ZFS you're always trading one thing for another. If it wasn't a tradeoff then it would be the default because there would be no downside. He recommends setting the resilver_delay to 0. Normally a server will delay a resilver if a read or write comes in. This is because it anticipates that you will probably want to do another read or write within a period of time. Setting this to 0 means that ZFS will simply queue up your reads and writes, make them at the appropriate time while they are mixed in with scrub requests. This will hurt user performance. In some situations it may be unnoticeable. In others it would absolutely detrimental. So it's not something that someone should be willy nilly recommended with 2 whole sentences.

Then he uses the phrase that I find is the tell-tale that you are a fool. He says "in almost all cases". Well, if it was so great that it was optimal for "almost all cases" it would be the damn default. It's not. And there some damn good reasons it's not. Not the least of which is because ... gasp.. it's not optimal for "almost all cases". The guy seems pretty wanton with just doing shit, justifying it with a blog post, then unsuspecting people like yourself that are actually wanting help get the big stiffy in the butt because you made a few changes that he recommended and now your server is a POS when you want to use it. But in your eyes you followed his "expert" opinion and things are now worse, so more tuning clearly is necessary.

Another thing to consider is how hard do you want to work your drives when you're already down on redundancy. Some people would prefer to work the other disks in the vdev at 100% of their maximum, others would prefer something that isn't taxing the drive heavily, but does get the job done. To me, taxing your remaining disks at 100% is just stupid and asking to fail another disk at a time when another failed disk is precisely what you don't want. (We all know the statistics about failing a second drive due to the loading places on the remaining drives during resilvers). So why would you *deliberately* subject your vdev to that kind of load, potentially for a day or more, all in the name of redundancy. That's literally exactly what he is doing. If you are *that* concerned about the resilvering of your drives, there's a solution. Smart ZFS admins would have created the vdev with additional redundancy to start with. There is a RAIDZ3 for a very very good reason. There is also a very good reason why ZFS has "mirrors"(mirrors meaning you can mirror the original disk as many times as you want as opposed to a RAID1 that is limited to 2 disks).

Overall, I don't think he really has thought through his choices. He's done exactly what he wanted to do.. made resilvers go faster and made the L2ARC fill faster. Those are things that look great on the surface, but probably aren't the best choices when you stop and consider the marvelous complexity of what you are doing and where you are trying to go.

What I do know though is that he definitely hasn't explained to the reader why he was doing what he was doing, just that "this is awesome and you should do it now!" and that there are tradeoffs and how *you*, the reader, can determine *if* you should play with a tunable or not. He just tells you BS like "almost all cases" and noobs have no choice but to eat that stuff up.

Sorry it's a long wall. I didn't want to write something short like "he's an idiot" or "don't do that and do this instead". That's exactly what that guy did and it makes me sick to know that people like you datnus do a Google search and end up having to read that kind of garbage.

jgreco · Apr 29, 2014

Not to be a negative nelly but it's been well known for awhile now that the L2ARC fill rate default is somewhat dated (i.e. too small) for pretty much every reasonable use case. This is a function of several variables, including the relative explosion of system memory sizes making monster ARC/L2ARC feasible, the increase in performance of SSD devices, etc, I'm pretty sure jpaetzel suggested tuning it on FreeNAS at one point. However, you're spot on about not just randomly twiddling stuff.

Important Announcement for the TrueNAS Community.

vfs.zfs.arc_shrink_shift in 8.3

datnus

Contributor

jgreco

Resident Grinch

datnus

Contributor

jgreco

Resident Grinch

datnus

Contributor

jgreco

Resident Grinch

datnus

Contributor

cyberjock

Inactive Account

datnus

Contributor

cyberjock

Inactive Account

cyberjock

Inactive Account

datnus

Contributor

cyberjock

Inactive Account

jgreco

Resident Grinch

Similar threads

Important Announcement for the TrueNAS Community.

vfs.zfs.arc_shrink_shift in 8.3

Contributor

Resident Grinch

Contributor

Resident Grinch

Contributor

Resident Grinch

Contributor

Inactive Account

Contributor

Inactive Account

Inactive Account

Contributor

Inactive Account

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "vfs.zfs.arc_shrink_shift in 8.3"

Similar threads