ARC not used efficiently

Status
Not open for further replies.

n1ko

Dabbler
Joined
Aug 27, 2013
Messages
10
With FreeNAS this has been kind of a constant woe for me. Year ago I posted this: http://forums.freenas.org/index.php?threads/freenas-not-using-all-of-the-ram.14576/#post-70620 (32GB ram not used for ARC).

I upgraded my rig last month to a proper xeon+supermicro mobo + 128GB setup, but still the ARC won't function like it should.

ARC size seems to way under what it should be. I'm only running storage services (NFS+iSCSI) on the box, so I would like to use the RAM for ARC. This is what i'm seeing at the moment:

ARC Size: 76.64% 91.97 GiB
Target Size: (Adaptive) 91.67% 110.00 GiB
Min Size (Hard Limit): 91.67% 110.00 GiB
Max Size (High Water): 1:1 120.00 GiB

And this is what especially worries me:
CACHE HITS BY CACHE LIST:
Anonymously Used: 7.84% 5.94m
Most Recently Used: 23.33% 17.69m
Most Frequently Used: 65.29% 49.51m
Most Recently Used Ghost: 0.63% 481.14k
Most Frequently Used Ghost: 2.91% 2.21m

Over 3% ghost hits, even when theres plenty of RAM.

When I boot up the box I usually see the MFU warmed first to around 15-20GB. After usage (day or two) it goes down and stays down. This is what I have now:

last pid: 70813; load averages: 0.06, 0.05, 0.06 up 4+03:02:18 12:02:42
50 processes: 1 running, 49 sleeping
CPU: 0.1% user, 0.0% nice, 2.0% system, 0.2% interrupt, 97.7% idle
Mem: 655M Active, 10G Inact, 95G Wired, 32K Cache, 88M Buf, 19G Free
ARC: 92G Total, 693M MFU, 90G MRU, 16K Anon, 449M Header, 494M Other

So any ideas what could be wrong? According to Oracle ( https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/ ) 7/8 should be used for ARC by default. I'm not close to hitting that.

And is there any way to prioritise the MFU over MRU?

Would adding a 512GB or larger L2ARC help me better utilise the RAM?


Full specs are:
Xeon 2609v2
Supermicro X9SRL-F
4x 32GB of LRDIMM ECC
3x IBM m1015
20x ST4000DM000 in 2x 10-disk raidz2

Tunables:
ZFS Tunable (sysctl):
kern.maxusers 384
vm.kmem_size 130996502528
vm.kmem_size_scale 1
vm.kmem_size_min 0
vm.kmem_size_max 130996502528
vfs.zfs.l2c_only_size 0
vfs.zfs.mfu_ghost_data_lsize 96466924032
vfs.zfs.mfu_ghost_metadata_lsize 267013632
vfs.zfs.mfu_ghost_size 96733937664
vfs.zfs.mfu_data_lsize 606307840
vfs.zfs.mfu_metadata_lsize 17950208
vfs.zfs.mfu_size 727090176
vfs.zfs.mru_ghost_data_lsize 19736062976
vfs.zfs.mru_ghost_metadata_lsize 1330995200
vfs.zfs.mru_ghost_size 21067058176
vfs.zfs.mru_data_lsize 96687334912
vfs.zfs.mru_metadata_lsize 81828864
vfs.zfs.mru_size 97041267200
vfs.zfs.anon_data_lsize 0
vfs.zfs.anon_metadata_lsize 0
vfs.zfs.anon_size 16384
vfs.zfs.l2arc_norw 0
vfs.zfs.l2arc_feed_again 1
vfs.zfs.l2arc_noprefetch 0
vfs.zfs.l2arc_feed_min_ms 200
vfs.zfs.l2arc_feed_secs 1
vfs.zfs.l2arc_headroom 2
vfs.zfs.l2arc_write_boost 400000000
vfs.zfs.l2arc_write_max 400000000
vfs.zfs.arc_meta_limit 32212254720
vfs.zfs.arc_meta_used 1462661512
vfs.zfs.arc_min 118111600640
vfs.zfs.arc_max 128849018880
vfs.zfs.dedup.prefetch 1
vfs.zfs.mdcomp_disable 0
vfs.zfs.nopwrite_enabled 1
vfs.zfs.zfetch.array_rd_sz 1048576
vfs.zfs.zfetch.block_cap 256
vfs.zfs.zfetch.min_sec_reap 2
vfs.zfs.zfetch.max_streams 8
vfs.zfs.prefetch_disable 0
vfs.zfs.no_scrub_prefetch 0
vfs.zfs.no_scrub_io 0
vfs.zfs.resilver_min_time_ms 3000
vfs.zfs.free_min_time_ms 1000
vfs.zfs.scan_min_time_ms 1000
vfs.zfs.scan_idle 50
vfs.zfs.scrub_delay 4
vfs.zfs.resilver_delay 2
vfs.zfs.top_maxinflight 32
vfs.zfs.write_to_degraded 0
vfs.zfs.mg_noalloc_threshold 0
vfs.zfs.mg_alloc_failures 8
vfs.zfs.condense_pct 200
vfs.zfs.metaslab.weight_factor_enable 0
vfs.zfs.metaslab.preload_enabled 1
vfs.zfs.metaslab.preload_limit 3
vfs.zfs.metaslab.unload_delay 8
vfs.zfs.metaslab.load_pct 50
vfs.zfs.metaslab.min_alloc_size 10485760
vfs.zfs.metaslab.df_free_pct 4
vfs.zfs.metaslab.df_alloc_threshold 131072
vfs.zfs.metaslab.debug_unload 0
vfs.zfs.metaslab.debug_load 0
vfs.zfs.metaslab.gang_bang 131073
vfs.zfs.ccw_retry_interval 300
vfs.zfs.check_hostid 1
vfs.zfs.deadman_enabled 1
vfs.zfs.deadman_checktime_ms 5000
vfs.zfs.deadman_synctime_ms 1000000
vfs.zfs.recover 0
vfs.zfs.txg.timeout 5
vfs.zfs.max_auto_ashift 13
vfs.zfs.vdev.cache.bshift 16
vfs.zfs.vdev.cache.size 0
vfs.zfs.vdev.cache.max 16384
vfs.zfs.vdev.trim_on_init 1
vfs.zfs.vdev.mirror.non_rotating_seek_inc1
vfs.zfs.vdev.mirror.non_rotating_inc 0
vfs.zfs.vdev.mirror.rotating_seek_offset1048576
vfs.zfs.vdev.mirror.rotating_seek_inc 5
vfs.zfs.vdev.mirror.rotating_inc 0
vfs.zfs.vdev.write_gap_limit 4096
vfs.zfs.vdev.read_gap_limit 32768
vfs.zfs.vdev.aggregation_limit 131072
vfs.zfs.vdev.scrub_max_active 2
vfs.zfs.vdev.scrub_min_active 1
vfs.zfs.vdev.async_write_max_active 10
vfs.zfs.vdev.async_write_min_active 1
vfs.zfs.vdev.async_read_max_active 3
vfs.zfs.vdev.async_read_min_active 1
vfs.zfs.vdev.sync_write_max_active 10
vfs.zfs.vdev.sync_write_min_active 10
vfs.zfs.vdev.sync_read_max_active 10
vfs.zfs.vdev.sync_read_min_active 10
vfs.zfs.vdev.max_active 1000
vfs.zfs.vdev.larger_ashift_minimal 0
vfs.zfs.vdev.bio_delete_disable 0
vfs.zfs.vdev.bio_flush_disable 0
vfs.zfs.vdev.trim_max_pending 64
vfs.zfs.vdev.trim_max_bytes 2147483648
vfs.zfs.cache_flush_disable 0
vfs.zfs.zil_replay_disable 0
vfs.zfs.sync_pass_rewrite 2
vfs.zfs.sync_pass_dont_compress 5
vfs.zfs.sync_pass_deferred_free 2
vfs.zfs.zio.use_uma 1
vfs.zfs.snapshot_list_prefetch 0
vfs.zfs.version.ioctl 3
vfs.zfs.version.zpl 5
vfs.zfs.version.spa 5000
vfs.zfs.version.acl 1
vfs.zfs.debug 0
vfs.zfs.super_owner 0
vfs.zfs.vol.mode 2
vfs.zfs.trim.enabled 1
vfs.zfs.trim.max_interval 1
vfs.zfs.trim.timeout 30
vfs.zfs.trim.txg_delay 32
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
So your concern is ... what? That you're merely using most of the ARC instead of every single byte?

For NFS/iSCSI storage services, what's your workload like that you wouldn't expect MFU to sink over time? What happens with those numbers is going to be dependent on the workload; booting a VM might cause a block to be accessed, cached, and all that ... which might meet the definition of "most frequently" during the time when the filer has not warmed up and has seen very little other traffic. In a VM environment, after awhile I'd expect it primarily refers to maybe the blocks that are being accessed most frequently, like perhaps the executables being run within the VM's frequently.

http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

It looks to me like your ARC is reasonably filled. MFU is seeing 65% cache hits even if the SIZE of MFU is only 693M. That seems awesome.
 

n1ko

Dabbler
Joined
Aug 27, 2013
Messages
10
My concern is that it should use all the RAM that is available. And that it would benefit from it (looking at the ghost stats). Also the fact that MFU is highly useful, but seems to be replaced by MRU when the cache warms up.

Somehow I think that the high rate in MFU is due to it being still used but replaced really fast.

Random reads from the box aren't superfast due to raidz2+slow disks, so i'm trying to compensate with caching. Do you think L2ARC would be helpful?

Oh yeah, and about the disks space. I only have like 1-2TB of VM's and physical machine on it. Around 40TB of the rest is mediafiles. So in this context also the MFU for the boxes should be favourited, not the mediafiles that are rarely accessed very rarely.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
My concern is that it should use all the RAM that is available. And that it would benefit from it (looking at the ghost stats). Also the fact that MFU is highly useful, but seems to be replaced by MRU when the cache warms up.

That's exactly what ought to happen, I'd think. You end up with a small working set that's accessed frequently, which is marked for retention, and then also lots of recently accessed stuff.

Somehow I think that the high rate in MFU is due to it being still used but replaced really fast.

It is a combined cache that adapts to the workload. A block that isn't sufficiently interesting to be retained will be dumped if there's ARC pressure in favor of something that may be more useful. So what may come as a shock is that if you have a bunch of shared blocks used by ten thousand VM's which all boot at once, yes, they'll end up in MRU and MFU at various points, but a day after all those VM's boot and those blocks haven't been accessed at all in 23+ hours, they may no longer be in ARC. Expected and sensible behaviour.

Random reads from the box aren't superfast due to raidz2+slow disks, so i'm trying to compensate with caching. Do you think L2ARC would be helpful?

Well at least you understand it has been designed to be slow. And yes, in your case, you ought to be able to safely add 256GB of L2ARC, perhaps as much as 512GB, but be aware that L2ARC robs space from the ARC, so you can actually hurt yourself if you get too aggressive on the sizing. You'd likely find a sweet spot around 384GB, but just a guess on my part.

Oh yeah, and about the disks space. I only have like 1-2TB of VM's and physical machine on it. Around 40TB of the rest is mediafiles. So in this context also the MFU for the boxes should be favourited, not the mediafiles that are rarely accessed very rarely.

The ARC/L2ARC will not favor sequentially read data unless you specifically tune it to do so. I don't think that's changed lately but you might want to doublecheck me on that, simply because I've not been playing with FreeNAS and OpenZFS much in the last year.

The one tweak I think you could do, aside from adding L2ARC, would be to increase your ARC max size. The 7/8ths rule is inappropriate past perhaps 32GB, and you can actually move to something more aggressive.
 

n1ko

Dabbler
Joined
Aug 27, 2013
Messages
10
Is the L2ARC bookkeeping space actually taken from ARC? I always thought that it was not actually related to ARC, rather just "you can't have as much ARC since you are using memory on that". So basically, if the end result is that my ARC gets even smaller while I still have 20GB+ of free RAM, its not really worth it...

vfs.zfs.arc_max 128849018880

This is 120GB, I bumped that to 130GB now which is more than I have actual RAM. Will report back how it goes.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
That seems like a spectacularly bad thing to do, let us know how recoverable it is after the kernel panicks...
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
That seems like a spectacularly bad thing to do, let us know how recoverable it is after the kernel panicks...

LOL


@n1ko- That documentation is for Oracle. It does NOT apply to anything except Oracle stuff. FreeBSD isn't based on Oracle so what you read doesn't necessarily apply. To be honest, I see nothing particularly wrong with your setup at all. Those numbers look fine. You are welcome to fight ZFS all you want, but it's not designed to take up a bunch of RAM "because it can". Take up too much and you'll find yourself crashing.
 

n1ko

Dabbler
Joined
Aug 27, 2013
Messages
10
@cyberjock thanks for the input. I'll probably continue to "fight" it, just to understand how it actually works. This is not black magic, everything has a reason and can be tuned, or atleast traced back the source where decision for stuff like 7/8 are made.

I'm pretty sure, that if I were to throw more RAM at it the absolute size of ARC would rise. And that's why I would like to understand why is it limiting itself in my case. If you have 4-16GB of RAM and you waste 15% of that it doesn't matter so much, but having 20G of free RAM sitting there is just ridiculous.
 

solarisguy

Guru
Joined
Apr 4, 2014
Messages
1,125
What is your swap size?

Can you try to manually disable all the swap partitions one by one (while the system is running) and see what happens?
 

n1ko

Dabbler
Joined
Aug 27, 2013
Messages
10
swap size is the default 2G per disk (=40G total). I tried changing that to 0 (no avail, seems that it doesnt do anything afterwards). I also manually turned swap off, but that didn't affect anything either.
 

solarisguy

Guru
Joined
Apr 4, 2014
Messages
1,125
Thank you for checking.

P.S. You know that you do not need 40GB of swap..., 1GB per disk (=20GB total) would be more than enough...
 

n1ko

Dabbler
Joined
Aug 27, 2013
Messages
10
Probably even 10GB would more than enough. The 2GB is default, and since I didn't tune that when installing the system thats what I got :)

One of the silly defaults I guess
 

n1ko

Dabbler
Joined
Aug 27, 2013
Messages
10
Just a update to this. After 16days of uptime i've monitored that the ARC stabilizes at 94G and the free memory hovers around 10-12GB. Not ideal, but I guess is the best it can do. Actual hit ratio is almost 90% so I it's working pretty well.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
90% is basically "the green band" for good ZFS performance. So I would just repeat what I said before... everything is working just fine. :)
 
Status
Not open for further replies.
Top