ZFS and SSD cache size (log (zil) and L2ARC)

paleoN · Nov 24, 2012

joeschmuck said:
Based on this thread I see no reason to use an entire SSD for a ZIL.

Wear-leveling/write endurance and speed of course. A cheaper SSD with a half-way decent GC algorithm will benefit from partitioning part of the drive and not ever using the rest of it. As long as you are not regularly writing to the driver faster than it can erase free space, which is likely for a lot/most/all? home users, it isn't as important, but still beneficial. Not to mention the IOPS contention between writes & reads to the SSD. Though that's only if it's being pushed that hard, there is probably a good chance it won't be and it's still likely faster than without it.

Even with ZFS v28 it is still best practices to mirror the SLOG. I've posted as much a few times, but I'm too lazy to find it again now.

bollar said:
My hunch would be that the system would be overall slower than if you just allocated the entire drive to L2ARC.

Depends on the workload, no? If you are doing a lot of sync writes, which is what the ZIL & SLOG is used for, than it would likely be faster.

bollar · Nov 24, 2012

paleoN said:
Depends on the workload, no? If you are doing a lot of sync writes, which is what the ZIL & SLOG is used for, than it would likely be faster.

I think you're right, but with ZIL and L2ARC sharing bandwidth, I think you could quickly get into a situation where the performance could tank at least some of the time.

cyberjock · Nov 24, 2012

joeschmuck said:
Interesting thread. Right now I'm looking into if it's possible to take a single SSD, say 250GB, and create two partitions for it, 8GB for a ZIL and the remaining for an L2ARC. Based on this thread I see no reason to use an entire SSD for a ZIL. I'm only doing this to play around and see what kind of performance gains I could obtain, if any for my use. So I'm in search of commands to join the cache and zil manually since the GUI only accepts the entire drive. Time to continue my search.

EDIT: Found the commands I needed, easy search on Google: (pool is called pool, SSD is da5)

zpool add pool log da5p1
zpool add pool cache da5p2

I setup a dual purpose ramdrive(ANS-9010- It's a 5.25" drive with 8 RAM slots with SATA) and setup the drive as a ZIL and L2ARC simultaneously. Mine was only 16GB, so that may make my results worthless. The L2ARC should be big(40GB+) and ZIL should be small(4-8GB generally). ZIL seemed to get "priority" over L2ARC. Writing was much improved. Reading not so much. My drive was a little small for reading(only 8GB) but it seemed to really help if you are in a RAM limited scenario(I forced my system to 4GB for the test, otherwise the L2ARC wasn't really being utilized) where stuff isn't in RAM and if you aren't writing anything. If you were writing at a high speed then all bets were off for good read performance.

The ANS-9010 also supports dual drive(basically lets you do a RAID0 with the 2 SATA ports for double performance). As soon as I split them up and instead of doing a RAID0 I used them as separate drives the writes and reads didn't seem to be limited.

The ANS-9010 is limited to about 200MB/sec or so I believe for reads and writes with almost no latency. So it's fast, but might not be "fast enough" since some SSDs can do more than twice that speed.

YMMV.

jgreco · Nov 24, 2012

It is tricky to "force" use of L2ARC. There are numerous tuning factors, and if you have bludgeoned the size of your memory down to nothing, there's less to work with as well.

My impression is that L2ARC works best if just left on and forgotten about for a long period of time. Because FreeNAS sets a very low vfs.zfs.l2arc_write_max, flushing of data from ARC to L2ARC happens kind of slowly. A busy system will be losing lots of stuff out of ARC without writing it to L2ARC when the write_max is exceeded. Lowering the memory increases pressure on the ARC, and basically your system won't behave the way you want. You're better off letting the system use all its RAM and then increasing the write_max to something closer to what your device can tolerate. Remember also that L2ARC consumes space in the ARC as well.

I thought that the conventional wisdom for ZFS was something more like two times the amount of data pushed out per transaction group, which really ought to only work out to a gigabyte or two in the average case.

cyberjock · Nov 24, 2012

it is very tricky. I'm glad that the drive has a local LED for usage for each of the SATA ports. That was the only way I could know for sure what it was using or not using.

joeschmuck · Nov 25, 2012

All good words for sure. In my situation I really doubt I'll see a real benefit from the L2ARC as I primarily use the NAS for backups and streaming video, nothing where there is repetitive data reading. The ZIL would benefit me and once I pop this SSD into my FreeNAS box, I'll test it out, hopefully with some consistent results. Oh, and I changed my mind on the size of SSD, I have a 128GB drive which I can use without impacting my drives in my main machine so that will be the test unit for me. I still plan to use an 8GB ZIL, the remainder for an L2ARC.

I didn't think about it until it was mentioned above, GC. The device I plan to use is old enough that automatic GC really isn't implemented well but TRIM is and I don't know if FreeNAS supports TRIM (haven't looked into it yet). The throughput should still be fast enough to not be a factor in maximizing my single Gb connection. I/O is not terribly fast with respect to my other two SSD but again it's much faster than my hard drives could ever dream of being.

So once I've played with a VM for a few more days and can verify to myself that I can insert and remove the ZIL and L2ARC without risk of data loss/pool corruption, then I'll run the test on my FreeNAS computer.

Thanks for all the comments.

jgreco · Nov 25, 2012

FreeBSD 8 ZFS does not support TRIM. We may see it in FreeBSD 9.

joeschmuck · Nov 25, 2012

jgreco said:
FreeBSD 8 ZFS does not support TRIM. We may see it in FreeBSD 9.

Thanks for saving me the work of looking this up.

Shastada · Jan 11, 2013

joeschmuck said:
Interesting thread. Right now I'm looking into if it's possible to take a single SSD, say 250GB, and create two partitions for it, 8GB for a ZIL and the remaining for an L2ARC. Based on this thread I see no reason to use an entire SSD for a ZIL. I'm only doing this to play around and see what kind of performance gains I could obtain, if any for my use. So I'm in search of commands to join the cache and zil manually since the GUI only accepts the entire drive. Time to continue my search.

EDIT: Found the commands I needed, easy search on Google: (pool is called pool, SSD is da5)

zpool add pool log da5p1
zpool add pool cache da5p2

I'm interested in doing this, have you tried it out? Can you let us know how it went?

My use case is this. I have 2 3TB drives I want to have cached by a 256GB SSD. This will be my datastore for many VMs running out over a 10G connection to several hosts (ESXi 5.1). So I figure the 256GB ssd should cover most of my VMs that are active and allow for most to be in L2ARC hopefully, but would like a little space for ZIL without having to purchase a second SSD. Not worried about lifespan of the SSD as if it dies in 1-2 years SSDs will likely be that much bigger and cheaper.

Currently running on the same freenas (8.3) box however is 5x2TB drives running all my CIFs shares used for backups and video streams. So I'm concerned that adding the datastore to the freenas will slow both down. Hoping the caching of the datastore should keep it performing well.

My System is a 2core AMD with 16GB RAM, 5 x 2TB drives on onboard SATAII ports. Planning on adding a PCIex1 card with 4 SATAII ports to add the 2x3TB and 1x256GB SSD.

mattlach · Jan 11, 2013

Well, for what it's worth I decided against adding either an L2ARC or ZIL to my FreeNAS server.

With six 2TB WD Greens in RAIDz2 and 22GB of RAM, I find access to the system is always instantaneous, and sustained transfer speeds are bottlenecked by my gigabit ethernet, so I have no reason to improve performance on it.

That being said, I'm just using it as a straight up file server, no running of VMWare images or anything like that off of it.

jgreco · Jan 11, 2013

If you need a reliable ZIL, you probably want a dedicated SSD for that. The characteristics for a good L2ARC are a bit different than ZIL. L2ARC, you want big, inexpensive SSD to hold as much as possible; the closer to your working set size, the faster your VM's. For ZIL, though, you really want something with a supercapacitor (or like the Intel 320, an array of large capacitors) so that in the event of a power fail, the writes your VM's have committed towards disk actually get a chance to be written somewhere.

We just covered some of this in another recent thread about ZIL; the short form is that if you don't care that much about your data, then it may be easier to just set sync=disabled and not worry about it. This opens up a slightly larger window for potential data loss in the event of a crash or power loss. Mixing ZIL and L2ARC on the same SSD device is generally seen as very bad across a wide range of ZFS experiences.

joeschmuck · Jan 11, 2013

Shastada said:
I'm interested in doing this, have you tried it out? Can you let us know how it went?

My use case is this. I have 2 3TB drives I want to have cached by a 256GB SSD. This will be my datastore for many VMs running out over a 10G connection to several hosts (ESXi 5.1). So I figure the 256GB ssd should cover most of my VMs that are active and allow for most to be in L2ARC hopefully, but would like a little space for ZIL without having to purchase a second SSD. Not worried about lifespan of the SSD as if it dies in 1-2 years SSDs will likely be that much bigger and cheaper.

Currently running on the same freenas (8.3) box however is 5x2TB drives running all my CIFs shares used for backups and video streams. So I'm concerned that adding the datastore to the freenas will slow both down. Hoping the caching of the datastore should keep it performing well.

My System is a 2core AMD with 16GB RAM, 5 x 2TB drives on onboard SATAII ports. Planning on adding a PCIex1 card with 4 SATAII ports to add the 2x3TB and 1x256GB SSD.

I did do it and it's still in my system for now but there were no benefits to my system for what I use it for. I posted lengthy test results for this experiment. Here is the link: http://forums.freenas.org/showthread.php?10325-Intel-NIC-vs-RealTek-NIC-Performance-Testing but it was more a test against NIC cards but you will see at the end I did include a ZIL and then an L2ARC.

A ZIL and L2ARC can give great performance gains they are really best served to a high I/O usage I gather.

Shastada · Jan 11, 2013

jgreco said:
If you need a reliable ZIL, you probably want a dedicated SSD for that. The characteristics for a good L2ARC are a bit different than ZIL. L2ARC, you want big, inexpensive SSD to hold as much as possible; the closer to your working set size, the faster your VM's. For ZIL, though, you really want something with a supercapacitor (or like the Intel 320, an array of large capacitors) so that in the event of a power fail, the writes your VM's have committed towards disk actually get a chance to be written somewhere.

We just covered some of this in another recent thread about ZIL; the short form is that if you don't care that much about your data, then it may be easier to just set sync=disabled and not worry about it. This opens up a slightly larger window for potential data loss in the event of a crash or power loss. Mixing ZIL and L2ARC is generally seen as very bad across a wide range of ZFS experiences.

Since I don't have a SSD that fits that bill I'd be better off using the SSD as L2ARC only and skip ZIL until I can get a hold of a good Intel SLC SSD? I have a 40G Intel MLC SSD (x25v) i could use, would that be sufficient?

jgreco · Jan 12, 2013

You can find a good discussion of ZIL hardware and related issues here; listen to sub.mesa in particular but realize that the discussion is a few years old. I don't happen to know, off-hand, whether the X25-V qualifies.

joeschmuck · Jan 12, 2013

My testing of using a SSD as a ZIL and L2ARC were not terribly impressive for my use but I wasn't expecting much. Here is a link to my testing and the results. This was an experiment in increasing my Ethernet throughput which I assume is what everyone really wants here.

http://forums.freenas.org/showthread.php?10325-Intel-NIC-vs-RealTek-NIC-Performance-Testing

Shastada · Jan 16, 2013

I ended up building a second FreeNAS (i3 w/16GB RAM) with 2 x 3TB drives in ZFS mirror with the Crucial M4 256GB as L2ARC and a Intel 320 160GB as ZIL, this box has two 10G cards to connect to the ESXi boxes. I should have everything up and running in the next couple days and I'll try to benchmark it out on the VMs to see how it looks.

cyberjock · Jan 16, 2013

Well, it's far more complicated than that. The reason why there is so much ambiguity between ZIL and L2ARC is that they aren't dumb caches like on windows.

For example, the ZIL appears to cache only writes smaller than 64kbytes. So if you copy a 10GB file to the server the ZIL will not be used at all. But if you copy a 32kbyte PDF that might go into the ZIL. But in the big picture, if a 32kbyte PDF took 2 seconds instead of 1 second would you really care? The ZIL seems to be nothing but a RAID-1 of the RAM write cache. So data isn't written to the ZIL and then dumped from RAM. It will also be stored in RAM until written. So basically anything bigger than about a couple of GB is just absurdly huge and a waste of space. Add in the fact that ZFS commits all writes in 6 second increments and with Gb LAN you are talking a MAX of (133MB x 6 = 666MB) for a ZIL. Anything bigger than that cannot ever be used at all unless you are doing stuff locally from command line. But remember that you'd need 666MB of writes that are all smaller than 64kbytes(ROFL.. not likely) Also, you are talking about "saving" the time it takes for a file to be written to the ZIL versus written to the zpool. Unless the zpool is already doing tons of other reads/writes to the zpool you save absolutely nothing.

For the L2ARC the L2ARC isn't overly useful unless you are reading the same data over and over and it remains in the L2ARC. But there are several obstacles to overcome to get that info in the L2ARC. The system doesn't just fill the L2ARC as fast as it can. It has throttled rates of how fast data can be put into the L2ARC. Because it must also anticipate that it may suddenly need to read data from the L2ARC. So you are limited to certain write speeds to the L2ARC(we're talking a few MB/sec if I remember correctly) ZFS doesn't do much 'read ahead' so even if you start watching a streaming movie don't expect the movie to be dumped into the L2ARC and then the drives to spin down from being idle. Also the more L2ARC space you have the more ARC space is dedicated to keeping tabs on the data in the L2ARC. Stealing from your L1ARC can hurt system performance too. The thumb rule I keep seeing is don't use an L2ARC that is greater than 10x your total RAM. So for 16GB you shouldn't build an L2ARC any bigger than 160GB. As the ARC fills up with entries that aren't helping the current loading the ARC begins to churn faster which hurts system performance. Each L2ARC entry uses 200-byte of ARC.

At the end of the day your system workload AND the type of workload make up 99.9999% of the total benefits you will see. I read somewhere last year that unless you are in an environment where your zpool is basically a giant transactional database you are wasting your money using ZIL and L2ARCs at all.

Despite all of that stuff that probably is making you go cross-eyed if your zpool isn't actually busy and it could have just read the data from the zpool you're using and taken the seek penalty of about 5-15ms. Big freakin' whoop. I'm pretty sure that when you double click on that PDF, movie, spreadsheet, etc that you have approximately zero chance of noticing that 5-15ms you potentially could have saved.

Everything I've just discussed is why you don't see people saying "ZOMG buy a ZIL and see a 50% increase". They instead say "I saw a 50% increase". The reason.. theird hardware, configuration of ZFS, and loading cannot be easily extrapolated to serve your situation. There are commands you can run after everything is built if you are unhappy with the performance to see if a ZIL or L2ARC would help, but I won't go into those.

There's a thread somewhere here where one of our forum regulars did some research on it. The performance increases were far less than amazing. They pretty much confirmed what alot of other people have posted on here complaining about. They buy the ZIL and/or L2ARC and see only a modest performance improvement.

jgreco · Jan 17, 2013

noobsauce80 said:
For example, the ZIL appears to cache only writes smaller than 64kbytes. So if you copy a 10GB file to the server the ZIL will not be used at all. But if you copy a 32kbyte PDF that might go into the ZIL.

Okay, sorry, I've read past this three or four times trying to convince myself that this isn't confusing the issue. You go on to a better explanation, but we should probably not use the word "cache" in conjunction with "ZIL", but rather "log". And I'm not sure where the 64KB idea comes from. And if you copy a 10GB file, the ZIL will probably still be used, but probably only for metadata updates, which are small and relatively trite, but if you ask for that as a sync write, then the ZIL is definitely involved.

Key point: The ZIL is NOT IN THE DATA PATH. In general, nothing ever gets read out of the ZIL during normal operation. Only written.

POSIX provides ways for an application (or anything else) to guarantee that data is committed to stable storage. This is called a synchronous write. What it means is that if I call the system's write call with the sync flag set, if the power fails at any point after the write call returns, even just a microsecond, that data is supposed to be guaranteed to be retrievable when the system comes back up.

The problem is, disk is super slow, and if you're asking for a lot of stuff to be written sync, performance tanks. Some things - particularly ESXi - have no clue what they're reading or writing because the I/O is caused by VM's, so they generally ask for EVERYTHING to write sync.

So ZFS honors sync requests. But it does it cleverly. The POSIX mandate is basically that sync written data not be lost. Without a dedicated ZIL, ZFS has a small part of your pool set aside as ZIL. It puts the data there as fast as it can (and basically ignores it), at which point it can return success to the calling process. But the data is also put out to be written to the actual pool. Now, since it's already fulfilled its requirement to commit it to stable storage, ZFS is free to cache that pool write to happen "a bit later". But the in-pool ZIL write still incurs a penalty. Moving that to a dedicated ZIL fixes that.

Now in the event of a crash or reboot or whatever, the import is when the ZIL is actually read. ZFS has to make sure that the data that was promised to be written to the pool will actually be written to the pool. So the ZIL is rewound, read back, and then ZFS makes sure that those changes are reflected in the pool.

With that having been said...

Because it must also anticipate that it may suddenly need to read data from the L2ARC.

I think you meant "you must also anticipate" because I don't see any logic to do that in ZFS.

So you are limited to certain write speeds to the L2ARC(we're talking a few MB/sec if I remember correctly) ZFS doesn't do much 'read ahead' so even if you start watching a streaming movie don't expect the movie to be dumped into the L2ARC and then the drives to spin down from being idle.

This is controlled by several variables.

vfs.zfs.l2arc_norw: 1
vfs.zfs.l2arc_feed_again: 1
vfs.zfs.l2arc_noprefetch: 1
vfs.zfs.l2arc_feed_min_ms: 200
vfs.zfs.l2arc_feed_secs: 1
vfs.zfs.l2arc_headroom: 2
vfs.zfs.l2arc_write_boost: 8388608
vfs.zfs.l2arc_write_max: 8388608

norw: if this is set to 1, it suppresses reads from the L2ARC device if it is being written to.

noprefetch: if this is set to 1, it suppresses L2ARC caching of prefetch buffers

headroom: the number of buffers worth of headroom the L2ARC tries to maintain. If the ARC is under pressure and there's insufficient headroom, the L2ARC may not get some stuff that it would have been good to get.

The rest of this is complicated and works together.

write_max is the maximum size of an L2ARC write. Typically this happens every feed_secs seconds. Do NOT set write_max to a very large number without understanding all of the rest of this.

When the L2ARC is cold and no reads have yet happened, write_max is augmented by write_boost. The theory is that if nothing's being read, it's not disruptive to write at a higher rate.

If feed_again is set to 1, ZFS may actually write to L2ARC as frequently as feed_min_ms; for the default value of 200, that means 5x per second.

So now, as an administrator, you have to use your head and figure this all out. So here's the thing. The 8MB write_max is very conservative. But you can't just say "oh yeah my SSD can write at 475MB/sec! I'll set it to THAT!" An L2ARC is only useful if it's offloading a lot of read activities from your main pool. So an easy call is that it would make no sense to be using more than half its bandwidth for writing. But further, ZFS already allows for automatic bumping up of write speed when the L2ARC is cold through the write_boost mechanism. Also, the feed_again mechanism works to allow multiple feeds per second if there is sufficient demand, so with 200ms, you only need one fifth. So you can safely set this to 1/2 of 1/5 of what your SSD can write at and still have it all work very well; so for a 475MB/sec SSD, you can go for 47.5MB/sec. Probably best to pick a power of two, though, so pick 32MB or 64MB. More does NOT make sense.

joeschmuck · Jan 17, 2013

I like reading these explanations, they really make it sink in on what the ZIL and L2ARC are useful for and how to adjust/tune it.

paleoN · Jan 17, 2013

jgreco said:
And I'm not sure where the 64KB idea comes from.

zfs_log.c - zfs_immediate_write_sz which is/was a tunable under Solaris. I halfway remember reading that it was 1MB at one point, but I could be imagining that. I know it used to be 64KB, then reduced down to 32K which, I believe, is where it is currently.

Important Announcement for the TrueNAS Community.

ZFS and SSD cache size (log (zil) and L2ARC)

Wizard

Patron

Inactive Account

Resident Grinch

Inactive Account

Old Man

Resident Grinch

Old Man

Cadet

Patron

Resident Grinch

Old Man

Cadet

Resident Grinch

Old Man

Cadet

Inactive Account

Resident Grinch

Old Man

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZFS and SSD cache size (log (zil) and L2ARC)"

Similar threads