Some insights into SLOG/ZIL with ZFS on FreeNAS

jgreco · Jul 5, 2013

What is the ZIL?

POSIX provides a facility for the system or an application to make sure that data requested to be written is actually committed to stable storage: a synchronous write request. Upon completion of a sync write request, the underlying filesystem is supposed to guarantee that a subsequent read will return that data, regardless of system crashes, reboots, power loss, etc. Sync writes have numerous uses in computing, from the trivial to the mandatory.

Some RAID controllers contain a battery-backed write cache, and put all writes through it, in order to provide this sort of functionality. This works, but is somewhat limiting in the way a system can be designed. A ZFS system has a potentially much larger write cache, but it is in system RAM, so it is volatile. The ZIL provides a mechanism to implement nonvolatile storage. However, it is not a write cache! It is an intent log. The ZIL is not in the data path, and instead sits alongside the data path. ZFS can then safely commit sync writes to the ZIL while simultaneously disregarding sync and aggregating the writes in the normal ZFS transaction group write process. If a crash, reboot, power loss, or other catastrophic event occurs before the transaction group is committed to the pool, the ZIL allows ZFS to read back the intent log, rebuild what was supposed to happen, and commit that to the pool. This is the only important time that the ZIL is read by ZFS, and when that happens, the data will be used to update the pool as appropriate. Under normal usage, data to the ZIL is written and then discarded as soon as the associated transaction group is safely committed.

What is SLOG?

The hardware RAID controller with battery backed write cache has a very fast place to store writes: its onboard cache, and usually all writes (not just sync writes) get pushed through the write cache. A ZFS system isn't guaranteed to have such a place, so by default ZFS stores the ZIL in the only persistent storage available to it: the ZFS pool. However, the pool is often busy and may be limited by seek speeds, so a ZIL stored in the pool is typically pretty slow. ZFS can honor the sync write request, but it'll get slower and slower as the pool activity increases.

The solution to this is to move the writes to something faster. This is called creating a Separate intent LOG, or SLOG. This is often incorrectly referred to as "ZIL", especially here on the FreeNAS forums. Without a dedicated SLOG device, the ZIL still exists, it is just stored on the ZFS pool vdevs. ZIL traffic on the pool vdevs substantially impacts pool performance in a negative manner, so if a large amount of sync writes are expected, a separate SLOG device of some sort is the way to go. ZIL traffic on the pool must also be written using whatever data protection strategy exists for the pool, so an in-pool ZIL write to an 11 drive RAIDZ3 will be hitting multiple disks for every block written.

What is a good choice for a SLOG device?

A SLOG device must be able to insure that data requested to be written is actually stored, a function that many SSD's do not provide. Supercapacitor based SSD's are often selected as the method to guarantee data can be written. This is not something that should be considered optional. If you're not going to get a suitable device for SLOG, then you are putting your pool at risk, and you might as well just turn on sync=disabled and be done with it. Lightning speed without all the pesky ZIL concerns. Of course, I already said that you are putting your pool at risk! You don't want to do that if you value your data and your pool...

Since a SLOG device could conceivably be writing massive amounts of data, endurance is an important consideration. If your pool is constantly writing large amounts of sync data, please remember that all of that will be written to SLOG. This is of particular concern for certain applications where all (or nearly all) pool writes are made in sync mode. If you have such an application, please pay particular attention to the endurance descriptions below.

The other thing a SLOG should have is low latency. When a sync write request comes in, everything comes to a halt. ZFS has to put the write request out to the SLOG, then wait for confirmation. For a SATA SSD on a HBA port, this involves invoking a device driver, which talks over PCIe to the HBA, which serializes the request onto a SATA cable, which is deserialized by the SSD's controller, which then has to Do Something, and then responds back over the SATA, up through the HBA, and back through the device driver, at which point ZFS is now aware that the data's been committed to stable storage, and now ZFS can process the next block. By way of comparison, a NVMe SSD eliminates a bunch of those steps, and reduces latency. Device driver -> PCIe -> SSD controller is a much shorter path.

A SLOG device could be as simple as a separate hard disk drive. This has the benefits of not needing to seek constantly, since the SLOG writes will generally be sequential, but still involves rotational latency, so a hard drive based SLOG may not be all that fast. However, it has incredible endurance.

A MLC based SSD such as the Intel 320 is a good low-cost choice. It doesn't have a supercapacitor, but does have an array of capacitors that performs as well (bonus, without the slight risk of explosion). The 40GB version of the 320 appears to be good for at least 20MB/sec worth of ZIL writes. The downside is that MLC flash has limited endurance. Buying a larger MLC flash device may yield more speed, as they typically have higher write rates due to the additional banks of flash. MLC devices should be underprovisioned in order to allow their internal wear leveling and page allocation functions to operate as efficiently as possible.

A SLC based SSD has much greater write endurance than MLC, but also is usually much more expensive. These are a great choice for heavy, intensive write environments. Again, supercapacitor or equivalent is required.

Several SATA-based storage devices involving DDR memory and a battery backup are available, many of which have very good performance and endurance characteristics. It is, however, getting more difficult to find 5.25" bays to stick these devices into on the large storage servers that might benefit from them.

An interesting but unorthodox alternative for SLOG is to use a RAID controller with battery backed write cache, along with conventional hard disks. Normally RAID controllers are frowned upon with ZFS, but here is an opportunity to take advantage of the capabilities: Since the cache absorbs sync writes and writes them as the disk allows, rotational latency becomes a nonissue, and you gain a SLOG device that can operate at the speed the drives are capable of writing at. In the case of a LSI 2208 with 1GB cache, and a pair of slowish ~50MB/sec 2.5" hard drives, it was interesting to note that a burst of ZIL writes could be absorbed by the cache at lightning speed, and then ZIL writes would slow down to the 50MB/sec that the drives were capable of sustaining. With the nearly unlimited endurance of battery-backed RAM and conventional hard drives, this is a very promising technique.

About the best SAS-based SSD you can get for SLOG use is probably the STEC ZeusRAM. However, it is still at the end of an SAS channel, and so there is some latency involved. I can't tell you if this is a good idea or not, as I haven't been able to justify the cash to try one.

About the best all-around SSD you can get for SLOG is the Intel DC P3700, an NVMe based SSD with great endurance characteristics. A consumer-grade version, the Intel 750, retains the power loss protection and general speed and low latency while offering reduced endurance.

All of these devices have some amount of latency involved, however, so typically sync writes will be somewhat slower than normal writes.

Why sync writes matter, especially on a fileserver

In the old days, when the VAX would crash and take your changes with it (because you hadn't yet saved to disk), you knew what work you had lost.

To a large extent, people I've talked to feel that this is still the case for a fileserver. Either the file is there or it is lost. In many cases, this is even true.

However, these days, we are using storage for much more complicated purposes. In the best cases, sync writes may already happen by default. In other cases, sync writes must be explicitly enabled.

Consider a VMware virtual disk device. Here is a massive disk file with lots of contents. It will probably exist for the lifetime of the associated VM. So one day in the VM's life, it is writing some minor update to the "hard disk". ESXi transmits that to the NAS, asking it to commit that to the "hard disk" (vmdk file). The NAS responds that it has been written, but then crashes before it actually commits the write to the actual disk. NAS reboots. Now the running VM has data that it thinks has been written to the "hard disk" but that is no longer actually represented in the data served by the NAS. So that might merely be a corrupted file on the VM disk, but that lost write could also have been critical metadata crucial to the integrity of the filesystem. The failure to write this with sync mode translates to some sort of danger.

Unfortunately, the hypervisor is not in a good position to judge the relative importance of a given write request, so the hypervisor simply marks all of the writes to be written sync. This is conservative and safe to do ... and it is dangerous to second-guess this, at least if you care about the integrity of your VM.

So the "good" news is that ESXi issues all NFS requests in sync mode, but the bad news is that ZFS performance handling large numbers of sync requests is poor without a SLOG device. This leaves a lot of ESXi users wanting to "fix" the "awful" performance of their FreeNAS ZFS. The choices basically come down to disabling the ZIL or disabling sync somehow, or providing a proper SLOG. The former options are dangerous to the integrity of the VM, and the latter is typically expense (for the SLOG) and learning (what you're reading right now, etc).

However, the same problem exists with iSCSI - and by default, iSCSI writes are async. The same risks to VM integrity exist. So for an iSCSI setup with VM's, setting "sync=always" is the correct way to go.

The ZFS write cache. (No, it's not the ZIL.)

ZFS has a clever write caching system that aggregates writes into a "transaction group". Using your system's main memory, it stores a transaction group, until either a predefined size or a given amount of time has passed. In FreeNAS, the default size is 1/8th your system's memory, and the default time is 5 seconds. Given that the hardware recommendations for ZFS are a minimum of 8GB, this means that the ZFS write cache can easily be 1GB, even on the smallest system!

Every time the transaction group fills, or whenever the transaction group time limit expires, the transaction group begins flushing out to disk, and another transaction group begins building up. If the new transaction group fills before the previous one finishes writing, ZFS pauses I/O to allow the pool to catch up.

Sync writes involve immediately writing the update to the ZIL (whether in-pool or SLOG), and then proceeding as though the write was async, inserting the write into the transaction group along with all the other writes. Since the act of writing a transaction group to disk makes the written data persistent, the ZIL is only protecting the data that is in the current transaction group and any transaction group that is currently flushing to disk. Any ZIL data for already-committed transaction groups is effectively stale and not necessary.

This gives us some guidance as to what the size of a SLOG device should be, to avoid performance penalties: it must be large enough to hold a minimum of two transaction groups.

Laaaaaaaaatency. Low is better.

The SLOG is all about latency. The "sync" part of "sync write" means that the client is requesting that the current data block be confirmed as written to disk before the write() system call returns to the client. Without sync writes, a client is free to just stack up a bunch of write requests and then they can send over a slowish channel, and they arrive when they can. Look at the layers:

Client initiates a write syscall
Client filesystem processes request
Filesystem hands this off to the network stack as NFS or iSCSI
Network stack hands this packet off to network silicon
Silicon transmits to switch
Switch transmits to NAS network silicon
NAS network silicon throws an interrupt
NAS network stack processes packet
Kernel identifies this as a NFS or iSCSI request and passes to appropriate kernel thread
Kernel thread passes request off to ZFS
ZFS sees "sync request", sees an available SLOG device
ZFS pushes the request to the SAS device driver
Device driver pushes to LSI SAS silicon
LSI SAS chipset serializes the request and passes it over the SAS topology
SAS or SATA SSD deserializes the request
SSD controller processes the request and queues for commit to flash
SSD controller confirms request
SSD serializes the response and pssses it back over the SAS topology
LSI SAS chipset receives the response and throws an interrupt
SAS device driver gets the acknowledgment and passes it up to ZFS
ZFS passes acknowledgement back to kernel NFS/iSCSI thread
NFS/iSCSI thread generates an acknowledgement packet and passes it to the network silicon
NAS network silicon transmits to switch
Switch transmits to client network silicon
Client network silicon throws an interrupt
Client network stack receives acknowledgement packet and hands it off to filesystem
Filesystem says "yay, finally, what took you so long" and releases the syscall, allowing the client program to move on.

That's what happens for EACH sync write request. So on a NAS there's not a hell of a lot you can do to make this all better. However, you CAN do things like substituting in low-latency NVMe in place of SAS, and upgrade from gigabit to ten gigabit ethernet.

Tangent: Dangers of overly large ZFS write cache.

FreeBSD's ZFS implementation defaults to allowing up to 1/8th your system's memory for a transaction group. Already explained, "if the new transaction group fills before the previous one finishes writing, ZFS pauses I/O". This is a critical factor to consider.

If you are expecting your system to remain responsive under heavy load, it is important to consider the pool's ability to sustain a given I/O load. A few notes:

1) Heavy use of in-pool ZIL has a substantial negative effect on the pool's overall IOPS capability. Hopefully you've already decided SLOG is the way to go, but if not, be aware that you can actually cause everything to crawl at a snail's pace through poor design...

2) A small pool on a system with a lot of memory, such as one where a designer has included lots of ARC for maximum responsiveness, can counter-intuitively perform very poorly due to the increased default size for transaction groups. In particular, if you have a system with four disks, each of which is capable of writing at 150MB/sec, and the pool can actually sustain 600MB/sec, that still doesn't fit well with a system that has 32GB of RAM, because it allows up to 4GB per txg, which is greater than the 3GB per 5 seconds that the pool can manage.

As a result, tuning the size of a transaction group to be appropriate to a pool is advised, and since that maximum size is directly related to SLOG sizing, it is all tied together.

Another tangent: ZFS logbias property

Added 11/2014. This is pulled, slightly out of context, from a private conversation.

I don't think logbias is necessarily useful for general purpose use, including iSCSI. It is trying to solve a problem I'm very familiar with, from my USENET software design days, which is essentially how to optimize for massive concurrency.

In the USENET scenario: If you have a ton of drives, let's say 24, with a normal UNIX filesystem (not ZFS) and a heap of articles on each one, and you have a few hundred simultaneous requests coming in to be filled, each request gets mapped to a drive that holds the answer to the query, which isn't a random mapping, but for many intents and purposes is essentially random given that what is being requested is hard to predict and articles are evenly spread across the 24 drives making up the spool. Some requests will be filled a little faster and some a little slower, kind of like when you're at the grocery store with 24 checkout lines and you pick a line. Maybe yours goes faster or slower. But all requests tend to cause the I/O subsystem to get very busy fulfilling requests, which is a good thing.

For an Oracle database, the problem can be similar: if you have lots of simultaneous requests, the challenge becomes to fill the requests *ON AVERAGE* as quickly as possible.

My impression is that logbias was designed to prevent an Oracle database from thrashing the hell out of a SLOG device for massively parallel, less important traffic that must still be sync. It allows the writes to be written immediately (and of course not as quickly) to the main pool outside of the txg process, and only writes associated metadata updates to the SLOG device. This means less writing to the SLOG but you're doing the sync write of the data blocks out to the pool, which still has to be sync. This effectively gets you to a similar situation as not having a SLOG device, when you look at it from a single request's point of view. :-( But if you have dozens or hundreds of simultaneous requests, the SLOG device is no longer as much of a bottleneck because you're not logging *all* writes to it first, but is still accelerating things like metadata, so it's an overall win.

But if you've got an application that's only a few threads (or just a single thread), this won't be helping you (much) because the data gets written out to the pool sync. It also seems like you might significantly worsen fragmentation. One of the things I haven't looked at but would worry about is what happens when a VM is writing contiguous data blocks. My concern would be this: if your application writes a 512 byte sector sync, then you need to flush that to stable storage. For the normal SLOG process, that just gets written to SLOG sync and then thrown into the transaction group for later commit to the pool. For the logbias=throughput setting, instead that gets flushed immediately to the pool. But with a ZFS zvol blocksize of 8K and logbias=throughput, that seems like it might imply that you'd be allocating and writing a new block sixteen times to the pool in order to write the 16 512 byte sectors. Normally with logbias=latency that'd be mitigated through the transaction group process which effectively coalesces the writes. It seems like there must be some sort of mitigation but I can't think of what it would be.

So in the end, my impression winds up being that I think logbias is mostly useful for specialized applications, and that the design intent was to avoid crushing a SLOG device. Since a SLOG device has to service an entire pool, if one portion of the pool is doing massively parallel sync writes and soaking up SLOG bandwidth, then latency of sync writes for other areas in the pool will increase. That's a good target dataset for logbias.

<not yet complete>

cyberjock · Jul 5, 2013

Few ideas of things you should probably discuss on this topic(and I'd recommend you delete my post to not cause confusion) . I could write it up myself but I'll let you handle this thread...

why sync=disabled is bad and how dangerous it can be. You and I know the warnings but the second I read "Lightning speed without all the pesky ZIL concerns." with my "noob" glasses on I have to think "why would this ever be bad". While you and I and others may understand just how incredibly dangerous this can be, you should probably touch on it a little since you mentioned sync=disabled.

I know that on my RAID controller, with the write cache enabled(it also had BBU) total performance decreased significantly while the high latency writes were often brought down to more reasonable speeds. My guess(and I think you and I have discussed this) that the RAID controller will often try to clear its write cache while zfs is trying to do reads. For me, the difference between write cache on and write cache off was like 400MB/sec on reads. For me, it was a performance killer.

Basically, this thread is about timing. You want to time when the writes happen and when the reads happen as well as them occurring in an efficient manner to not cause performance degredation. Or as Doc Brown said to Marty McFly on Back to the Future, "You gotta think 4th dimensionally".

jgreco · Jul 5, 2013

couldn't even wait for it to be complete, tsk tsk

jgreco · Jul 11, 2013

Might be. Hard to find a device that works well in all circumstances.

jgreco · Jul 14, 2013

It depends on numerous factors. Still, it is probably going to be correct if I say no more than a handful of gigs.

pbucher · Jul 22, 2013

jgreco said:
It depends on numerous factors. Still, it is probably going to be correct if I say no more than a handful of gigs.

Agreed. The only reason to get a 200+GB SSD is because the read/write performance is almost always better on the larger SSDs(they have multiple memory banks/channels they read & write across - think RAID zero for memory). The 80GB versions are never the ones they quote the read/write #s from in the marketing material.

That said unless you are using something with really seriously low throughput for your zpool storage or expect sustained sync writes that overwhelm your storage the SLOG will never use more then a 1GB or 2 most likely. The key thing being sync writes to the pool, async writes just get buffered.

As for a Intel SD3700, it's hard to say without knowing more about the setup and what the work load looks like. I've added SSDs as SLOGs with dismal results too many times.

Ivo · Aug 19, 2013

Hi, great article... any idea when we'll be able to read the rest of it?

Particularly, how does one adjust the transaction group size in FreeNAS 8.3 and 9.1?

also, how does one set "sync=always" for iSCSI? where does this setting go?

Thanks in advance!

cyberjock · Aug 19, 2013

Did you try using the search feature Ivo? In 1 search I found the command to set sync=always. :P

Ivo · Aug 19, 2013

Oh well... I guess I was looking in all the wrong places...

So if I understand correctly, this is a ZFS command which modifies the sync behaviour for a single zfs dataset/volume? (as in "zfs set sync=always pool/(dataset/volume)"

pbucher · Aug 21, 2013

scotch_tape said:
Is there a good doc on implementing a SLOG with 9.1? I just got a shiny new DC S3700 and want to try it out as a SLOG. I did a search, a lot of SLOG talk and not a lot SLOG doc....

I hate to be a major downer, but your question basically says you shouldn't even be trying this. The 1st thing is to know is if your workload will even benefit from a SLOG device, certain workloads won't even use it(at least not by default). 2nd I seriously doubt if your workload does need a SLOG device that unless your current ZFS pool is made up of some really slow drives that the only thing a S3700 will do it nuke your performance.

So keep reading the forums for how folks have tested and why generally SSD based SLOG devices don't work out the way folks hope that they will. Clue: 90% of the people who need an SLOG device are using NFS shares to serve datasets to vmware ESXi for use has a datastore(or using iSCSI with sync=always for maximum protection but that is fairly rare) and NFS with ESXi and a SAS/SATA flash SSD drive = horrible performance because the disk write transaction sizes are so small.

David E · Nov 1, 2013

pbucher said:
I hate to be a major downer, but your question basically says you shouldn't even be trying this. The 1st thing is to know is if your workload will even benefit from a SLOG device, certain workloads won't even use it(at least not by default). 2nd I seriously doubt if your workload does need a SLOG device that unless your current ZFS pool is made up of some really slow drives that the only thing a S3700 will do it nuke your performance.

So keep reading the forums for how folks have tested and why generally SSD based SLOG devices don't work out the way folks hope that they will. Clue: 90% of the people who need an SLOG device are using NFS shares to serve datasets to vmware ESXi for use has a datastore(or using iSCSI with sync=always for maximum protection but that is fairly rare) and NFS with ESXi and a SAS/SATA flash SSD drive = horrible performance because the disk write transaction sizes are so small.

Can you elaborate on what the solution is then? I don't think I am following as to how putting a SSD with much higher IOPS, lower latency, and much higher sequential write performance can actually decrease the performance of a (small) pool of mechanical hard drives for a sync write workload.

jgreco · Nov 1, 2013

You didn't get what he was saying, I think... maybe read again. Heavy sync writes are one of the places SLOG can be very helpful. Most people don't have sufficient sync writes for it to matter. Therefore, for them, a waste of SSD. But for heavy sync writes? SLOG is great.

David E · Nov 1, 2013

Ah right, misread, somehow in my head I read 90% of people will get horrible performance by adding a SLOG into an NFS via ESXi workload. My bad :)

jgreco · Nov 1, 2013

Most people mistake the SLOG for some kind of write cache. So we kind of watch out for that brain-o.

James Snell · Nov 5, 2013

jgreco said:
.... This is called creating a Separate intent LOG, or SLOG. This is often incorrectly referred to as "ZIL", especially here on the FreeNAS forums. ...

Is it reasonable to call an SLOG an external ZIL?

jgreco · Nov 5, 2013

James Snell said:
Is it reasonable to call an SLOG an external ZIL?

Why invent new terms? SLOG means Separate intent LOG device.

James Snell · Nov 5, 2013

jgreco said:
Why invent new terms? SLOG means Separate intent LOG device.

I'm definitely not trying to reinvent terms. Please excuse that. I'm only asking to test if my understanding of the relationship between a ZIL and an SLOG is accurate. It sounds like they're the same thing, just differentiated by location relative to the stored data.

If I'm understanding the relationship correctly, then I'm a bit confused why it'd be called ZIL and SLOG when it would seem far more intuitive to me to call it ZIL and SZIL. (separate zil).

Or, perhaps I am missing a key point? I'm only trying to solidify my understanding here, not invent anything.

jgreco · Nov 5, 2013

Because some masochistic ZFS designer wanted to make sure you would SLOG your way through it all and have to work to understand it all.... haha

But seriously, I thought the first post in this thread clarified what you're asking about, and at length.

James Snell · Nov 5, 2013

jgreco said:
...But seriously, I thought the first post in this thread clarified what you're asking about, and at length.

I read the whole thing carefully several times and then later caught myself calling my SLOG a ZIL, just like you said is commonly done on these forums. I decided I maybe missed the point or something. It was a sanity check. Now I feel like I've just polluted this thread a little.

cyberjock · Nov 5, 2013

James Snell said:
I read the whole thing carefully several times and then later caught myself calling my SLOG a ZIL, just like you said is commonly done on these forums. I decided I maybe missed the point or something. It was a sanity check. Now I feel like I've just polluted this thread a little.

Technically, you did. But it won't be the first time someone has made that exact mistake. And I promise you it won't be the last surely within a week someone else won't want to read and search and post this exact same question all over again.

Important Announcement for the TrueNAS Community.

Some insights into SLOG/ZIL with ZFS on FreeNAS

Resident Grinch

Inactive Account

Resident Grinch

Resident Grinch

Resident Grinch

Contributor

Cadet

Inactive Account

Cadet

Contributor

Contributor

Resident Grinch

Contributor

Resident Grinch

Explorer

Resident Grinch

Explorer

Resident Grinch

Explorer

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Some insights into SLOG/ZIL with ZFS on FreeNAS"

Similar threads