jgreco
Resident Grinch
- Joined
- May 29, 2011
- Messages
- 18,680
What is the ZIL?
POSIX provides a facility for the system or an application to make sure that data requested to be written is actually committed to stable storage: a synchronous write request. Upon completion of a sync write request, the underlying filesystem is supposed to guarantee that a subsequent read will return that data, regardless of system crashes, reboots, power loss, etc. Sync writes have numerous uses in computing, from the trivial to the mandatory.
Some RAID controllers contain a battery-backed write cache, and put all writes through it, in order to provide this sort of functionality. This works, but is somewhat limiting in the way a system can be designed. A ZFS system has a potentially much larger write cache, but it is in system RAM, so it is volatile. The ZIL provides a mechanism to implement nonvolatile storage. However, it is not a write cache! It is an intent log. The ZIL is not in the data path, and instead sits alongside the data path. ZFS can then safely commit sync writes to the ZIL while simultaneously disregarding sync and aggregating the writes in the normal ZFS transaction group write process. If a crash, reboot, power loss, or other catastrophic event occurs before the transaction group is committed to the pool, the ZIL allows ZFS to read back the intent log, rebuild what was supposed to happen, and commit that to the pool. This is the only important time that the ZIL is read by ZFS, and when that happens, the data will be used to update the pool as appropriate. Under normal usage, data to the ZIL is written and then discarded as soon as the associated transaction group is safely committed.
What is SLOG?
The hardware RAID controller with battery backed write cache has a very fast place to store writes: its onboard cache, and usually all writes (not just sync writes) get pushed through the write cache. A ZFS system isn't guaranteed to have such a place, so by default ZFS stores the ZIL in the only persistent storage available to it: the ZFS pool. However, the pool is often busy and may be limited by seek speeds, so a ZIL stored in the pool is typically pretty slow. ZFS can honor the sync write request, but it'll get slower and slower as the pool activity increases.
The solution to this is to move the writes to something faster. This is called creating a Separate intent LOG, or SLOG. This is often incorrectly referred to as "ZIL", especially here on the FreeNAS forums. Without a dedicated SLOG device, the ZIL still exists, it is just stored on the ZFS pool vdevs. ZIL traffic on the pool vdevs substantially impacts pool performance in a negative manner, so if a large amount of sync writes are expected, a separate SLOG device of some sort is the way to go. ZIL traffic on the pool must also be written using whatever data protection strategy exists for the pool, so an in-pool ZIL write to an 11 drive RAIDZ3 will be hitting multiple disks for every block written.
What is a good choice for a SLOG device?
A SLOG device must be able to insure that data requested to be written is actually stored, a function that many SSD's do not provide. Supercapacitor based SSD's are often selected as the method to guarantee data can be written. This is not something that should be considered optional. If you're not going to get a suitable device for SLOG, then you are putting your pool at risk, and you might as well just turn on sync=disabled and be done with it. Lightning speed without all the pesky ZIL concerns. Of course, I already said that you are putting your pool at risk! You don't want to do that if you value your data and your pool...
Since a SLOG device could conceivably be writing massive amounts of data, endurance is an important consideration. If your pool is constantly writing large amounts of sync data, please remember that all of that will be written to SLOG. This is of particular concern for certain applications where all (or nearly all) pool writes are made in sync mode. If you have such an application, please pay particular attention to the endurance descriptions below.
The other thing a SLOG should have is low latency. When a sync write request comes in, everything comes to a halt. ZFS has to put the write request out to the SLOG, then wait for confirmation. For a SATA SSD on a HBA port, this involves invoking a device driver, which talks over PCIe to the HBA, which serializes the request onto a SATA cable, which is deserialized by the SSD's controller, which then has to Do Something, and then responds back over the SATA, up through the HBA, and back through the device driver, at which point ZFS is now aware that the data's been committed to stable storage, and now ZFS can process the next block. By way of comparison, a NVMe SSD eliminates a bunch of those steps, and reduces latency. Device driver -> PCIe -> SSD controller is a much shorter path.
A SLOG device could be as simple as a separate hard disk drive. This has the benefits of not needing to seek constantly, since the SLOG writes will generally be sequential, but still involves rotational latency, so a hard drive based SLOG may not be all that fast. However, it has incredible endurance.
A MLC based SSD such as the Intel 320 is a good low-cost choice. It doesn't have a supercapacitor, but does have an array of capacitors that performs as well (bonus, without the slight risk of explosion). The 40GB version of the 320 appears to be good for at least 20MB/sec worth of ZIL writes. The downside is that MLC flash has limited endurance. Buying a larger MLC flash device may yield more speed, as they typically have higher write rates due to the additional banks of flash. MLC devices should be underprovisioned in order to allow their internal wear leveling and page allocation functions to operate as efficiently as possible.
A SLC based SSD has much greater write endurance than MLC, but also is usually much more expensive. These are a great choice for heavy, intensive write environments. Again, supercapacitor or equivalent is required.
Several SATA-based storage devices involving DDR memory and a battery backup are available, many of which have very good performance and endurance characteristics. It is, however, getting more difficult to find 5.25" bays to stick these devices into on the large storage servers that might benefit from them.
An interesting but unorthodox alternative for SLOG is to use a RAID controller with battery backed write cache, along with conventional hard disks. Normally RAID controllers are frowned upon with ZFS, but here is an opportunity to take advantage of the capabilities: Since the cache absorbs sync writes and writes them as the disk allows, rotational latency becomes a nonissue, and you gain a SLOG device that can operate at the speed the drives are capable of writing at. In the case of a LSI 2208 with 1GB cache, and a pair of slowish ~50MB/sec 2.5" hard drives, it was interesting to note that a burst of ZIL writes could be absorbed by the cache at lightning speed, and then ZIL writes would slow down to the 50MB/sec that the drives were capable of sustaining. With the nearly unlimited endurance of battery-backed RAM and conventional hard drives, this is a very promising technique.
About the best SAS-based SSD you can get for SLOG use is probably the STEC ZeusRAM. However, it is still at the end of an SAS channel, and so there is some latency involved. I can't tell you if this is a good idea or not, as I haven't been able to justify the cash to try one.
About the best all-around SSD you can get for SLOG is the Intel DC P3700, an NVMe based SSD with great endurance characteristics. A consumer-grade version, the Intel 750, retains the power loss protection and general speed and low latency while offering reduced endurance.
All of these devices have some amount of latency involved, however, so typically sync writes will be somewhat slower than normal writes.
Why sync writes matter, especially on a fileserver
In the old days, when the VAX would crash and take your changes with it (because you hadn't yet saved to disk), you knew what work you had lost.
To a large extent, people I've talked to feel that this is still the case for a fileserver. Either the file is there or it is lost. In many cases, this is even true.
However, these days, we are using storage for much more complicated purposes. In the best cases, sync writes may already happen by default. In other cases, sync writes must be explicitly enabled.
Consider a VMware virtual disk device. Here is a massive disk file with lots of contents. It will probably exist for the lifetime of the associated VM. So one day in the VM's life, it is writing some minor update to the "hard disk". ESXi transmits that to the NAS, asking it to commit that to the "hard disk" (vmdk file). The NAS responds that it has been written, but then crashes before it actually commits the write to the actual disk. NAS reboots. Now the running VM has data that it thinks has been written to the "hard disk" but that is no longer actually represented in the data served by the NAS. So that might merely be a corrupted file on the VM disk, but that lost write could also have been critical metadata crucial to the integrity of the filesystem. The failure to write this with sync mode translates to some sort of danger.
Unfortunately, the hypervisor is not in a good position to judge the relative importance of a given write request, so the hypervisor simply marks all of the writes to be written sync. This is conservative and safe to do ... and it is dangerous to second-guess this, at least if you care about the integrity of your VM.
So the "good" news is that ESXi issues all NFS requests in sync mode, but the bad news is that ZFS performance handling large numbers of sync requests is poor without a SLOG device. This leaves a lot of ESXi users wanting to "fix" the "awful" performance of their FreeNAS ZFS. The choices basically come down to disabling the ZIL or disabling sync somehow, or providing a proper SLOG. The former options are dangerous to the integrity of the VM, and the latter is typically expense (for the SLOG) and learning (what you're reading right now, etc).
However, the same problem exists with iSCSI - and by default, iSCSI writes are async. The same risks to VM integrity exist. So for an iSCSI setup with VM's, setting "sync=always" is the correct way to go.
The ZFS write cache. (No, it's not the ZIL.)
ZFS has a clever write caching system that aggregates writes into a "transaction group". Using your system's main memory, it stores a transaction group, until either a predefined size or a given amount of time has passed. In FreeNAS, the default size is 1/8th your system's memory, and the default time is 5 seconds. Given that the hardware recommendations for ZFS are a minimum of 8GB, this means that the ZFS write cache can easily be 1GB, even on the smallest system!
Every time the transaction group fills, or whenever the transaction group time limit expires, the transaction group begins flushing out to disk, and another transaction group begins building up. If the new transaction group fills before the previous one finishes writing, ZFS pauses I/O to allow the pool to catch up.
Sync writes involve immediately writing the update to the ZIL (whether in-pool or SLOG), and then proceeding as though the write was async, inserting the write into the transaction group along with all the other writes. Since the act of writing a transaction group to disk makes the written data persistent, the ZIL is only protecting the data that is in the current transaction group and any transaction group that is currently flushing to disk. Any ZIL data for already-committed transaction groups is effectively stale and not necessary.
This gives us some guidance as to what the size of a SLOG device should be, to avoid performance penalties: it must be large enough to hold a minimum of two transaction groups.
Laaaaaaaaatency. Low is better.
The SLOG is all about latency. The "sync" part of "sync write" means that the client is requesting that the current data block be confirmed as written to disk before the write() system call returns to the client. Without sync writes, a client is free to just stack up a bunch of write requests and then they can send over a slowish channel, and they arrive when they can. Look at the layers:
Client initiates a write syscall
Client filesystem processes request
Filesystem hands this off to the network stack as NFS or iSCSI
Network stack hands this packet off to network silicon
Silicon transmits to switch
Switch transmits to NAS network silicon
NAS network silicon throws an interrupt
NAS network stack processes packet
Kernel identifies this as a NFS or iSCSI request and passes to appropriate kernel thread
Kernel thread passes request off to ZFS
ZFS sees "sync request", sees an available SLOG device
ZFS pushes the request to the SAS device driver
Device driver pushes to LSI SAS silicon
LSI SAS chipset serializes the request and passes it over the SAS topology
SAS or SATA SSD deserializes the request
SSD controller processes the request and queues for commit to flash
SSD controller confirms request
SSD serializes the response and pssses it back over the SAS topology
LSI SAS chipset receives the response and throws an interrupt
SAS device driver gets the acknowledgment and passes it up to ZFS
ZFS passes acknowledgement back to kernel NFS/iSCSI thread
NFS/iSCSI thread generates an acknowledgement packet and passes it to the network silicon
NAS network silicon transmits to switch
Switch transmits to client network silicon
Client network silicon throws an interrupt
Client network stack receives acknowledgement packet and hands it off to filesystem
Filesystem says "yay, finally, what took you so long" and releases the syscall, allowing the client program to move on.
That's what happens for EACH sync write request. So on a NAS there's not a hell of a lot you can do to make this all better. However, you CAN do things like substituting in low-latency NVMe in place of SAS, and upgrade from gigabit to ten gigabit ethernet.
Tangent: Dangers of overly large ZFS write cache.
FreeBSD's ZFS implementation defaults to allowing up to 1/8th your system's memory for a transaction group. Already explained, "if the new transaction group fills before the previous one finishes writing, ZFS pauses I/O". This is a critical factor to consider.
If you are expecting your system to remain responsive under heavy load, it is important to consider the pool's ability to sustain a given I/O load. A few notes:
1) Heavy use of in-pool ZIL has a substantial negative effect on the pool's overall IOPS capability. Hopefully you've already decided SLOG is the way to go, but if not, be aware that you can actually cause everything to crawl at a snail's pace through poor design...
2) A small pool on a system with a lot of memory, such as one where a designer has included lots of ARC for maximum responsiveness, can counter-intuitively perform very poorly due to the increased default size for transaction groups. In particular, if you have a system with four disks, each of which is capable of writing at 150MB/sec, and the pool can actually sustain 600MB/sec, that still doesn't fit well with a system that has 32GB of RAM, because it allows up to 4GB per txg, which is greater than the 3GB per 5 seconds that the pool can manage.
As a result, tuning the size of a transaction group to be appropriate to a pool is advised, and since that maximum size is directly related to SLOG sizing, it is all tied together.
Another tangent: ZFS logbias property
Added 11/2014. This is pulled, slightly out of context, from a private conversation.
I don't think logbias is necessarily useful for general purpose use, including iSCSI. It is trying to solve a problem I'm very familiar with, from my USENET software design days, which is essentially how to optimize for massive concurrency.
In the USENET scenario: If you have a ton of drives, let's say 24, with a normal UNIX filesystem (not ZFS) and a heap of articles on each one, and you have a few hundred simultaneous requests coming in to be filled, each request gets mapped to a drive that holds the answer to the query, which isn't a random mapping, but for many intents and purposes is essentially random given that what is being requested is hard to predict and articles are evenly spread across the 24 drives making up the spool. Some requests will be filled a little faster and some a little slower, kind of like when you're at the grocery store with 24 checkout lines and you pick a line. Maybe yours goes faster or slower. But all requests tend to cause the I/O subsystem to get very busy fulfilling requests, which is a good thing.
For an Oracle database, the problem can be similar: if you have lots of simultaneous requests, the challenge becomes to fill the requests *ON AVERAGE* as quickly as possible.
My impression is that logbias was designed to prevent an Oracle database from thrashing the hell out of a SLOG device for massively parallel, less important traffic that must still be sync. It allows the writes to be written immediately (and of course not as quickly) to the main pool outside of the txg process, and only writes associated metadata updates to the SLOG device. This means less writing to the SLOG but you're doing the sync write of the data blocks out to the pool, which still has to be sync. This effectively gets you to a similar situation as not having a SLOG device, when you look at it from a single request's point of view. :-( But if you have dozens or hundreds of simultaneous requests, the SLOG device is no longer as much of a bottleneck because you're not logging *all* writes to it first, but is still accelerating things like metadata, so it's an overall win.
But if you've got an application that's only a few threads (or just a single thread), this won't be helping you (much) because the data gets written out to the pool sync. It also seems like you might significantly worsen fragmentation. One of the things I haven't looked at but would worry about is what happens when a VM is writing contiguous data blocks. My concern would be this: if your application writes a 512 byte sector sync, then you need to flush that to stable storage. For the normal SLOG process, that just gets written to SLOG sync and then thrown into the transaction group for later commit to the pool. For the logbias=throughput setting, instead that gets flushed immediately to the pool. But with a ZFS zvol blocksize of 8K and logbias=throughput, that seems like it might imply that you'd be allocating and writing a new block sixteen times to the pool in order to write the 16 512 byte sectors. Normally with logbias=latency that'd be mitigated through the transaction group process which effectively coalesces the writes. It seems like there must be some sort of mitigation but I can't think of what it would be.
So in the end, my impression winds up being that I think logbias is mostly useful for specialized applications, and that the design intent was to avoid crushing a SLOG device. Since a SLOG device has to service an entire pool, if one portion of the pool is doing massively parallel sync writes and soaking up SLOG bandwidth, then latency of sync writes for other areas in the pool will increase. That's a good target dataset for logbias.
<not yet complete>
POSIX provides a facility for the system or an application to make sure that data requested to be written is actually committed to stable storage: a synchronous write request. Upon completion of a sync write request, the underlying filesystem is supposed to guarantee that a subsequent read will return that data, regardless of system crashes, reboots, power loss, etc. Sync writes have numerous uses in computing, from the trivial to the mandatory.
Some RAID controllers contain a battery-backed write cache, and put all writes through it, in order to provide this sort of functionality. This works, but is somewhat limiting in the way a system can be designed. A ZFS system has a potentially much larger write cache, but it is in system RAM, so it is volatile. The ZIL provides a mechanism to implement nonvolatile storage. However, it is not a write cache! It is an intent log. The ZIL is not in the data path, and instead sits alongside the data path. ZFS can then safely commit sync writes to the ZIL while simultaneously disregarding sync and aggregating the writes in the normal ZFS transaction group write process. If a crash, reboot, power loss, or other catastrophic event occurs before the transaction group is committed to the pool, the ZIL allows ZFS to read back the intent log, rebuild what was supposed to happen, and commit that to the pool. This is the only important time that the ZIL is read by ZFS, and when that happens, the data will be used to update the pool as appropriate. Under normal usage, data to the ZIL is written and then discarded as soon as the associated transaction group is safely committed.
What is SLOG?
The hardware RAID controller with battery backed write cache has a very fast place to store writes: its onboard cache, and usually all writes (not just sync writes) get pushed through the write cache. A ZFS system isn't guaranteed to have such a place, so by default ZFS stores the ZIL in the only persistent storage available to it: the ZFS pool. However, the pool is often busy and may be limited by seek speeds, so a ZIL stored in the pool is typically pretty slow. ZFS can honor the sync write request, but it'll get slower and slower as the pool activity increases.
The solution to this is to move the writes to something faster. This is called creating a Separate intent LOG, or SLOG. This is often incorrectly referred to as "ZIL", especially here on the FreeNAS forums. Without a dedicated SLOG device, the ZIL still exists, it is just stored on the ZFS pool vdevs. ZIL traffic on the pool vdevs substantially impacts pool performance in a negative manner, so if a large amount of sync writes are expected, a separate SLOG device of some sort is the way to go. ZIL traffic on the pool must also be written using whatever data protection strategy exists for the pool, so an in-pool ZIL write to an 11 drive RAIDZ3 will be hitting multiple disks for every block written.
What is a good choice for a SLOG device?
A SLOG device must be able to insure that data requested to be written is actually stored, a function that many SSD's do not provide. Supercapacitor based SSD's are often selected as the method to guarantee data can be written. This is not something that should be considered optional. If you're not going to get a suitable device for SLOG, then you are putting your pool at risk, and you might as well just turn on sync=disabled and be done with it. Lightning speed without all the pesky ZIL concerns. Of course, I already said that you are putting your pool at risk! You don't want to do that if you value your data and your pool...
Since a SLOG device could conceivably be writing massive amounts of data, endurance is an important consideration. If your pool is constantly writing large amounts of sync data, please remember that all of that will be written to SLOG. This is of particular concern for certain applications where all (or nearly all) pool writes are made in sync mode. If you have such an application, please pay particular attention to the endurance descriptions below.
The other thing a SLOG should have is low latency. When a sync write request comes in, everything comes to a halt. ZFS has to put the write request out to the SLOG, then wait for confirmation. For a SATA SSD on a HBA port, this involves invoking a device driver, which talks over PCIe to the HBA, which serializes the request onto a SATA cable, which is deserialized by the SSD's controller, which then has to Do Something, and then responds back over the SATA, up through the HBA, and back through the device driver, at which point ZFS is now aware that the data's been committed to stable storage, and now ZFS can process the next block. By way of comparison, a NVMe SSD eliminates a bunch of those steps, and reduces latency. Device driver -> PCIe -> SSD controller is a much shorter path.
A SLOG device could be as simple as a separate hard disk drive. This has the benefits of not needing to seek constantly, since the SLOG writes will generally be sequential, but still involves rotational latency, so a hard drive based SLOG may not be all that fast. However, it has incredible endurance.
A MLC based SSD such as the Intel 320 is a good low-cost choice. It doesn't have a supercapacitor, but does have an array of capacitors that performs as well (bonus, without the slight risk of explosion). The 40GB version of the 320 appears to be good for at least 20MB/sec worth of ZIL writes. The downside is that MLC flash has limited endurance. Buying a larger MLC flash device may yield more speed, as they typically have higher write rates due to the additional banks of flash. MLC devices should be underprovisioned in order to allow their internal wear leveling and page allocation functions to operate as efficiently as possible.
A SLC based SSD has much greater write endurance than MLC, but also is usually much more expensive. These are a great choice for heavy, intensive write environments. Again, supercapacitor or equivalent is required.
Several SATA-based storage devices involving DDR memory and a battery backup are available, many of which have very good performance and endurance characteristics. It is, however, getting more difficult to find 5.25" bays to stick these devices into on the large storage servers that might benefit from them.
An interesting but unorthodox alternative for SLOG is to use a RAID controller with battery backed write cache, along with conventional hard disks. Normally RAID controllers are frowned upon with ZFS, but here is an opportunity to take advantage of the capabilities: Since the cache absorbs sync writes and writes them as the disk allows, rotational latency becomes a nonissue, and you gain a SLOG device that can operate at the speed the drives are capable of writing at. In the case of a LSI 2208 with 1GB cache, and a pair of slowish ~50MB/sec 2.5" hard drives, it was interesting to note that a burst of ZIL writes could be absorbed by the cache at lightning speed, and then ZIL writes would slow down to the 50MB/sec that the drives were capable of sustaining. With the nearly unlimited endurance of battery-backed RAM and conventional hard drives, this is a very promising technique.
About the best SAS-based SSD you can get for SLOG use is probably the STEC ZeusRAM. However, it is still at the end of an SAS channel, and so there is some latency involved. I can't tell you if this is a good idea or not, as I haven't been able to justify the cash to try one.
About the best all-around SSD you can get for SLOG is the Intel DC P3700, an NVMe based SSD with great endurance characteristics. A consumer-grade version, the Intel 750, retains the power loss protection and general speed and low latency while offering reduced endurance.
All of these devices have some amount of latency involved, however, so typically sync writes will be somewhat slower than normal writes.
Why sync writes matter, especially on a fileserver
In the old days, when the VAX would crash and take your changes with it (because you hadn't yet saved to disk), you knew what work you had lost.
To a large extent, people I've talked to feel that this is still the case for a fileserver. Either the file is there or it is lost. In many cases, this is even true.
However, these days, we are using storage for much more complicated purposes. In the best cases, sync writes may already happen by default. In other cases, sync writes must be explicitly enabled.
Consider a VMware virtual disk device. Here is a massive disk file with lots of contents. It will probably exist for the lifetime of the associated VM. So one day in the VM's life, it is writing some minor update to the "hard disk". ESXi transmits that to the NAS, asking it to commit that to the "hard disk" (vmdk file). The NAS responds that it has been written, but then crashes before it actually commits the write to the actual disk. NAS reboots. Now the running VM has data that it thinks has been written to the "hard disk" but that is no longer actually represented in the data served by the NAS. So that might merely be a corrupted file on the VM disk, but that lost write could also have been critical metadata crucial to the integrity of the filesystem. The failure to write this with sync mode translates to some sort of danger.
Unfortunately, the hypervisor is not in a good position to judge the relative importance of a given write request, so the hypervisor simply marks all of the writes to be written sync. This is conservative and safe to do ... and it is dangerous to second-guess this, at least if you care about the integrity of your VM.
So the "good" news is that ESXi issues all NFS requests in sync mode, but the bad news is that ZFS performance handling large numbers of sync requests is poor without a SLOG device. This leaves a lot of ESXi users wanting to "fix" the "awful" performance of their FreeNAS ZFS. The choices basically come down to disabling the ZIL or disabling sync somehow, or providing a proper SLOG. The former options are dangerous to the integrity of the VM, and the latter is typically expense (for the SLOG) and learning (what you're reading right now, etc).
However, the same problem exists with iSCSI - and by default, iSCSI writes are async. The same risks to VM integrity exist. So for an iSCSI setup with VM's, setting "sync=always" is the correct way to go.
The ZFS write cache. (No, it's not the ZIL.)
ZFS has a clever write caching system that aggregates writes into a "transaction group". Using your system's main memory, it stores a transaction group, until either a predefined size or a given amount of time has passed. In FreeNAS, the default size is 1/8th your system's memory, and the default time is 5 seconds. Given that the hardware recommendations for ZFS are a minimum of 8GB, this means that the ZFS write cache can easily be 1GB, even on the smallest system!
Every time the transaction group fills, or whenever the transaction group time limit expires, the transaction group begins flushing out to disk, and another transaction group begins building up. If the new transaction group fills before the previous one finishes writing, ZFS pauses I/O to allow the pool to catch up.
Sync writes involve immediately writing the update to the ZIL (whether in-pool or SLOG), and then proceeding as though the write was async, inserting the write into the transaction group along with all the other writes. Since the act of writing a transaction group to disk makes the written data persistent, the ZIL is only protecting the data that is in the current transaction group and any transaction group that is currently flushing to disk. Any ZIL data for already-committed transaction groups is effectively stale and not necessary.
This gives us some guidance as to what the size of a SLOG device should be, to avoid performance penalties: it must be large enough to hold a minimum of two transaction groups.
Laaaaaaaaatency. Low is better.
The SLOG is all about latency. The "sync" part of "sync write" means that the client is requesting that the current data block be confirmed as written to disk before the write() system call returns to the client. Without sync writes, a client is free to just stack up a bunch of write requests and then they can send over a slowish channel, and they arrive when they can. Look at the layers:
Client initiates a write syscall
Client filesystem processes request
Filesystem hands this off to the network stack as NFS or iSCSI
Network stack hands this packet off to network silicon
Silicon transmits to switch
Switch transmits to NAS network silicon
NAS network silicon throws an interrupt
NAS network stack processes packet
Kernel identifies this as a NFS or iSCSI request and passes to appropriate kernel thread
Kernel thread passes request off to ZFS
ZFS sees "sync request", sees an available SLOG device
ZFS pushes the request to the SAS device driver
Device driver pushes to LSI SAS silicon
LSI SAS chipset serializes the request and passes it over the SAS topology
SAS or SATA SSD deserializes the request
SSD controller processes the request and queues for commit to flash
SSD controller confirms request
SSD serializes the response and pssses it back over the SAS topology
LSI SAS chipset receives the response and throws an interrupt
SAS device driver gets the acknowledgment and passes it up to ZFS
ZFS passes acknowledgement back to kernel NFS/iSCSI thread
NFS/iSCSI thread generates an acknowledgement packet and passes it to the network silicon
NAS network silicon transmits to switch
Switch transmits to client network silicon
Client network silicon throws an interrupt
Client network stack receives acknowledgement packet and hands it off to filesystem
Filesystem says "yay, finally, what took you so long" and releases the syscall, allowing the client program to move on.
That's what happens for EACH sync write request. So on a NAS there's not a hell of a lot you can do to make this all better. However, you CAN do things like substituting in low-latency NVMe in place of SAS, and upgrade from gigabit to ten gigabit ethernet.
Tangent: Dangers of overly large ZFS write cache.
FreeBSD's ZFS implementation defaults to allowing up to 1/8th your system's memory for a transaction group. Already explained, "if the new transaction group fills before the previous one finishes writing, ZFS pauses I/O". This is a critical factor to consider.
If you are expecting your system to remain responsive under heavy load, it is important to consider the pool's ability to sustain a given I/O load. A few notes:
1) Heavy use of in-pool ZIL has a substantial negative effect on the pool's overall IOPS capability. Hopefully you've already decided SLOG is the way to go, but if not, be aware that you can actually cause everything to crawl at a snail's pace through poor design...
2) A small pool on a system with a lot of memory, such as one where a designer has included lots of ARC for maximum responsiveness, can counter-intuitively perform very poorly due to the increased default size for transaction groups. In particular, if you have a system with four disks, each of which is capable of writing at 150MB/sec, and the pool can actually sustain 600MB/sec, that still doesn't fit well with a system that has 32GB of RAM, because it allows up to 4GB per txg, which is greater than the 3GB per 5 seconds that the pool can manage.
As a result, tuning the size of a transaction group to be appropriate to a pool is advised, and since that maximum size is directly related to SLOG sizing, it is all tied together.
Another tangent: ZFS logbias property
Added 11/2014. This is pulled, slightly out of context, from a private conversation.
I don't think logbias is necessarily useful for general purpose use, including iSCSI. It is trying to solve a problem I'm very familiar with, from my USENET software design days, which is essentially how to optimize for massive concurrency.
In the USENET scenario: If you have a ton of drives, let's say 24, with a normal UNIX filesystem (not ZFS) and a heap of articles on each one, and you have a few hundred simultaneous requests coming in to be filled, each request gets mapped to a drive that holds the answer to the query, which isn't a random mapping, but for many intents and purposes is essentially random given that what is being requested is hard to predict and articles are evenly spread across the 24 drives making up the spool. Some requests will be filled a little faster and some a little slower, kind of like when you're at the grocery store with 24 checkout lines and you pick a line. Maybe yours goes faster or slower. But all requests tend to cause the I/O subsystem to get very busy fulfilling requests, which is a good thing.
For an Oracle database, the problem can be similar: if you have lots of simultaneous requests, the challenge becomes to fill the requests *ON AVERAGE* as quickly as possible.
My impression is that logbias was designed to prevent an Oracle database from thrashing the hell out of a SLOG device for massively parallel, less important traffic that must still be sync. It allows the writes to be written immediately (and of course not as quickly) to the main pool outside of the txg process, and only writes associated metadata updates to the SLOG device. This means less writing to the SLOG but you're doing the sync write of the data blocks out to the pool, which still has to be sync. This effectively gets you to a similar situation as not having a SLOG device, when you look at it from a single request's point of view. :-( But if you have dozens or hundreds of simultaneous requests, the SLOG device is no longer as much of a bottleneck because you're not logging *all* writes to it first, but is still accelerating things like metadata, so it's an overall win.
But if you've got an application that's only a few threads (or just a single thread), this won't be helping you (much) because the data gets written out to the pool sync. It also seems like you might significantly worsen fragmentation. One of the things I haven't looked at but would worry about is what happens when a VM is writing contiguous data blocks. My concern would be this: if your application writes a 512 byte sector sync, then you need to flush that to stable storage. For the normal SLOG process, that just gets written to SLOG sync and then thrown into the transaction group for later commit to the pool. For the logbias=throughput setting, instead that gets flushed immediately to the pool. But with a ZFS zvol blocksize of 8K and logbias=throughput, that seems like it might imply that you'd be allocating and writing a new block sixteen times to the pool in order to write the 16 512 byte sectors. Normally with logbias=latency that'd be mitigated through the transaction group process which effectively coalesces the writes. It seems like there must be some sort of mitigation but I can't think of what it would be.
So in the end, my impression winds up being that I think logbias is mostly useful for specialized applications, and that the design intent was to avoid crushing a SLOG device. Since a SLOG device has to service an entire pool, if one portion of the pool is doing massively parallel sync writes and soaking up SLOG bandwidth, then latency of sync writes for other areas in the pool will increase. That's a good target dataset for logbias.
<not yet complete>
Last edited: