Some insights into SLOG/ZIL with ZFS on FreeNAS

James Snell · Nov 5, 2013

cyberjock said:
Technically, you did. But it won't be the first time someone has made that exact mistake. And I promise you it won't be the last surely within a week someone else won't want to read and search and post this exact same question all over again.

I hear the pain in your words Cyberjock. There are deep wounds there. The thing is, I did want to and did read. I was not confident my grasp was on par. What would be nice is if I could now remove my comments in some way so not to distract from the original post. It sort of touches on a design flaw in discussion forums.

cyberjock · Nov 5, 2013

I think its better to leave them there. Hopefully someone else will learn from your mistake. Isn't that what the forums are all about?

jgreco · Nov 6, 2013

James Snell said:
I read the whole thing carefully several times and then later caught myself calling my SLOG a ZIL, just like you said is commonly done on these forums.

No, the problem is the other way around. You always have a ZIL. But people often refer to their SLOG devices as a ZIL. So then they get confused when you talk about the ZIL, because they think "but I don't have one of those." (They do, it's just part of the pool and therefore slow.) Or you discuss modifying the behaviour of the ZIL, and they go "but I don't have one." etc. A SLOG device *is* "a ZIL" - just a (hopefully) much faster one.

James Snell · Nov 6, 2013

jgreco said:
No, the problem is the other way around. You always have a ZIL. But people often refer to their SLOG devices as a ZIL. So then they get confused when you talk about the ZIL, because they think "but I don't have one of those." (They do, it's just part of the pool and therefore slow.) Or you discuss modifying the behaviour of the ZIL, and they go "but I don't have one." etc. A SLOG device *is* "a ZIL" - just a (hopefully) much faster one.

Lol, human communication is so funny... I'll just call my external ZIL an SLOG rather than a ZIL, which I've previously called it.

David E · Nov 7, 2013

Unfortunately, the hypervisor is not in a good position to judge the relative importance of a given write request, so the hypervisor simply marks all of the writes to be written sync. This is conservative and safe to do ... and it is dangerous to second-guess this, at least if you care about the integrity of your VM.

I'm trying to understand how this is true, if the hypervisor is emulating a SATA/SAS controller (or implementing its own PV device), surely it is receiving the commands from the guest OS asking for synchronous writes (which is the ONLY entity that is in a position to dictate which writes are important), so these ought to be easily translated to the appropriate NFS/iSCSI sync write commands. From looking at wireshark traces of ESXi <-> NFS/iSCSI commands it seems like ESXi is just being lazy and either setting everything to fsync (NFS), or nothing (iSCSI). Am I missing something here? This seems to be exclusively the fault of ESXi.

cyberjock · Nov 7, 2013

I think that they whole "sync write this" and "don't sync write that" is a higher level on the food chain, thus the SATA/SAS controller only knows to "write this data" to the disk and not to the write cache.

David E · Nov 7, 2013

cyberjock said:
I think that they whole "sync write this" and "don't sync write that" is a higher level on the food chain, thus the SATA/SAS controller only knows to "write this data" to the disk and not to the write cache.

I don't see how that could be possible, it is the device interface responsible for writing to disk. Do you have an example of what the higher food chain device terminating the sync commands would be?

pbucher · Nov 7, 2013

cyberjock said:
I think that they whole "sync write this" and "don't sync write that" is a higher level on the food chain, thus the SATA/SAS controller only knows to "write this data" to the disk and not to the write cache.

And the team that did the NFS code went the safe route and does sync on every write. Which is smart because you can't assume things like UPS backup and monitoring. The iSCSI team decided to got for performance instead of max safety or they assumed UPS backup and monitoring. To some degree folks deploying iSCSI for ESXi are probably more likely to have that sort of stuff in place vs folks who buy a little netgear 2 bay NAS box and share it to their ESXi server. But even with UPS backups one still needs to weigh performance vs safety of the data before doing things like sync=disabled or sync=always.

cyberjock · Nov 7, 2013

What I mean is that when writes are being assembled for write to disk it comes to a point where it determines if it should cache the write in RAM or go to disk. If its a sync write it goes to disk(obviously) and waits for the command that the write is complete. If its a non-sync write it goes to the write cache in RAM. At some later point when the write cache in RAM flushes to disk you'll get another write(along with the system waiting for the disk response that the disk write is complete). Then the data in the write cache in RAM is cleared.

To the disk it just knows that its got to write some data to certain blocks on the disk and executes that. The disk has no way to ascertain if a given write is a sync write or not. To a disk a write is a write. Period. The actual OS calls make the choice to buffer the write to RAM or to immediately write it to the disk.

ESXi blurs the lines because it gets a disk write and all it knows is "I got data to write to disk to these blocks". Since that's all it knows, and ESXi errs on the side of caution, it makes every single write a sync write.

Let's assume for a minute that i'm completely out to lunch and everything the forums knows on the very detailed technical issues are totally incorrect. We do know for certainty that the issue is with sync writes without requiring much information or assumptions to validate that. But its very well documented for other systems that aren't FreeNAS, and the penalty is staggering. I'd have to assume that as much as the NFS penalty kills performance that if it were something that they could fix(for example, similiar to what you are thinking in your post with the SATA/SAS controller) I'd think they'd have fixed it long ago. Bugs that have that kind of a performance penalty would be fixed due to so many complaints. Considering that the VMware community has lots of comments and discussions on how to help get around the issue by improving sync write performance(or just ignoring sync writes altogether) it seems pretty obvious that the solution isn't simple. Yes?

cyberjock · Nov 7, 2013

http://forums.freenas.org/threads/s...xi-nfs-so-slow-and-why-is-iscsi-faster.12506/

pbucher · Nov 7, 2013

David E said:
I don't see how that could be possible, it is the device interface responsible for writing to disk. Do you have an example of what the higher food chain device terminating the sync commands would be?

That would be the OS kernel or file system, it can cache the data(async) and issue the write command when it makes sense vs write it now(sync).

cyberjock · Nov 7, 2013

I'm trying to find proof that istgt even has sync writes in any form but all I'm seeing is comments about asynchronous writes. I'm trying to find something that explicitly mentions that sync writes do not exist for istgt. Google don't fail me now!

Edit: Just to clarify I'm doing this because if you read up on how you make iSCSI have sync writes you actually make the dataset or pool have exclusively sync writes by setting sync=always. That's not an iSCSI setting, which tends to support all the other information that iSCSI just doesn't have a "sync write" in any form... or at least not in the version that's in FreeNAS.

jgreco · Nov 7, 2013

David E said:
I'm trying to understand how this is true, if the hypervisor is emulating a SATA/SAS controller (or implementing its own PV device), surely it is receiving the commands from the guest OS asking for synchronous writes (which is the ONLY entity that is in a position to dictate which writes are important), so these ought to be easily translated to the appropriate NFS/iSCSI sync write commands. From looking at wireshark traces of ESXi <-> NFS/iSCSI commands it seems like ESXi is just being lazy and either setting everything to fsync (NFS), or nothing (iSCSI). Am I missing something here? This seems to be exclusively the fault of ESXi.

In a paravirtualized environment, this could potentially be made to happen, given sufficient effort and support up into the guest device driver and filesystem framework.

In a hardware virtualized environment, it won't fly. You are actually emulating hardware. You don't have any way to reliably determine the reason that an operating system has issued a particular set of operations in a given order. So a client throws a million SCSI WRITE operations and then a SYNCHRONIZE_CACHE. What portion of the million blocks were intended to be a sync write, and which of them were merely filesystem writes being pushed out? Without an understanding of intention by the writer, it's all voodoo.

As a result, ESXi tags all writes as sync. It "doesn't" for iSCSI only because SCSI doesn't work that way. You'd have to issue a SCSI SYNCHRONIZE_CACHE after each iSCSI write, which would just ruin many large SAN arrays which would actually honor the request. Administrators are expected to be intelligent enough to use iSCSI that is handled properly.

So your understanding that the guest OS "is the ONLY entity that is in a position to dictate which writes are important" is correct and the key point. There's just nothing to signal that knowledge down to the hardware layer in a useful manner.

pbucher · Nov 7, 2013

cyberjock said:
I'm trying to find proof that istgt even has sync writes in any form but all I'm seeing is comments about asynchronous writes. I'm trying to find something that explicitly mentions that sync writes do not exist for istgt. Google don't fail me now!

I found it first ;) the answer is the Force Unit Access bit in a SCSI write command. (FUA) tells the target to immediately send the data to the media surface and to not buffer it through a cache. And it appears that istgt doesn't do anything with that bit based on other things I turned up. Looks like FreeBSD is going aiming to replace istgt with a new in kernel version for FreeBSD 10. https://wiki.freebsd.org/Native%20iSCSI%20target

Edit: The bigger question for ESXi is vmware using the FUA bit or not. Won't do any good to have a iSCSI target that supports it if esxi doesn't set the bit.

cyberjock · Nov 7, 2013

pbucher said:
I found it first ;) the answer is the Force Unit Access bit in a SCSI write command. (FUA) tells the target to immediately send the data to the media surface and to not buffer it through a cache. And it appears that istgt doesn't do anything with that bit based on other things I turned up. Looks like FreeBSD is going aiming to replace istgt with a new in kernel version for FreeBSD 10. https://wiki.freebsd.org/Native%20iSCSI%20target

Edit: The bigger question for ESXi is vmware using the FUA bit or not. Won't do any good to have a iSCSI target that supports it if esxi doesn't set the bit.

Ooo. Good job! I figured it was out there. I did know that FreeBSD 10 changes the landscape for iSCSI(and therefore FreeNAS) but I lost the link to back that up.

jgreco · Nov 7, 2013

Handling FUA is complicated, and isn't certain to get you the right behaviour. Some filesystems such as NTFS will only toggle FUA when a device advertises that there's a write cache enabled. It isn't clear to me how a VMware virtual disk appears on a NTFS system; on a FreeBSD system, "camcontrol identify daXX" would give info about the disk (including write cache) but on a VMware disk won't give anything.

It is even less clear what'll happen under Windows and a virtual IDE disk; vSphere 4 at least defaults to creating IDE disks. Bleh!

So ESXi is just doing a conservative thing, which coincidentally helps encourage people to buy expensive storage systems that can quickly commit sync writes. Hah.

David E · Nov 7, 2013

cyberjock said:
I'm trying to find proof that istgt even has sync writes in any form but all I'm seeing is comments about asynchronous writes. I'm trying to find something that explicitly mentions that sync writes do not exist for istgt. Google don't fail me now!

Edit: Just to clarify I'm doing this because if you read up on how you make iSCSI have sync writes you actually make the dataset or pool have exclusively sync writes by setting sync=always. That's not an iSCSI setting, which tends to support all the other information that iSCSI just doesn't have a "sync write" in any form... or at least not in the version that's in FreeNAS.

pbucher said:
I found it first ;) the answer is the Force Unit Access bit in a SCSI write command. (FUA) tells the target to immediately send the data to the media surface and to not buffer it through a cache. And it appears that istgt doesn't do anything with that bit based on other things I turned up. Looks like FreeBSD is going aiming to replace istgt with a new in kernel version for FreeBSD 10. https://wiki.freebsd.org/Native%20iSCSI%20target

Edit: The bigger question for ESXi is vmware using the FUA bit or not. Won't do any good to have a iSCSI target that supports it if esxi doesn't set the bit.

Its not, I've looked at the wireshark traces, which is why I think it is the bad actor here. It isn't just the FUA bit either, you could use write and verify scsi commands, or synchronize cache commands, neither of which are used.

David E · Nov 7, 2013

cyberjock said:
What I mean is that when writes are being assembled for write to disk it comes to a point where it determines if it should cache the write in RAM or go to disk. If its a sync write it goes to disk(obviously) and waits for the command that the write is complete. If its a non-sync write it goes to the write cache in RAM. At some later point when the write cache in RAM flushes to disk you'll get another write(along with the system waiting for the disk response that the disk write is complete). Then the data in the write cache in RAM is cleared.

To the disk it just knows that its got to write some data to certain blocks on the disk and executes that. The disk has no way to ascertain if a given write is a sync write or not. To a disk a write is a write. Period. The actual OS calls make the choice to buffer the write to RAM or to immediately write it to the disk.

ESXi blurs the lines because it gets a disk write and all it knows is "I got data to write to disk to these blocks". Since that's all it knows, and ESXi errs on the side of caution, it makes every single write a sync write.

No this is incorrect. The filesystem gets a sync write request from the application, it passes this to the block device driver in the OS, which is either a real driver, or an emulated one. This call contains flags from the file system indicating whether the write needs to be synchronous or not. This driver then issues the appropriate writes to the actual physical disk as well, PASSING these flags as needed (SCSI disks use FUA as you saw, and other options I mentioned, SATA uses different ones), because in the event of an fsync write it must wait for the disk to fully acknowledge it is on the platter before returning. In a virtualized system the guest OS is either emulating a SCSI/Parallel/SATA controller card or running a paravirtualized driver for efficiency. In either case it is most definitively receiving the correct flags.

For Linux, the controllers being emulated/PV'd are block device drivers, here is the header file:
https://git.kernel.org/cgit/linux/k...e/include/linux/blkdev.h?id=refs/tags/v3.11.7

You can even look in Linux's PV API for block device drivers and see a mention of the cache flush command and write barriers:
https://git.kernel.org/cgit/linux/k.../uapi/linux/virtio_blk.h?id=refs/tags/v3.11.7

David E · Nov 7, 2013

jgreco said:
In a paravirtualized environment, this could potentially be made to happen, given sufficient effort and support up into the guest device driver and filesystem framework.

In a hardware virtualized environment, it won't fly. You are actually emulating hardware. You don't have any way to reliably determine the reason that an operating system has issued a particular set of operations in a given order. So a client throws a million SCSI WRITE operations and then a SYNCHRONIZE_CACHE. What portion of the million blocks were intended to be a sync write, and which of them were merely filesystem writes being pushed out? Without an understanding of intention by the writer, it's all voodoo.

You realize this is nonsense, if the driver issued a SYNCHRONIZE_CACHE API call to the emulated hardware then it should dutifully pass this along and follow the SCSI semantics for flushing the cache. Otherwise it is incorrectly emulating the guarantees.

As a result, ESXi tags all writes as sync. It "doesn't" for iSCSI only because SCSI doesn't work that way. You'd have to issue a SCSI SYNCHRONIZE_CACHE after each iSCSI write, which would just ruin many large SAN arrays which would actually honor the request. Administrators are expected to be intelligent enough to use iSCSI that is handled properly.

This also makes no sense, lets for a minute assume that ESXi is emulating a SCSI card and also mounting the disk image under the covers using iSCSI. In this case there is quite literally a 1:1 relationship of commands that the guest OS's driver is issuing to the emulated SCSI card that should then be passed to the underlying iSCSI connection. Now granted this assumes this VM is the only one on this iSCSI mount, otherwise synchronize_cache commands will have detremental performance effects for other VMs - but in practice I would suspect that the FUA bit is used way more often than synchronize_cache, which then should not cause an issue.

jgreco · Nov 7, 2013

Yes, and there's good reason for each choice made. Look, VMware's stuff has to actually WORK. ESXi is not a "bad actor" for having made pragmatic choices about being paranoid with VM data. They've made the safest reasonable choices that can be generally implemented across a variety of hardware - specifically including NON-SCSI hardware. VMware sits in between a VM that might-or-might-not have virtual hardware that vaguely resembles SCSI or might implement something like IDE, and then data storage through a variety of technologies including FC, iSCSI, NFS, SAS, and others. It has to all WORK. This is the sucky real world. Nothing prohibits an admin who dislikes VMware's pragmatic and conservative choices from overriding them. But I think we can at least respect VMware for trying to make sure that the storage system does the right thing.

Important Announcement for the TrueNAS Community.

Some insights into SLOG/ZIL with ZFS on FreeNAS

Explorer

Inactive Account

Resident Grinch

Explorer

Contributor

Inactive Account

Contributor

Contributor

Inactive Account

Inactive Account

Contributor

Inactive Account

Resident Grinch

Contributor

Inactive Account

Resident Grinch

Contributor

Contributor

Contributor

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Some insights into SLOG/ZIL with ZFS on FreeNAS"

Similar threads