Some insights into SLOG/ZIL with ZFS on FreeNAS

James Snell

Explorer
Joined
Jul 25, 2013
Messages
50
Technically, you did. But it won't be the first time someone has made that exact mistake. And I promise you it won't be the last surely within a week someone else won't want to read and search and post this exact same question all over again.

I hear the pain in your words Cyberjock. There are deep wounds there. The thing is, I did want to and did read. I was not confident my grasp was on par. What would be nice is if I could now remove my comments in some way so not to distract from the original post. It sort of touches on a design flaw in discussion forums.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I think its better to leave them there. Hopefully someone else will learn from your mistake. Isn't that what the forums are all about?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I read the whole thing carefully several times and then later caught myself calling my SLOG a ZIL, just like you said is commonly done on these forums.

No, the problem is the other way around. You always have a ZIL. But people often refer to their SLOG devices as a ZIL. So then they get confused when you talk about the ZIL, because they think "but I don't have one of those." (They do, it's just part of the pool and therefore slow.) Or you discuss modifying the behaviour of the ZIL, and they go "but I don't have one." etc. A SLOG device *is* "a ZIL" - just a (hopefully) much faster one.
 

James Snell

Explorer
Joined
Jul 25, 2013
Messages
50
No, the problem is the other way around. You always have a ZIL. But people often refer to their SLOG devices as a ZIL. So then they get confused when you talk about the ZIL, because they think "but I don't have one of those." (They do, it's just part of the pool and therefore slow.) Or you discuss modifying the behaviour of the ZIL, and they go "but I don't have one." etc. A SLOG device *is* "a ZIL" - just a (hopefully) much faster one.

Lol, human communication is so funny... I'll just call my external ZIL an SLOG rather than a ZIL, which I've previously called it.
 

David E

Contributor
Joined
Nov 1, 2013
Messages
119
Unfortunately, the hypervisor is not in a good position to judge the relative importance of a given write request, so the hypervisor simply marks all of the writes to be written sync. This is conservative and safe to do ... and it is dangerous to second-guess this, at least if you care about the integrity of your VM.

I'm trying to understand how this is true, if the hypervisor is emulating a SATA/SAS controller (or implementing its own PV device), surely it is receiving the commands from the guest OS asking for synchronous writes (which is the ONLY entity that is in a position to dictate which writes are important), so these ought to be easily translated to the appropriate NFS/iSCSI sync write commands. From looking at wireshark traces of ESXi <-> NFS/iSCSI commands it seems like ESXi is just being lazy and either setting everything to fsync (NFS), or nothing (iSCSI). Am I missing something here? This seems to be exclusively the fault of ESXi.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I think that they whole "sync write this" and "don't sync write that" is a higher level on the food chain, thus the SATA/SAS controller only knows to "write this data" to the disk and not to the write cache.
 

David E

Contributor
Joined
Nov 1, 2013
Messages
119
I think that they whole "sync write this" and "don't sync write that" is a higher level on the food chain, thus the SATA/SAS controller only knows to "write this data" to the disk and not to the write cache.

I don't see how that could be possible, it is the device interface responsible for writing to disk. Do you have an example of what the higher food chain device terminating the sync commands would be?
 

pbucher

Contributor
Joined
Oct 15, 2012
Messages
180
I think that they whole "sync write this" and "don't sync write that" is a higher level on the food chain, thus the SATA/SAS controller only knows to "write this data" to the disk and not to the write cache.

And the team that did the NFS code went the safe route and does sync on every write. Which is smart because you can't assume things like UPS backup and monitoring. The iSCSI team decided to got for performance instead of max safety or they assumed UPS backup and monitoring. To some degree folks deploying iSCSI for ESXi are probably more likely to have that sort of stuff in place vs folks who buy a little netgear 2 bay NAS box and share it to their ESXi server. But even with UPS backups one still needs to weigh performance vs safety of the data before doing things like sync=disabled or sync=always.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
What I mean is that when writes are being assembled for write to disk it comes to a point where it determines if it should cache the write in RAM or go to disk. If its a sync write it goes to disk(obviously) and waits for the command that the write is complete. If its a non-sync write it goes to the write cache in RAM. At some later point when the write cache in RAM flushes to disk you'll get another write(along with the system waiting for the disk response that the disk write is complete). Then the data in the write cache in RAM is cleared.

To the disk it just knows that its got to write some data to certain blocks on the disk and executes that. The disk has no way to ascertain if a given write is a sync write or not. To a disk a write is a write. Period. The actual OS calls make the choice to buffer the write to RAM or to immediately write it to the disk.

ESXi blurs the lines because it gets a disk write and all it knows is "I got data to write to disk to these blocks". Since that's all it knows, and ESXi errs on the side of caution, it makes every single write a sync write.

Let's assume for a minute that i'm completely out to lunch and everything the forums knows on the very detailed technical issues are totally incorrect. We do know for certainty that the issue is with sync writes without requiring much information or assumptions to validate that. But its very well documented for other systems that aren't FreeNAS, and the penalty is staggering. I'd have to assume that as much as the NFS penalty kills performance that if it were something that they could fix(for example, similiar to what you are thinking in your post with the SATA/SAS controller) I'd think they'd have fixed it long ago. Bugs that have that kind of a performance penalty would be fixed due to so many complaints. Considering that the VMware community has lots of comments and discussions on how to help get around the issue by improving sync write performance(or just ignoring sync writes altogether) it seems pretty obvious that the solution isn't simple. Yes?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526

pbucher

Contributor
Joined
Oct 15, 2012
Messages
180
I don't see how that could be possible, it is the device interface responsible for writing to disk. Do you have an example of what the higher food chain device terminating the sync commands would be?

That would be the OS kernel or file system, it can cache the data(async) and issue the write command when it makes sense vs write it now(sync).
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'm trying to find proof that istgt even has sync writes in any form but all I'm seeing is comments about asynchronous writes. I'm trying to find something that explicitly mentions that sync writes do not exist for istgt. Google don't fail me now!

Edit: Just to clarify I'm doing this because if you read up on how you make iSCSI have sync writes you actually make the dataset or pool have exclusively sync writes by setting sync=always. That's not an iSCSI setting, which tends to support all the other information that iSCSI just doesn't have a "sync write" in any form... or at least not in the version that's in FreeNAS.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I'm trying to understand how this is true, if the hypervisor is emulating a SATA/SAS controller (or implementing its own PV device), surely it is receiving the commands from the guest OS asking for synchronous writes (which is the ONLY entity that is in a position to dictate which writes are important), so these ought to be easily translated to the appropriate NFS/iSCSI sync write commands. From looking at wireshark traces of ESXi <-> NFS/iSCSI commands it seems like ESXi is just being lazy and either setting everything to fsync (NFS), or nothing (iSCSI). Am I missing something here? This seems to be exclusively the fault of ESXi.

In a paravirtualized environment, this could potentially be made to happen, given sufficient effort and support up into the guest device driver and filesystem framework.

In a hardware virtualized environment, it won't fly. You are actually emulating hardware. You don't have any way to reliably determine the reason that an operating system has issued a particular set of operations in a given order. So a client throws a million SCSI WRITE operations and then a SYNCHRONIZE_CACHE. What portion of the million blocks were intended to be a sync write, and which of them were merely filesystem writes being pushed out? Without an understanding of intention by the writer, it's all voodoo.

As a result, ESXi tags all writes as sync. It "doesn't" for iSCSI only because SCSI doesn't work that way. You'd have to issue a SCSI SYNCHRONIZE_CACHE after each iSCSI write, which would just ruin many large SAN arrays which would actually honor the request. Administrators are expected to be intelligent enough to use iSCSI that is handled properly.

So your understanding that the guest OS "is the ONLY entity that is in a position to dictate which writes are important" is correct and the key point. There's just nothing to signal that knowledge down to the hardware layer in a useful manner.
 

pbucher

Contributor
Joined
Oct 15, 2012
Messages
180
I'm trying to find proof that istgt even has sync writes in any form but all I'm seeing is comments about asynchronous writes. I'm trying to find something that explicitly mentions that sync writes do not exist for istgt. Google don't fail me now!

I found it first ;) the answer is the Force Unit Access bit in a SCSI write command. (FUA) tells the target to immediately send the data to the media surface and to not buffer it through a cache. And it appears that istgt doesn't do anything with that bit based on other things I turned up. Looks like FreeBSD is going aiming to replace istgt with a new in kernel version for FreeBSD 10. https://wiki.freebsd.org/Native%20iSCSI%20target

Edit: The bigger question for ESXi is vmware using the FUA bit or not. Won't do any good to have a iSCSI target that supports it if esxi doesn't set the bit.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I found it first ;) the answer is the Force Unit Access bit in a SCSI write command. (FUA) tells the target to immediately send the data to the media surface and to not buffer it through a cache. And it appears that istgt doesn't do anything with that bit based on other things I turned up. Looks like FreeBSD is going aiming to replace istgt with a new in kernel version for FreeBSD 10. https://wiki.freebsd.org/Native%20iSCSI%20target

Edit: The bigger question for ESXi is vmware using the FUA bit or not. Won't do any good to have a iSCSI target that supports it if esxi doesn't set the bit.

Ooo. Good job! I figured it was out there. I did know that FreeBSD 10 changes the landscape for iSCSI(and therefore FreeNAS) but I lost the link to back that up.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Handling FUA is complicated, and isn't certain to get you the right behaviour. Some filesystems such as NTFS will only toggle FUA when a device advertises that there's a write cache enabled. It isn't clear to me how a VMware virtual disk appears on a NTFS system; on a FreeBSD system, "camcontrol identify daXX" would give info about the disk (including write cache) but on a VMware disk won't give anything.

It is even less clear what'll happen under Windows and a virtual IDE disk; vSphere 4 at least defaults to creating IDE disks. Bleh!

So ESXi is just doing a conservative thing, which coincidentally helps encourage people to buy expensive storage systems that can quickly commit sync writes. Hah.
 

David E

Contributor
Joined
Nov 1, 2013
Messages
119
I'm trying to find proof that istgt even has sync writes in any form but all I'm seeing is comments about asynchronous writes. I'm trying to find something that explicitly mentions that sync writes do not exist for istgt. Google don't fail me now!

Edit: Just to clarify I'm doing this because if you read up on how you make iSCSI have sync writes you actually make the dataset or pool have exclusively sync writes by setting sync=always. That's not an iSCSI setting, which tends to support all the other information that iSCSI just doesn't have a "sync write" in any form... or at least not in the version that's in FreeNAS.
I found it first ;) the answer is the Force Unit Access bit in a SCSI write command. (FUA) tells the target to immediately send the data to the media surface and to not buffer it through a cache. And it appears that istgt doesn't do anything with that bit based on other things I turned up. Looks like FreeBSD is going aiming to replace istgt with a new in kernel version for FreeBSD 10. https://wiki.freebsd.org/Native%20iSCSI%20target

Edit: The bigger question for ESXi is vmware using the FUA bit or not. Won't do any good to have a iSCSI target that supports it if esxi doesn't set the bit.

Its not, I've looked at the wireshark traces, which is why I think it is the bad actor here. It isn't just the FUA bit either, you could use write and verify scsi commands, or synchronize cache commands, neither of which are used.
 

David E

Contributor
Joined
Nov 1, 2013
Messages
119
What I mean is that when writes are being assembled for write to disk it comes to a point where it determines if it should cache the write in RAM or go to disk. If its a sync write it goes to disk(obviously) and waits for the command that the write is complete. If its a non-sync write it goes to the write cache in RAM. At some later point when the write cache in RAM flushes to disk you'll get another write(along with the system waiting for the disk response that the disk write is complete). Then the data in the write cache in RAM is cleared.

To the disk it just knows that its got to write some data to certain blocks on the disk and executes that. The disk has no way to ascertain if a given write is a sync write or not. To a disk a write is a write. Period. The actual OS calls make the choice to buffer the write to RAM or to immediately write it to the disk.

ESXi blurs the lines because it gets a disk write and all it knows is "I got data to write to disk to these blocks". Since that's all it knows, and ESXi errs on the side of caution, it makes every single write a sync write.

No this is incorrect. The filesystem gets a sync write request from the application, it passes this to the block device driver in the OS, which is either a real driver, or an emulated one. This call contains flags from the file system indicating whether the write needs to be synchronous or not. This driver then issues the appropriate writes to the actual physical disk as well, PASSING these flags as needed (SCSI disks use FUA as you saw, and other options I mentioned, SATA uses different ones), because in the event of an fsync write it must wait for the disk to fully acknowledge it is on the platter before returning. In a virtualized system the guest OS is either emulating a SCSI/Parallel/SATA controller card or running a paravirtualized driver for efficiency. In either case it is most definitively receiving the correct flags.

For Linux, the controllers being emulated/PV'd are block device drivers, here is the header file:
https://git.kernel.org/cgit/linux/k...e/include/linux/blkdev.h?id=refs/tags/v3.11.7

You can even look in Linux's PV API for block device drivers and see a mention of the cache flush command and write barriers:
https://git.kernel.org/cgit/linux/k.../uapi/linux/virtio_blk.h?id=refs/tags/v3.11.7
 

David E

Contributor
Joined
Nov 1, 2013
Messages
119
In a paravirtualized environment, this could potentially be made to happen, given sufficient effort and support up into the guest device driver and filesystem framework.

In a hardware virtualized environment, it won't fly. You are actually emulating hardware. You don't have any way to reliably determine the reason that an operating system has issued a particular set of operations in a given order. So a client throws a million SCSI WRITE operations and then a SYNCHRONIZE_CACHE. What portion of the million blocks were intended to be a sync write, and which of them were merely filesystem writes being pushed out? Without an understanding of intention by the writer, it's all voodoo.

You realize this is nonsense, if the driver issued a SYNCHRONIZE_CACHE API call to the emulated hardware then it should dutifully pass this along and follow the SCSI semantics for flushing the cache. Otherwise it is incorrectly emulating the guarantees.

As a result, ESXi tags all writes as sync. It "doesn't" for iSCSI only because SCSI doesn't work that way. You'd have to issue a SCSI SYNCHRONIZE_CACHE after each iSCSI write, which would just ruin many large SAN arrays which would actually honor the request. Administrators are expected to be intelligent enough to use iSCSI that is handled properly.

This also makes no sense, lets for a minute assume that ESXi is emulating a SCSI card and also mounting the disk image under the covers using iSCSI. In this case there is quite literally a 1:1 relationship of commands that the guest OS's driver is issuing to the emulated SCSI card that should then be passed to the underlying iSCSI connection. Now granted this assumes this VM is the only one on this iSCSI mount, otherwise synchronize_cache commands will have detremental performance effects for other VMs - but in practice I would suspect that the FUA bit is used way more often than synchronize_cache, which then should not cause an issue.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Yes, and there's good reason for each choice made. Look, VMware's stuff has to actually WORK. ESXi is not a "bad actor" for having made pragmatic choices about being paranoid with VM data. They've made the safest reasonable choices that can be generally implemented across a variety of hardware - specifically including NON-SCSI hardware. VMware sits in between a VM that might-or-might-not have virtual hardware that vaguely resembles SCSI or might implement something like IDE, and then data storage through a variety of technologies including FC, iSCSI, NFS, SAS, and others. It has to all WORK. This is the sucky real world. Nothing prohibits an admin who dislikes VMware's pragmatic and conservative choices from overriding them. But I think we can at least respect VMware for trying to make sure that the storage system does the right thing.
 
Top