New experimental iSCSI and sync=always

someone1 · Jul 9, 2014

Would the general recommendation of needing sync=always be false now that the experimental iSCSI provider lives in the kernel space? It seems this bug fixes a known issue with sync writes and using iSCSI. Members of FreeNAS' core team even suggest this isn't necessary.

That is, sync=always was used to remediate the fact that istgt would not appropriately issue sync writes. However, we can now expect the new experimental iSCSI provider to do so. One would assume systems should issue sync writes when those writes are expected to be written (e.g. databases, VMs, etc.) and issue/allow asynchronous writes for the rest. I know assumptions are not great to drive decisions off of, but light research has led me to believe that most hypervisors always issue sync writes to the underlying storage layer of their VMs and so do SQL databases.

In particular, is the statement "iSCSI by default does not implement sync writes" in Option #3 given here still accurate? If not, Option #4 would not be required as using sync=standard should guarantee any sync write requests are properly handled.

I'd like to hear what other people's opinion on the matter is!

cyberjock · Jul 9, 2014

To be honest, I don't want to talk about the experimental iSCSI as it is still very much experimental and there isn't much that has been disseminated down to people like me on the new kernel iSCSI.

someone1 · Jul 9, 2014

Well if its the same one used in FreeBSD 10, it is new but it has been available for ~6 months.

I'm more concerned about the recommendation of using sync=always vs sync=standard than the stability of the code. The information would be useful once the experimental status is removed and bugs have been worked out. I think the goal is make this the standard/default in a future release as the code itself can be updated/maintained readily instead of the black box istgt is.

The bug tracker seems to indicate the the faults of istgt is fixed with the experimental iSCSI. If this is true, would the general best practice be updated to remove the recommendation for using sync=always?

someone1 · Apr 10, 2015

I hate to bring up an old topic, but now that the new iSCSI target is the default option, is there more insight into the question I asked? Doing some tests in my production environment, switching from sync=standard to sync=always results in a ~30-50% decrease in ops reported from zilstat, leading me to believe that the new iSCSI target properly listens for and responds to sync writes.

This is for a zvol used for Hyper-V storage.

cyberjock · Apr 10, 2015

There is more solid answers because there is less speculation, but the answers are the same. sync=always is a good idea if data security is the top priority. Basically you'd better have a slog device that is actually designed to be a good slog device, and things will be fine.

I really have no clue how you are coming to the conclusion that the new iscsi target properly listens for and responds to sync writes and you provided no data to support the claim. Performance *always* tanks if you do sync=always. And if you don't have a lot of sync writes already, when you set sync=standard you will *always* see performance improveAnd when I say always, I mean always. As in, even with NFS, CIFS, or just good ol' dd, your performance will drop pretty badly with sync=always. In fact, unless I'm mistaken, the iSCSI spec actually provides no provision for iscsi writes to be labeled as a sync write, which would completely debunk your claim, if I am correct.

So I don't know why you are asking, but you seem to have the whole idea of sync writes upside-down in your head. Did you need some coffee before you posted?

mav@ · Apr 11, 2015

cyberjock said:
In fact, unless I'm mistaken, the iSCSI spec actually provides no provision for iscsi writes to be labeled as a sync write

That is not true. SCSI block device specifications provide three ways to control sync writes:
- write caching can be disabled globally on per-LUN basis via Caching Mode Page (this is almost the same as sync=always, but controlled by client);
- each specific write operation may have FUA (Forced Unit Access) bit set to be done synchronously (if set for all commands, it will be the same as previous, but more flexible);
- previously cached writes could be forcefully flushed to the media with SYNCHRONIZE CACHE command.
The new iSCSI target in FreeNAS supports all three methods.

The other question is whether specific initiator supports them and whether it does it correctly. AFAIK, according to their docs, up to some point Microsoft supported/used FUA bits in their NTFS implementation, but after some point they considered it to be too slow and inefficient and switched to SYNCHRONIZE CACHE approach. Whether any of that supported by Hyper-V I have no idea -- ask Microsoft.

jgreco · Apr 11, 2015

cyberjock said:
In fact, unless I'm mistaken, the iSCSI spec actually provides no provision for iscsi writes to be labeled as a sync write, which would completely debunk your claim, if I am correct.

You're mistaken if we're being pedantic, but correct in practice, and you probably got that from me.

While @mav@ is technically correct about implementation, the practical reality is that it cannot be done reliably in a generalized manner, which is why ESXi doesn't try. FUA isn't implemented (or implemented correctly) on many arrays. SYNC CACHE is even worse - on some arrays this will cause the array to flush its battery backed cache to disk, which is a performance disaster. I wanna say Hitachi on that one... anyways the short form is there's no generalized strategy that works correctly (or ar least harmlessly) out of the box except not to try at all. Things might be better now that iSCSI's a little more mature, but since most of my interest in iSCSI these days is to use FreeNAS as a target, I've mostly not paid much attention to the current state of affairs, since setting sync=always or sync=standard will handle the two use cases we need here.

someone1 · Apr 11, 2015

I misspoke in my response, I switched TO sync=standard FROM sync=always and saw a 30-50% decreases in ops from zilstat. Apologies on the misleading statement.

My thought is this: With the old iSCSI target in FreeNAS, using sync=standard behaved the same as sync=disabled, so I can see why using sync=always was recommended. However, with the new iSCSI target, sync=standard does seem to work, with my very non-scientific tests, sync=standard did not drop down to 0 ops reported by zilstat like it would have with the old target, but instead, it reported ~30-50% reduction in ops when compared to sync=always. Performance here is not my main concern, but it is a added bonus. I was merely asking about the correctness of the new target, and from @mav@ reply, it seems the new target is indeed better behaved. And as @cyberjock stated, sync=always is best used when you don't want to risk losing any information (though I don't see why the reply couldn't have been more courteous and less brazen and provocative). Let's assume that I rely on the software to be written such that if a sync write is required, then it is issued, and I don't want to supersede the decisions of the software developers.

@mav@ brought up a good point, even if the target supports correct sync write handling, whose to say the initiator does? As a software developer myself, I would hope that if a developer wanted a sync write, one is issued and worry about how the underlying systems handle it is of no concern to the developer. That is, so what if the hardware underneath tanks in performance? The software needs to be written for correctness, blame on performance can be placed on the hardware or the hardware vendor's firmware. There's a case to be made that developers do need to worry about the hardware underneath, but that's no excuse not to write for correctness.

Well hoping for the ideal situation is great and all, but how about some concrete tests? I don't mind whipping up some code to issue sync writes and seeing how the initiator's handle it.

Test #1: Directly connect to FreeNAS' iSCSI target and run tests with sync writes and monitor zilstat output. Mix in non-sync writes and compare output for sync=standard and sync=always (the latter should produce more in this case) - This should prove if the given initiator will pass through sync write requests.
Test #2: Run the same tests, but this time, have the hypervisor provide a virtual disk to run the tests on - If this test is passed, then the hypervisor properly passed through sync writes.

Now I plan on running the tests for my use-case (Hyper-V) but I don't mind making the code publicly available for others to improve upon and use to test their own respective systems.

I think the code will be written such that it does the following:

Send a few megabytes worth of writes asynchronously
Wait 15 seconds
Send a few megabytes worth of writes, 50% asynchronously, 50% synchronously
Wait 15 seconds
Send a few megabytes worth of writes synchronously

With the intention that zilstat will be running as: zilstat 5 -p <pool/dataset to monitor>

Does anybody think this approach would be helpful? Is it flawed? I am willing to collaborate to make something useful we can all share/use.

jgreco · Apr 11, 2015

someone1 said:
(though I don't see why the reply couldn't have been more courteous and less brazen and provocative).

New here, huh.

Let's assume that I rely on the software to be written such that if a sync write is required, then it is issued, and I don't want to supersede the decisions of the software developers.

@mav@ brought up a good point, even if the target supports correct sync write handling, whose to say the initiator does? As a software developer myself, I would hope that if a developer wanted a sync write, one is issued and worry about how the underlying systems handle it is of no concern to the developer. That is, so what if the hardware underneath tanks in performance? The software needs to be written for correctness, blame on performance can be placed on the hardware or the hardware vendor's firmware. There's a case to be made that developers do need to worry about the hardware underneath, but that's no excuse not to write for correctness.

Welcome to the real world, where people have been solving problems pragmatically for a long time. Developers always need to be aware of the realities of the underlying hardware. In the old days, we used to optimize for *bytes*, because the underlying hardware might only have 1KB of RAM. This is now lost on developers, most of whom have never coded in assembly (or better yet, a direct machine level debugger) and many of whom can't identify what sorts of actual instructions and structures are being crapped out by their high level languages. "Time complexity? What's that?" Big O is not an anime reference!

It is worse with something like iSCSI, where there's enough complexity going on throughout several layers of subsystems, that the interactions get extremely hard to understand. Further, a developer often doesn't get to choose the hardware, but is nevertheless tasked with making an application work, and work well.

You're of course welcome to write any tests you feel appropriate, and, unlike most products, there are actually people here who are interested in making this work correctly, and also making this work as-needed to implement proper data protections, so discussion is available.

So, then, consider: while in theory an ESXi host could propagate both sync and non-sync writes through from a VM with those flags, in practice this doesn't happen, and in order to get the best chance of data protection, data must be committed in the order it is sent by the ESXi initiator, else there's a greater chance of VM scramble. In such a case, it isn't sufficient for FreeNAS to merely implement FUA/SC, because that's not being sent by the initiator. So we can probably agree that's not "correct" FSVO correct, but there you have it, the example we see almost daily here. But not everyone needs that data protection guarantee, either.

So, while you can argue that the software developer should get to determine when a sync write is required, there's an equally compelling argument that the sysadmin should be aware of when this matters and when it doesn't. As a sysadmin, if I've got an application that thinks its bytes are all precious, but I know that the app means nearly nothing to me in the grander scheme of things, I'm probably not going to let that app grind my storage unnecessarily.

mav@ · Apr 11, 2015

@someone1, while @jgreco is right that this area of syncing is a total mess and full of bugs/misfeatures (for example, FreeBSD UFS unlike ZFS still does not implement cache control), I think it still would be great to know which initiators are at least trying to do things right, and which don't even try. For example, if Windows and NTFS themselves are doing syncing good enough to keep their metadata consistency, that could be enough for significant amount of people, who may not bother much about data of specific application, but still would prefer to not reinstall Windows after every crash. On the other side in cases like ESXi, if initiator does not even try to be correct, setting sync=always may be the only reliable choice, and it would be good to know what to expect in advance.

jgreco · Apr 11, 2015

mav@ said:
On the other side in cases like ESXi, if initiator does not even try to be correct,

Just to be nitpicky, here, I think that's an unfair characterization. The problem is that in the VM environment, it's almost impossible to know what writes need to be sync for a given VM. Perhaps with PV drivers...

mav@ · Apr 11, 2015

jgreco said:
Just to be nitpicky, here, I think that's an unfair characterization. The problem is that in the VM environment, it's almost impossible to know what writes need to be sync for a given VM. Perhaps with PV drivers...

While VM is indeed a complication here, I would say that this is in big part a problem of ESXi, having very minimalistic SCSI device emulation. I don't know what Hyper-V does in this area, and I would definitely like to know that, but I know that even bhyve in FreeBSD stable/10 does some things better now -- it reports proper device geometry to the guest, it supports ATA TRIM, it supports ATA FLUSHCACHE...

mav@ · Apr 11, 2015

Thinking again, NFS synchronization primitives are more optimized for network storages then SCSI. SCSI synchronization is more optimized for internal disks, that in most cases loose data only on host power down. In such case proper synchronization allows proper filesystem recovery after crash. Even reset and firmware crash are officially not a good reasons for SCSI disk to loose its write cache. NFS same time on the level of RFC explicitly defines what client should do in case of NAS crash with potential cache loss. I don't know why ESXi running over NFS marks all requests as synchronous and does not use delayed commit technique. I agree that proper implementation of it means additional client complication and memory consumption, but if done properly it could be not so dependent on fast SLOG. Theoretically nothing prevents iSCSI clients to implement the same replying techniques, but that is not required, and I have feeling that nobody really bother about it.

jgreco · Apr 14, 2015

My estimation of why VMware does stuff with NFS and iSCSI is this:

1) Do the thing most likely to work without breaking on stupid common implementations,

2) in a data-safe manner,

3) ideally requiring expensive hardware from favored partners in order to perform well.

jgreco · Apr 15, 2015

mav@ said:
While VM is indeed a complication here, I would say that this is in big part a problem of ESXi, having very minimalistic SCSI device emulation. I don't know what Hyper-V does in this area, and I would definitely like to know that, but I know that even bhyve in FreeBSD stable/10 does some things better now -- it reports proper device geometry to the guest, it supports ATA TRIM, it supports ATA FLUSHCACHE...

Missed that message somehow.

I thought bhyve only supported PV drivers, which seems like it'd make intercepting a lot of stuff like this a hell of a lot easier ... basically not needing to read tea leaves to reverse-engineer whatever things some random hardware driver is toggling at the hardware. Many hardware drivers support multiple similar hardware platforms, and of course each OS can potentially have device drivers authored by different parties. Interception and emulation of modern hardware devices has got to be a bit rough.

Important Announcement for the TrueNAS Community.

New experimental iSCSI and sync=always

someone1

Dabbler

cyberjock

Inactive Account

someone1

Dabbler

someone1

Dabbler

cyberjock

Inactive Account

mav@

iXsystems

jgreco

Resident Grinch

someone1

Dabbler

jgreco

Resident Grinch

mav@

iXsystems

jgreco

Resident Grinch

mav@

iXsystems

mav@

iXsystems

jgreco

Resident Grinch

jgreco

Resident Grinch

Similar threads

Important Announcement for the TrueNAS Community.

New experimental iSCSI and sync=always

Dabbler

Inactive Account

Dabbler

Dabbler

Inactive Account

iXsystems

Resident Grinch

Dabbler

Resident Grinch

iXsystems

Resident Grinch

iXsystems

iXsystems

Resident Grinch

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "New experimental iSCSI and sync=always"

Similar threads