Untangling sync vs async options with NFS and ZFS

SuperWhisk

Dabbler
Joined
Jan 14, 2022
Messages
19
My current understanding is that ZFS's default sync setting of "standard" allows the "client application" to determine if a write should be synchronous or not (and 99% of the time, this is exactly how it should stay). Async writes are cached in memory until a transaction group is filled, or a timeout is reached. Sync writes are immediately written to ZIL (possibly disrupting the writing of a previous transaction group), then added to the transaction group in memory like Async writes.

Then there's NFS (I'm using NFSv4). You can mount NFS on the client machine as either sync or async, and this affects (among other things) whether the client will cache writes locally before sending them to the NFS server in a batch - conceptually similar to ZFS transaction groups, but probably very very different in implementation.

The part I am much less sure about:
From what I can gather, it seems that if I use the "sync" mount option on the NFS client, all writes will be synchronous, regardless of what the application running on the client machine wanted. This in turn affects how ZFS handles these writes, forcing it to immediately write everything to ZIL before continuing with "regular" operation.

Assuming all of the above is correct (if not, please help me set the record straight), is there a way to have NFSv4 not cache writes locally on the client, but also not force ZFS to immediately flush all writes to disk?
Basically I want writes to be immediately flushed to the server and acknowledged to the client as complete, but still allow the server to decide when it wants to commit those to disk.

Now you might say "well that seems dangerous, why would you want that? Just do everything sync" and certainly you might have a point, but if I am going to allow asynchronous writing for performance reasons, I'd rather have writes cached and batched in a single location, not two. As far as data loss concerns go, the solitary NFS client in this case is a VM running on the TrueNAS host that is exporting the NFS share. They are connected over a virtual network bridge. If the server goes down, there won't be a client left running with a different idea of what the data should be anyway. This is also all protected from power loss with a UPS, and there are hourly, daily, and monthly snapshots, with the latter two being replicated to a separate TrueNAS machine daily. In the case of unexpected shutdown or crash, I would likely lose no more than an hour, or up to a day in a really bad situation, which is more than adequate for my needs.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
Sync writes are immediately written to ZIL (possibly disrupting the writing of a previous transaction group), then added to the transaction group in memory like Async writes.
close but not quite. the RAM component is basically the same for sync and async; the difference is that for a sync write it ALSO goes to the ZIL/SLOG as soon as possible, independent of waiting for the transaction group. the transaction group is then written as normal, and the ZIL/SLOG data considered stale if the write was successful.

From what I can gather, it seems that if I use the "sync" mount option on the NFS client, all writes will be synchronous, regardless of what the application running on the client machine wanted. This in turn affects how ZFS handles these writes, forcing it to immediately write everything to ZIL before continuing with "regular" operation.
yup. you can also set sync at the dataset level (which is the the only way to get SYNC with SMB)

as to your question, I would ask "well that seems dangerous, why would you want that? Just do everything sync" :tongue:

As far as data loss concerns go, the solitary NFS client in this case is a VM running on the TrueNAS host that is exporting the NFS share.

what you mean by this? it sounds like you are sharing via NFS from Trunenas to a VM (running on truenas?) to reshare...via NFS? this sounds really strange to me.

I would likely lose no more than an hour, or up to a day in a really bad situation, which is more than adequate for my needs.
power loss with a UPS,
one of the problems with the way VM's write is that they do not write in complete parts, and so an interrupted write can corrupt the whole VM, not just some of it's data; if the lost write was to the kernel, the OS would be unable to boot, and there is no way to tell from the hypervisor. this is why ESX is always sync writes to NFS datastores.


from what I see here, even if what you are asking is possible, there doesn't appear to be a reason to do so.
 

SuperWhisk

Dabbler
Joined
Jan 14, 2022
Messages
19
close but not quite. the RAM component is basically the same for sync and async; the difference is that for a sync write it ALSO goes to the ZIL/SLOG as soon as possible, independent of waiting for the transaction group. the transaction group is then written as normal, and the ZIL/SLOG data considered stale if the write was successful.
Makes sense, that does follow from what I read on jgreco's post about ZIL writes being out of band. What I meant by "possibly interrupting" was
mostly relating to disk IO capacity. If ZFS is actively writing a transaction group to the pool and a sync write comes along and needs to be put in the ZIL on the same pool, those two processes will be competing for disk IO. Good to be reminded about the out of band nature of ZIL though.

what you mean by this? it sounds like you are sharing via NFS from Trunenas to a VM (running on truenas?) to reshare...via NFS? this sounds really strange to me.
I apologise if I was not clear. It is less strange than it sounds (I think).
I am running Truenas Scale on a single machine. I have a VM (ubuntu server) running under Truenas with a zvol based virtual disk connected via virtio as you normally would. I am also exporting two datasets (from different pools) over NFS to this VM because zvols aren't as flexible on size as datasets.
I just want to have writes always flushed out of the VM and into Truenas immediately, but I don't mind if they sit in RAM on the Truenas side until the transaction group is ready to be written. Client software in the VM should still be able to explicitly request sync writes as needed though.

I would switch this setup to mount the datasets through the hypervisor with virtioFS immediately if it became supported in a future update.

one of the problems with the way VM's write is that they do not write in complete parts, and so an interrupted write can corrupt the whole VM, not just some of it's data; if the lost write was to the kernel, the OS would be unable to boot, and there is no way to tell from the hypervisor. this is why ESX is always sync writes to NFS datastores.
This is great to keep in the back of my head, but since the VM's virtual disk is a zvol connected directly through the hypervisor that shouldn't come into play here. The NFS shares are mounted at /nfs, not system directories.

from what I see here, even if what you are asking is possible, there doesn't appear to be a reason to do so.
This is still possibly valid. Given it's all running on the same machine, and the NFS "network" traffic is over an virtual network bridge and never leaves the machine, perhaps the double caching isn't such a big concern. The VM is the only client on these NFS shares, so there is no concern about concurrency. The VM will always have a "good enough" view of the real data, and if it all goes up in flames I can roll it all back to a snapshot where it was still working (zvol included).
 
Last edited:

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
"possibly interrupting" was
definitely interrupting, this is why sync writes are basically unusable on spinners and basically requires SLOG.

since you didn't post your hardware, I don't know if you have SSD's or HDD's. if you have SSDs, this whole thing is pointless. the SSDs will likely write faster than NFS can feed them.
The VM is the only client
what is it writing? more accuracy on this would require knowing the actual workload. a Database would be very different from something like a VM for handbrake encodes.
 

SuperWhisk

Dabbler
Joined
Jan 14, 2022
Messages
19
since you didn't post your hardware, I don't know if you have SSD's or HDD's.if you have SSDs, this whole thing is pointless. the SSDs will likely write faster than NFS can feed them.
what is it writing? more accuracy on this would require knowing the actual workload. a Database would be very different from something like a VM for handbrake encodes.
I have added detailed hardware to my signature, but in short, I have an SSD pool and an HDD pool. There will be various workloads writing to the SSD pool, including the VM zvol and one of the NFS shares. That NFS share will contain docker volumes for various containers, some of which are indeed database containers which will probably be doing plenty of sync writes regardless of my mount options.
The HDD pool will be used for general bulk file storage. Nextcloud is the current plan (Nextcloud app files and database will be backed by the SSD pool).
Other apps I plan to run: Home Assistant, Unifi Controller, and others I can't think of right now. All orchestrated by Portainer.

So I guess you are probably right, this is all premature and very likely pointless optimisation. But it doesn't hurt to know how it all works at a deeper level, so thank you for your time and responses!
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
there is also a post here about NFS.
 
Top