Resource icon

Resource Sync writes, or: Why is my ESXi NFS so slow, and why is iSCSI faster?

This post is not specific to ESXi, however, ESXi users typically experience more trouble with this topic due to the way VMware works.

When an application on a client machine wishes to write something to storage, it issues a write request. On a machine with local storage, this involves talking to the hard drive HBA and issuing a command to write a given block to a given spot on the disk. This much should be obvious. However, we are writing to a filesystem. The filesystem contains metadata and file data. In many cases, it is very important to the stability of the filesystem that metadata be updated in a careful and thoughtful manner, because if it isn't, and there's a crash or power loss, it could render the filesystem unstable, unusable, and even unrecoverable. A properly executed sync write provides assurance to the storage consumer (in this case the filesystem layer) that the requested operation has been committed to stable storage; upon completion, an attempt to read the block is guaranteed to return the data just written, regardless of any crashes or power failures or other bad events. This usually means waiting for the drive to be available, then seeking to the sector, then writing the sector... a process that can take a little time. File data blocks are typically much less important, and are usually written asynchronously by most filesystems. This means that the disk controller might have many write requests simultaneously queued up, and the controller and drive are writing them out in a convenient manner.

Because for the most part users are copying around data files, usually we're used to seeing throughput in MBytes/sec. Sync writes, even on a local system, are usually substantially slower.

So here's the problem. ESXi represents your VM's disk as data in a vmdk file. However, it really has no clue as to what that VM is trying to write to disk, and as a result, it treats everything as precious. In other words, VMware writes nearly everything to the datastore as a sync write. This can result in significant slowdowns. On a local host, the typical solution to the problem is to get a RAID controller with write back cache and a BBU.

The same thing happens over the network. VMware wants writes to be committed to stable storage, because generally speaking, ESXi has no clue as to the value, importance, or other attributes of the data that your VM is attempting to write.

Now, it is important to recognize here, that VMware is actually trying to be careful and cautious with your data. Because here's the problem. Let's say your VM writes some filesystem metadata, such as, let's say, the free block table. It writes an update that indicates a bunch of blocks have been used. ESXi sends that to NAS/SAN storage. Now, let's just say that it isn't instantly committed to stable storage, and let's further pretend that at that exact moment, the NAS crashes and reboots. Upon recovery, you now have a functioning NAS again - but the NAS storage is showing those blocks are free and available. Eventually, the VM may try to write to those blocks again. Catastrophe.

So VMware's solution is to make sure writes are flagged as sync. Basically they hand the problem over to the storage subsystem, which is, in their defense, a fair thing to do. A lot of NAS devices cope with this by turning on async writes. You can do this, but then you become susceptible to damage from a crash or power loss... the very thing VMware was trying to protect you from.

So.

ZFS stinks at sync writes. Much of what ZFS does is geared towards moving around massive quantities of data on large systems quickly. There are layers of buffering and caching to increase performance. A sync write basically causes ZFS to have to flush a lot of things before it would prefer to. To partially address this, ZFS has a feature called ZIL, the ZFS Intent Log, that allows the system to make a log of sync writes to the pool a little more quickly (and then it can update the pool as part of the normal and efficient transaction group write process). That's still not all that fast, so ZFS includes the capability to split the ZIL off on a separate device, called a SLOG, which can be a fast SSD. With a SSD for SLOG, ZFS can perform fairly well under a sync write load.

If you care about the integrity of your VM disks, sync writes - and guaranteeing that the data is actually written to stable storage - are both mandatory.

These are the four options you have:

  1. NFS by default will implement sync writes as requested by the ESXi client. By default, FreeNAS will properly store data using sync mode for an ESXi client. That's why it is slow. You can make it faster with a SSD SLOG device. How much faster is basically a function of how fast the SLOG device is.

  2. Some people suggest using "sync=disabled" on an NFS share to gain speed. This causes async writes of your VM data, and yes, it is lightning fast. However, in addition to turning off sync writes for the VM data, it also turns off sync writes for ZFS metadata. This may be hazardous to both your VM's and the integrity of your pool and ZFS filesystem.

  3. iSCSI by default does not implement sync writes. As such, it often appears to users to be much faster, and therefore a much better choice than NFS. However, your VM data is being written async, which is hazardous to your VM's. On the other hand, the ZFS filesystem and pool metadata are being written synchronously, which is a good thing. That means that this is probably the way to go if you refuse to buy a SSD SLOG device and are okay with some risk to your VM's.

  4. iSCSI can be made to implement sync writes. Set "sync=always" on the dataset. Write performance will be, of course, poor without a SLOG device.
Author
jgreco
Views
8,794
First release
Last update
Rating
0.00 star(s) 0 ratings

More resources from jgreco

Top