Register for the iXsystems Community to get an ad-free experience and exclusive discounts in our eBay Store.
Resource icon

Sync writes, or: Why is my ESXi NFS so slow, and why is iSCSI faster?

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
13,759
This post is not specific to ESXi, however, ESXi users typically experience more trouble with this topic due to the way VMware works.

When an application on a client machine wishes to write something to storage, it issues a write request. On a machine with local storage, this involves talking to the hard drive HBA and issuing a command to write a given block to a given spot on the disk. This much should be obvious. However, we are writing to a filesystem. The filesystem contains metadata and file data. In many cases, it is very important to the stability of the filesystem that metadata be updated in a careful and thoughtful manner, because if it isn't, and there's a crash or power loss, it could render the filesystem unstable, unusable, and even unrecoverable. A properly executed sync write provides assurance to the storage consumer (in this case the filesystem layer) that the requested operation has been committed to stable storage; upon completion, an attempt to read the block is guaranteed to return the data just written, regardless of any crashes or power failures or other bad events. This usually means waiting for the drive to be available, then seeking to the sector, then writing the sector... a process that can take a little time. File data blocks are typically much less important, and are usually written asynchronously by most filesystems. This means that the disk controller might have many write requests simultaneously queued up, and the controller and drive are writing them out in a convenient manner.

Because for the most part users are copying around data files, usually we're used to seeing throughput in MBytes/sec. Sync writes, even on a local system, are usually substantially slower.

So here's the problem. ESXi represents your VM's disk as data in a vmdk file. However, it really has no clue as to what that VM is trying to write to disk, and as a result, it treats everything as precious. In other words, VMware writes nearly everything to the datastore as a sync write. This can result in significant slowdowns. On a local host, the typical solution to the problem is to get a RAID controller with write back cache and a BBU.

The same thing happens over the network. VMware wants writes to be committed to stable storage, because generally speaking, ESXi has no clue as to the value, importance, or other attributes of the data that your VM is attempting to write.

Now, it is important to recognize here, that VMware is actually trying to be careful and cautious with your data. Because here's the problem. Let's say your VM writes some filesystem metadata, such as, let's say, the free block table. It writes an update that indicates a bunch of blocks have been used. ESXi sends that to NAS/SAN storage. Now, let's just say that it isn't instantly committed to stable storage, and let's further pretend that at that exact moment, the NAS crashes and reboots. Upon recovery, you now have a functioning NAS again - but the NAS storage is showing those blocks are free and available. Eventually, the VM may try to write to those blocks again. Catastrophe.

So VMware's solution is to make sure writes are flagged as sync. Basically they hand the problem over to the storage subsystem, which is, in their defense, a fair thing to do. A lot of NAS devices cope with this by turning on async writes. You can do this, but then you become susceptible to damage from a crash or power loss... the very thing VMware was trying to protect you from.

So.

ZFS stinks at sync writes. Much of what ZFS does is geared towards moving around massive quantities of data on large systems quickly. There are layers of buffering and caching to increase performance. A sync write basically causes ZFS to have to flush a lot of things before it would prefer to. To partially address this, ZFS has a feature called ZIL, the ZFS Intent Log, that allows the system to make a log of sync writes to the pool a little more quickly (and then it can update the pool as part of the normal and efficient transaction group write process). That's still not all that fast, so ZFS includes the capability to split the ZIL off on a separate device, called a SLOG, which can be a fast SSD. With a SSD for SLOG, ZFS can perform fairly well under a sync write load.

If you care about the integrity of your VM disks, sync writes - and guaranteeing that the data is actually written to stable storage - are both mandatory.

These are the four options you have:

  1. NFS by default will implement sync writes as requested by the ESXi client. By default, FreeNAS will properly store data using sync mode for an ESXi client. That's why it is slow. You can make it faster with a SSD SLOG device. How much faster is basically a function of how fast the SLOG device is.

  2. Some people suggest using "sync=disabled" on an NFS share to gain speed. This causes async writes of your VM data, and yes, it is lightning fast. However, in addition to turning off sync writes for the VM data, it also turns off sync writes for ZFS metadata. This may be hazardous to both your VM's and the integrity of your pool and ZFS filesystem.

  3. iSCSI by default does not implement sync writes. As such, it often appears to users to be much faster, and therefore a much better choice than NFS. However, your VM data is being written async, which is hazardous to your VM's. On the other hand, the ZFS filesystem and pool metadata are being written synchronously, which is a good thing. That means that this is probably the way to go if you refuse to buy a SSD SLOG device and are okay with some risk to your VM's.

  4. iSCSI can be made to implement sync writes. Set "sync=always" on the dataset. Write performance will be, of course, poor without a SLOG device.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Nice write-up by the way. :D

Just a few comments(and I guess questions if I'm wrong):

You mentioned a hardware RAID with BBU, but you didn't mention that this is(or isn't) a preferred method for ZFS. Personally, when I've used the write cache on my RAID controller my zpool performance tanked, so I'm kind of thinking you should have a comment that the RAID w/ BBU isn't preferred for ZFS. But you may have information contrary to this. I just know I'd never recommend a RAID w BBU based on my experience with it. My performance was less than 1/5th of what it was with RAID controller write cache disabled.

For #1, should the last sentence says "SLOG device is and if the SLOG is actually caching the writes" since not all writes are committed to the SLOG.

For #2, maybe it would be good to clarify that your data is safe except during a loss of system power, kernel panic, or other situation where unwritten data in RAM is not able to be written to the zpool. If those things never happen your data is just as safe as always. Of course, I'm not exactly advocating people take this path because "their system is stable and has an UPS". That's still taking a big risk and putting your eggs in 1 basket. After all, one accidental reboot or kernel panic and it may be game over for the zpool.
 

paleoN

Neophyte Sage
Joined
Apr 22, 2012
Messages
1,403
Some people suggest using "sync=disabled" on an NFS share to gain speed. This causes async writes of your VM data, and yes, it is lightning fast. However, in addition to turning off sync writes for the VM data, it also turns off sync writes for ZFS metadata. This may be hazardous to both your VM's and the integrity of your pool and ZFS filesystem.
The integrity of the pool is not affected by this. Data is affected as you noted. The ZIL is not a journal and is not required for ZFS on-disk consistency. I fail to see how porting to FreeBSD changes any of that. I would be interested in any references to the contrary.

Perhaps you are thinking of vfs.zfs.cache_flush_disable which disables the flushing of the disk cache and can lead to pool-wide corruption.
Code:
/*
 * Tunable parameter for debugging or performance analysis.  Setting
 * zfs_nocacheflush will cause corruption on power loss if a volatile
 * out-of-order write cache is enabled.
 */
boolean_t zfs_nocacheflush = B_FALSE;
 

pbucher

Member
Joined
Oct 15, 2012
Messages
180
Excellent write up..

Just a quick additional note for folks who wish to run ESXi with NFS and sync=disabled. To help contain the damage to just your VMs(if you use FreeNAS for other things), create a dataset within your zpool & export it via NFS that is dedicated to just your VMs then set sync=disabled on just that dataset and not the entire pool. You still have risks though it's hard to really say what the additional risks are to your VMs of this method vs iSCSI with sync=standard, either way you could loose a very critical write by vmware. With iSCSI it could be a VMFS file system update vs ZFS file system write with NFS, with way almost everything will be hosed, just don't know how VMFS compares for recovery to ZFS.

In some brief testing I found that iSCSI with sync=always performed pretty much in line with NFS and sync=standard for a zpool without a SLOG device.
 

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
13,759
The integrity of the pool is not affected by this.
The jury's still out on that one. That statement is supposed to be true for Solaris, but I've seen conflicting statements for FreeBSD, and there is a compelling argument that since ZFS was not designed with the feature from the ground up, there is also a lot of room for unanticipated/unexpected interactions - even on Solaris.
 

paleoN

Neophyte Sage
Joined
Apr 22, 2012
Messages
1,403
The jury's still out on that one.
Who's jury?

That statement is supposed to be true for Solaris, but I've seen conflicting statements for FreeBSD, and there is a compelling argument that since ZFS was not designed with the feature from the ground up, there is also a lot of room for unanticipated/unexpected interactions - even on Solaris.
I haven't ran across conflicting statements for FreeBSD, but there is a lot I haven't seen. I will reiterate that I'm interested in links for any such statements assuming Oracle hasn't broken them.

What wasn't designed from the ground up, disabling the ZIL per dataset or globally disabling the ZIL? Neil Perrin, the ZFS dev who wrote the ZIL code, has always stated that ZFS does not require the ZIL for on-disk consistency from the beginning. Metadata is essentially written after the data and I don't understand how disabling the ZIL suddenly breaks all these metadata writes. But then I'm no ZFS expert either.
 

KTrain

Member
Joined
Dec 29, 2013
Messages
36
Thanks for the great write up! Helps me a lot as a newcomer.
 

datnus

Member
Joined
Jan 25, 2013
Messages
102
Hi jgreco,
Let's say, I'm using SSD ZIL and sync=standard for iSCSI.
1) What happen to my VMs if the power of Freenas is off or Freenas is shutdown?
Will my VMs be corrupted if they are writting data to vmdk disks?

2) Would ZFS metadata be helpful in this situation?
Thanks for clearing my doubt.

iSCSI by default does not implement sync writes. As such, it often appears to users to be much faster, and therefore a much better choice than NFS. However, your VM data is being written async, which is hazardous to your VM's. On the other hand, the ZFS filesystem and pool metadata are being written synchronously, which is a good thing. That means that this is probably the way to go if you refuse to buy a SSD SLOG device and are okay with some risk to your VM's.
 

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
13,759
1) While not a guarantee, "yes" is the answer you seek.
2) No.
 

pbucher

Member
Joined
Oct 15, 2012
Messages
180
1) While not a guarantee, "yes" is the answer you seek.
2) No.
Agreed. On #1 is the same as pulling the plug on your physical server when it's in the middle of writing data. You may be lucky and everything just recovers when turned back or or you might have nothing but corrupted files and need to find your backups.
 

Ken Almond

Junior Member
Joined
May 11, 2014
Messages
19
Thanks to all for the info - and to me, it solves my issue.
1) I did many things (new network cards, new net cables, new switch) to see why my FreeNAS was 'so slow' ... no good answer. All the while, it was VMware itself that was a key contributor. I understand that now. I also discovered iSCSI was about 2x faster than NFS.
2) Today, I am cloning about 800GB to my NFS (FreeNAS) and it was running about 3GB/sec... too slow. I had previously read aboug zfs set sync... so I knew it applied a) immediately and b) dynamically (you can change it while things are going on). And, my system is rock solid and I have UPS... so, for the duration of 800GB transfer, I have set sync=disabled and WOW, my performance went from 3GB/sec to 111GB/sec as shown on VMware KBps and on FreeNAS Internet Traffic. Amazing!!!
3) I liked the suggestion about a VMware volume - will do that in future.
4) This meet my personal goals - e.g. a way to dynamically speed up writes for a few hours while I make some backup copies. Then I can reset to sync=always AND I can test the copy VMs to double-check.

Buy mainly I wanted to thank you-all for this info - its really hard to get good info sometimes.

Thanks
Ken
 

Ken Almond

Junior Member
Joined
May 11, 2014
Messages
19
Thanks to all for the info - and to me, it solves my issue.
1) I did many things (new network cards, new net cables, new switch) to see why my FreeNAS was 'so slow' ... no good answer. All the while, it was VMware itself that was a key contributor. I understand that now. I also discovered iSCSI was about 2x faster than NFS.
2) Today, I am cloning about 800GB to my NFS (FreeNAS) and it was running about 3GB/sec... too slow. I had previously read aboug zfs set sync... so I knew it applied a) immediately and b) dynamically (you can change it while things are going on). And, my system is rock solid and I have UPS... so, for the duration of 800GB transfer, I have set sync=disabled and WOW, my performance went from 3GB/sec to 111GB/sec as shown on VMware KBps and on FreeNAS Internet Traffic. Amazing!!!
3) I liked the suggestion about a VMware volume - will do that in future.
4) This meet my personal goals - e.g. a way to dynamically speed up writes for a few hours while I make some backup copies. Then I can reset to sync=always AND I can test the copy VMs to double-check.

Buy mainly I wanted to thank you-all for this info - its really hard to get good info sometimes.

Thanks
Ken
Sorry, I ment 3MB/sec to 111MB/sec (not GB)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I was going to ask what kind of hardware do you have that makes your pool go something like 4x the theoretical maximum for most server RAM throughput. :P
 

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
13,759
cyberjock, check back in 30 years ... we've gone from ST-506 and a few hundred KB/sec to now where a few hundred MB/sec is possible. Figure another 30 and we'll see a few hundred GB/sec :)
 

Ken Almond

Junior Member
Joined
May 11, 2014
Messages
19
I was going to ask what kind of hardware do you have that makes your pool go something like 4x the theoretical maximum for most server RAM throughput. :p
- FreeNAS box: Intel Quad Core Q9550, 8GB RAM, single 1GB nic, 5 x 3TB Seagate 7200 drives in single volume,zfs RAIDZ1.
- VMware box: NFS mount/copying --to--> FreeNAS volume is i7-3930K, 64GB RAM, dual 1GB nics (teamed), Adaptec 3405 talking to 3TB Seagate 7200 drives.
Part of what drove up the thru-put was simultaneous clones of 5 x 150GB vms (instead of just doing 1 at a time).
The CPU on FreeNAS was showing close to 100% busy with average load of 6'ish.
 

pbucher

Member
Joined
Oct 15, 2012
Messages
180
Thanks to all for the info - and to me, it solves my issue.
1) I did many things (new network cards, new net cables, new switch) to see why my FreeNAS was 'so slow' ... no good answer. All the while, it was VMware itself that was a key contributor. I understand that now. I also discovered iSCSI was about 2x faster than NFS.

Ken,

Nice work on navigating the fun of ZFS & vmware. You didn't mention it but I'll assume you figured out that iSCSI & vmware isn't committing your data to storage to get your 2x increase over NFS, if not read around some more.

-Paul
 

Ken Almond

Junior Member
Joined
May 11, 2014
Messages
19
Ken,

Nice work on navigating the fun of ZFS & vmware. You didn't mention it but I'll assume you figured out that iSCSI & vmware isn't committing your data to storage to get your 2x increase over NFS, if not read around some more.

-Paul
Yes, the iSCSI approach doesn't solve anything - NFS and 'be aware' of sync setting seems like safer approach. The other reason I tried iSCSI was just to see how it worked with VMware from a 'hookup standpoint'.
 

Ken Almond

Junior Member
Joined
May 11, 2014
Messages
19
- FreeNAS box: Intel Quad Core Q9550, 8GB RAM, single 1GB nic, 5 x 3TB Seagate 7200 drives in single volume,zfs RAIDZ1.
- VMware box: NFS mount/copying --to--> FreeNAS volume is i7-3930K, 64GB RAM, dual 1GB nics (teamed), Adaptec 3405 talking to 3TB Seagate 7200 drives.
Part of what drove up the thru-put was simultaneous clones of 5 x 150GB vms (instead of just doing 1 at a time).
The CPU on FreeNAS was showing close to 100% busy with average load of 6'ish.
>o something like 4x the theoretical maximum for most server RAM throughput.
Hopefully you can see this screenshot of FreeNAS performance - Interface Traffic at 900,000,000 Bps / 8 = 112,500,000 Bps / 1024 = 109,863 KBps / 1024 = 107.28 MBs
The dip down to 400 MilBps is when some of the 5 x simultaneous VMware migrations finished. The uptick is when I added more.
In addition to this internet traffic on FreeNAS, the VMware side showed 100+ MBs in terms of Write Bytes on the NFS datastore.
 

Attachments

Peppermint

Neophyte
Joined
Jul 8, 2014
Messages
9
Hello,

I ran into this problem...slow write and read performance with my ESXi and FreeNAS (9.2.1.5) using NFS (~130MB/s max read inside the vm). NFS to a SLES machine runs faster (>400MB/s write). All involved machines using 10GbE and newest driver for the ethernet cards. I tested the option zfs set sync=disabled, but without any raise of IO-performance. Hoped that this will increase the speed temporarly! (I am aware, that turning sync off may corrupt the pool) I also tested FreeNAS 9.3, with slower read and write speed via NFS, than with FreeNAS 9.2...
What might go wrong here? Any idea?


2x Xeon E5-2609 @ 2,5GHz
128GB RAM
Intel x540 T2 dual port 10 GbE-card
LSI 9300 12Gb/s SAS 8i with 12 HGST UltraStar 15K600 (6 of them in a software RAID 10)


Thanks
Peppermint
 

reqlez

Member
Joined
Mar 15, 2014
Messages
84
Hello,

I ran into this problem...slow write and read performance with my ESXi and FreeNAS (9.2.1.5) using NFS (~130MB/s max read inside the vm). NFS to a SLES machine runs faster (>400MB/s write). All involved machines using 10GbE and newest driver for the ethernet cards. I tested the option zfs set sync=disabled, but without any raise of IO-performance. Hoped that this will increase the speed temporarly! (I am aware, that turning sync off may corrupt the pool) I also tested FreeNAS 9.3, with slower read and write speed via NFS, than with FreeNAS 9.2...
What might go wrong here? Any idea?


2x Xeon E5-2609 @ 2,5GHz
128GB RAM
Intel x540 T2 dual port 10 GbE-card
LSI 9300 12Gb/s SAS 8i with 12 HGST UltraStar 15K600 (6 of them in a software RAID 10)


Thanks
Peppermint
Wow that's a killer system. My under 2k system can do more than that.

So what are you using to test speed ? Let me use your tool and I can compare my NFS esxi speeds. By the way I saw somewhere that 12gb/s controller support is experimental in freenas. Can somebody confirm that is still the case ?

Also maybe you have "a time" enabled on pool ? That is bad ?
 
Last edited:
Top