Performance notes: sync vs. async, iSCSI vs. NFS

Status
Not open for further replies.

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
I've been doing some performance testing with a FreeNAS rig that's being used as a datastore for vSphere. Some aspects were not obvious, so I'm sharing the results here.

Storage usage
The FreeNAS box is being used as the datastores for a VMware vSphere instance. Probably about 100 concurrently running VMs. All of the VMs using this storage are performing fuzz testing. The output from the machines is collected in a zpool that is synchronous (via NFS) and backed up. The output is important. The VMs themselves are NOT important. They are both ephemeral and also created dynamically from scripts. You will realize that this aspect is quite important.

Experimentation: sync vs. async
First Zpool configuration: sync=always iSCSI NFS. As noted in another thread of mine, this was slower than I expected. However, the benchmarking that I was doing wasn't realistic for a number of reasons, so I wasn't too concerned. And the SLOG device was a consumer grade SATA drive.

Even with a SAS2-attached ZeusRAM SLOG device, the sync=always configuration was significantly faster than with no SLOG, but was still sluggish. This is the part that surprised me a bit. The VMs that lived on the iSCSI share were sluggish and interacting with a guest OS had noticeable latency. The fuzzing campaigns should have pegged the available vSphere CPU, however it was significantly less than it should be:
YxP3NNg.png


Each guest OS was spending more time waiting on I/O, rather than using the CPU as it should be. Just as an experiment, I tried setting sync=standard while the VMs were running. Disregard the screenshot label, it really is sync=standard. Edit: As I review my notes, the system was in NFS rather than iSCSI mode at the point of the screenshots. That's why the annotations indicate sync=disabled vs. sync=standard. But the conclusions about sync vs. async should still apply. The results were quite impressive:
5QqMwdE.png


At the point of disabling sync=always, the vsphere environment throughput went up significantly. That is, the CPU usage is now pretty much maxed out, just by disabling the sync requirement.

While the CPU usage is a good indication of how well the VMs are running, I looked at both network and disk activity that included a window where both sync and async transfers were visible:
i4Q1jLe.png


Here is a clear indication of higher network throughput once sync=always was disabled at the zpool level. And again at the disk level:
f4K83qA.png

In sync=always mode, the disk activity was less consistent, with large differences between the peaks and valleys. At the point where sync=always is disabled, the disk activity is more consistent and also as higher throughput.

Conclusion #1: sync=always doesn't make sense for us, even with a fast SLOG
Before you say that not running in sync mode is dangerous / foolish, refer once again to our specific use case for the storage. The ZeusRAM simply didn't come close enough to async mode for our purpose. I don't have the ability to test an NVMe SLOG, so it's not clear where it would lie in the performance spectrum between sync+ZeusRAM and async.

Experimentation: iSCSI vs. NFS
Initial configuration of our FreeNAS system used iSCSI for vSphere. However, FreeNAS would occasionally panic. The panic details matched the details that were outlined in another thread. Until that bug was fixed, I experimented with NFS as an alternative for providing the vSphere store.

When just using the guest VMs, they seemed responsive. The zpool explicitly stated that sync=disabled, so that the storage would be async, regardless of what NFS requested. However, looking at the vsphere performance counters, the latency of the storage was higher than expected. That is, up to and occasionally going over 100ms for read/write latency. That's about 10x higher latency than expected for a hard disk. This high latency occurred even under light load.

Once the iSCSI panic was fixed, I switched back to iSCSI. At this point, the latency for the same underlying disk structure went down to about 5ms!

Conclusion #2: NFS has a significantly higher latency than iSCSI
Whether or not this is important to you depends on what you're doing.
 
Last edited:

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'd like to make a few comments. These may sound scathing, but I haven't had coffee yet, so I apologize if they do. That is not my intention!

1. If you are still on the latest build and having panics, you should put in a bug ticket at bugs.freenas.org and include a copy of your debug file. I don't think you are, but I'm just stating this for clarity.

2. You said that latency was often 10x higher than expected for a hard drive. Can you provide a debug file? That will allow me to possibly tell you why it was so high.

3. iSCSI is 'where its at' with regards to getting great performance with FreeNAS/TrueNAS when using ESXi hosts. NFS is a second-rate citizen on the ESXi side of the house. With Xen though, the opposite is somewhat true. If you are using Xen, I'd recommend you consider NFS first and foremost. Additionally, iSCSI has some major performance opportunities because it is kernel-based while NFS is not. So, all things being equal on the VM hosts side, iSCSI should outperform NFS virtually every time. For many, using NFS because it uses files is far more important than performance, so you can still justify using NFS on ESXi if that is what you want/need.

Conclusion #1: sync=always doesn't make sense for us, even with a fast SLOG
Before you say that not running in sync mode is dangerous / foolish, refer once again to our specific use case for the storage. The ZeusRAM simply didn't come close enough to async mode for our purpose. I don't have the ability to test an NVMe SLOG, so it's not clear where it would lie in the performance spectrum between sync+ZeusRAM and async.

Allow me to clarify a few things you didn't mention. You may or may not know these things, so I'll say them and you can figure out if this is something to learn or something you know.

1. Assuming you have a respectable workload that is sync writes, an slog will always be slower than setting sync=disabled. With sync=disabled any write is immediately returned (aka cached in RAM) and with sync=standard or sync=always any sync write must have additional latency because you must write the data to the slog for POSIX compliance (in addition to the copy in RAM).

2. If you value your data, sync=always is the conservative answer because iSCSI generally isn't using sync with ESXi and the like. As you said, your VMs aren't important. That is probably the single-biggest reason why your 'test' in no way reflect 99.999% of real-world scenarios. I have not met a single person yet that says "my VMs are disposable", which is what you are effectively saying when sync=disabled. More than a few people have lost most/all of their VMs when running with sync=disabled. So while your test were another validation that sync=disabled is always faster than sync=standard or sync=enabled, they really weren't any indication of "how people do things in the real-world". In fact, if someone read this and then did what you did, they'd be taking serious risk of data loss of the VMs.

I could easily argue that "any VM on an iSCSI target that is a production VM should be on an iscsi target that has sync=always". Does this mean you are basically forced to use an slog? Yep. But, it also means you aren't going to have an unbootable VM someday if you have a spontaneous power loss, panic, etc. And believe me, it can happen, and it has to quite a few people. Some of those even did tests where they randomly powered off their storage just to see if sync=disabled was really as dangerous as everyone had claimed. They couldn't find anything wrong after multiple attempts and decided that those people claiming serious damage to VMs with sync=disabled were full of you-know-what and set sync=disabled to maximize performance. It's really like playing russian roulette. You might be the only guy that walks away from the game. But its far safer to never play.

Quite a few people (and more every day) are virtualizing massive amounts of infrastructure. This puts "all your eggs in one basket" and a failure of your storage can spell disaster for your company. This means you should be "baby-ing" your storage server even more, since your entire enterprise may depend on that one storage not suddenly blacking out on you.

Even back in 2013 when I did one-time support for some very desperate users in the forums, there were a few companies that basically lost everything in one swoop due to failure of their storage and having no verified good backups. The fact that your VMs are disposable makes you the exception, not the norm. ;)

What has been discussed in the past was having sync=enabled be the default setting when you setup a zvol as an iscsi extent. It looks like this is being considered yet again.
 

Pointeo13

Explorer
Joined
Apr 18, 2014
Messages
86
1. If you are still on the latest build and having panics, you should put in a bug ticket at bugs.freenas.org and include a copy of your debug file. I don't think you are, but I'm just stating this for clarity.

^ This, ever since they put the fix in place, I have had zero random reboots, now I just need to figure out my last issue with freenas and it will run like a dream.
 

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
I'd like to make a few comments. These may sound scathing, but I haven't had coffee yet, so I apologize if they do. That is not my intention!
1. If you are still on the latest build and having panics, you should put in a bug ticket at bugs.freenas.org and include a copy of your debug file. I don't think you are, but I'm just stating this for clarity.
2. You said that latency was often 10x higher than expected for a hard drive. Can you provide a debug file? That will allow me to possibly tell you why it was so high.

Nothing sounds scathing to me. Every point you made is quite valid. Yes, having disposable VMs puts us in a unique situation where we don't require sync. To answer your questions:

1. We had those panics only until the December release of FreeNAS 9.3. The bug was fixed with that release: https://bugs.freenas.org/issues/12080
(note that mere mortals will likely not have permission to access that bug)
2. The latency was measured on the vSphere side of things. Using esxtop or the vSphere client GUI. So there's not much on the FreeNAS side that I can provide. I merely wanted to share my experience that from the vSphere side of things, vSphere notices that NFS has a much higher latency than iSCSI. From a real-world usability perspective (e.g. poking around and using a guest OS VM), I couldn't notice any problem, though.
 
Status
Not open for further replies.
Top