Write performance discrepancy between pool and VM on NFS

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
Hi everyone

I could use your help in getting some understanding of what I'm seeing here.

I am unhappy with the disk write performance I see on a ESXi VM that is running on a NFS3 datastore. That datastore is provisioned on TrueNAS and backed by a 3 NVME RAIDZ1. That pool has neither L2ARC nor SLOG by the way.

Running fio on both TrueNAS and the VM, I get an enormous discrepancy. If someone could help me understand where that's coming from.

The commands I used:
fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=posixaio, iodepth=16
and
fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=posixaio, iodepth=16


TrueNAS result:
Run status group 0 (all jobs):
WRITE: bw=786MiB/s (824MB/s), 48.9MiB/s-50.2MiB/s (51.2MB/s-52.6MB/s), io=49.0GiB (52.6GB), run=62611-63766msec
and
Run status group 0 (all jobs):
WRITE: bw=16.7MiB/s (17.5MB/s), 970KiB/s-1161KiB/s (993kB/s-1189kB/s), io=1016MiB (1066MB), run=60692-60765msec

VM result:
Run status group 0 (all jobs):
WRITE: bw=10.3MiB/s (10.8MB/s), 537KiB/s-1153KiB/s (550kB/s-1180kB/s), io=1095MiB (1148MB), run=66888-106317msec
and
Run status group 0 (all jobs):
WRITE: bw=47.6MiB/s (49.9MB/s), 2989KiB/s-3472KiB/s (3061kB/s-3555kB/s), io=3432MiB (3598MB), run=63187-72113msec
Hardware wise I have one ESXi server that contains all the disks, too. I pcipassthrough the HBA and the NVMe drives to a TrueNAS VM.
The Server is connected with 10G fiber to a Mikrotik switch. However since both are VMs and running on the same hypervisor, I would expect traffic to never leave the vSwitch. And indeed, keeping an eye on Rx and Tx rates on the Mikrotik switch port, I see no change in traffic when running the fio command on the VM.

o the interesting question becomes, where did I bottleneck this? I'll be honest, from the GUI I am currently unable to determine exactly how I have configured networking on TrueNAS. All I see is two vmx interfaces with an IP each, one management, one in the storage VLAN. Since the host only has the one physical NIC, even if I had switched up the NICs somehow, they'd still both go out on the same vSwitch and thus the same physical NIC to the outside world... which is unused...

I find it weird that the remote 64k blocksize is so much slower while the 4k is much faster than locally. Have I misconfigured the blocksize of the pool?

What information do you need from me/what should I look into? I thought about deactivating sync (set to standard currently) but given the fio gives adequate performance when done in TrueNAS, I'm not sure that is the issue?

Grateful for any pointers you could provide.

Marco
 
Last edited:

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
Okay I feel like I lack some fundamental understanding here.

I just added an NVMe as a SLOG to my 12 spindle RAIDZ1 pool. I am restoring a VM onto a NFS shared dataset on it. atime is off. I get 60 MB/s over 10G.

That just can't be right, can it? Shouldn't all writes go to the SLOG now?

And yes, I am aware that having a single NVMe for a SLOG is a risk. This is for testing purposes only.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
Sync is set to Standard on all ESXi datasets. From my understanding, sync should only be active when requested by the client, no?

I ran the fio both on the truenas shell directly in the mountpoint of that dataset and in a VM I restored onto it.

On Truenas I saw about 180MiB/s, in the VM only about 60. So factor of three.

For the other datastore that resides on an NVMe RAIDZ1 I see almost a GiB/s throughput on truenas with --end_fsync=1 and in the VM without --end_fsync=1 I only get a measly 90 MiB/s

Okay, I see. The moment I truely set sync to disable I see 300 MiB/s. Still not really getting anywhere near 10G but it's something.

I'm just not sure if async writes are a good idea for VMs?
 

nabsltd

Contributor
Joined
Jul 1, 2022
Messages
133
I just added an NVMe as a SLOG to my 12 spindle RAIDZ1 pool. I am restoring a VM onto a NFS shared dataset on it. atime is off. I get 60 MB/s over 10G.

That just can't be right, can it? Shouldn't all writes go to the SLOG now?
Although sync writes are held in RAM and written to SLOG, they eventually have to be written to the final destination in the pool. This happens after ZFS decides that too much data in the transaction group is being held in RAM, even though the data is "safe" on the SLOG. The data is then written to the final destination in the pool, and your 12 spindle RAIDZ1 pool will eventually report that all the disks have the data that needed to be written. Over the long term, this means that the write speed is exactly equal to the write speed of the spinning rust pool of disks with async writes.

SLOG is good for absorbing bursty writes, but not as helpful with a sustained write, although there are ZFS settings that can increase the delay before the data starts to be written from RAM to pool, thus allowing the SLOG to absorb a larger amount. I'm not a ZFS expert by any means, but I'm pretty sure that no matter what you do, the long term size of the data in the SLOG that has not been written to the pool cannot exceed your RAM size.

When people say "SLOG is not cache", this is what they mean. Writes to SLOG are not considered "safely written" by ZFS, regardless of how "safe" the SLOG device is.

Also, 12 disks in a RAIDZ1 is far too many. You'd be much better off with a pair of 6-disk RAIDZ1 vDevs.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
 

Marco2G

Dabbler
Joined
Sep 26, 2019
Messages
14
Although sync writes are held in RAM and written to SLOG, they eventually have to be written to the final destination in the pool. This happens after ZFS decides that too much data in the transaction group is being held in RAM, even though the data is "safe" on the SLOG. The data is then written to the final destination in the pool, and your 12 spindle RAIDZ1 pool will eventually report that all the disks have the data that needed to be written. Over the long term, this means that the write speed is exactly equal to the write speed of the spinning rust pool of disks with async writes.

SLOG is good for absorbing bursty writes, but not as helpful with a sustained write, although there are ZFS settings that can increase the delay before the data starts to be written from RAM to pool, thus allowing the SLOG to absorb a larger amount. I'm not a ZFS expert by any means, but I'm pretty sure that no matter what you do, the long term size of the data in the SLOG that has not been written to the pool cannot exceed your RAM size.

When people say "SLOG is not cache", this is what they mean. Writes to SLOG are not considered "safely written" by ZFS, regardless of how "safe" the SLOG device is.

Also, 12 disks in a RAIDZ1 is far too many. You'd be much better off with a pair of 6-disk RAIDZ1 vDevs.

The pool contains of two vdevs with six spindles each.

My current point of understanding: If the slog never holds more data than is already in the write cache in ram and the VM has 12GB RAM, then any burst larger than <12GB will need to flush to disk and await sync no matter what. Even if the SLOG is 1TB, the RAM is the limiting factor.

A NVMe SLOG makes no sense for an NVMe pool IMO, since even though the disk will have to handle parallel IO for ZIL and data, I don't think my noob usecase should notice the difference. However a mirror vdev instead of RAIDZ1 could save a bit on checksum calculations and IO multiplications.


That all being said, I will try increasing RAM. However I still lack an explanation why I can fio with near 1GiB/s directly from the TrueNAS shell but get only a third of that if I go through NFS. RAM constraints due to having to handle both ZFS and NFS?
 
Last edited:

Marco2G

Dabbler
Joined
Sep 26, 2019
Messages
14
Thank you, I've read through those already. Obviously I'm not smart enough to comprehend that :D.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
However I still lack an explanation why I can fio with near 1GiB/s directly from the TrueNAS shell but get only a third of that if I go through NFS. RAM constraints due to having to handle both ZFS and NFS?

What about compression?
 
Last edited:

Marco2G

Dabbler
Joined
Sep 26, 2019
Messages
14
I have read on many occasions that lz4 does not negatively impact performance. Is that wrong? I have it active in some places.
And since when is NFS block storage?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776

Marco2G

Dabbler
Joined
Sep 26, 2019
Messages
14
Since you stored VM disk images on it?
I have been a storage engineer working with EMC, HP, PureStorage and lately NetApp and THAT just sounds wrong on so many levels.

Sure, the VM itself is interacting on a block level with the hypervisor but the hypervisor MUST interact with TrueNAS via Fileshare when the datastore is attached via NFS.

Please elaborate what you mean because I do not understand what you mean.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
The guest VM is interacting with the virtual disk on the block level. That does not change if the underlying file system is not. Combined with the fact that VMware uses sync writes on NFS by default.
 

Marco2G

Dabbler
Joined
Sep 26, 2019
Messages
14

What about compression?
I have read these before but this time a point caught my eye. Large contiguous files are better for RAIDZ. That would imply I would be better off using Thick eager zeroed VM disks on NFS datastores, no?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
ZFS is copy on write. No block is overwritten in place ever. So "contiguous" is a property unlikely to last.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Each block written by the guest to the virtual disk device is committed to ZFS synchronously (if the default settings of ESXi are active).

That means a new block is allocated in ZFS, the guest's virtual disk block is written to it, a block in your "large contiguous file" is marked as free. ZFS never overwrites blocks in place, but I am repeating myself. What exactly is unclear about this?

Block gets written to Ext4, ends up in Ext4 cache inside the guest. Block is eventually flushed to virtual "disk" inside the guest. VMware takes block, writes it synchronously to NFS which pushes it synchronously to ZFS. ZFS acknowledges finished write operation to NFS layer wich in turn ACKs it to ESXi which in turn pretends to be a disk drive and ACKs it to the guest OS.

You do not have large contigous writes with virtual machines but a gazillion of independent synchronous writes of few blocks. Just like with a database. The fact that there is an NFS layer inbetween does not change that characteristic.

Large contiguous write: copy video file to archive, perform VM export/backup with ghettoVCB, ... stuff like that.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Quoting the first part of that same resource:
RAIDZ (including Z2, Z3) is good for storing large sequential files. ZFS will allocate long, contiguous stretches of disk for large blocks of data, compressing them, and storing the parity in an efficient manner. RAIDZ makes very good use of the raw disk space available when you are using it in this fashion. However, RAIDZ is not good at storing small blocks of data. To illustrate what I mean, consider the case where you wish to store an 8K block of data on RAIDZ3. In order to store that, you store the 8K data block, then three additional parity blocks... not efficient. Further, from an IOPS perspective, a RAIDZ vdev tends to exhibit the IOPS behaviour of a single component disk (and the slowest one, at that).

So we see a lot of people coming into the forums trying to store VM data on their 12 disk wide RAIDZ2 and wonder why their 12 disk 30 TB array sucks for performance. It's exhibiting the speed of a single disk.

The solution to this is mirrors. Mirrors aren't as good at making good use of the raw disk space (because you only end up with 1/2 or 1/3 the space), but in return for the greater resource commitment, you get much better performance. First, mirrors do not consume a variable amount of space for parity. Second, you're likely to have more vdevs. That 12 drive system we were just talking about will have 4 three-way mirrors or 6 two-way mirrors, which is 4x or 6x the number of vdevs. This translates directly to greatly enhanced performance!​

Basically you want to have mirrors to skip the write amplification (party data) of RAIDZ and to get the increased parallel performance of multiple VDEVs; additionally, you want them to have around 50% of free space in order to let the cow... graze freely.

Using mixed operations drives might also improve your pool's performance.

I'm assuming you are not CPU bottlenecked.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Block gets written to Ext4, ends up in Ext4 cache inside the guest. Block is eventually flushed to virtual "disk" inside the guest. VMware takes block, writes it synchronously to NFS which pushes it synchronously to ZFS. ZFS acknowledges finished write operation to NFS layer wich in turn ACKs it to ESXi which in turn pretends to be a disk drive and ACKs it to the guest OS.

You do not have large contigous writes with virtual machines but a gazillion of independent synchronous writes of few blocks. Just like with a database. The fact that there is an NFS layer inbetween does not change that characteristic.
NFS sort of matters here (vs. iSCSI) because the NFS client on VMware's ESXi hypervisor is requesting sync writes of TrueNAS. The iSCSI initiator, on the other hand, leaves data integrity assurances up to the target array.

If you force sync=always at the pool/dataset level you'll likely have much closer results between a local fio and remote over NFS.

There's more at play here including the potential for read-modify-write cycles against the .VMDK file itself (if it was provisioned fully, and has 128K records, and I don't believe we can do reclamation inside of them through VAAI NAS - only over VAAI block) but there's also the nature of the NAND needed for an SLOG device (has to be write-latency-optimized)
 
Top