Hi everyone
I could use your help in getting some understanding of what I'm seeing here.
I am unhappy with the disk write performance I see on a ESXi VM that is running on a NFS3 datastore. That datastore is provisioned on TrueNAS and backed by a 3 NVME RAIDZ1. That pool has neither L2ARC nor SLOG by the way.
Running fio on both TrueNAS and the VM, I get an enormous discrepancy. If someone could help me understand where that's coming from.
The commands I used:
TrueNAS result:
VM result:
The Server is connected with 10G fiber to a Mikrotik switch. However since both are VMs and running on the same hypervisor, I would expect traffic to never leave the vSwitch. And indeed, keeping an eye on Rx and Tx rates on the Mikrotik switch port, I see no change in traffic when running the fio command on the VM.
o the interesting question becomes, where did I bottleneck this? I'll be honest, from the GUI I am currently unable to determine exactly how I have configured networking on TrueNAS. All I see is two vmx interfaces with an IP each, one management, one in the storage VLAN. Since the host only has the one physical NIC, even if I had switched up the NICs somehow, they'd still both go out on the same vSwitch and thus the same physical NIC to the outside world... which is unused...
I find it weird that the remote 64k blocksize is so much slower while the 4k is much faster than locally. Have I misconfigured the blocksize of the pool?
What information do you need from me/what should I look into? I thought about deactivating sync (set to standard currently) but given the fio gives adequate performance when done in TrueNAS, I'm not sure that is the issue?
Grateful for any pointers you could provide.
Marco
I could use your help in getting some understanding of what I'm seeing here.
I am unhappy with the disk write performance I see on a ESXi VM that is running on a NFS3 datastore. That datastore is provisioned on TrueNAS and backed by a 3 NVME RAIDZ1. That pool has neither L2ARC nor SLOG by the way.
Running fio on both TrueNAS and the VM, I get an enormous discrepancy. If someone could help me understand where that's coming from.
The commands I used:
fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=posixaio, iodepth=16
and
fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=posixaio, iodepth=16
TrueNAS result:
Run status group 0 (all jobs):
WRITE: bw=786MiB/s (824MB/s), 48.9MiB/s-50.2MiB/s (51.2MB/s-52.6MB/s), io=49.0GiB (52.6GB), run=62611-63766msec
and
Run status group 0 (all jobs):
WRITE: bw=16.7MiB/s (17.5MB/s), 970KiB/s-1161KiB/s (993kB/s-1189kB/s), io=1016MiB (1066MB), run=60692-60765msec
VM result:
Hardware wise I have one ESXi server that contains all the disks, too. I pcipassthrough the HBA and the NVMe drives to a TrueNAS VM.Run status group 0 (all jobs):
WRITE: bw=10.3MiB/s (10.8MB/s), 537KiB/s-1153KiB/s (550kB/s-1180kB/s), io=1095MiB (1148MB), run=66888-106317msec
and
Run status group 0 (all jobs):
WRITE: bw=47.6MiB/s (49.9MB/s), 2989KiB/s-3472KiB/s (3061kB/s-3555kB/s), io=3432MiB (3598MB), run=63187-72113msec
The Server is connected with 10G fiber to a Mikrotik switch. However since both are VMs and running on the same hypervisor, I would expect traffic to never leave the vSwitch. And indeed, keeping an eye on Rx and Tx rates on the Mikrotik switch port, I see no change in traffic when running the fio command on the VM.
o the interesting question becomes, where did I bottleneck this? I'll be honest, from the GUI I am currently unable to determine exactly how I have configured networking on TrueNAS. All I see is two vmx interfaces with an IP each, one management, one in the storage VLAN. Since the host only has the one physical NIC, even if I had switched up the NICs somehow, they'd still both go out on the same vSwitch and thus the same physical NIC to the outside world... which is unused...
I find it weird that the remote 64k blocksize is so much slower while the 4k is much faster than locally. Have I misconfigured the blocksize of the pool?
What information do you need from me/what should I look into? I thought about deactivating sync (set to standard currently) but given the fio gives adequate performance when done in TrueNAS, I'm not sure that is the issue?
Grateful for any pointers you could provide.
Marco
Last edited: