Write performance discrepancy between pool and VM on NFS

nabsltd

Contributor
Joined
Jul 1, 2022
Messages
133
I have been a storage engineer working with EMC, HP, PureStorage and lately NetApp and THAT just sounds wrong on so many levels.
ZFS pools store blocks...they know nothing about files.

A ZFS dataset can expose either blocks or files. You have chosen to have your dataset expose files using the NFS protocol.

On top of that, you have ESXi that writes one large file on the NFS filesystem, and tells the VM "this is block storage".

Next, you have a VM that creates a filesystem on top of the virtualized block storage created by ESXi.

Basically, it's turtles all the way down.
 

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
Each block written by the guest to the virtual disk device is committed to ZFS synchronously (if the default settings of ESXi are active).

That means a new block is allocated in ZFS, the guest's virtual disk block is written to it, a block in your "large contiguous file" is marked as free. ZFS never overwrites blocks in place, but I am repeating myself. What exactly is unclear about this?

Block gets written to Ext4, ends up in Ext4 cache inside the guest. Block is eventually flushed to virtual "disk" inside the guest. VMware takes block, writes it synchronously to NFS which pushes it synchronously to ZFS. ZFS acknowledges finished write operation to NFS layer wich in turn ACKs it to ESXi which in turn pretends to be a disk drive and ACKs it to the guest OS.

You do not have large contigous writes with virtual machines but a gazillion of independent synchronous writes of few blocks. Just like with a database. The fact that there is an NFS layer inbetween does not change that characteristic.

Large contiguous write: copy video file to archive, perform VM export/backup with ghettoVCB, ... stuff like that.
With an MTU of 1500 on the network, subtracting packet overhead that means that less than 1500 bytes are available per IP packed. So no, the VM writing 4k blocks will not result in ESXi sending 4k packets to TrueNAS' NFS share. That is my whole point. NFS isn't a block protocol and will not deal in blocks but in packets.

At least that would be my understanding of it. So what I imagine happens:

Guest writes 4k to its virtual disk (the vmdk file). Esxi takes that 4k block and splits it into three packets, sends it over the network. The NFS server needs to put them back together, presumably, to write that 4k block to the dataset. ZFS then takes the 4k block, allocates space for it several times due to RAIDZ logic and if sync is active, waits for each one to be written before sending an ack to NFS server, which then sends ACK to ESXi, which then sends ACK to the VM.

Is this even remotely correct?
 

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
ZFS pools store blocks...they know nothing about files.

A ZFS dataset can expose either blocks or files. You have chosen to have your dataset expose files using the NFS protocol.

On top of that, you have ESXi that writes one large file on the NFS filesystem, and tells the VM "this is block storage".

Next, you have a VM that creates a filesystem on top of the virtualized block storage created by ESXi.

Basically, it's turtles all the way down.
That's how I imagine it. But tell me, how is a large 20GB vmdk updated through NFS? I hardly believe it would be usable for VMs if the whole file had to be rewritten on every IO. Can NFS partially change a file? And if so, on which basis? On a disk that would happen through proper block allocation but NFS has no blocks..?
 

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
Oh by the way as a FYI: I have upped the hosts memory and given TrueNAS 64 GB RAM. fio performance stayed more or less the same and ZFS cache never went beyontd 13.5 GB.

I guess as my next trick I'll move the VMs away and rebuild the NVMe pool with mirrored vdevs.
 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,554
Can NFS partially change a file? And if so, on which basis? On a disk that would happen through proper block allocation but NFS has no blocks..?
NFSv3 see RFC1813 Section 3.3.7
NFSv4 see RFC3530 Section 14.2.36

Protocol details are open standards, this is roughly equivalent to pwrite syscall.
 

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
Okay, this is my current status:
TrueNAS has 64GB RAM of which I've seen up to 50 something GB used as ZFS cache.
This is the new NVMe pool:
pool: virtualmachines
state: ONLINE
config:

NAME STATE READ WRITE CKSUM
virtualmachines ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/572b07b1-d5bc-11ee-b079-000c2917d2c9 ONLINE 0 0 0
gptid/572998ae-d5bc-11ee-b079-000c2917d2c9 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/57436c0a-d5bc-11ee-b079-000c2917d2c9 ONLINE 0 0 0
gptid/573204e7-d5bc-11ee-b079-000c2917d2c9 ONLINE 0 0 0

errors: No known data errors
The pool has Auto TRIM enabled.

Attached are screenshots of the dataset configs. The only thing I can realistically fiddle with at this point is record size.

After building the datasets, I ran a fio test that reached almost 2 GiB/s.

Now all I get is about 200+ MiB/s. That is on TrueNAS itself. Funny enough, the VMs can reach the sameish values now. It's jsut not all that much.
 

Attachments

  • esx-nfs-1.PNG
    esx-nfs-1.PNG
    56.9 KB · Views: 33
  • esx-nfs-2.PNG
    esx-nfs-2.PNG
    57.3 KB · Views: 31
  • virtualmachines.PNG
    virtualmachines.PNG
    52.4 KB · Views: 33

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
What is the fragmentation value of the dataset you are testing?
What is the % of available space?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
Obviously not impossible but I haven't knowingly done any read tests so far and I used the shell history...

None the less... 200MiB/s on a two vdev mirror? The Crucial P2 drives have a theoretical write throughput of 1.8GB/s...

I have seen ~1GiB/s several times before. After this major change, I saw 2GiB/s. That much seems coherent. It's also a given that when I put actual workload on it, that IO contention will become a thing and rates will drop... but by a factor of 8?

I'm starting to wonder whether I have accidently bifurcated a PCIe 3x x8 slot instead of a x16... Edit: Nope, it's an x16... so that should be 4 GB/s per gumstick...
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
You are not giving us much to work with apart from your numbers. How did you see those GiB/s numbers? Which tests did you run? What are your system's specs? You might be CPU bottlenecked as far as we know. Are you running anything else in parallel? How is the networking set up? Are you using dedup? How are you connecting the drives?

Edit: it appears the drives you are using (CRUCIAL's P2 as far as I understand) are your issue here since according to this review, their 64K (sequential!) write performance is perfectly compatibile with your numbers. You are leveraging the NAND when you reach higher numbers.

1000030623.jpg
 
Last edited:

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
This VM has 8 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz cores

If I don't give you much to go on, how about linking that comprehensive list of information and how to get it I could follow?

I have shown the fio commands I use but here goes:

TrueNAS:
fio --name=random-write --ioengine=posixaio --rw=write --bs=4k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based
Run status group 0 (all jobs):
WRITE: bw=460MiB/s (482MB/s), 27.2MiB/s-29.6MiB/s (28.6MB/s-31.0MB/s), io=27.0GiB (29.0GB), run=60007-60009msec
fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based
Run status group 0 (all jobs):
WRITE: bw=30.4MiB/s (31.9MB/s), 1925KiB/s-1994KiB/s (1972kB/s-2041kB/s), io=1830MiB (1918MB), run=60075-60115msec

I think what I witnessed her eis the first job doing very well up until 50%. It was going at almost 1GiB/s. Then took a knee to around 100+MiB/s. That's probably when the ZFS cache ran full.

For the second test, ZFS cache was still full, so abysmal performance from the start.

So yes, this would indicate poor performance by the P2s. However even at the 120 Mb/s you show in your test, I have two vdevs and should be able to sustain 250 MB/s over the whole pool, right?

So let's get down to basics: How do I performance test the hardware devices directly?

EDIT: What is worse about the P2s, Crucial seems to have swapped the TLC for QLC chips without changing model numbers. I might have a mix going here... God damn it, the enshittification these days. Can't trust anybody...
 
Last edited:

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
So yes, this would indicate poor performance by the P2s. However even at the 120 Mb/s you show in your test, I have two vdevs and should be able to sustain 250 MB/s over the whole pool, right?
Nope, the review I posted refers to sequential writes: the random write performance is definitely going to be way lower than that (as you have seen in your testing).

So let's get down to basics: How do I performance test the hardware devices directly?
Other than fio? dd, which is also used in @jgreco's solnet array test.
As far as I am concerned I don't think your tests are wrong, you simply have exceptionally bad drives for your use case.
 

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
Okay, thanks.

In that case this leaves me with the decision to upgrade the NVMe pool at some point. If I want to invest that kinda dough.

While completely uncool performance for NVMe, it's still much better than what I'm seeing on the spindles so yeah :D.

I want to thank everyone in this thread for taking the time to help.

I think at this point it is safe to say that the NFS/local discrepancy stems from the local test having come first and having profited from ZFS cache.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
That's probably when the ZFS cache ran full.

For the second test, ZFS cache was still full, so abysmal performance from the start.
Unless you ran the two tests directly after each other in a scripted fashion, it's likely less so the ZFS "cache" (transaction logging/dirty data thresholds) and more likely the inability of the P2's QLC NAND to sustain steady writes.

The local fio test will have been able to run asynchronously, masking the slow writes on the P2's NAND with ZFS - if you want a more apples-to-apples test removing network factors, use zfs set sync=always Poolname/Datasetname and re-run the test. To change it back to the previous (inherit parent) setting, use zfs inherit sync Poolname/Datasetname
 
Top