Atrocious write performance

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
Hi everyone

A bit of background: I noticed that some shows in my Jellyfin VM (running on a zvol over iSCSI located on three NVMe gumsticks and having media mounted via NFS from another ZFS pool on 12 spindles) took a long time to start running.

During that troubleshooting, I noticed that the performance on the transcription directory had very bad write performance. I did tests with dd and hdparm and while hdparm returned outrageous read performance in the multiple GB/s (impossible, as the LAN is 10G), dd was much more interesting.

I used
dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync
and
dd if=/dev/zero of=/tmp/test2.img bs=512 count=1000 oflag=dsync

And while the former gave me about 250 MB/s IIRC, the latter was in the few hundred kb/s.

So I SSHed into TrueNAS and ran

fio --direct=1 --rw=write --bs=4k --ioengine=posixaio --iodepth=64 --name=throughput-test-job --ramp_time=5 --size=1G --filename=/dev/zvol/pool0/somezvol

I tested a relatively empty, new zvol on the NVMe RAIDZ1 (lz4 compression and dedup on) and got :
Jobs: 1 (f=1): [W(1)][83.3%][w=180MiB/s][w=46.2k IOPS][eta 00m:02s]
throughput-test-job: (groupid=0, jobs=1): err= 0: pid=8683: Fri Feb 23 08:40:17 2024
write: IOPS=43.0k, BW=168MiB/s (176MB/s)(776MiB/4621msec); 0 zone resets
slat (usec): min=7, max=3267, avg=20.31, stdev=61.39
clat (usec): min=104, max=19307, avg=782.09, stdev=993.58
lat (usec): min=113, max=20324, avg=802.43, stdev=1014.72
clat percentiles (usec):
| 1.00th=[ 127], 5.00th=[ 159], 10.00th=[ 212], 20.00th=[ 314],
| 30.00th=[ 416], 40.00th=[ 519], 50.00th=[ 619], 60.00th=[ 725],
| 70.00th=[ 824], 80.00th=[ 930], 90.00th=[ 1106], 95.00th=[ 1647],
| 99.00th=[ 6063], 99.50th=[ 7504], 99.90th=[11076], 99.95th=[13042],
| 99.99th=[16319]
bw ( KiB/s): min=28781, max=236660, per=99.82%, avg=171602.11, stdev=83549.79, samples=9
iops : min= 7195, max=59165, avg=42900.22, stdev=20887.36, samples=9
lat (usec) : 250=13.68%, 500=24.56%, 750=24.53%, 1000=22.98%
lat (msec) : 2=10.39%, 4=1.91%, 10=1.82%, 20=0.16%
cpu : usr=5.74%, sys=69.91%, ctx=17869, majf=0, minf=1
IO depths : 1=1.6%, 2=3.1%, 4=6.2%, 8=12.5%, 16=25.0%, 32=50.0%, >=64=1.6%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=98.5%, 8=0.0%, 16=0.0%, 32=0.0%, 64=1.5%, >=64=0.0%
issued rwts: total=0,198531,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
WRITE: bw=168MiB/s (176MB/s), 168MiB/s-168MiB/s (176MB/s-176MB/s), io=776MiB (813MB), run=4621-4621msec

Not eye watering, blazingly fast but looks workable.

Then, for shits and giggles, I made a zvol on the spindles. just to see. No compression, no dedup:
^Cbs: 1 (f=1): [W(1)][16.8%][w=250KiB/s][w=62 IOPS][eta 47m:58s] 4 --name=throughput-test-job --ramp_time=5 --size=1G --filename=/dev/zvol/virtualmachines/Jobs: 1 (f=1): [W(1)][15.8%][w=248KiB/s][w=62 IOPS][eta 48m:30s]
fio: terminating on signal 2

throughput-test-job: (groupid=0, jobs=1): err= 0: pid=8717: Fri Feb 23 08:52:06 2024
write: IOPS=76, BW=304KiB/s (312kB/s)(171MiB/575563msec); 0 zone resets
slat (msec): min=3, max=218, avg=13.15, stdev= 9.37
clat (usec): min=13, max=1815.6k, avg=417436.62, stdev=276328.38
lat (msec): min=3, max=1838, avg=430.58, stdev=277.27
clat percentiles (usec):
| 1.00th=[ 139], 5.00th=[ 33162], 10.00th=[ 68682],
| 20.00th=[ 149947], 30.00th=[ 231736], 40.00th=[ 312476],
| 50.00th=[ 396362], 60.00th=[ 480248], 70.00th=[ 566232],
| 80.00th=[ 641729], 90.00th=[ 759170], 95.00th=[ 943719],
| 99.00th=[1149240], 99.50th=[1216349], 99.90th=[1434452],
| 99.95th=[1518339], 99.99th=[1719665]
bw ( KiB/s): min= 487, max= 513, per=100.00%, avg=500.72, stdev= 4.64, samples=684
iops : min= 121, max= 128, avg=124.88, stdev= 1.19, samples=684
lat (usec) : 20=0.01%, 250=1.43%, 500=0.14%
lat (msec) : 4=0.01%, 10=0.26%, 20=1.39%, 50=4.03%, 100=6.32%
lat (msec) : 250=18.84%, 500=30.06%, 750=27.31%, 1000=6.49%, 2000=3.81%
cpu : usr=0.05%, sys=0.87%, ctx=87673, majf=0, minf=1
IO depths : 1=1.6%, 2=3.1%, 4=6.3%, 8=12.5%, 16=25.0%, 32=50.0%, >=64=1.6%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=98.5%, 8=0.0%, 16=0.1%, 32=0.0%, 64=1.5%, >=64=0.0%
issued rwts: total=0,43755,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
WRITE: bw=304KiB/s (312kB/s), 304KiB/s-304KiB/s (312kB/s-312kB/s), io=171MiB (179MB), run=575563-575563msec

I aborted this one because, well, it sucked :D

So I did another test on the NVMe, this time an actual ESXi iSCSI datastore that is in use. lz4 and dedup on:

^Cbs: 1 (f=1): [W(1)][1.4%][w=1009KiB/s][w=252 IOPS][eta 19m:46s]
fio: terminating on signal 2

throughput-test-job: (groupid=0, jobs=1): err= 0: pid=8809: Fri Feb 23 08:52:02 2024
write: IOPS=229, BW=929KiB/s (951kB/s)(11.1MiB/12290msec); 0 zone resets
slat (usec): min=2180, max=22360, avg=4360.02, stdev=2749.40
clat (usec): min=56, max=715089, avg=139909.62, stdev=114453.03
lat (msec): min=3, max=718, avg=144.34, stdev=115.49
clat percentiles (usec):
| 1.00th=[ 133], 5.00th=[ 10290], 10.00th=[ 22676], 20.00th=[ 47449],
| 30.00th=[ 71828], 40.00th=[ 96994], 50.00th=[121111], 60.00th=[145753],
| 70.00th=[173016], 80.00th=[198181], 90.00th=[270533], 95.00th=[367002],
| 99.00th=[566232], 99.50th=[641729], 99.90th=[708838], 99.95th=[708838],
| 99.99th=[717226]
bw ( KiB/s): min= 499, max= 1536, per=99.62%, avg=925.21, stdev=356.95, samples=24
iops : min= 124, max= 384, avg=230.96, stdev=89.20, samples=24
lat (usec) : 100=0.04%, 250=1.56%
lat (msec) : 4=1.28%, 10=1.70%, 20=4.29%, 50=12.31%, 100=20.76%
lat (msec) : 250=48.01%, 500=9.19%, 750=2.09%
cpu : usr=0.02%, sys=1.68%, ctx=5663, majf=0, minf=1
IO depths : 1=1.6%, 2=3.1%, 4=6.2%, 8=12.5%, 16=25.0%, 32=50.0%, >=64=1.6%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=98.4%, 8=0.0%, 16=0.0%, 32=0.0%, 64=1.6%, >=64=0.0%
issued rwts: total=0,2818,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
WRITE: bw=929KiB/s (951kB/s), 929KiB/s-929KiB/s (951kB/s-951kB/s), io=11.1MiB (11.7MB), run=12290-12290msec


And again atrocious values. 230 IOPS, 1MB/s throughput.

I am very sure this is due to some misconfiguration on my part but being the noob I am I have trouble seeing it.

Here are the pools. newstorage is the spindles containing, ignoring my test zvol, only filesystems. virtualmachines, as the name sugests, is the NVMe pool for ESXi:

pool: newstorage
state: ONLINE
scan: scrub repaired 0B in 05:03:45 with 0 errors on Sun Jan 28 05:03:45 2024
config:

NAME STATE READ WRITE CKSUM
newstorage ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
da1 ONLINE 0 0 0
da3 ONLINE 0 0 0
da5 ONLINE 0 0 0
da7 ONLINE 0 0 0
da9 ONLINE 0 0 0
da12 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
da2 ONLINE 0 0 0
da4 ONLINE 0 0 0
da6 ONLINE 0 0 0
da8 ONLINE 0 0 0
da10 ONLINE 0 0 0
da13 ONLINE 0 0 0
spares
da14 AVAIL
da11 AVAIL

errors: No known data errors

pool: virtualmachines
state: ONLINE
scan: scrub repaired 0B in 00:38:23 with 0 errors on Sun Feb 18 00:38:57 2024
config:

NAME STATE READ WRITE CKSUM
virtualmachines ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/4665d5bf-4a44-11ed-9615-002590f39cea ONLINE 0 0 0
gptid/4668c633-4a44-11ed-9615-002590f39cea ONLINE 0 0 0
gptid/466c8862-4a44-11ed-9615-002590f39cea ONLINE 0 0 0


My understanding is that a ZIL makes sense on slower disks only. Under the assumptin that the NVMe should be plenty fast, I left it out. L2ARC should be a read cache and have no impact on my issues, right?

The gumsticks behind the NVMe are Crucial P2. Certainly nothing too elaborate, but I would still expect more than 300 IOPS.

TrueNAS is running as a VM on ESXI, the Host's HBA and the NVMes being pcipassthroughed to it. It's a poor man's hyperconverged setup. The VM has 8 cores and 12GB ram. Is it possible that I am RAM constrained due to dedup?

Any other ideas what I could have a look at here?
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
Just to get you started, please list all your hardware in detail please and use CODE tags.

My understanding is that a ZIL makes sense on slower disks only.
ZIL is by default present on a pool, you mean L2ARC SLOG.

TrueNAS is running as a VM on ESXI, the Host's HBA and the NVMes being pcipassthroughed to it.
Are the nvmes individually passed through? Guide on virtualization

I don't have a resource at hand, but don't use dedup. When fast dedup is available you may rethink that, as of now stay clear of it. It's ressource intensinve and will hurt performance.
 
Last edited:

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Is it possible that I am RAM constrained due to dedup?
Not "possible": Certain!
Run zpool status -D.
You'll get a breakout of the DDT by size, indicating how much you're gaining (probably not enough to make it worth), and a line
dedup: DDT entries #########, size 1.47K on disk, 167B in core
Multiply the number of entries by "on disk" and you have the size of the DDT in the pool; multiply entries by "in core" and you have the size of what ZFS wants loaded in RAM at all time: If it doesn't fit, TrueNAS is swapping, on each and every operation—and here go your hopes of performance.

Recommended minmum RAM is 16 GB.
Recommended minimum RAM for 10G is 32 GB. (RAM is the write cache. 2 txg @ 10 Gb/s = 12.5 GB)
Recommended minimum RAM for dedup is 64 GB.
Minimum RAM for dedup AND 10G is going to be a lot more than 12 GB.
 

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
@chuck32 I have written ZIL because I meant ZIL. It may well be that I misunderstood what ZIL is but that is another matter. I am aware that there is always a ZIL but on spinning rust, the IO of ZIL will, as far as I understood it, contend with the actual writes of userdata.
Therefore you could move ZIL to an SSD to enhance responsiveness.
ARC is a read cache, as far as I remember and the L2, as the name suggests, can be put on SSD as well but since I have trouble with write performance, I think ARC has no impact on my issue.

The NVMe are on a bifurcated PCIe card and are passed through individually.

I think the dedup may be the culprit here. Unrelated to this issue, I have ordered another stick to fill the PCIe card and give me more storage. I understand I will have to recreate the pool anyway so I will build the new pool without dedup and lz4 compression.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
LZ4 is harmless, even if your data does not compress.
Remove dedup, and forget about SLOG (dedicated hardware ZIL): If you seek write performance, set sync=never.
 

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
@Etorix: I don't think I need maximum write performance. If that was necessary, I would have gone with better gumsticks. I just think I'd like to bottleneck a hypothetical 100MBit network before I bottleneck the NVMe pool, is all :D
 

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
Not "possible": Certain!
Run zpool status -D.
You'll get a breakout of the DDT by size, indicating how much you're gaining (probably not enough to make it worth), and a line
dedup: DDT entries #########, size 1.47K on disk, 167B in core
Multiply the number of entries by "on disk" and you have the size of the DDT in the pool; multiply entries by "in core" and you have the size of what ZFS wants loaded in RAM at all time: If it doesn't fit, TrueNAS is swapping, on each and every operation—and here go your hopes of performance.

Recommended minmum RAM is 16 GB.
Recommended minimum RAM for 10G is 32 GB. (RAM is the write cache. 2 txg @ 10 Gb/s = 12.5 GB)
Recommended minimum RAM for dedup is 64 GB.
Minimum RAM for dedup AND 10G is going to be a lot more than 12 GB.
I get this:

dedup: DDT entries 2654604, size 3.01K on disk, 747B in core
Soo 1.8Gigs?
 

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
Well, it seems it's not going to be as straight forward as all that.
The pool itself does not have dedup active. So I created a new zvol without either lz4 and dedup, plain vanilla, and I gave it 4k block size seeing as the iSCSI block size is 512.

TrueNAS warned that 16K is recommended. I'll try that next. But so far, fio performance is still in the 300 IOPS range... then I tried disabling sync as recommended above and now I'm seeing 45k IOPS at 170MB/s

Sync Always seems to have a brutal impact on my setup.
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
Therefore you could move ZIL to an SSD to enhance responsiveness.
I edited my posts accordingly, you are right I meant to say SLOG.
 

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
I am trying to test the iSCSI zvols with the new settings and I'm comming to the conclusion that the current patch of 7.0 ESXi is REALLY sensitive to anything going on with those LUNs. Any change, it seems, even a resize, leads to ESXi seeing the device but being completely unwilling to mount the vmfs6.

I wonder if I should just do NFS instead and make my life easier.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Ouch! I had missed the part in your initial post which mentioned iSCSI. That is a performance killer, a notorious mismatch with any form of raidz, and another ressource hog which would on its own require 64 GB RAM to begin with.

Make your life easier: No block storage and no dedup.
 

Marco3G

Dabbler
Joined
Mar 15, 2022
Messages
28
Okay so I have two choices when adding a fourth gumstick: two mirrored vdevs and iSCSI or NFS on a 4 stick RAIDZ1.

Since I'm kinda loathe to trust ESXi to not corrupt its datastores whenever iSCSI has a hickup, I'll probably go with the latter.
 
Top