Hi everyone
A bit of background: I noticed that some shows in my Jellyfin VM (running on a zvol over iSCSI located on three NVMe gumsticks and having media mounted via NFS from another ZFS pool on 12 spindles) took a long time to start running.
During that troubleshooting, I noticed that the performance on the transcription directory had very bad write performance. I did tests with dd and hdparm and while hdparm returned outrageous read performance in the multiple GB/s (impossible, as the LAN is 10G), dd was much more interesting.
I used
dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync
and
dd if=/dev/zero of=/tmp/test2.img bs=512 count=1000 oflag=dsync
And while the former gave me about 250 MB/s IIRC, the latter was in the few hundred kb/s.
So I SSHed into TrueNAS and ran
fio --direct=1 --rw=write --bs=4k --ioengine=posixaio --iodepth=64 --name=throughput-test-job --ramp_time=5 --size=1G --filename=/dev/zvol/pool0/somezvol
I tested a relatively empty, new zvol on the NVMe RAIDZ1 (lz4 compression and dedup on) and got :
Jobs: 1 (f=1): [W(1)][83.3%][w=180MiB/s][w=46.2k IOPS][eta 00m:02s]
throughput-test-job: (groupid=0, jobs=1): err= 0: pid=8683: Fri Feb 23 08:40:17 2024
write: IOPS=43.0k, BW=168MiB/s (176MB/s)(776MiB/4621msec); 0 zone resets
slat (usec): min=7, max=3267, avg=20.31, stdev=61.39
clat (usec): min=104, max=19307, avg=782.09, stdev=993.58
lat (usec): min=113, max=20324, avg=802.43, stdev=1014.72
clat percentiles (usec):
| 1.00th=[ 127], 5.00th=[ 159], 10.00th=[ 212], 20.00th=[ 314],
| 30.00th=[ 416], 40.00th=[ 519], 50.00th=[ 619], 60.00th=[ 725],
| 70.00th=[ 824], 80.00th=[ 930], 90.00th=[ 1106], 95.00th=[ 1647],
| 99.00th=[ 6063], 99.50th=[ 7504], 99.90th=[11076], 99.95th=[13042],
| 99.99th=[16319]
bw ( KiB/s): min=28781, max=236660, per=99.82%, avg=171602.11, stdev=83549.79, samples=9
iops : min= 7195, max=59165, avg=42900.22, stdev=20887.36, samples=9
lat (usec) : 250=13.68%, 500=24.56%, 750=24.53%, 1000=22.98%
lat (msec) : 2=10.39%, 4=1.91%, 10=1.82%, 20=0.16%
cpu : usr=5.74%, sys=69.91%, ctx=17869, majf=0, minf=1
IO depths : 1=1.6%, 2=3.1%, 4=6.2%, 8=12.5%, 16=25.0%, 32=50.0%, >=64=1.6%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=98.5%, 8=0.0%, 16=0.0%, 32=0.0%, 64=1.5%, >=64=0.0%
issued rwts: total=0,198531,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=168MiB/s (176MB/s), 168MiB/s-168MiB/s (176MB/s-176MB/s), io=776MiB (813MB), run=4621-4621msec
Not eye watering, blazingly fast but looks workable.
Then, for shits and giggles, I made a zvol on the spindles. just to see. No compression, no dedup:
^Cbs: 1 (f=1): [W(1)][16.8%][w=250KiB/s][w=62 IOPS][eta 47m:58s] 4 --name=throughput-test-job --ramp_time=5 --size=1G --filename=/dev/zvol/virtualmachines/Jobs: 1 (f=1): [W(1)][15.8%][w=248KiB/s][w=62 IOPS][eta 48m:30s]
fio: terminating on signal 2
throughput-test-job: (groupid=0, jobs=1): err= 0: pid=8717: Fri Feb 23 08:52:06 2024
write: IOPS=76, BW=304KiB/s (312kB/s)(171MiB/575563msec); 0 zone resets
slat (msec): min=3, max=218, avg=13.15, stdev= 9.37
clat (usec): min=13, max=1815.6k, avg=417436.62, stdev=276328.38
lat (msec): min=3, max=1838, avg=430.58, stdev=277.27
clat percentiles (usec):
| 1.00th=[ 139], 5.00th=[ 33162], 10.00th=[ 68682],
| 20.00th=[ 149947], 30.00th=[ 231736], 40.00th=[ 312476],
| 50.00th=[ 396362], 60.00th=[ 480248], 70.00th=[ 566232],
| 80.00th=[ 641729], 90.00th=[ 759170], 95.00th=[ 943719],
| 99.00th=[1149240], 99.50th=[1216349], 99.90th=[1434452],
| 99.95th=[1518339], 99.99th=[1719665]
bw ( KiB/s): min= 487, max= 513, per=100.00%, avg=500.72, stdev= 4.64, samples=684
iops : min= 121, max= 128, avg=124.88, stdev= 1.19, samples=684
lat (usec) : 20=0.01%, 250=1.43%, 500=0.14%
lat (msec) : 4=0.01%, 10=0.26%, 20=1.39%, 50=4.03%, 100=6.32%
lat (msec) : 250=18.84%, 500=30.06%, 750=27.31%, 1000=6.49%, 2000=3.81%
cpu : usr=0.05%, sys=0.87%, ctx=87673, majf=0, minf=1
IO depths : 1=1.6%, 2=3.1%, 4=6.3%, 8=12.5%, 16=25.0%, 32=50.0%, >=64=1.6%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=98.5%, 8=0.0%, 16=0.1%, 32=0.0%, 64=1.5%, >=64=0.0%
issued rwts: total=0,43755,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=304KiB/s (312kB/s), 304KiB/s-304KiB/s (312kB/s-312kB/s), io=171MiB (179MB), run=575563-575563msec
I aborted this one because, well, it sucked :D
So I did another test on the NVMe, this time an actual ESXi iSCSI datastore that is in use. lz4 and dedup on:
^Cbs: 1 (f=1): [W(1)][1.4%][w=1009KiB/s][w=252 IOPS][eta 19m:46s]
fio: terminating on signal 2
throughput-test-job: (groupid=0, jobs=1): err= 0: pid=8809: Fri Feb 23 08:52:02 2024
write: IOPS=229, BW=929KiB/s (951kB/s)(11.1MiB/12290msec); 0 zone resets
slat (usec): min=2180, max=22360, avg=4360.02, stdev=2749.40
clat (usec): min=56, max=715089, avg=139909.62, stdev=114453.03
lat (msec): min=3, max=718, avg=144.34, stdev=115.49
clat percentiles (usec):
| 1.00th=[ 133], 5.00th=[ 10290], 10.00th=[ 22676], 20.00th=[ 47449],
| 30.00th=[ 71828], 40.00th=[ 96994], 50.00th=[121111], 60.00th=[145753],
| 70.00th=[173016], 80.00th=[198181], 90.00th=[270533], 95.00th=[367002],
| 99.00th=[566232], 99.50th=[641729], 99.90th=[708838], 99.95th=[708838],
| 99.99th=[717226]
bw ( KiB/s): min= 499, max= 1536, per=99.62%, avg=925.21, stdev=356.95, samples=24
iops : min= 124, max= 384, avg=230.96, stdev=89.20, samples=24
lat (usec) : 100=0.04%, 250=1.56%
lat (msec) : 4=1.28%, 10=1.70%, 20=4.29%, 50=12.31%, 100=20.76%
lat (msec) : 250=48.01%, 500=9.19%, 750=2.09%
cpu : usr=0.02%, sys=1.68%, ctx=5663, majf=0, minf=1
IO depths : 1=1.6%, 2=3.1%, 4=6.2%, 8=12.5%, 16=25.0%, 32=50.0%, >=64=1.6%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=98.4%, 8=0.0%, 16=0.0%, 32=0.0%, 64=1.6%, >=64=0.0%
issued rwts: total=0,2818,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=929KiB/s (951kB/s), 929KiB/s-929KiB/s (951kB/s-951kB/s), io=11.1MiB (11.7MB), run=12290-12290msec
And again atrocious values. 230 IOPS, 1MB/s throughput.
I am very sure this is due to some misconfiguration on my part but being the noob I am I have trouble seeing it.
Here are the pools. newstorage is the spindles containing, ignoring my test zvol, only filesystems. virtualmachines, as the name sugests, is the NVMe pool for ESXi:
pool: newstorage
state: ONLINE
scan: scrub repaired 0B in 05:03:45 with 0 errors on Sun Jan 28 05:03:45 2024
config:
NAME STATE READ WRITE CKSUM
newstorage ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
da1 ONLINE 0 0 0
da3 ONLINE 0 0 0
da5 ONLINE 0 0 0
da7 ONLINE 0 0 0
da9 ONLINE 0 0 0
da12 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
da2 ONLINE 0 0 0
da4 ONLINE 0 0 0
da6 ONLINE 0 0 0
da8 ONLINE 0 0 0
da10 ONLINE 0 0 0
da13 ONLINE 0 0 0
spares
da14 AVAIL
da11 AVAIL
errors: No known data errors
pool: virtualmachines
state: ONLINE
scan: scrub repaired 0B in 00:38:23 with 0 errors on Sun Feb 18 00:38:57 2024
config:
NAME STATE READ WRITE CKSUM
virtualmachines ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/4665d5bf-4a44-11ed-9615-002590f39cea ONLINE 0 0 0
gptid/4668c633-4a44-11ed-9615-002590f39cea ONLINE 0 0 0
gptid/466c8862-4a44-11ed-9615-002590f39cea ONLINE 0 0 0
My understanding is that a ZIL makes sense on slower disks only. Under the assumptin that the NVMe should be plenty fast, I left it out. L2ARC should be a read cache and have no impact on my issues, right?
The gumsticks behind the NVMe are Crucial P2. Certainly nothing too elaborate, but I would still expect more than 300 IOPS.
TrueNAS is running as a VM on ESXI, the Host's HBA and the NVMes being pcipassthroughed to it. It's a poor man's hyperconverged setup. The VM has 8 cores and 12GB ram. Is it possible that I am RAM constrained due to dedup?
Any other ideas what I could have a look at here?
A bit of background: I noticed that some shows in my Jellyfin VM (running on a zvol over iSCSI located on three NVMe gumsticks and having media mounted via NFS from another ZFS pool on 12 spindles) took a long time to start running.
During that troubleshooting, I noticed that the performance on the transcription directory had very bad write performance. I did tests with dd and hdparm and while hdparm returned outrageous read performance in the multiple GB/s (impossible, as the LAN is 10G), dd was much more interesting.
I used
dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync
and
dd if=/dev/zero of=/tmp/test2.img bs=512 count=1000 oflag=dsync
And while the former gave me about 250 MB/s IIRC, the latter was in the few hundred kb/s.
So I SSHed into TrueNAS and ran
fio --direct=1 --rw=write --bs=4k --ioengine=posixaio --iodepth=64 --name=throughput-test-job --ramp_time=5 --size=1G --filename=/dev/zvol/pool0/somezvol
I tested a relatively empty, new zvol on the NVMe RAIDZ1 (lz4 compression and dedup on) and got :
Jobs: 1 (f=1): [W(1)][83.3%][w=180MiB/s][w=46.2k IOPS][eta 00m:02s]
throughput-test-job: (groupid=0, jobs=1): err= 0: pid=8683: Fri Feb 23 08:40:17 2024
write: IOPS=43.0k, BW=168MiB/s (176MB/s)(776MiB/4621msec); 0 zone resets
slat (usec): min=7, max=3267, avg=20.31, stdev=61.39
clat (usec): min=104, max=19307, avg=782.09, stdev=993.58
lat (usec): min=113, max=20324, avg=802.43, stdev=1014.72
clat percentiles (usec):
| 1.00th=[ 127], 5.00th=[ 159], 10.00th=[ 212], 20.00th=[ 314],
| 30.00th=[ 416], 40.00th=[ 519], 50.00th=[ 619], 60.00th=[ 725],
| 70.00th=[ 824], 80.00th=[ 930], 90.00th=[ 1106], 95.00th=[ 1647],
| 99.00th=[ 6063], 99.50th=[ 7504], 99.90th=[11076], 99.95th=[13042],
| 99.99th=[16319]
bw ( KiB/s): min=28781, max=236660, per=99.82%, avg=171602.11, stdev=83549.79, samples=9
iops : min= 7195, max=59165, avg=42900.22, stdev=20887.36, samples=9
lat (usec) : 250=13.68%, 500=24.56%, 750=24.53%, 1000=22.98%
lat (msec) : 2=10.39%, 4=1.91%, 10=1.82%, 20=0.16%
cpu : usr=5.74%, sys=69.91%, ctx=17869, majf=0, minf=1
IO depths : 1=1.6%, 2=3.1%, 4=6.2%, 8=12.5%, 16=25.0%, 32=50.0%, >=64=1.6%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=98.5%, 8=0.0%, 16=0.0%, 32=0.0%, 64=1.5%, >=64=0.0%
issued rwts: total=0,198531,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=168MiB/s (176MB/s), 168MiB/s-168MiB/s (176MB/s-176MB/s), io=776MiB (813MB), run=4621-4621msec
Not eye watering, blazingly fast but looks workable.
Then, for shits and giggles, I made a zvol on the spindles. just to see. No compression, no dedup:
^Cbs: 1 (f=1): [W(1)][16.8%][w=250KiB/s][w=62 IOPS][eta 47m:58s] 4 --name=throughput-test-job --ramp_time=5 --size=1G --filename=/dev/zvol/virtualmachines/Jobs: 1 (f=1): [W(1)][15.8%][w=248KiB/s][w=62 IOPS][eta 48m:30s]
fio: terminating on signal 2
throughput-test-job: (groupid=0, jobs=1): err= 0: pid=8717: Fri Feb 23 08:52:06 2024
write: IOPS=76, BW=304KiB/s (312kB/s)(171MiB/575563msec); 0 zone resets
slat (msec): min=3, max=218, avg=13.15, stdev= 9.37
clat (usec): min=13, max=1815.6k, avg=417436.62, stdev=276328.38
lat (msec): min=3, max=1838, avg=430.58, stdev=277.27
clat percentiles (usec):
| 1.00th=[ 139], 5.00th=[ 33162], 10.00th=[ 68682],
| 20.00th=[ 149947], 30.00th=[ 231736], 40.00th=[ 312476],
| 50.00th=[ 396362], 60.00th=[ 480248], 70.00th=[ 566232],
| 80.00th=[ 641729], 90.00th=[ 759170], 95.00th=[ 943719],
| 99.00th=[1149240], 99.50th=[1216349], 99.90th=[1434452],
| 99.95th=[1518339], 99.99th=[1719665]
bw ( KiB/s): min= 487, max= 513, per=100.00%, avg=500.72, stdev= 4.64, samples=684
iops : min= 121, max= 128, avg=124.88, stdev= 1.19, samples=684
lat (usec) : 20=0.01%, 250=1.43%, 500=0.14%
lat (msec) : 4=0.01%, 10=0.26%, 20=1.39%, 50=4.03%, 100=6.32%
lat (msec) : 250=18.84%, 500=30.06%, 750=27.31%, 1000=6.49%, 2000=3.81%
cpu : usr=0.05%, sys=0.87%, ctx=87673, majf=0, minf=1
IO depths : 1=1.6%, 2=3.1%, 4=6.3%, 8=12.5%, 16=25.0%, 32=50.0%, >=64=1.6%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=98.5%, 8=0.0%, 16=0.1%, 32=0.0%, 64=1.5%, >=64=0.0%
issued rwts: total=0,43755,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=304KiB/s (312kB/s), 304KiB/s-304KiB/s (312kB/s-312kB/s), io=171MiB (179MB), run=575563-575563msec
I aborted this one because, well, it sucked :D
So I did another test on the NVMe, this time an actual ESXi iSCSI datastore that is in use. lz4 and dedup on:
^Cbs: 1 (f=1): [W(1)][1.4%][w=1009KiB/s][w=252 IOPS][eta 19m:46s]
fio: terminating on signal 2
throughput-test-job: (groupid=0, jobs=1): err= 0: pid=8809: Fri Feb 23 08:52:02 2024
write: IOPS=229, BW=929KiB/s (951kB/s)(11.1MiB/12290msec); 0 zone resets
slat (usec): min=2180, max=22360, avg=4360.02, stdev=2749.40
clat (usec): min=56, max=715089, avg=139909.62, stdev=114453.03
lat (msec): min=3, max=718, avg=144.34, stdev=115.49
clat percentiles (usec):
| 1.00th=[ 133], 5.00th=[ 10290], 10.00th=[ 22676], 20.00th=[ 47449],
| 30.00th=[ 71828], 40.00th=[ 96994], 50.00th=[121111], 60.00th=[145753],
| 70.00th=[173016], 80.00th=[198181], 90.00th=[270533], 95.00th=[367002],
| 99.00th=[566232], 99.50th=[641729], 99.90th=[708838], 99.95th=[708838],
| 99.99th=[717226]
bw ( KiB/s): min= 499, max= 1536, per=99.62%, avg=925.21, stdev=356.95, samples=24
iops : min= 124, max= 384, avg=230.96, stdev=89.20, samples=24
lat (usec) : 100=0.04%, 250=1.56%
lat (msec) : 4=1.28%, 10=1.70%, 20=4.29%, 50=12.31%, 100=20.76%
lat (msec) : 250=48.01%, 500=9.19%, 750=2.09%
cpu : usr=0.02%, sys=1.68%, ctx=5663, majf=0, minf=1
IO depths : 1=1.6%, 2=3.1%, 4=6.2%, 8=12.5%, 16=25.0%, 32=50.0%, >=64=1.6%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=98.4%, 8=0.0%, 16=0.0%, 32=0.0%, 64=1.6%, >=64=0.0%
issued rwts: total=0,2818,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=929KiB/s (951kB/s), 929KiB/s-929KiB/s (951kB/s-951kB/s), io=11.1MiB (11.7MB), run=12290-12290msec
And again atrocious values. 230 IOPS, 1MB/s throughput.
I am very sure this is due to some misconfiguration on my part but being the noob I am I have trouble seeing it.
Here are the pools. newstorage is the spindles containing, ignoring my test zvol, only filesystems. virtualmachines, as the name sugests, is the NVMe pool for ESXi:
pool: newstorage
state: ONLINE
scan: scrub repaired 0B in 05:03:45 with 0 errors on Sun Jan 28 05:03:45 2024
config:
NAME STATE READ WRITE CKSUM
newstorage ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
da1 ONLINE 0 0 0
da3 ONLINE 0 0 0
da5 ONLINE 0 0 0
da7 ONLINE 0 0 0
da9 ONLINE 0 0 0
da12 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
da2 ONLINE 0 0 0
da4 ONLINE 0 0 0
da6 ONLINE 0 0 0
da8 ONLINE 0 0 0
da10 ONLINE 0 0 0
da13 ONLINE 0 0 0
spares
da14 AVAIL
da11 AVAIL
errors: No known data errors
pool: virtualmachines
state: ONLINE
scan: scrub repaired 0B in 00:38:23 with 0 errors on Sun Feb 18 00:38:57 2024
config:
NAME STATE READ WRITE CKSUM
virtualmachines ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/4665d5bf-4a44-11ed-9615-002590f39cea ONLINE 0 0 0
gptid/4668c633-4a44-11ed-9615-002590f39cea ONLINE 0 0 0
gptid/466c8862-4a44-11ed-9615-002590f39cea ONLINE 0 0 0
My understanding is that a ZIL makes sense on slower disks only. Under the assumptin that the NVMe should be plenty fast, I left it out. L2ARC should be a read cache and have no impact on my issues, right?
The gumsticks behind the NVMe are Crucial P2. Certainly nothing too elaborate, but I would still expect more than 300 IOPS.
TrueNAS is running as a VM on ESXI, the Host's HBA and the NVMes being pcipassthroughed to it. It's a poor man's hyperconverged setup. The VM has 8 cores and 12GB ram. Is it possible that I am RAM constrained due to dedup?
Any other ideas what I could have a look at here?