ZVOL write performance <10MB/s when blocks were previously written

ret1 · Feb 9, 2023

Hello Everyone!

Im getting really bad write performance (less than 10MB/s) on my ZVOL, if the blocks were already previously written. If the write benchmark is done on a newly created ZVOL, performance looks ok. ZVOL read, and Dataset read and write benchmarks seem to perform as expected. Please see benchmark section for details.

i also did a "iostat -x 1" during the 2nd run of my ZVOL write benchmark. which seems odd is, there are many reads between the write operations. see video: https://storage.r3t.at/temp/iostat_during_zvol_write_benchmark.mov

Is this typical behavior or is something wrong with my setup?

thanks in advance
Thomas

Hardware:

ProLiant DL380e Gen8
2x Intel(R) Xeon(R) CPU E5-2450L 0 @ 1.80GHz
8x 8GB DDR3 ECC 1333 MT/s
8x 12TB TOSHIBA MG07ACA12TE 7200rpm (RAID-Z2) -> LSI SAS2008 HBA-Mode
2x 128GB Samsung SSD 840 PRO (SLOG; 16GB Over-provisioning) -> Smart Array P420 Controller (fw: 8.32)
Intel Corporation 82599 10Gbit/s Dualport LACP configuration

Software:

TrueNAS-13.0-U3.1

Configuration:

root@storage-01[~]# zpool list -v pool0
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
pool0 87T 58.4T 28.6T - - 2% 67% 1.00x ONLINE /mnt
raidz2-0 87T 58.4T 28.6T - - 2% 67.1% - ONLINE
gptid/e9396616-4aa5-11eb-b5fd-90e2ba88c0fc 10.9T - - - - - - - ONLINE
gptid/7e30bb66-597a-11eb-b5fd-90e2ba88c0fc 10.9T - - - - - - - ONLINE
gptid/f295c5d0-4ea3-11eb-b5fd-90e2ba88c0fc 10.9T - - - - - - - ONLINE
gptid/40d3736b-5fd3-11eb-b5fd-90e2ba88c0fc 10.9T - - - - - - - ONLINE
gptid/c7ce9b60-44aa-11eb-b5fd-90e2ba88c0fc 10.9T - - - - - - - ONLINE
gptid/e71d4f91-3e23-11eb-b5fd-90e2ba88c0fc 10.9T - - - - - - - ONLINE
gptid/fc988100-6966-11eb-b5fd-90e2ba88c0fc 10.9T - - - - - - - ONLINE
gptid/caade709-6aac-11eb-b5fd-90e2ba88c0fc 10.9T - - - - - - - ONLINE
logs - - - - - - - - -
mirror-3 14.5G 396K 14.5G - - 0% 0.00% - ONLINE
gptid/e47cbdbe-a6e9-11ed-84c6-90e2ba88c0fc 14.9G - - - - - - - ONLINE
gptid/e47fbd5e-a6e9-11ed-84c6-90e2ba88c0fc 14.9G - - - - - - - ONLINE

root@storage-01[~]# zpool get all pool0
NAME PROPERTY VALUE SOURCE
pool0 size 87T -
pool0 capacity 67% -
pool0 altroot /mnt local
pool0 health ONLINE -
pool0 guid 2080558856356982529 -
pool0 version - default
pool0 bootfs - default
pool0 delegation on default
pool0 autoreplace off default
pool0 cachefile /data/zfs/zpool.cache local
pool0 failmode continue local
pool0 listsnapshots off default
pool0 autoexpand on local
pool0 dedupratio 1.00x -
pool0 free 28.6T -
pool0 allocated 58.4T -
pool0 readonly off -
pool0 ashift 0 default
pool0 comment - default
pool0 expandsize - -
pool0 freeing 0 -
pool0 fragmentation 2% -
pool0 leaked 0 -
pool0 multihost off default
pool0 checkpoint - -
pool0 load_guid 8493025088334420322 -
pool0 autotrim off default
pool0 compatibility off default
pool0 feature@async_destroy enabled local
pool0 feature@empty_bpobj active local
pool0 feature@lz4_compress active local
pool0 feature@multi_vdev_crash_dump enabled local
pool0 feature@spacemap_histogram active local
pool0 feature@enabled_txg active local
pool0 feature@hole_birth active local
pool0 feature@extensible_dataset active local
pool0 feature@embedded_data active local
pool0 feature@bookmarks enabled local
pool0 feature@filesystem_limits enabled local
pool0 feature@large_blocks enabled local
pool0 feature@large_dnode enabled local
pool0 feature@sha512 enabled local
pool0 feature@skein enabled local
pool0 feature@userobj_accounting active local
pool0 feature@encryption enabled local
pool0 feature@project_quota active local
pool0 feature@device_removal enabled local
pool0 feature@obsolete_counts enabled local
pool0 feature@zpool_checkpoint enabled local
pool0 feature@spacemap_v2 active local
pool0 feature@allocation_classes enabled local
pool0 feature@resilver_defer enabled local
pool0 feature@bookmark_v2 enabled local
pool0 feature@redaction_bookmarks enabled local
pool0 feature@redacted_datasets enabled local
pool0 feature@bookmark_written enabled local
pool0 feature@log_spacemap active local
pool0 feature@livelist enabled local
pool0 feature@device_rebuild enabled local
pool0 feature@zstd_compress enabled local
pool0 feature@draid enabled local

root@storage-01[~]# zdb -U /data/zfs/zpool.cache
pool0:
version: 5000
name: 'pool0'
state: 0
txg: 26286855
pool_guid: 2080558856356982529
errata: 0
hostid: 1283492834
hostname: 'storage-01.local'
com.delphix:has_per_vdev_zaps
hole_array[0]: 1
hole_array[1]: 2
vdev_children: 4
vdev_tree:
type: 'root'
id: 0
guid: 2080558856356982529
create_txg: 4
children[0]:
type: 'raidz'
id: 0
guid: 9609669695960293150
nparity: 2
metaslab_array: 45
metaslab_shift: 39
ashift: 12
asize: 95983889285120
is_log: 0
create_txg: 4
com.delphix:vdev_zap_top: 36
children[0]:
type: 'disk'
id: 0
guid: 2980144553633230025
path: '/dev/gptid/e9396616-4aa5-11eb-b5fd-90e2ba88c0fc'
phys_path: 'id1,enc@n50014380260d09a0/type@0/slot@2/p2'
DTL: 2050
create_txg: 4
com.delphix:vdev_zap_leaf: 2049
children[1]:
type: 'disk'
id: 1
guid: 15671853429811190557
path: '/dev/gptid/7e30bb66-597a-11eb-b5fd-90e2ba88c0fc'
phys_path: 'id1,enc@n50014380260d09a0/type@0/slot@3/p2'
DTL: 3202
create_txg: 4
com.delphix:vdev_zap_leaf: 3201
children[2]:
type: 'disk'
id: 2
guid: 10042218666031593438
path: '/dev/gptid/f295c5d0-4ea3-11eb-b5fd-90e2ba88c0fc'
phys_path: 'id1,enc@n50014380260d09a0/type@0/slot@5/p2'
DTL: 385
create_txg: 4
com.delphix:vdev_zap_leaf: 2565
children[3]:
type: 'disk'
id: 3
guid: 2360911892941709616
path: '/dev/gptid/40d3736b-5fd3-11eb-b5fd-90e2ba88c0fc'
phys_path: 'id1,enc@n50014380260d09a0/type@0/slot@6/p2'
DTL: 3457
create_txg: 4
com.delphix:vdev_zap_leaf: 642
children[4]:
type: 'disk'
id: 4
guid: 819974807475987832
path: '/dev/gptid/c7ce9b60-44aa-11eb-b5fd-90e2ba88c0fc'
phys_path: 'id1,enc@n50014380260d09a0/type@0/slot@8/p2'
DTL: 178
create_txg: 4
com.delphix:vdev_zap_leaf: 1027
children[5]:
type: 'disk'
id: 5
guid: 8124138941333668689
path: '/dev/gptid/e71d4f91-3e23-11eb-b5fd-90e2ba88c0fc'
phys_path: 'id1,enc@n50014380260d09a0/type@0/slot@9/p2'
DTL: 768
create_txg: 4
com.delphix:vdev_zap_leaf: 640
children[6]:
type: 'disk'
id: 6
guid: 11580297040101880531
path: '/dev/gptid/fc988100-6966-11eb-b5fd-90e2ba88c0fc'
phys_path: 'id1,enc@n50014380260d09a0/type@0/slot@b/p2'
DTL: 1795
create_txg: 4
com.delphix:vdev_zap_leaf: 774
children[7]:
type: 'disk'
id: 7
guid: 14392914346318225443
path: '/dev/gptid/caade709-6aac-11eb-b5fd-90e2ba88c0fc'
phys_path: 'id1,enc@n50014380260d09a0/type@0/slot@c/p2'
DTL: 2054
create_txg: 4
com.delphix:vdev_zap_leaf: 775
children[1]:
type: 'hole'
id: 1
guid: 0
whole_disk: 0
metaslab_array: 0
metaslab_shift: 0
ashift: 0
asize: 0
is_log: 0
is_hole: 1
children[2]:
type: 'hole'
id: 2
guid: 0
whole_disk: 0
metaslab_array: 0
metaslab_shift: 0
ashift: 0
asize: 0
is_log: 0
is_hole: 1
children[3]:
type: 'mirror'
id: 3
guid: 10628891141652935674
metaslab_array: 1796
metaslab_shift: 29
ashift: 12
asize: 16009134080
is_log: 1
create_txg: 26286852
com.delphix:vdev_zap_top: 2178
children[0]:
type: 'disk'
id: 0
guid: 16451764569266323683
path: '/dev/gptid/e47cbdbe-a6e9-11ed-84c6-90e2ba88c0fc'
phys_path: 'id1,enc@n50014380260d09a0/type@0/slot@a/p1'
create_txg: 26286852
com.delphix:vdev_zap_leaf: 2308
children[1]:
type: 'disk'
id: 1
guid: 17405883628691890282
path: '/dev/gptid/e47fbd5e-a6e9-11ed-84c6-90e2ba88c0fc'
phys_path: 'id1,enc@n50014380260d09a0/type@0/slot@d/p1'
create_txg: 26286852
com.delphix:vdev_zap_leaf: 2309
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data

root@storage-01[~]# zfs get all pool0
NAME PROPERTY VALUE SOURCE
pool0 type filesystem -
pool0 creation Thu Dec 20 21:06 2018 -
pool0 used 44.5T -
pool0 available 17.0T -
pool0 referenced 205K -
pool0 compressratio 1.00x -
pool0 mounted yes -
pool0 quota none default
pool0 reservation none default
pool0 recordsize 128K default
pool0 mountpoint /mnt/pool0 default
pool0 sharenfs off default
pool0 checksum on default
pool0 compression lz4 local
pool0 atime off local
pool0 devices on default
pool0 exec on default
pool0 setuid on default
pool0 readonly off default
pool0 jailed off default
pool0 snapdir hidden default
pool0 aclmode passthrough local
pool0 aclinherit passthrough local
pool0 createtxg 1 -
pool0 canmount on default
pool0 xattr on default
pool0 copies 1 local
pool0 version 5 -
pool0 utf8only off -
pool0 normalization none -
pool0 casesensitivity sensitive -
pool0 vscan off default
pool0 nbmand off default
pool0 sharesmb off default
pool0 refquota none default
pool0 refreservation none default
pool0 guid 12075911625983732024 -
pool0 primarycache all default
pool0 secondarycache all default
pool0 usedbysnapshots 0B -
pool0 usedbydataset 205K -
pool0 usedbychildren 44.5T -
pool0 usedbyrefreservation 0B -
pool0 logbias latency default
pool0 objsetid 21 -
pool0 dedup off default
pool0 mlslabel none default
pool0 sync standard local
pool0 dnodesize legacy default
pool0 refcompressratio 1.00x -
pool0 written 205K -
pool0 logicalused 38.9T -
pool0 logicalreferenced 34.5K -
pool0 volmode default default
pool0 filesystem_limit none default
pool0 snapshot_limit none default
pool0 filesystem_count none default
pool0 snapshot_count none default
pool0 snapdev hidden default
pool0 acltype nfsv4 default
pool0 context none default
pool0 fscontext none default
pool0 defcontext none default
pool0 rootcontext none default
pool0 relatime off default
pool0 redundant_metadata all default
pool0 overlay on default
pool0 encryption off default
pool0 keylocation none default
pool0 keyformat none default
pool0 pbkdf2iters 0 default
pool0 special_small_blocks 0 local
pool0 org.freebsd.ioc:active yes local

root@storage-01[~]# zfs get all pool0/rettest2
NAME PROPERTY VALUE SOURCE
pool0/rettest2 type volume -
pool0/rettest2 creation Tue Feb 7 23:39 2023 -
pool0/rettest2 used 1.07T -
pool0/rettest2 available 17.9T -
pool0/rettest2 referenced 107G -
pool0/rettest2 compressratio 1.00x -
pool0/rettest2 reservation none default
pool0/rettest2 volsize 1.00T local
pool0/rettest2 volblocksize 64K -
pool0/rettest2 checksum on default
pool0/rettest2 compression lz4 local
pool0/rettest2 readonly off default
pool0/rettest2 createtxg 26293163 -
pool0/rettest2 copies 1 inherited from pool0
pool0/rettest2 refreservation 1.07T local
pool0/rettest2 guid 5780249142669288686 -
pool0/rettest2 primarycache all default
pool0/rettest2 secondarycache all default
pool0/rettest2 usedbysnapshots 0B -
pool0/rettest2 usedbydataset 107G -
pool0/rettest2 usedbychildren 0B -
pool0/rettest2 usedbyrefreservation 989G -
pool0/rettest2 logbias latency default
pool0/rettest2 objsetid 834 -
pool0/rettest2 dedup off local
pool0/rettest2 mlslabel none default
pool0/rettest2 sync standard local
pool0/rettest2 refcompressratio 1.00x -
pool0/rettest2 written 107G -
pool0/rettest2 logicalused 100G -
pool0/rettest2 logicalreferenced 100G -
pool0/rettest2 volmode default default
pool0/rettest2 snapshot_limit none default
pool0/rettest2 snapshot_count none default
pool0/rettest2 snapdev hidden default
pool0/rettest2 context none default
pool0/rettest2 fscontext none default
pool0/rettest2 defcontext none default
pool0/rettest2 rootcontext none default
pool0/rettest2 redundant_metadata all default
pool0/rettest2 encryption off default
pool0/rettest2 keylocation none default
pool0/rettest2 keyformat none default
pool0/rettest2 pbkdf2iters 0 default
pool0/rettest2 org.freenas:description local
pool0/rettest2 org.truenas:managedby 172.16.0.11 local
pool0/rettest2 org.freebsd.ioc:active yes inherited from pool0

Benchmark (run locally on TrueNAS server):

root@storage-01[~]# fio --direct=1 --rw=write --bs=4k --ioengine=posixaio --iodepth=64 --name=throughput-test-job --ramp_time=5 --size=100G --filename=/dev/zvol/pool0/rettest2
throughput-test-job: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=64
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [W(1)][98.9%][w=193MiB/s][w=49.5k IOPS][eta 00m:06s]
throughput-test-job: (groupid=0, jobs=1): err= 0: pid=31931: Wed Feb 8 14:30:55 2023
write: IOPS=49.0k, BW=191MiB/s (201MB/s)(99.0GiB/530294msec); 0 zone resets
slat (usec): min=11, max=1211.5k, avg=16.73, stdev=270.93
clat (usec): min=131, max=224809, avg=700.08, stdev=532.41
lat (usec): min=144, max=1213.7k, avg=716.81, stdev=603.23
clat percentiles (usec):
| 1.00th=[ 157], 5.00th=[ 198], 10.00th=[ 247], 20.00th=[ 347],
| 30.00th=[ 474], 40.00th=[ 570], 50.00th=[ 693], 60.00th=[ 799],
| 70.00th=[ 898], 80.00th=[ 1020], 90.00th=[ 1123], 95.00th=[ 1188],
| 99.00th=[ 1352], 99.50th=[ 1434], 99.90th=[ 2638], 99.95th=[ 8717],
| 99.99th=[20841]
bw ( KiB/s): min=26517, max=218262, per=100.00%, avg=196190.04, stdev=23447.66, samples=1056
iops : min= 6629, max=54565, avg=49047.22, stdev=5861.94, samples=1056
lat (usec) : 250=10.29%, 500=22.35%, 750=22.52%, 1000=22.56%
lat (msec) : 2=22.17%, 4=0.03%, 10=0.04%, 20=0.03%, 50=0.01%
lat (msec) : 100=0.01%, 250=0.01%
cpu : usr=12.73%, sys=85.11%, ctx=79514, majf=0, minf=1
IO depths : 1=1.6%, 2=3.1%, 4=6.2%, 8=12.5%, 16=25.0%, 32=50.0%, >=64=1.6%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=98.5%, 8=0.0%, 16=0.0%, 32=0.0%, 64=1.5%, >=64=0.0%
issued rwts: total=0,25960491,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
WRITE: bw=191MiB/s (201MB/s), 191MiB/s-191MiB/s (201MB/s-201MB/s), io=99.0GiB (106GB), run=530294-530294msec

root@storage-01[~]# fio --direct=1 --rw=write --bs=4k --ioengine=posixaio --iodepth=64 --name=throughput-test-job --ramp_time=5 --size=4G --filename=/dev/zvol/pool0/rettest2
throughput-test-job: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=64
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [W(1)][88.0%][w=12.5MiB/s][w=3187 IOPS][eta 01m:15s]
throughput-test-job: (groupid=0, jobs=1): err= 0: pid=55087: Thu Feb 9 10:06:17 2023
write: IOPS=1701, BW=6806KiB/s (6969kB/s)(3600MiB/541645msec); 0 zone resets
slat (usec): min=12, max=626203, avg=584.53, stdev=6991.28
clat (usec): min=138, max=907952, avg=14357.87, stdev=40460.10
lat (usec): min=152, max=908401, avg=14942.37, stdev=41503.53
clat percentiles (usec):
| 1.00th=[ 163], 5.00th=[ 208], 10.00th=[ 262], 20.00th=[ 379],
| 30.00th=[ 898], 40.00th=[ 1057], 50.00th=[ 1598], 60.00th=[ 2147],
| 70.00th=[ 2573], 80.00th=[ 9503], 90.00th=[ 26084], 95.00th=[114820],
| 99.00th=[204473], 99.50th=[233833], 99.90th=[358613], 99.95th=[455082],
| 99.99th=[633340]
bw ( KiB/s): min= 498, max=34372, per=100.00%, avg=6867.22, stdev=4092.88, samples=1063
iops : min= 124, max= 8593, avg=1716.49, stdev=1023.22, samples=1063
lat (usec) : 250=8.93%, 500=14.13%, 750=0.94%, 1000=12.86%
lat (msec) : 2=20.99%, 4=14.18%, 10=8.44%, 20=5.79%, 50=6.22%
lat (msec) : 100=1.32%, 250=5.89%, 500=0.29%, 750=0.03%, 1000=0.01%
cpu : usr=0.47%, sys=5.18%, ctx=57675, majf=0, minf=1
IO depths : 1=1.6%, 2=3.1%, 4=6.2%, 8=12.5%, 16=25.0%, 32=50.0%, >=64=1.6%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=98.5%, 8=0.0%, 16=0.0%, 32=0.0%, 64=1.5%, >=64=0.0%
issued rwts: total=0,921551,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
WRITE: bw=6806KiB/s (6969kB/s), 6806KiB/s-6806KiB/s (6969kB/s-6969kB/s), io=3600MiB (3775MB), run=541645-541645msec

root@storage-01[~]# fio --direct=1 --rw=write --bs=4k --ioengine=posixaio --iodepth=64 --name=throughput-test-job --ramp_time=5 --size=100G --filename=/mnt/pool0/rettestdataset/test5
throughput-test-job: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=64
fio-3.28
Starting 1 process
throughput-test-job: Laying out IO file (1 file / 102400MiB)
Jobs: 1 (f=1): [W(1)][98.7%][w=273MiB/s][w=69.8k IOPS][eta 00m:05s]
throughput-test-job: (groupid=0, jobs=1): err= 0: pid=55325: Thu Feb 9 10:17:15 2023
write: IOPS=70.3k, BW=275MiB/s (288MB/s)(98.6GiB/367859msec); 0 zone resets
slat (nsec): min=1134, max=1400.5k, avg=7945.12, stdev=17357.09
clat (usec): min=58, max=20451, avg=716.09, stdev=646.06
lat (usec): min=63, max=20456, avg=724.03, stdev=645.66
clat percentiles (usec):
| 1.00th=[ 178], 5.00th=[ 227], 10.00th=[ 269], 20.00th=[ 338],
| 30.00th=[ 412], 40.00th=[ 498], 50.00th=[ 586], 60.00th=[ 668],
| 70.00th=[ 750], 80.00th=[ 865], 90.00th=[ 1074], 95.00th=[ 1467],
| 99.00th=[ 3818], 99.50th=[ 4080], 99.90th=[ 4686], 99.95th=[ 4752],
| 99.99th=[ 5342]
bw ( KiB/s): min=54506, max=354778, per=100.00%, avg=281401.82, stdev=92028.96, samples=732
iops : min=13626, max=88694, avg=70350.17, stdev=23007.25, samples=732
lat (usec) : 100=0.01%, 250=7.63%, 500=32.66%, 750=29.50%, 1000=17.57%
lat (msec) : 2=8.59%, 4=3.47%, 10=0.59%, 20=0.01%, 50=0.01%
cpu : usr=18.01%, sys=63.94%, ctx=2689836, majf=0, minf=1
IO depths : 1=0.1%, 2=0.2%, 4=1.4%, 8=5.4%, 16=19.9%, 32=68.4%, >=64=4.6%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=97.7%, 8=0.1%, 16=0.1%, 32=0.3%, 64=1.7%, >=64=0.0%
issued rwts: total=0,25850409,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=98.6GiB (106GB), run=367859-367859msec

root@storage-01[~]# fio --direct=1 --rw=read --readonly --bs=4k --ioengine=posixaio --iodepth=64 --name=throughput-test-job --ramp_time=5 --size=100G --filename=/dev/zvol/pool0/rettest2
throughput-test-job: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=64
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [R(1)][99.3%][r=195MiB/s][r=50.0k IOPS][eta 00m:04s]
throughput-test-job: (groupid=0, jobs=1): err= 0: pid=56172: Thu Feb 9 10:56:43 2023
read: IOPS=48.0k, BW=188MiB/s (197MB/s)(99.3GiB/541941msec)
slat (usec): min=8, max=414120, avg=17.33, stdev=193.59
clat (usec): min=121, max=416582, avg=684.31, stdev=1022.62
lat (usec): min=131, max=416632, avg=701.64, stdev=1043.55
clat percentiles (usec):
| 1.00th=[ 153], 5.00th=[ 186], 10.00th=[ 225], 20.00th=[ 306],
| 30.00th=[ 457], 40.00th=[ 545], 50.00th=[ 668], 60.00th=[ 775],
| 70.00th=[ 881], 80.00th=[ 1004], 90.00th=[ 1106], 95.00th=[ 1188],
| 99.00th=[ 1336], 99.50th=[ 1385], 99.90th=[ 2540], 99.95th=[11076],
| 99.99th=[32637]
bw ( KiB/s): min= 1488, max=224830, per=100.00%, avg=192249.00, stdev=27290.40, samples=1080
iops : min= 372, max=56207, avg=48062.05, stdev=6822.63, samples=1080
lat (usec) : 250=13.24%, 500=21.70%, 750=22.10%, 1000=22.47%
lat (msec) : 2=20.39%, 4=0.01%, 10=0.04%, 20=0.03%, 50=0.02%
lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%
cpu : usr=11.76%, sys=84.78%, ctx=16896, majf=2, minf=1
IO depths : 1=1.6%, 2=3.1%, 4=6.2%, 8=12.5%, 16=25.0%, 32=50.0%, >=64=1.6%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=98.5%, 8=0.0%, 16=0.0%, 32=0.0%, 64=1.5%, >=64=0.0%
issued rwts: total=26022201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=188MiB/s (197MB/s), 188MiB/s-188MiB/s (197MB/s-197MB/s), io=99.3GiB (107GB), run=541941-541941msec

root@storage-01[~]# fio --direct=1 --rw=read --readonly --bs=4k --ioengine=posixaio --iodepth=64 --name=throughput-test-job --ramp_time=5 --size=100G --filename=/mnt/pool0/rettestdataset/test5
throughput-test-job: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=64
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [R(1)][98.3%][r=467MiB/s][r=120k IOPS][eta 00m:04s]
throughput-test-job: (groupid=0, jobs=1): err= 0: pid=56375: Thu Feb 9 11:02:51 2023
read: IOPS=116k, BW=453MiB/s (475MB/s)(98.0GiB/221668msec)
slat (nsec): min=767, max=876763, avg=5261.16, stdev=5128.82
clat (usec): min=50, max=235582, avg=375.27, stdev=846.86
lat (usec): min=55, max=235585, avg=380.54, stdev=846.82
clat percentiles (usec):
| 1.00th=[ 172], 5.00th=[ 194], 10.00th=[ 215], 20.00th=[ 253],
| 30.00th=[ 285], 40.00th=[ 318], 50.00th=[ 355], 60.00th=[ 392],
| 70.00th=[ 429], 80.00th=[ 461], 90.00th=[ 510], 95.00th=[ 603],
| 99.00th=[ 717], 99.50th=[ 758], 99.90th=[ 922], 99.95th=[ 1172],
| 99.99th=[27919]
bw ( KiB/s): min=193952, max=487113, per=100.00%, avg=464024.33, stdev=35115.81, samples=441
iops : min=48488, max=121778, avg=116005.74, stdev=8778.83, samples=441
lat (usec) : 100=0.01%, 250=19.13%, 500=69.37%, 750=10.93%, 1000=0.50%
lat (msec) : 2=0.03%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%, 250=0.01%
cpu : usr=14.48%, sys=83.44%, ctx=87581, majf=0, minf=1
IO depths : 1=0.1%, 2=1.6%, 4=5.2%, 8=12.4%, 16=26.3%, 32=52.8%, >=64=1.7%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=98.4%, 8=0.1%, 16=0.1%, 32=0.1%, 64=1.6%, >=64=0.0%
issued rwts: total=25692073,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=453MiB/s (475MB/s), 453MiB/s-453MiB/s (475MB/s-475MB/s), io=98.0GiB (105GB), run=221668-221668msec

Samuel Tai · Feb 9, 2023

Zvols are recommended on mirror pools, not RAIDZx pools, because mirrors allow much better throughput. Your RAIDZ2 VDEV is also at the top end of acceptable device width, with 8 drives. You also appear to have dedup enabled, which won't perform well unless you have > 256 G of RAM.

With 8 disks, you'll see much better throughput with a 4-way stripe of 2-way mirrors, and dedup disabled. Remember, ZFS is a copy-on-write file system, and there is no such thing as writing to a previous zvol block. You're reading the old block and writing a new block, and adjusting file system pointers along the way. Your pool also shows 67% fragmentation, which is where you're also losing performance.

jgreco · Feb 9, 2023

ret1 said:
Im getting really bad write performance (less than 10MB/s) on my ZVOL, if the blocks were already previously written.

Yup. You're at 67% utilization, so 10MB/s counts as AWESOME performance, especially on RAIDZ.

Resource - The path to success for block storage

ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle bits modified after creation. This is optimal for RAIDZ. It's what most...

www.truenas.com

ret1 said:
8x 8GB DDR3 ECC 1333 MT/s

8x 12TB TOSHIBA MG07ACA12TE 7200rpm (RAID-Z2) -> LSI SAS2008 HBA-Mode

You are substantially under the recommended RAM; for basic performance filesharing (not iSCSI) on a 96TB raw pool, you should have something more like 96GB or 128GB of RAM. For iSCSI, 256GB of RAM.

The use of RAIDZ2 will kill your performance; RAIDZ is optimized towards large sequential file accesses by a single consumer, and typically IOPS are very poor.

Having 67% utilization puts you far over the 40%-ish maximum I would recommend as well, you should be able to find a copy of the Delphix steady state performance chart in the link above, IIRC.

ret1 said:
Smart Array P420 Controller (fw: 8.32)

This is not suitable for use with TrueNAS.

What's all the noise about HBAs, and why can't I use a RAID controller?

1) An HBA is a Host Bus Adapter. This is a controller that allows SAS and SATA devices to be attached to, and communicate directly with, a server. RAID controllers typically aggregate several disks into a Virtual Disk abstraction of some sort...

www.truenas.com

ret1 said:
Intel Corporation 82599 10Gbit/s Dualport LACP configuration

If you are doing something like iSCSI, LACP is not recommended.

ret1 · Feb 9, 2023

Samuel Tai said:
Zvols are recommended on mirror pools, not RAIDZx pools, because mirrors allow much better throughput. Your RAIDZ2 VDEV is also at the top end of acceptable device width, with 8 drives. You also appear to have dedup enabled, which won't perform well unless you have > 256 G of RAM.

With 8 disks, you'll see much better throughput with a 4-way stripe of 2-way mirrors, and dedup disabled. Remember, ZFS is a copy-on-write file system, and there is no such thing as writing to a previous zvol block. You're reading the old block and writing a new block, and adjusting file system pointers along the way. Your pool also shows 67% fragmentation, which is where you're also losing performance.

hello Samuel!

thank you very much for your fast response. because, i dont need a high performance storage, i decided to go with raid-z2. everything over 50mb/s ist good enough.
in the TrueNAS webinterface, the "ZFS Deduplication" pool-settings on pool0 ist set to "off". in the output from "zfs get all pool0", i can see "pool0 dedup off default" and also for "zfs get all pool0/rettest2" it says "pool0/rettest2 dedup off local".
im sorry, but the formatting is a little bit unfavourable. as you can see, fragmentation is at 2%:

Code:

root@storage-01[~]# zpool list
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
freenas-boot   232G  8.61G   223G        -         -     1%     3%  1.00x    ONLINE  -
pool0           87T  58.4T  28.6T        -         -     2%    67%  1.00x    ONLINE  /mnt

the fact that ZFS is a COW filesystem, makes this behavior even stranger. if i create a new ZVOL and write 4GB at the beginning of that block device, everything is fast (>200MB/s). when i do a second run with 8GB, the first 4GB get written slow (<10MB/s), after that it speeds up to 200MB/s again. this is reproduce. is there any explanation for this?

regards
Thomas

ret1 · Feb 9, 2023

Hello jgreco!

jgreco said:
Yup. You're at 67% utilization, so 10MB/s counts as AWESOME performance, especially on RAIDZ.

Resource - The path to success for block storage

ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle bits modified after creation. This is optimal for RAIDZ. It's what most...

www.truenas.com

10MB/s counts as AWESOME doesn't sounds good to me :(. if i understand the graph under 6 correctly, this performance penalty is not unique to ZVOL, but also to ZFS. if that's the problem on my system, shouldn't there also a performance problem if i write to the Dataset?

jgreco said:
You are substantially under the recommended RAM; for basic performance filesharing (not iSCSI) on a 96TB raw pool, you should have something more like 96GB or 128GB of RAM. For iSCSI, 256GB of RAM.

thanks you mention that bottleneck. i will do a ram upgrade when i start to use the filesharing feature. but for now, i did all the benchmarks locally on the TrueNAS console. can that be the problem when locally executed Dataset benchmarks are fine?

jgreco said:
The use of RAIDZ2 will kill your performance; RAIDZ is optimized towards large sequential file accesses by a single consumer, and typically IOPS are very poor.

this is a single consumer setup without concurrent access. like i said. Dataset works really good.

jgreco said:
This is not suitable for use with TrueNAS.

What's all the noise about HBAs, and why can't I use a RAID controller?

1) An HBA is a Host Bus Adapter. This is a controller that allows SAS and SATA devices to be attached to, and communicate directly with, a server. RAID controllers typically aggregate several disks into a Virtual Disk abstraction of some sort...

www.truenas.com

The P420i controller is set to HBA mode and is only used for the SLOG disks. ive already tried without SLOG, but no difference.

jgreco said:
If you are doing something like iSCSI, LACP is not recommended.

never read that before. thanks for the info. i will keep that in mind if there are problems when i export the ZVOL over iSCSI.

jgreco · Feb 9, 2023

ret1 said:
10MB/s counts as AWESOME doesn't sounds good to me :(. if i understand the graph under 6 correctly, this performance penalty is not unique to ZVOL, but also to ZFS. if that's the problem on my system, shouldn't there also a performance problem if i write to the Dataset?

For block storage, ZFS may be storing your data in a ZFS block (for iSCSI, this'd be volblocksize) which most people do not set to the actual device block size (512/4096) because this is rather onerous. What this means is that for a rewrite of a 512-byte chunk of a 16384-byte ZFS block, ZFS has to read that block, overwrite in memory the relevant contents, search metaslabs for free space for the new block (often involving more reads), write the new block, then also write out metadata updates for the ZVOL (which might also involve more reads to fetch the old metadata), which includes pointers to the new block and also freeing the old block and old metadata. And then you're on to doing this for the next block, etc., as well.

Writing fresh data is generally easier because you're just laying down the new data and new metadata, and not having to do anywhere as much metadata maintenance. These can often just be laid down sequentially relatively quickly.

Dataset access works differently, because ZFS has clues as to what is going on. See

Resource - Why iSCSI often requires more resources for the same result

iSCSI is a SAN protocol. NFS, CIFS, etc., are NAS protocols. For a NAS protocol, the client sends a command to the filer, such as "open this file", or "read ten blocks", or "remove this file." On the filer, the local NAS protocol daemon...

www.truenas.com

ret1 said:
thanks you mention that bottleneck. i will do a ram upgrade when i start to use the filesharing feature. but for now, i did all the benchmarks locally on the TrueNAS console. can that be the problem when locally executed Dataset benchmarks are fine?

You're not required to believe me. You haven't done meaningful "locally executed Dataset benchmarks" however. Try installing a FreeBSD or Linux VM backed by a ZFS ZVOL and run random I/O tests within.

ret1 said:
this is a single consumer setup without concurrent access.

You don't just get to drop inconvenient words out of my definitions. I said large sequential file accesses as well. A single consumer block storage setup (which almost by definition does not have large sequential file accesses unless it is a WORM-style usage) is still not good with RAIDZ.

ret1 said:
Dataset works really good.

Trucks work real well too, unless you decide to drive them on the railroad tracks. Apples, oranges. Different things. Very different.

ret1 said:
The P420i controller is set to HBA mode and is only used for the SLOG disks.

Clearly didn't read the linked resource.

ret1 · Feb 10, 2023

Hello jgreco!

jgreco said:
For block storage, ZFS may be storing your data in a ZFS block (for iSCSI, this'd be volblocksize) which most people do not set to the actual device block size (512/4096) because this is rather onerous. What this means is that for a rewrite of a 512-byte chunk of a 16384-byte ZFS block, ZFS has to read that block, overwrite in memory the relevant contents, search metaslabs for free space for the new block (often involving more reads), write the new block, then also write out metadata updates for the ZVOL (which might also involve more reads to fetch the old metadata), which includes pointers to the new block and also freeing the old block and old metadata. And then you're on to doing this for the next block, etc., as well.

Writing fresh data is generally easier because you're just laying down the new data and new metadata, and not having to do anywhere as much metadata maintenance. These can often just be laid down sequentially relatively quickly.

Dataset access works differently, because ZFS has clues as to what is going on. See

Resource - Why iSCSI often requires more resources for the same result

iSCSI is a SAN protocol. NFS, CIFS, etc., are NAS protocols. For a NAS protocol, the client sends a command to the filer, such as "open this file", or "read ten blocks", or "remove this file." On the filer, the local NAS protocol daemon...

www.truenas.com

thanks for the really good information. this explains exactly what i can see with iostat.

jgreco said:
You're not required to believe me. You haven't done meaningful "locally executed Dataset benchmarks" however. Try installing a FreeBSD or Linux VM backed by a ZFS ZVOL and run random I/O tests within.

its not about don't believing you. you wrote:

for basic performance filesharing (not iSCSI) on a 96TB raw pool

but the provided benchmarks had nothing todo with filesharing. i only did write tests on a local dataset and on a local ZVOL. based on your information, i concluded ram is not the reason for my problems at the moment.

jgreco said:
You don't just get to drop inconvenient words out of my definitions. I said large sequential file accesses as well. A single consumer block storage setup (which almost by definition does not have large sequential file accesses unless it is a WORM-style usage) is still not good with RAIDZ.

you are right, i forgot to mention that part, but not out of inconvenience. in most case i do large sequential file accesses, both read and write.

jgreco said:
Trucks work real well too, unless you decide to drive them on the railroad tracks. Apples, oranges. Different things. Very different.

i know, Datasets and ZVOL are different things, but both are "layers" on top of ZFS pools. i said "Dataset performance is fine" multiple times because most of your arguments aimed at ZFS in general, but i clearly only have ZVOL related problems.

jgreco said:
Clearly didn't read the linked resource.

i have read the information in the linked resource, but also wrote that my problems don't disappear when i remove the SLOG disks which are connected to unsupported P420 controller. all other disks are connected to the LSI SAS 9211-8i (SAS2008 in IT mode)

root@storage-01[~]# sas2flash -c 0 -list
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

Adapter Selected is a LSI SAS: SAS2008(B2)

Controller Number : 0
Controller : SAS2008(B2)
PCI Address : 00:07:00:00
SAS Address : 5d4ae52-0-accf-0400
NVDATA Version (Default) : 14.01.00.08
NVDATA Version (Persistent) : 14.01.00.08
Firmware Product ID : 0x2213 (IT)
Firmware Version : 20.00.07.00
NVDATA Vendor : LSI
NVDATA Product ID : SAS9211-8i
BIOS Version : N/A
UEFI BSD Version : N/A
FCODE Version : N/A
Board Name : SAS9211-8i
Board Assembly : N/A
Board Tracer Number : N/A

Finished Processing Commands Successfully.
Exiting SAS2Flash.

thank you for your help and regards
Thomas

ret1 · Feb 10, 2023

Hello again jgreco!

one last thing. you told me_

You haven't done meaningful "locally executed Dataset benchmarks" however

can you tell me how good sequential r/w tests would look like with fio? and maybe explain why my tests aren't good, if you want to?

regards
Thomas

jgreco · Feb 10, 2023

ret1 said:
Hello again jgreco!

one last thing. you told me_

can you tell me how good sequential r/w tests would look like with fio? and maybe explain why my tests aren't good, if you want to?

regards
Thomas

I don't use fio because I believe it is generally crap, and hard for people to understand what they are actually testing.

Good sequential r/w tests need be no more complicated than disabling compression and using dd on suitably sized files. However, this is basically just changing from a crappy testing tool (fio) to a crappy test (dd-based sequential I/O). It does little to address the question that caused this thread; to understand THAT, you need to fix your understanding of "sequential". There are at least two different types of sequential I/O in ZFS. One is conventional sequential I/O that results in adjacent LBA's on a HDD being written to in order. This may be what you're expecting when I say "sequential" I/O, but ZFS is a CoW filesystem and allocates space from free ranges where it can find large blocks of free space, and as a result, it only makes a tepid attempt to allocate blocks conveniently to such access. The other type of sequential access is where you have a file where you access a long run of adjacent bytes from the file. In such a case, these are typically only written to adjacent LBA's when the initial write happens and there's lots of free space on the pool. However, when you "rewrite" or "overwrite" this file, you are no longer accessing that same set of LBA's, and ZFS is writing its blocks elsewhere on the pool. This is actually very stressy because it has no good mechanism to optimize how this happens, and because you end up mixing data and metadata blocks, you end up with a "slurry" of block types being written into whatever free space is being found. This causes lots of seeking.

ret1 said:
but the provided benchmarks had nothing todo with filesharing.

Neither ZFS nor I care whether your client accesses originate with filesharing -- if that sounds harsh, it is, but it is also really important to drive this point through, so I want that to really hit you. I said filesharing because TrueNAS just happens to be a NAS product. Make your tests resemble your intended workload if you want them to provide relevant clues. My suggestion was to install a local VM with FreeBSD or Linux, IIRC, which is much more likely to give you the sort of environment that will be present in a ZVOL. Doing sequential I/O to a dataset is not going to be representative of the pain experienced by your typical ZVOL and you do yourself no favors by conflating them.

ret1 said:
i concluded ram is not the reason for my problems at the moment.

It's possible. However, your conclusion is probably wrong. One of the things that trips people up with ZFS is that they forget that one of the major tasks of a filesystem is to store metadata. For ZFS, you have 96TB of raw space on your pool. What do you have on a typical NTFS, FFS, or EXT3? One terabyte? Maybe ten for a really large one? Your ZFS pool is broken up into (probably) a few hundred metaslabs, each of which act as a somewhat independent pool for free space purposes. ZFS has to read and maintain the free space information for these metaslabs (called "space maps") in ARC, and a common newbie error is to shortchange ZFS by not giving it the RAM to do so. If it is able to, ZFS will keep the space maps in the ARC and this makes it easier to find space that is nearby or adjacent to an existing block. If you do not give it the ARC space, then you get less efficient behaviours when allocating space for new blocks. ARC is not just space for caching data from your files.

ret1 said:
you are right, i forgot to mention that part, but not out of inconvenience. in most case i do large sequential file accesses, both read and write.

You are unlikely to do "large sequential file accesses" in a ZVOL. Particularly writes, which in turn informs how reads will behave. The first time you write a megabyte to a 1.5MB ZVOL with a 64KB block size, you write 16 blocks to the pool, and since there was nothing out there, this may indeed be written sequentially. However, the next time you do this, unless you have lots of free space on the pool, you are much less likely to get that sequential behaviour as ZFS has to work harder to find free space. You overwrite block 1, which allocates new space after block 16 and then frees block 1's space, and then -- assuming some optimization that doesn't necessarily exist -- the same for blocks 17-23, but then you're out of fresh contiguous space, so now you have to go back (incurring a seek) to a previously allocated set of LBA's. And here, fragmentation hell begins.

ret1 said:
i know, Datasets and ZVOL are different things, but both are "layers" on top of ZFS pools. i said "Dataset performance is fine" multiple times because most of your arguments aimed at ZFS in general, but i clearly only have ZVOL related problems.

You can say that dataset performance is fine as many times as you please. It doesn't change what's going on. This isn't about ZFS in general. It's about the black magic of CoW storage and how to cope with two incredibly different methods for interacting with the pool. Until you grasp the vast difference between the things, you'll continue to experience problems. Once you do grasp the vast difference, you'll be very disappointed, but will suddenly have an aha moment about ZFS and throwing vast resources at the problem as a viable fix.

Important Announcement for the TrueNAS Community.

ZVOL write performance <10MB/s when blocks were previously written

ret1

Cadet

Samuel Tai

Never underestimate your own stupidity

jgreco

Resident Grinch

Resource - The path to success for block storage

What's all the noise about HBAs, and why can't I use a RAID controller?

ret1

Cadet

ret1

Cadet

Resource - The path to success for block storage

What's all the noise about HBAs, and why can't I use a RAID controller?

jgreco

Resident Grinch

Resource - Why iSCSI often requires more resources for the same result

ret1

Cadet

Resource - Why iSCSI often requires more resources for the same result

ret1

Cadet

jgreco

Resident Grinch

Similar threads

Important Announcement for the TrueNAS Community.

ZVOL write performance <10MB/s when blocks were previously written

Cadet

Never underestimate your own stupidity

Resident Grinch

Cadet

Cadet

Resident Grinch

Cadet

Cadet

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZVOL write performance <10MB/s when blocks were previously written"

Similar threads