Lower IOPS than expected / slow small file transfer through SMB

truenasuserh

Cadet
Joined
Sep 29, 2020
Messages
9
I've got the following setup:
1701266370233.png


Disks here are Seagate IronWolf 12 TB hard disks (ST12000VN0008)

I've got a test folder of representative production data, which I copy over SMB. This folder is 2.2GB, with mixed filesizes, some 12KB, some 2MB, with a ratio of roughly 50/50 (in size, not file count). During this, the smallest files seem to slow down the copy a lot. My boss is convinced that small or big files shouldn't matter, as small files would get aggregated using transactions, and therefore the hard drives will only see large consecutive writes. This copy takes roughly 59.5 seconds.

Copying this folder twice in parallel, I'd expect about two minutes. However, this actually takes about 8 minutes 20 seconds, with transfer speeds dropping to a couple hunderd KB/s when the transfers hit the small files.
1701267903048.png
1701268207633.png

I'd assume this is an IOPS issue. But what's killing my IOPS? The FIO tests show 2K iops, so a folder with 2K files should not be throttled on IOPS, right?


Code:
fio --randrepeat=1 --direct=1 --gtod_reduce=1 --numjobs=1 --bs=4k --iodepth=64 --size=1G --readwrite=randwrite --ramp_time=4 --group_reporting --name=test --filename=test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [w(1)][90.6%][w=344MiB/s][w=88.1k IOPS][eta 00m:03s]
test: (groupid=0, jobs=1): err= 0: pid=4860: Wed Nov 29 15:04:23 2023
  write: IOPS=10.4k, BW=40.8MiB/s (42.8MB/s)(1023MiB/25076msec); 0 zone resets
   bw (  KiB/s): min=  341, max=579808, per=95.83%, avg=40027.96, stdev=106981.49, samples=49
   iops        : min=   85, max=144952, avg=10006.61, stdev=26745.43, samples=49
  cpu          : usr=0.87%, sys=14.31%, ctx=12697, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,261841,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=40.8MiB/s (42.8MB/s), 40.8MiB/s-40.8MiB/s (42.8MB/s-42.8MB/s), io=1023MiB (1073MB), run=25076-25076msec


Code:
fio --randrepeat=1 --direct=1 --gtod_reduce=1 --numjobs=10 --bs=4k --iodepth=64 --size=1G --readwrite=randwrite --ramp_time=4 --group_reporting --name=test --filename=test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64
...
fio-3.28
Starting 10 processes
Jobs: 10 (f=10): [w(10)][46.7%][w=1151MiB/s][w=295k IOPS][eta 00m:08s]
test: (groupid=0, jobs=10): err= 0: pid=4889: Wed Nov 29 15:06:06 2023
  write: IOPS=402k, BW=1572MiB/s (1648MB/s)(3773MiB/2400msec); 0 zone resets
   bw (  MiB/s): min= 1092, max= 2089, per=89.32%, avg=1404.13, stdev=41.10, samples=40
   iops        : min=279768, max=535000, avg=359453.75, stdev=10520.42, samples=40
  cpu          : usr=4.14%, sys=61.84%, ctx=148692, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,965810,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=1572MiB/s (1648MB/s), 1572MiB/s-1572MiB/s (1648MB/s-1648MB/s), io=3773MiB (3956MB), run=2400-2400msec

Code:
fio --randrepeat=1 --direct=1 --gtod_reduce=1 --numjobs=10 --bs=4k --iodepth=64 --size=4G --readwrite=randwrite --ramp_time=4 --group_reporting --name=test --filename=test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64
...
fio-3.28
Starting 10 processes
Jobs: 5 (f=5): [_(1),w(1),_(1),w(3),_(2),w(1),_(1)][99.8%][w=298MiB/s][w=76.2k IOPS][eta 00m:01s]
test: (groupid=0, jobs=10): err= 0: pid=4910: Wed Nov 29 15:17:34 2023
  write: IOPS=16.9k, BW=65.8MiB/s (69.0MB/s)(40.0GiB/621485msec); 0 zone resets
   bw (  KiB/s): min= 7887, max=427386, per=100.00%, avg=67421.05, stdev=3381.92, samples=12352
   iops        : min= 1968, max=106843, avg=16852.13, stdev=845.49, samples=12352
  cpu          : usr=1.09%, sys=16.80%, ctx=8370246, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10474974,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=65.8MiB/s (69.0MB/s), 65.8MiB/s-65.8MiB/s (69.0MB/s-69.0MB/s), io=40.0GiB (42.9GB), run=621485-621485msec

Code:
zpool status pool1
  pool: pool1
  state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 19.9G in 00:12:05 with 0 errors on Wed Nov 29 13:31:05 2023
config:

        NAME                                              STATE     READ WRITE CKSUM
        pool1                                             ONLINE       0     0     0
          raidz3-0                                        ONLINE       0     0     0
            gptid/96687099-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/f7429afe-dc42-11ed-9b7a-e0d55e60faba    ONLINE       0     0     0
            spare-2                                       ONLINE       0     0     0
              gptid/97f3a674-d203-11eb-9e64-e0d55e60faba  ONLINE       0     0     0
              gptid/a6c07145-d203-11eb-9e64-e0d55e60faba  ONLINE       0     0     0
            gptid/99e77ff3-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/18d7ebc7-dc43-11ed-9b7a-e0d55e60faba    ONLINE       0     0     0
            gptid/9e3ed33b-d203-11eb-9e64-e0d55e60faba    ONLINE      47     0     0
            gptid/396b46cc-dc43-11ed-9b7a-e0d55e60faba    ONLINE       0     0     0
            gptid/9bf4f4c2-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/9eb74444-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/97416f1e-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/a10904dd-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/a2003ccd-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/a468b099-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/a3ec783a-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/a5a34558-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
        special
          mirror-2                                        ONLINE       0     0     0
            gptid/a51b321a-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/a59ed3da-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/a59dcf5d-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
        logs
          gptid/a06d9015-d203-11eb-9e64-e0d55e60faba      ONLINE       0     0     0
        spares
          gptid/a6c07145-d203-11eb-9e64-e0d55e60faba      INUSE     currently in use

errors: No known data errors


System information:
OS Version:
TrueNAS-13.0-U5.3
Model:
D120-C21
Memory:
32 GiB
CPU
Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz
(CPU MAX according to the reporting is 70%.)
1701268930675.png


(Mod Edit - Removed the color tags from your post and converted your multilines to codeblocks for readability.)
 
Last edited by a moderator:

truenasuserh

Cadet
Joined
Sep 29, 2020
Messages
9
From what I'm understanding, as "metadata (special) small block size" is set to 0, they should go to the raidz3. As there's a lot of them, it would significantly reduce the capacity if they went to the special devices. The 3 special disks are INTEL SSDSC2KB960G8, configured in triple mirror. <1TB.

The log disk is a SAMSUNG MZVLB1T0HALR-00000
 

asap2go

Patron
Joined
Jun 11, 2023
Messages
228
From what I'm understanding, as "metadata (special) small block size" is set to 0, they should go to the raidz3. As there's a lot of them, it would significantly reduce the capacity if they went to the special devices. The 3 special disks are INTEL SSDSC2KB960G8, configured in triple mirror. <1TB.

The log disk is a SAMSUNG MZVLB1T0HALR-00000
Ah, I didn't see that it was set to 0.
Then it makes sense that the files are on the HDDs.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

I'm not clear how you are using that special VDEV then.

Are writes sync, standard, or async? You have a log drive but seem to be using async writes.

You are also over the reccomend max VDEV size of 12.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Plausible.
So we get the 12K files in the 128K dataset, the larger than 12 VDEV, and possibly a very fragmented pool.

Output of zpool list pool1 please.

 
Last edited:

truenasuserh

Cadet
Joined
Sep 29, 2020
Messages
9
Code:
zpool list pool1
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
pool1   165T  79.6T  85.0T        -         -    16%    48%  1.00x    ONLINE  /mnt
 

truenasuserh

Cadet
Joined
Sep 29, 2020
Messages
9

I'm not clear how you are using that special VDEV then.

Are writes sync, standard, or async? You have a log drive but seem to be using async writes.

You are also over the reccomend max VDEV size of 12.
It's been set up as metadata.
1701357780881.png

From what I understand, this would mean that allocation tables, file sizes, write timestamps etc are stored there, but the actual contents are stored in the pool. So it should cause faster read times, and not (or even positively) affect write times, right?

Sync=disabled, as with sync=always the write times seem to drop even further. This server is in a datacenter, so just adding it and then later figuring out if it should be used was easier than figuring it out beforehand... I could run another set of tests tomorrow if that data is of any interest.

Recommended max VDEV size, that sounds like something I'd want to know more about. Do you have a link which actually explains why that is? (I can find a lot of people claiming 8, 10, 12, some 16, but no one seems to explain it. E.g. https://www.truenas.com/community/threads/vdev-raidz2-max-recommend-disks-per-vdev.81552/ - 15 disks for RAIDZ3.)
 
Last edited:

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
So it should cause faster read times, and not (or even positively) affect write times, right?
Yup.

Sync=disabled, as with sync=always the write times seem to drop even further. This server is in a datacenter, so just adding it and then later figuring out if it should be used was easier than figuring it out beforehand... I could run another set of tests tomorrow if that data is of any interest.
Yup, sync writes will always be flower than async ones. No test required.

Recommended max VDEV size, that sounds like something I'd want to know more about. Do you have a link which actually explains why that is? (I can find a lot of people claiming 8, 10, 12, some 16, but no one seems to explain it. E.g. https://www.truenas.com/community/threads/vdev-raidz2-max-recommend-disks-per-vdev.81552/ - 15 disks for RAIDZ3.)
I don't right now, but I am pretty sure the consensus here is 12: generally, the larger the VDEV the riskier It gets from a resiliency point; about performance, a RAIDZX VDEV has the iops of a single drive.
 
Last edited:

truenasuserh

Cadet
Joined
Sep 29, 2020
Messages
9
a RAIDZX VDEV has the iops of a single drive.
And do transaction groups improve this for my workload of files of 12KB mixed with other files? Or is that only for really tiny files? Is there anything I can do with other settings, like zfs_vdev_aggregation_limit?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
My boss is convinced that small or big files shouldn't matter, as small files would get aggregated using transactions, and therefore the hard drives will only see large consecutive writes.
To put it simply, that's sadly not how things works.

I suggest reading the following links (especially the comments of the third link).

No tunable comes to my mind that would help you, but I'm no expert; you could allow files under a certain size to be written in the special VDEV you already have: that would speed up your writes by removing the io amplification due to small files in big recordsize.
 
Last edited:
Top