Lower IOPS than expected / slow small file transfer through SMB

truenasuserh · Nov 29, 2023

I've got the following setup:

Disks here are Seagate IronWolf 12 TB hard disks (ST12000VN0008)

I've got a test folder of representative production data, which I copy over SMB. This folder is 2.2GB, with mixed filesizes, some 12KB, some 2MB, with a ratio of roughly 50/50 (in size, not file count). During this, the smallest files seem to slow down the copy a lot. My boss is convinced that small or big files shouldn't matter, as small files would get aggregated using transactions, and therefore the hard drives will only see large consecutive writes. This copy takes roughly 59.5 seconds.

Copying this folder twice in parallel, I'd expect about two minutes. However, this actually takes about 8 minutes 20 seconds, with transfer speeds dropping to a couple hunderd KB/s when the transfers hit the small files.

I'd assume this is an IOPS issue. But what's killing my IOPS? The FIO tests show 2K iops, so a folder with 2K files should not be throttled on IOPS, right?

Code:

fio --randrepeat=1 --direct=1 --gtod_reduce=1 --numjobs=1 --bs=4k --iodepth=64 --size=1G --readwrite=randwrite --ramp_time=4 --group_reporting --name=test --filename=test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [w(1)][90.6%][w=344MiB/s][w=88.1k IOPS][eta 00m:03s]
test: (groupid=0, jobs=1): err= 0: pid=4860: Wed Nov 29 15:04:23 2023
  write: IOPS=10.4k, BW=40.8MiB/s (42.8MB/s)(1023MiB/25076msec); 0 zone resets
   bw (  KiB/s): min=  341, max=579808, per=95.83%, avg=40027.96, stdev=106981.49, samples=49
   iops        : min=   85, max=144952, avg=10006.61, stdev=26745.43, samples=49
  cpu          : usr=0.87%, sys=14.31%, ctx=12697, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,261841,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=40.8MiB/s (42.8MB/s), 40.8MiB/s-40.8MiB/s (42.8MB/s-42.8MB/s), io=1023MiB (1073MB), run=25076-25076msec

Code:

fio --randrepeat=1 --direct=1 --gtod_reduce=1 --numjobs=10 --bs=4k --iodepth=64 --size=1G --readwrite=randwrite --ramp_time=4 --group_reporting --name=test --filename=test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64
...
fio-3.28
Starting 10 processes
Jobs: 10 (f=10): [w(10)][46.7%][w=1151MiB/s][w=295k IOPS][eta 00m:08s]
test: (groupid=0, jobs=10): err= 0: pid=4889: Wed Nov 29 15:06:06 2023
  write: IOPS=402k, BW=1572MiB/s (1648MB/s)(3773MiB/2400msec); 0 zone resets
   bw (  MiB/s): min= 1092, max= 2089, per=89.32%, avg=1404.13, stdev=41.10, samples=40
   iops        : min=279768, max=535000, avg=359453.75, stdev=10520.42, samples=40
  cpu          : usr=4.14%, sys=61.84%, ctx=148692, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,965810,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=1572MiB/s (1648MB/s), 1572MiB/s-1572MiB/s (1648MB/s-1648MB/s), io=3773MiB (3956MB), run=2400-2400msec

Code:

fio --randrepeat=1 --direct=1 --gtod_reduce=1 --numjobs=10 --bs=4k --iodepth=64 --size=4G --readwrite=randwrite --ramp_time=4 --group_reporting --name=test --filename=test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64
...
fio-3.28
Starting 10 processes
Jobs: 5 (f=5): [_(1),w(1),_(1),w(3),_(2),w(1),_(1)][99.8%][w=298MiB/s][w=76.2k IOPS][eta 00m:01s]
test: (groupid=0, jobs=10): err= 0: pid=4910: Wed Nov 29 15:17:34 2023
  write: IOPS=16.9k, BW=65.8MiB/s (69.0MB/s)(40.0GiB/621485msec); 0 zone resets
   bw (  KiB/s): min= 7887, max=427386, per=100.00%, avg=67421.05, stdev=3381.92, samples=12352
   iops        : min= 1968, max=106843, avg=16852.13, stdev=845.49, samples=12352
  cpu          : usr=1.09%, sys=16.80%, ctx=8370246, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10474974,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=65.8MiB/s (69.0MB/s), 65.8MiB/s-65.8MiB/s (69.0MB/s-69.0MB/s), io=40.0GiB (42.9GB), run=621485-621485msec

Code:

zpool status pool1
  pool: pool1
  state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 19.9G in 00:12:05 with 0 errors on Wed Nov 29 13:31:05 2023
config:

        NAME                                              STATE     READ WRITE CKSUM
        pool1                                             ONLINE       0     0     0
          raidz3-0                                        ONLINE       0     0     0
            gptid/96687099-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/f7429afe-dc42-11ed-9b7a-e0d55e60faba    ONLINE       0     0     0
            spare-2                                       ONLINE       0     0     0
              gptid/97f3a674-d203-11eb-9e64-e0d55e60faba  ONLINE       0     0     0
              gptid/a6c07145-d203-11eb-9e64-e0d55e60faba  ONLINE       0     0     0
            gptid/99e77ff3-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/18d7ebc7-dc43-11ed-9b7a-e0d55e60faba    ONLINE       0     0     0
            gptid/9e3ed33b-d203-11eb-9e64-e0d55e60faba    ONLINE      47     0     0
            gptid/396b46cc-dc43-11ed-9b7a-e0d55e60faba    ONLINE       0     0     0
            gptid/9bf4f4c2-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/9eb74444-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/97416f1e-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/a10904dd-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/a2003ccd-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/a468b099-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/a3ec783a-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/a5a34558-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
        special
          mirror-2                                        ONLINE       0     0     0
            gptid/a51b321a-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/a59ed3da-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
            gptid/a59dcf5d-d203-11eb-9e64-e0d55e60faba    ONLINE       0     0     0
        logs
          gptid/a06d9015-d203-11eb-9e64-e0d55e60faba      ONLINE       0     0     0
        spares
          gptid/a6c07145-d203-11eb-9e64-e0d55e60faba      INUSE     currently in use

errors: No known data errors

System information:
OS Version:
TrueNAS-13.0-U5.3
Model:
D120-C21
Memory:
32 GiB
CPU
Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz
(CPU MAX according to the reporting is 70%.)

(Mod Edit - Removed the color tags from your post and converted your multilines to codeblocks for readability.)

asap2go · Nov 30, 2023

Are the 12kB files on the raidz3 or on the special devices?

truenasuserh · Nov 30, 2023

From what I'm understanding, as "metadata (special) small block size" is set to 0, they should go to the raidz3. As there's a lot of them, it would significantly reduce the capacity if they went to the special devices. The 3 special disks are INTEL SSDSC2KB960G8, configured in triple mirror. <1TB.

The log disk is a SAMSUNG MZVLB1T0HALR-00000

asap2go · Nov 30, 2023

truenasuserh said:
From what I'm understanding, as "metadata (special) small block size" is set to 0, they should go to the raidz3. As there's a lot of them, it would significantly reduce the capacity if they went to the special devices. The 3 special disks are INTEL SSDSC2KB960G8, configured in triple mirror. <1TB.

The log disk is a SAMSUNG MZVLB1T0HALR-00000

Ah, I didn't see that it was set to 0.
Then it makes sense that the files are on the HDDs.

Davvo · Nov 30, 2023

/scale/scaletutorials/storage/fusionpoolsscale/

I'm not clear how you are using that special VDEV then.

Are writes sync, standard, or async? You have a log drive but seem to be using async writes.

You are also over the reccomend max VDEV size of 12.

asap2go · Nov 30, 2023

Davvo said:
/scale/scaletutorials/storage/fusionpoolsscale/

I'm not clear how you are using that special VDEV then.

Are writes sync, standard, or async?

You are also over the reccomend max VDEV size of 12.

If small block size is 0 then it's metadata only afaik.
Maybe the pool is quite full or strongly fragmented. ZFS has really bad behaviour in those regards.

Davvo · Nov 30, 2023

Plausible.
So we get the 12K files in the 128K dataset, the larger than 12 VDEV, and possibly a very fragmented pool.

Output of zpool list pool1 please.

Workload Tuning — OpenZFS documentation

openzfs.github.io

truenasuserh · Nov 30, 2023

Code:

zpool list pool1
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
pool1   165T  79.6T  85.0T        -         -    16%    48%  1.00x    ONLINE  /mnt

truenasuserh · Nov 30, 2023

Davvo said:
/scale/scaletutorials/storage/fusionpoolsscale/

I'm not clear how you are using that special VDEV then.

Are writes sync, standard, or async? You have a log drive but seem to be using async writes.

You are also over the reccomend max VDEV size of 12.

It's been set up as metadata.

From what I understand, this would mean that allocation tables, file sizes, write timestamps etc are stored there, but the actual contents are stored in the pool. So it should cause faster read times, and not (or even positively) affect write times, right?

Sync=disabled, as with sync=always the write times seem to drop even further. This server is in a datacenter, so just adding it and then later figuring out if it should be used was easier than figuring it out beforehand... I could run another set of tests tomorrow if that data is of any interest.

Recommended max VDEV size, that sounds like something I'd want to know more about. Do you have a link which actually explains why that is? (I can find a lot of people claiming 8, 10, 12, some 16, but no one seems to explain it. E.g. https://www.truenas.com/community/threads/vdev-raidz2-max-recommend-disks-per-vdev.81552/ - 15 disks for RAIDZ3.)

Davvo · Nov 30, 2023

truenasuserh said:
So it should cause faster read times, and not (or even positively) affect write times, right?

Yup.

truenasuserh said:
Sync=disabled, as with sync=always the write times seem to drop even further. This server is in a datacenter, so just adding it and then later figuring out if it should be used was easier than figuring it out beforehand... I could run another set of tests tomorrow if that data is of any interest.

Yup, sync writes will always be flower than async ones. No test required.

truenasuserh said:
Recommended max VDEV size, that sounds like something I'd want to know more about. Do you have a link which actually explains why that is? (I can find a lot of people claiming 8, 10, 12, some 16, but no one seems to explain it. E.g. https://www.truenas.com/community/threads/vdev-raidz2-max-recommend-disks-per-vdev.81552/ - 15 disks for RAIDZ3.)

I don't right now, but I am pretty sure the consensus here is 12: generally, the larger the VDEV the riskier It gets from a resiliency point; about performance, a RAIDZX VDEV has the iops of a single drive.

truenasuserh · Nov 30, 2023

Davvo said:
a RAIDZX VDEV has the iops of a single drive.

And do transaction groups improve this for my workload of files of 12KB mixed with other files? Or is that only for really tiny files? Is there anything I can do with other settings, like zfs_vdev_aggregation_limit?

Davvo · Nov 30, 2023

truenasuserh said:
My boss is convinced that small or big files shouldn't matter, as small files would get aggregated using transactions, and therefore the hard drives will only see large consecutive writes.

To put it simply, that's sadly not how things works.

I suggest reading the following links (especially the comments of the third link).

Blog | Perforce Software

Stay ahead with the latest trends, insights, and best practices in software development, DevOps, and version control on the Perforce blog

www.delphix.com

Blog | Perforce Software

Stay ahead with the latest trends, insights, and best practices in software development, DevOps, and version control on the Perforce blog

www.delphix.com

The problem with RAIDZ

The problem with RAIDZ or why you probably won't get the storage efficiency you think you will get. As a ZFS rookie, I struggled a fair bit to find out what settings I should use for my Proxmox hypervisor. To learn more about ZFS and help other rookies, I wrote down this wall of text. Although...

www.truenas.com

No tunable comes to my mind that would help you, but I'm no expert; you could allow files under a certain size to be written in the special VDEV you already have: that would speed up your writes by removing the io amplification due to small files in big recordsize.

Important Announcement for the TrueNAS Community.

Lower IOPS than expected / slow small file transfer through SMB

truenasuserh

Cadet

asap2go

Patron

truenasuserh

Cadet

asap2go

Patron

Davvo

MVP

asap2go

Patron

Davvo

MVP

Workload Tuning — OpenZFS documentation

truenasuserh

Cadet

truenasuserh

Cadet

Davvo

MVP

truenasuserh

Cadet

Davvo

MVP

Blog | Perforce Software

Blog | Perforce Software

The problem with RAIDZ

Similar threads

Important Announcement for the TrueNAS Community.

Lower IOPS than expected / slow small file transfer through SMB

Cadet

Patron

Cadet

Patron

MVP

Patron

MVP

Cadet

Cadet

MVP

Cadet

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Lower IOPS than expected / slow small file transfer through SMB"

Similar threads