Sudden drop in file transfer speed

AndroidBot · Mar 7, 2024

I have a rather powerful server (E5-2699A v4, 512GB RAM, 25GB/s ethernet), running only TrueNAS, it does have 36 HDDs, each 18TB, and I have 512GB RAM and 2x 4TB fast NVMe SSDs added as a cache to the pool.

I have iPerf3 server running on TrueNAS, from my windows machine, when I run
iperf3 -P 50 -bidir -c truenasIP -n 20G

The iPerf3 runs smoothly and I get a constant 10GB/s speed with no trouble. I re-ran the test many times back to back. So the issue is not the network, it's not jumbo frames, not the router etc. (I think?)

So when I transfer a large file, initially it peaks at the speed it should be transferring files, it goes up to 1.1GB/s, but after 5 seconds or so it immediately drops to 200MB/s range and sometimes it goes even lower.

Any suggestions on what to check or what to test?

Arwen · Mar 7, 2024

Without more information we can only guess. First, your entire hardware, including how the disks are wired to your system board.

Then please describe your pool layout. Some people new to ZFS have made too wide of RAID-Zx vDev. With 36 disks it may seem logical to make 2 x 18 disk RAID-Zx vDevs. (Or worse, a single 36 disk RAID-Zx vDev.) However, 10 to 12 disks is considered more reasonable maximum. The reasons are a bit too complex for the moment, until we know if you even used such.

Next, 8TByte of L2ARC / Cache when using 512MByte of RAM is a bit too much. That is 20 times the size of RAM. In general, 5 to 10 times is thought to be acceptable. But, it may be okay due to changes in ZFS' index / header for the entries in the L2ARC / Cache.

Last, the 5 second sounds like a ZFS transaction write to the pool. With your pool limited to 200MBytes/ps continuous write speed. Thus, back to the pool layout.

Davvo · Mar 7, 2024

AndroidBot said:
25GB/s ethernet

Take a read at the following resource.

Resource - High Speed Networking Tuning to maximize your 10G, 25G, 40G networks

Both FreeBSD and Linux come by default highly optimized for classic 1Gbps ethernet. This is by far the most commonly deployed networking for both clients and servers, and a lot of research has been done to tune performance especially for local...

www.truenas.com

AndroidBot said:
36 HDDs

Model? I assume three 12-wide RAIDZ(at least)2 VDEVs. Also, pool's fragmentation and free space values.

AndroidBot said:
2x 4TB fast NVMe SSDs added as a cache

Model? I assume you are mirroring them and using as SLOG; I doubt you are overprivisioning them. Do note that "fast" means very little: they might be fast until you fill the NAND, or their performance might plummet during mixed workloads (which is a SLOG's workload), hence the need for model number.
If you are using them as L2ARC, you are likely wasting money in having two... and that's beyond the fact that, as pointed out by @Arwen, they are too big: you want four/six times the RAM size due to how L2ARC works.

AndroidBot said:
So when I transfer a large file

How big? Using which service? On which dataset's recordsize? On which syncwrites value?

AndroidBot said:
Any suggestions on what to check or what to test?

Arwen nailed it, I suggest a fio test and a pass of jgreco's solnet array for starters.

AndroidBot · Mar 7, 2024

@Davvo @Arwen

HDDs are all 18TB WD Gold Enterprise Class SATA HDD

zpool status output:

Code:

# zpool status
  pool: Data
 state: ONLINE
  scan: resilvered 1.97T in 1 days 11:06:16 with 0 errors on Thu Mar  7 08:39:13 2024
config:

        NAME                                            STATE     READ WRITE CKSUM
        Data                                            ONLINE       0     0 0
          raidz2-0                                      ONLINE       0     0 0
            gptid/7ab18138-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7c057610-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7d95f255-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7e3873a8-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7e650bd1-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7ddab2c0-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7f84da6c-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7ea60eaf-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7fb19f80-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7faecf98-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7fa918af-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7fac0afa-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/804bc82b-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/80c19294-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/d7116d26-db61-11ee-9c6a-649d99b17ae8  ONLINE       0     0 0
            gptid/8108d998-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/a7a4b6f3-c01c-11ec-8ddb-649d99b17ae8  ONLINE       0     0 0
            gptid/8349e47b-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/853ede6a-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/85e91b20-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/8677e1cd-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/87084ced-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/87593963-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/87f7e22f-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/88a55a82-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/88df14bc-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/890179ab-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/890f3208-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/895277d2-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/8969caec-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/8a0e84b9-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/8a8d7e79-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/8af14687-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/8ad1599f-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/8b72bba8-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/8c492ab3-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
        cache
          gptid/2324147c-a304-11eb-8d5e-649d99b17ae8    ONLINE       0     0 0
          gptid/232d5cbf-a304-11eb-8d5e-649d99b17ae8    ONLINE       0     0 0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:01:26 with 0 errors on Wed Mar  6 03:46:26 2024
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          nvd0p2    ONLINE       0     0     0

errors: No known data errors

As for SSD cache, sorry I was wrong, it's actually 2x 3 TB, they are INTEL SSDPECKE064T8

I transfer files using SMB

When I try to do `fio` test, I get `fio` engine libaio not loadable, failed to load engine.

What do you think?

Davvo · Mar 7, 2024

Totally a pool layout issue. Break it in 3 VDEVs and you will see much better performance and resiliency.

fio needs parameters, search on forum or on it's manpage.

I am appalled by that ~~time bomb~~ pool; a single 36-wide RAIDZ2 VDEV might be a record.

ZFS Storage Pool Layout

This resource was originally created by user: @Davvo on the TrueNAS Community Forums Archive. https://www.truenas.com/community/resources/zfs-storage-pool-layout.201/download [1] This amazing document, created by iXsystems in February 2022 as a “White Paper”, cleanly explains how to qualify...

www.truenas.com

Assessing the Potential for Data Loss

This guide was written to be read from top to bottom without jumps, with the intent of spreading awareness to both new and experienced users; the author of this document assumes the understanding of the concepts explained in the following...

www.truenas.com

AndroidBot · Mar 7, 2024

@Davvo
Ok, here's fio result

Code:

# fio --filename=test --direct=1 --rw=randrw --randrepeat=0 --rwmixread=100 --iodepth=128 --numjobs=12 --runtime=60 --group_reporting --name=4ktest --ioengine=psync --size=4G --bs=4k
4ktest: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=128
...
fio-3.28
Starting 12 processes
Jobs: 6 (f=6): [r(3),_(2),r(2),_(1),E(1),_(1),r(1),_(1)][95.0%][r=2654MiB/s][r=680k IOPS][eta 00m:01s]
4ktest: (groupid=0, jobs=12): err= 0: pid=31715: Thu Mar  7 22:04:51 2024
  read: IOPS=671k, BW=2622MiB/s (2749MB/s)(48.0GiB/18748msec)
    clat (nsec): min=1888, max=4521.1k, avg=17144.27, stdev=76703.90
     lat (nsec): min=1910, max=4521.1k, avg=17170.16, stdev=76703.92
    clat percentiles (usec):
     |  1.00th=[    5],  5.00th=[    6], 10.00th=[    6], 20.00th=[    7],
     | 30.00th=[    7], 40.00th=[    8], 50.00th=[    8], 60.00th=[    8],
     | 70.00th=[    9], 80.00th=[   10], 90.00th=[   13], 95.00th=[   25],
     | 99.00th=[  273], 99.50th=[  537], 99.90th=[ 1156], 99.95th=[ 1401],
     | 99.99th=[ 1975]
   bw (  MiB/s): min= 2375, max= 3231, per=100.00%, avg=2650.76, stdev=13.93, samples=425
   iops        : min=608184, max=827155, avg=678589.98, stdev=3565.58, samples=425
  lat (usec)   : 2=0.01%, 4=0.07%, 10=82.58%, 20=11.30%, 50=3.35%
  lat (usec)   : 100=0.84%, 250=0.56%, 500=0.65%, 750=0.32%, 1000=0.16%
  lat (msec)   : 2=0.14%, 4=0.01%, 10=0.01%
  cpu          : usr=3.07%, sys=96.91%, ctx=4118, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=12582912,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=2622MiB/s (2749MB/s), 2622MiB/s-2622MiB/s (2749MB/s-2749MB/s), io=48.0GiB (51.5GB), run=18748-18748msec

Sorry for the appalling pool layout. How can I break this into 3 vdevs without losing data?

As for SSD cache, this is what I mean

Davvo · Mar 7, 2024

AndroidBot said:
Sorry for the appalling pool layout. How can I break this into 3 vdevs without losing data?

Backup data since you have to destroy the pool and recreate it in the new layout, no other solution.
You were lucky to find out the layout issue due to performance and not data loss, which is great!

AndroidBot · Mar 7, 2024

I don't have large storage to store 70TB data temporarily.... Based on fio, is this still pool layout issue you think?

Davvo · Mar 7, 2024

I believe that fio was run on the SSD: you need to run it on the Data pool. Also, you might want different parameters... but honestly the test is of little significance in front of your actual pool layout, which puts you data at great risk.

AndroidBot · Mar 7, 2024

I didn't run fio on SSD, I did it on /mnt/poolname/sharename folder

So you think I should find some temporary 70TB storage, move my data there, then delete the whole vdev and create new one with 3 layouts? But if I create 3 separate layouts, can I still get 1 SMB share with all my data in one place? I mean whole size being available to me in one shared drive?

Other than pool layout, anything else that might be impacting my speed ? I mean everything was working fine up to couple weeks ago

Davvo · Mar 7, 2024

AndroidBot said:
So you think I should find some temporary 70TB storage, move my data there, then delete the whole vdev and create new one with 3 layouts?

Delete the whole pool, yes.

AndroidBot said:
But if I create 3 separate layouts, can I still get 1 SMB share with all my data in one place? I mean whole size being available to me in one shared drive?

Yes, VDEVs are parts of the same pool: data is striped between them, but it will appear as a single drive.

Introduction to ZFS

This is a short introduction to ZFS. It is really only intended to convey the bare minimum knowledge needed to start diving into ZFS and is in no way meant to cut Michael W. Lucas' and Allan Jude's book income. It is a bit of a spiritual...

www.truenas.com

AndroidBot said:
Other than pool layout, anything else that might be impacting my speed ? I mean everything was working fine up to couple weeks ago

Define working fine: were you reaching greater speeds on the same files size?
You could post the output of zpool list Data inside [CODE][/CODE] tags.

AndroidBot · Mar 7, 2024

Define working fine: Yes, I was getting 1.1GB/s SMB file transfer, both ways, read and write

Code:

# zpool list Data
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
Data       589T  75.0T   514T        -         -     1%    12%  1.00x    ONLINE  /mnt

Davvo · Mar 8, 2024

With the same file sizes?

AndroidBot · Mar 8, 2024

Yes, with same file sizes, always moving around large MP4 files

Davvo · Mar 8, 2024

Try running the following after cding to the HDD's dataset, which should be a more accurate test of what happens using SMB with a single client:

fio --name WTEST --filename=fio-writefile.dat --rw=write --ioengine=posixaio --filesize=10g --iodepth=16 --direct=1 --numjobs=1 --runtime=120 --group_reporting

.

AndroidBot · Mar 8, 2024

Sure, here are the results

Code:

# fio --name WTEST --filename=fio-writefile.dat --rw=write --ioengine=posixaio --filesize=10g --iodepth=16 --direct=1 --numjobs=1 --runtime=120 --group_reporting
WTEST: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=16
fio-3.28
Starting 1 process
WTEST: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=301MiB/s][w=77.1k IOPS][eta 00m:00s]
WTEST: (groupid=0, jobs=1): err= 0: pid=42368: Fri Mar  8 12:59:51 2024
  write: IOPS=81.8k, BW=320MiB/s (335MB/s)(10.0GiB/32047msec); 0 zone resets
    slat (nsec): min=656, max=758334, avg=3027.86, stdev=2889.14
    clat (usec): min=8, max=475356, avg=179.24, stdev=1523.86
     lat (usec): min=10, max=475360, avg=182.27, stdev=1523.87
    clat percentiles (usec):
     |  1.00th=[   20],  5.00th=[   26], 10.00th=[   32], 20.00th=[   42],
     | 30.00th=[   53], 40.00th=[   73], 50.00th=[  118], 60.00th=[  167],
     | 70.00th=[  219], 80.00th=[  306], 90.00th=[  379], 95.00th=[  453],
     | 99.00th=[  709], 99.50th=[  799], 99.90th=[ 1156], 99.95th=[ 1205],
     | 99.99th=[ 8356]
   bw (  KiB/s): min=61801, max=446355, per=100.00%, avg=327519.08, stdev=116801.18, samples=63
   iops        : min=15450, max=111588, avg=81879.49, stdev=29200.29, samples=63
  lat (usec)   : 10=0.01%, 20=1.08%, 50=27.08%, 100=18.77%, 250=27.15%
  lat (usec)   : 500=22.62%, 750=2.55%, 1000=0.47%
  lat (msec)   : 2=0.26%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%
  cpu          : usr=10.39%, sys=39.51%, ctx=476016, majf=0, minf=1
  IO depths    : 1=0.1%, 2=2.0%, 4=12.6%, 8=64.7%, 16=20.7%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=93.4%, 8=2.6%, 16=3.9%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2621440,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=320MiB/s (335MB/s), 320MiB/s-320MiB/s (335MB/s-335MB/s), io=10.0GiB (10.7GB), run=32047-32047msec

Davvo · Mar 8, 2024

Much closer to your numbers.

AndroidBot · Mar 8, 2024

So you think I should move my 70T data to some temporary storage, delete entire pool, re-create it with 3 devs, each dev to have 12 disks in it and this would solve my issue?

Davvo · Mar 8, 2024

I am confident you will see a performance improvement.

Arwen · Mar 8, 2024

AndroidBot said:
...
I mean everything was working fine up to couple weeks ago

That is part of the problem with too wide RAID-Zx vDevs. With 36 disk wide RAID-Zx vDev, the performance problems will likely just get worse.

We have had people that have had too wide RAID-Zx vDevs end up with performance problems so bad that they can barely get their data off the pool. So be prepared for performance problems copying the data off.

AndroidBot said:
So you think I should move my 70T data to some temporary storage, delete entire pool, re-create it with 3 devs, each dev to have 12 disks in it and this would solve my issue?

It will help, maybe a lot. On occasion a person runs across other performance limiting items, like single thread CPU speed. (Faster cores are better for Samba / SMB, than more, slower cores.)

The way ZFS works with vDevs, is that a pool with more than one data vDev, data is normally striped across them. (Special vDevs, like L2ARC / Cache are not part of "data vDevs"...)

Last, using 2 x 3TByte NVMe for L2ARC / Cache drives, is still 12 times the size of RAM. Probably still too much. A single 3TByte would be about ideal.

Important Announcement for the TrueNAS Community.

Sudden drop in file transfer speed

Dabbler

MVP

MVP

Dabbler

MVP

Dabbler

MVP

Dabbler

MVP

Dabbler

MVP

Dabbler

MVP

Dabbler

MVP

Dabbler

MVP

Dabbler

MVP

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Sudden drop in file transfer speed"

Similar threads