Sudden drop in file transfer speed

AndroidBot

Dabbler
Joined
Apr 25, 2021
Messages
28
I have a rather powerful server (E5-2699A v4, 512GB RAM, 25GB/s ethernet), running only TrueNAS, it does have 36 HDDs, each 18TB, and I have 512GB RAM and 2x 4TB fast NVMe SSDs added as a cache to the pool.

I have iPerf3 server running on TrueNAS, from my windows machine, when I run
iperf3 -P 50 -bidir -c truenasIP -n 20G

The iPerf3 runs smoothly and I get a constant 10GB/s speed with no trouble. I re-ran the test many times back to back. So the issue is not the network, it's not jumbo frames, not the router etc. (I think?)

So when I transfer a large file, initially it peaks at the speed it should be transferring files, it goes up to 1.1GB/s, but after 5 seconds or so it immediately drops to 200MB/s range and sometimes it goes even lower.

Any suggestions on what to check or what to test?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Without more information we can only guess. First, your entire hardware, including how the disks are wired to your system board.

Then please describe your pool layout. Some people new to ZFS have made too wide of RAID-Zx vDev. With 36 disks it may seem logical to make 2 x 18 disk RAID-Zx vDevs. (Or worse, a single 36 disk RAID-Zx vDev.) However, 10 to 12 disks is considered more reasonable maximum. The reasons are a bit too complex for the moment, until we know if you even used such.

Next, 8TByte of L2ARC / Cache when using 512MByte of RAM is a bit too much. That is 20 times the size of RAM. In general, 5 to 10 times is thought to be acceptable. But, it may be okay due to changes in ZFS' index / header for the entries in the L2ARC / Cache.

Last, the 5 second sounds like a ZFS transaction write to the pool. With your pool limited to 200MBytes/ps continuous write speed. Thus, back to the pool layout.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
25GB/s ethernet
Take a read at the following resource.

Model? I assume three 12-wide RAIDZ(at least)2 VDEVs. Also, pool's fragmentation and free space values.

2x 4TB fast NVMe SSDs added as a cache
Model? I assume you are mirroring them and using as SLOG; I doubt you are overprivisioning them. Do note that "fast" means very little: they might be fast until you fill the NAND, or their performance might plummet during mixed workloads (which is a SLOG's workload), hence the need for model number.
If you are using them as L2ARC, you are likely wasting money in having two... and that's beyond the fact that, as pointed out by @Arwen, they are too big: you want four/six times the RAM size due to how L2ARC works.

So when I transfer a large file
How big? Using which service? On which dataset's recordsize? On which syncwrites value?

Any suggestions on what to check or what to test?
Arwen nailed it, I suggest a fio test and a pass of jgreco's solnet array for starters.
 
Last edited:

AndroidBot

Dabbler
Joined
Apr 25, 2021
Messages
28
@Davvo @Arwen

HDDs are all 18TB WD Gold Enterprise Class SATA HDD

zpool status output:

Code:
# zpool status
  pool: Data
 state: ONLINE
  scan: resilvered 1.97T in 1 days 11:06:16 with 0 errors on Thu Mar  7 08:39:13 2024
config:

        NAME                                            STATE     READ WRITE CKSUM
        Data                                            ONLINE       0     0 0
          raidz2-0                                      ONLINE       0     0 0
            gptid/7ab18138-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7c057610-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7d95f255-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7e3873a8-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7e650bd1-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7ddab2c0-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7f84da6c-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7ea60eaf-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7fb19f80-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7faecf98-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7fa918af-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/7fac0afa-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/804bc82b-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/80c19294-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/d7116d26-db61-11ee-9c6a-649d99b17ae8  ONLINE       0     0 0
            gptid/8108d998-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/a7a4b6f3-c01c-11ec-8ddb-649d99b17ae8  ONLINE       0     0 0
            gptid/8349e47b-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/853ede6a-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/85e91b20-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/8677e1cd-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/87084ced-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/87593963-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/87f7e22f-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/88a55a82-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/88df14bc-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/890179ab-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/890f3208-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/895277d2-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/8969caec-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/8a0e84b9-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/8a8d7e79-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/8af14687-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/8ad1599f-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/8b72bba8-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
            gptid/8c492ab3-9fc8-11eb-b1d6-000af79a3030  ONLINE       0     0 0
        cache
          gptid/2324147c-a304-11eb-8d5e-649d99b17ae8    ONLINE       0     0 0
          gptid/232d5cbf-a304-11eb-8d5e-649d99b17ae8    ONLINE       0     0 0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:01:26 with 0 errors on Wed Mar  6 03:46:26 2024
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          nvd0p2    ONLINE       0     0     0

errors: No known data errors
       


As for SSD cache, sorry I was wrong, it's actually 2x 3 TB, they are INTEL SSDPECKE064T8

I transfer files using SMB

When I try to do `fio` test, I get `fio` engine libaio not loadable, failed to load engine.

What do you think?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Totally a pool layout issue. Break it in 3 VDEVs and you will see much better performance and resiliency.

fio needs parameters, search on forum or on it's manpage.

I am appalled by that time bomb pool; a single 36-wide RAIDZ2 VDEV might be a record.
 

AndroidBot

Dabbler
Joined
Apr 25, 2021
Messages
28
@Davvo
Ok, here's fio result

Code:
# fio --filename=test --direct=1 --rw=randrw --randrepeat=0 --rwmixread=100 --iodepth=128 --numjobs=12 --runtime=60 --group_reporting --name=4ktest --ioengine=psync --size=4G --bs=4k
4ktest: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=128
...
fio-3.28
Starting 12 processes
Jobs: 6 (f=6): [r(3),_(2),r(2),_(1),E(1),_(1),r(1),_(1)][95.0%][r=2654MiB/s][r=680k IOPS][eta 00m:01s]
4ktest: (groupid=0, jobs=12): err= 0: pid=31715: Thu Mar  7 22:04:51 2024
  read: IOPS=671k, BW=2622MiB/s (2749MB/s)(48.0GiB/18748msec)
    clat (nsec): min=1888, max=4521.1k, avg=17144.27, stdev=76703.90
     lat (nsec): min=1910, max=4521.1k, avg=17170.16, stdev=76703.92
    clat percentiles (usec):
     |  1.00th=[    5],  5.00th=[    6], 10.00th=[    6], 20.00th=[    7],
     | 30.00th=[    7], 40.00th=[    8], 50.00th=[    8], 60.00th=[    8],
     | 70.00th=[    9], 80.00th=[   10], 90.00th=[   13], 95.00th=[   25],
     | 99.00th=[  273], 99.50th=[  537], 99.90th=[ 1156], 99.95th=[ 1401],
     | 99.99th=[ 1975]
   bw (  MiB/s): min= 2375, max= 3231, per=100.00%, avg=2650.76, stdev=13.93, samples=425
   iops        : min=608184, max=827155, avg=678589.98, stdev=3565.58, samples=425
  lat (usec)   : 2=0.01%, 4=0.07%, 10=82.58%, 20=11.30%, 50=3.35%
  lat (usec)   : 100=0.84%, 250=0.56%, 500=0.65%, 750=0.32%, 1000=0.16%
  lat (msec)   : 2=0.14%, 4=0.01%, 10=0.01%
  cpu          : usr=3.07%, sys=96.91%, ctx=4118, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=12582912,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=2622MiB/s (2749MB/s), 2622MiB/s-2622MiB/s (2749MB/s-2749MB/s), io=48.0GiB (51.5GB), run=18748-18748msec


Sorry for the appalling pool layout. How can I break this into 3 vdevs without losing data?


As for SSD cache, this is what I mean
1709867282676.png
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Sorry for the appalling pool layout. How can I break this into 3 vdevs without losing data?
Backup data since you have to destroy the pool and recreate it in the new layout, no other solution.
You were lucky to find out the layout issue due to performance and not data loss, which is great!
 

AndroidBot

Dabbler
Joined
Apr 25, 2021
Messages
28
I don't have large storage to store 70TB data temporarily.... Based on fio, is this still pool layout issue you think?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
I believe that fio was run on the SSD: you need to run it on the Data pool. Also, you might want different parameters... but honestly the test is of little significance in front of your actual pool layout, which puts you data at great risk.
 

AndroidBot

Dabbler
Joined
Apr 25, 2021
Messages
28
I didn't run fio on SSD, I did it on /mnt/poolname/sharename folder

So you think I should find some temporary 70TB storage, move my data there, then delete the whole vdev and create new one with 3 layouts? But if I create 3 separate layouts, can I still get 1 SMB share with all my data in one place? I mean whole size being available to me in one shared drive?

Other than pool layout, anything else that might be impacting my speed ? I mean everything was working fine up to couple weeks ago
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
So you think I should find some temporary 70TB storage, move my data there, then delete the whole vdev and create new one with 3 layouts?
Delete the whole pool, yes.

But if I create 3 separate layouts, can I still get 1 SMB share with all my data in one place? I mean whole size being available to me in one shared drive?
Yes, VDEVs are parts of the same pool: data is striped between them, but it will appear as a single drive.

Other than pool layout, anything else that might be impacting my speed ? I mean everything was working fine up to couple weeks ago
Define working fine: were you reaching greater speeds on the same files size?
You could post the output of zpool list Data inside [CODE][/CODE] tags.
 
Last edited:

AndroidBot

Dabbler
Joined
Apr 25, 2021
Messages
28
Define working fine: Yes, I was getting 1.1GB/s SMB file transfer, both ways, read and write

Code:
# zpool list Data
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
Data       589T  75.0T   514T        -         -     1%    12%  1.00x    ONLINE  /mnt
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Try running the following after cding to the HDD's dataset, which should be a more accurate test of what happens using SMB with a single client: fio --name WTEST --filename=fio-writefile.dat --rw=write --ioengine=posixaio --filesize=10g --iodepth=16 --direct=1 --numjobs=1 --runtime=120 --group_reporting.
 
Last edited:

AndroidBot

Dabbler
Joined
Apr 25, 2021
Messages
28
Sure, here are the results

Code:
# fio --name WTEST --filename=fio-writefile.dat --rw=write --ioengine=posixaio --filesize=10g --iodepth=16 --direct=1 --numjobs=1 --runtime=120 --group_reporting
WTEST: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=16
fio-3.28
Starting 1 process
WTEST: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=301MiB/s][w=77.1k IOPS][eta 00m:00s]
WTEST: (groupid=0, jobs=1): err= 0: pid=42368: Fri Mar  8 12:59:51 2024
  write: IOPS=81.8k, BW=320MiB/s (335MB/s)(10.0GiB/32047msec); 0 zone resets
    slat (nsec): min=656, max=758334, avg=3027.86, stdev=2889.14
    clat (usec): min=8, max=475356, avg=179.24, stdev=1523.86
     lat (usec): min=10, max=475360, avg=182.27, stdev=1523.87
    clat percentiles (usec):
     |  1.00th=[   20],  5.00th=[   26], 10.00th=[   32], 20.00th=[   42],
     | 30.00th=[   53], 40.00th=[   73], 50.00th=[  118], 60.00th=[  167],
     | 70.00th=[  219], 80.00th=[  306], 90.00th=[  379], 95.00th=[  453],
     | 99.00th=[  709], 99.50th=[  799], 99.90th=[ 1156], 99.95th=[ 1205],
     | 99.99th=[ 8356]
   bw (  KiB/s): min=61801, max=446355, per=100.00%, avg=327519.08, stdev=116801.18, samples=63
   iops        : min=15450, max=111588, avg=81879.49, stdev=29200.29, samples=63
  lat (usec)   : 10=0.01%, 20=1.08%, 50=27.08%, 100=18.77%, 250=27.15%
  lat (usec)   : 500=22.62%, 750=2.55%, 1000=0.47%
  lat (msec)   : 2=0.26%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%
  cpu          : usr=10.39%, sys=39.51%, ctx=476016, majf=0, minf=1
  IO depths    : 1=0.1%, 2=2.0%, 4=12.6%, 8=64.7%, 16=20.7%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=93.4%, 8=2.6%, 16=3.9%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2621440,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=320MiB/s (335MB/s), 320MiB/s-320MiB/s (335MB/s-335MB/s), io=10.0GiB (10.7GB), run=32047-32047msec
 

AndroidBot

Dabbler
Joined
Apr 25, 2021
Messages
28
So you think I should move my 70T data to some temporary storage, delete entire pool, re-create it with 3 devs, each dev to have 12 disks in it and this would solve my issue?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
I am confident you will see a performance improvement.
 
Last edited:

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
...
I mean everything was working fine up to couple weeks ago
That is part of the problem with too wide RAID-Zx vDevs. With 36 disk wide RAID-Zx vDev, the performance problems will likely just get worse.

We have had people that have had too wide RAID-Zx vDevs end up with performance problems so bad that they can barely get their data off the pool. So be prepared for performance problems copying the data off.

So you think I should move my 70T data to some temporary storage, delete entire pool, re-create it with 3 devs, each dev to have 12 disks in it and this would solve my issue?
It will help, maybe a lot. On occasion a person runs across other performance limiting items, like single thread CPU speed. (Faster cores are better for Samba / SMB, than more, slower cores.)


The way ZFS works with vDevs, is that a pool with more than one data vDev, data is normally striped across them. (Special vDevs, like L2ARC / Cache are not part of "data vDevs"...)


Last, using 2 x 3TByte NVMe for L2ARC / Cache drives, is still 12 times the size of RAM. Probably still too much. A single 3TByte would be about ideal.
 
Top