SSD IOPS and throughput seems low.

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189
I'm having a hard time understanding whats going on with the flash pool below.

Code:
root@nas1:~ # zpool status vol3
  pool: vol3
 state: ONLINE
  scan: scrub repaired 0B in 00:04:39 with 0 errors on Sun Oct 25 00:04:41 2020
config:

    NAME                                            STATE     READ WRITE CKSUM
    vol3                                            ONLINE       0     0     0
      mirror-0                                      ONLINE       0     0     0
        gptid/5dcdd488-f99c-11ea-aab7-90e2ba3a89c4  ONLINE       0     0     0
        gptid/5ee2636b-f99c-11ea-aab7-90e2ba3a89c4  ONLINE       0     0     0
      mirror-1                                      ONLINE       0     0     0
        gptid/5f604727-f99c-11ea-aab7-90e2ba3a89c4  ONLINE       0     0     0
        gptid/5fcbc4a4-f99c-11ea-aab7-90e2ba3a89c4  ONLINE       0     0     0
      mirror-2                                      ONLINE       0     0     0
        gptid/5fdf2d58-f99c-11ea-aab7-90e2ba3a89c4  ONLINE       0     0     0
        gptid/5fd6316a-f99c-11ea-aab7-90e2ba3a89c4  ONLINE       0     0     0
      mirror-3                                      ONLINE       0     0     0
        gptid/5e0f3c1f-f99c-11ea-aab7-90e2ba3a89c4  ONLINE       0     0     0
        gptid/5fe2580d-f99c-11ea-aab7-90e2ba3a89c4  ONLINE       0     0     0
      mirror-4                                      ONLINE       0     0     0
        gptid/5e9980e5-f99c-11ea-aab7-90e2ba3a89c4  ONLINE       0     0     0
        gptid/5f7733ed-f99c-11ea-aab7-90e2ba3a89c4  ONLINE       0     0     0
      mirror-5                                      ONLINE       0     0     0
        gptid/5e57b6f6-f99c-11ea-aab7-90e2ba3a89c4  ONLINE       0     0     0
        gptid/5fbf3133-f99c-11ea-aab7-90e2ba3a89c4  ONLINE       0     0     0
    logs   
      gptid/b0ceec12-15c0-11eb-85d6-00074333ba50    ONLINE       0     0     0

errors: No known data errors


I have attached the data sheet of the SSDs I'm using. The SLOG is an Intel Optane 900p 280GB PCI card version.
The SSD data sheet states that Write IOPS (max IOPS, random 4K) 24,000 and this is what I'm getting roughly. Every test I do matches the data sheet, but the data sheet is for a single disk I have 12 disks arranged in 6 x 2 way mirrors so I would of thought that the speed would of scaled more than a single disk. Full spec of server in my signature.

Test done inside a CentOS 7 VM over NFS, sync.
Code:
[root@cloud ~]# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.7
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=78.0MiB/s][r=0,w=20.2k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=5120: Sun Oct 25 12:37:45 2020
  write: IOPS=18.5k, BW=72.4MiB/s (75.9MB/s)(4096MiB/56603msec)
   bw (  KiB/s): min=56352, max=92368, per=99.91%, avg=74029.83, stdev=4936.87, samples=113
   iops        : min=14088, max=23092, avg=18507.44, stdev=1234.26, samples=113
  cpu          : usr=9.99%, sys=28.69%, ctx=376272, majf=0, minf=28
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,1048576,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=72.4MiB/s (75.9MB/s), 72.4MiB/s-72.4MiB/s (75.9MB/s-75.9MB/s), io=4096MiB (4295MB), run=56603-56603msec

Disk stats (read/write):
    dm-0: ios=0/1043192, merge=0/0, ticks=0/3570970, in_queue=3571337, util=99.93%, aggrios=18/1043212, aggrmerge=0/5394, aggrticks=122/3569746, aggrin_queue=3569753, aggrutil=99.91%
  xvda: ios=18/1043212, merge=0/5394, ticks=122/3569746, in_queue=3569753, util=99.91%


Test done inside a CentOS 7 VM over NFS, sync.
Code:
[root@cloud ~]# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.7
Starting 1 process
Jobs: 1 (f=1): [w(1)][98.3%][r=0KiB/s,w=70.1MiB/s][r=0,w=17.9k IOPS][eta 00m:01s]
test: (groupid=0, jobs=1): err= 0: pid=5232: Sun Oct 25 12:41:34 2020
  write: IOPS=18.1k, BW=70.9MiB/s (74.3MB/s)(4096MiB/57785msec)
   bw (  KiB/s): min=64328, max=85800, per=99.85%, avg=72477.19, stdev=3393.64, samples=115
   iops        : min=16082, max=21450, avg=18119.23, stdev=848.40, samples=115
  cpu          : usr=9.92%, sys=28.96%, ctx=384178, majf=0, minf=28
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,1048576,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=70.9MiB/s (74.3MB/s), 70.9MiB/s-70.9MiB/s (74.3MB/s-74.3MB/s), io=4096MiB (4295MB), run=57785-57785msec

Disk stats (read/write):
    dm-0: ios=1/1045455, merge=0/0, ticks=2/3650287, in_queue=3650931, util=99.92%, aggrios=19/1043307, aggrmerge=0/5363, aggrticks=46/3643350, aggrin_queue=3643369, aggrutil=99.90%
  xvda: ios=19/1043307, merge=0/5363, ticks=46/3643350, in_queue=3643369, util=99.90%


As can be seen the NVMe SLOG has no detrimental impact when in play. The SLOG shows 12% busy in gstat while the disks are at near 100% and only get 1200 IO/s.
Are there any tunables I should be looking at for an all flash pool. Surely this pool should preform better or am I wrong?
 

Attachments

  • data-sheet-ultrastar-ssd400m.pdf
    289.6 KB · Views: 254

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Normally, we would test with 10+ clients accessing a NAS for max performance. There are limits to what a single NFS client can do. The NFS server and ZFS have to do a lot more work to make a 4K write reliable.... mirror, metadata updates etc. The NFS protocol is not that efficient.

Our recommendation is to work out what your application needs and then validate that performance. If you need lots of random 4K writes, NFS may not be the protocol to use.

How did you measure the SSDs at 100% busy? An SSD with 100% busy interface just means that there is 1 I/O in the queue. Unlike an HDD, there can be way more than 1 in the queue of an SSD when it is fully utilized. Look at queue depth.
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Every test I do matches the data sheet, but the data sheet is for a single disk I have 12 disks arranged in 6 x 2 way mirrors so I would of thought that the speed would of scaled more than a single disk. Full spec of server in my signature.

The misconception (I also fell to) is the implication that a single writing process would make use of multiple vdevs in any meaningful way - my experiments say otherwise (at least at this point in time) - https://www.truenas.com/community/threads/ssd-array-performance.75429/, https://www.truenas.com/community/threads/pool-performance-scaling-at-1j-qd1.80417/.

You will see better (aggregated) performance if you use multiple jobs (opposed to deeper queues), but whether thats a realistic representation of your workload is up to you.

From a general point of testing I'd recommend to start with the local pool first, i.e. identify max capabilities by running locally without sync, then move to nfs (without sync), then enable sync/add slog (or vice versa on the last two) to see where you "loose" performance.
 

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189
Hi @morganL I'm using XCP-ng as the VM hosts and NFS is the only option that allows thin provisioning that's why I've chosen NFS over iSCSI. I will test with multiple VMs when I have time. This is a lab setup so this is for testing purposes.

Here is the direct writes to a data set on the same pool with sync disabled.

I get the following error
Code:
root@nas1:/mnt/vol3/test # sync ; fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=8G --readwrite=randwrite --ramp_time=4
fio: engine libaio not loadable
fio: engine libaio not loadable
fio: failed to load engine


I removed libaio switch and ran the command, not sure how this will change the results.
Code:
root@nas1:/mnt/vol3/test # sync ; fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=8G --readwrite=randwrite --ramp_time=4
fio: engine libaio not loadable
fio: engine libaio not loadable
fio: failed to load engine
root@nas1:/mnt/vol3/test # sync ; fio --randrepeat=1 --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=8G --readwrite=randwrite --ramp_time=4
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64
fio-3.19
Starting 1 process
Jobs: 1 (f=1): [w(1)][95.7%][w=239MiB/s][w=61.1k IOPS][eta 00m:04s]
test: (groupid=0, jobs=1): err= 0: pid=14335: Sun Oct 25 18:20:59 2020
  write: IOPS=23.5k, BW=91.6MiB/s (96.1MB/s)(7858MiB/85768msec)
   bw (  KiB/s): min=64396, max=265539, per=99.75%, avg=93582.57, stdev=24764.26, samples=168
   iops        : min=16099, max=66384, avg=23395.30, stdev=6191.09, samples=168
  cpu          : usr=5.12%, sys=87.29%, ctx=353224, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2011686,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=91.6MiB/s (96.1MB/s), 91.6MiB/s-91.6MiB/s (96.1MB/s-96.1MB/s), io=7858MiB (8240MB), run=85768-85768msec


There is only a slight improvement. I will setup 10 VMs and test and report back probably next weekend.
 

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189
Sorry here is 10 jobs. I'll try posixaio. Better results.

Direct to pool sync off
Code:
root@nas1:/mnt/vol3/test # sync ; fio --randrepeat=1 --direct=1 --gtod_reduce=1 --numjobs=10 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite --ramp_time=4 --group_reporting
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64
...
fio-3.19
Starting 10 processes
Jobs: 3 (f=3): [_(5),w(3),_(2)][95.0%][w=492MiB/s][w=126k IOPS][eta 00m:04s]
test: (groupid=0, jobs=10): err= 0: pid=14886: Sun Oct 25 18:47:35 2020
  write: IOPS=139k, BW=543MiB/s (570MB/s)(38.2GiB/71940msec)
   bw (  KiB/s): min=59846, max=1248802, per=100.00%, avg=559336.90, stdev=24875.33, samples=1410
   iops        : min=14958, max=312197, avg=139830.56, stdev=6218.83, samples=1410
  cpu          : usr=3.42%, sys=26.40%, ctx=1783614, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10008213,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=543MiB/s (570MB/s), 543MiB/s-543MiB/s (570MB/s-570MB/s), io=38.2GiB (40.0GB), run=71940-71940msec
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
So you get 570 instead of 90 MB/s which is roughly 6 times higher ... for 6 vdevs... seems fitting.

How does the remote result look with >6 jobs?
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
So you get 570 instead of 90 MB/s which is roughly 6 times higher ... for 6 vdevs... seems fitting.

How does the remote result look with >6 jobs?
18,000 -> 139,0000 random write IOPS is a lot... especially from one host and only 12 SSDs. What is the network speed?
 

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189
That was direct to the pool with sync disabled from TrueNAS console. I will test with 10 VMs in parralel to test NFS performance.

I am using 10GbE LACP.
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Why don't you run the same fio command on a linux box (libaio) first (with and without sync)?
That will give you proper comparison values depicting the "raw" capabilities via network.

If that is ok then you can do application level tests to see if there is further loss.

In the end it will depend on your use case and of course having proper expectation level/requirements defined.
Just mentioning this since testing 4K random writes is probably not a realistic representation of say ESXi hosted virtual machines ...
 

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189
This is a purely academic exercise. With 2 hosts, each have 2 x 10GiB in LACP to TrueNAS and 5 VMs on each hosts I'm getting 1440MB/s with 64K blocks and 3090MB/s with 1M blocks sync off. With sync on I seem to hi a ceiling of 1440MB/s.

Do you have any suggestions for executing out simultaneous test on multiple VMs at once? I'm currently using ansible.
 

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189
Thanks for the info. The Intel Optane seems better until it loses out with through put at larger block size. Comparing the below Optane to the above link.

Code:
root@nas1:~ # diskinfo -citwS /dev/nvd0
/dev/nvd0
    512             # sectorsize
    280065171456    # mediasize in bytes (261G)
    547002288       # mediasize in sectors
    0               # stripesize
    0               # stripeoffset
    INTEL SSDPED1D280GA    # Disk descr.
    PHMB742301K1280CGN    # Disk ident.
    Yes             # TRIM/UNMAP support
    0               # Rotation rate in RPM

I/O command overhead:
    time to read 10MB block      0.006517 sec    =    0.000 msec/sector
    time to read 20480 sectors   0.272275 sec    =    0.013 msec/sector
    calculated command overhead            =    0.013 msec/sector

Seek times:
    Full stroke:      250 iter in   0.003609 sec =    0.014 msec
    Half stroke:      250 iter in   0.003562 sec =    0.014 msec
    Quarter stroke:      500 iter in   0.006846 sec =    0.014 msec
    Short forward:      400 iter in   0.005633 sec =    0.014 msec
    Short backward:      400 iter in   0.005486 sec =    0.014 msec
    Seq outer:     2048 iter in   0.026915 sec =    0.013 msec
    Seq inner:     2048 iter in   0.027314 sec =    0.013 msec

Transfer rates:
    outside:       102400 kbytes in   0.068837 sec =  1487572 kbytes/sec
    middle:        102400 kbytes in   0.050872 sec =  2012895 kbytes/sec
    inside:        102400 kbytes in   0.050821 sec =  2014915 kbytes/sec

Asynchronous random reads:
    sectorsize:    876013 ops in    3.000074 sec =   291997 IOPS
    4 kbytes:      867927 ops in    3.000074 sec =   289302 IOPS
    32 kbytes:     237031 ops in    3.001652 sec =    78967 IOPS
    128 kbytes:     61489 ops in    3.006170 sec =    20454 IOPS

Synchronous random writes:
     0.5 kbytes:     18.0 usec/IO =     27.1 Mbytes/s
       1 kbytes:     18.1 usec/IO =     53.8 Mbytes/s
       2 kbytes:     17.5 usec/IO =    111.7 Mbytes/s
       4 kbytes:     15.6 usec/IO =    250.8 Mbytes/s
       8 kbytes:     17.9 usec/IO =    435.3 Mbytes/s
      16 kbytes:     23.0 usec/IO =    678.6 Mbytes/s
      32 kbytes:     31.0 usec/IO =   1009.4 Mbytes/s
      64 kbytes:     51.1 usec/IO =   1223.9 Mbytes/s
     128 kbytes:     90.9 usec/IO =   1374.7 Mbytes/s
     256 kbytes:    162.6 usec/IO =   1537.9 Mbytes/s
     512 kbytes:    309.7 usec/IO =   1614.6 Mbytes/s
    1024 kbytes:    610.3 usec/IO =   1638.5 Mbytes/s
    2048 kbytes:   1112.1 usec/IO =   1798.4 Mbytes/s
    4096 kbytes:   2238.5 usec/IO =   1786.9 Mbytes/s
    8192 kbytes:   4456.6 usec/IO =   1795.1 Mbytes/s


I'll do some more research but I'll most likely get one just to try.
 

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189

Rand

Guru
Joined
Dec 30, 2013
Messages
906
It sure is cheap enough to give it a try, but nvdimm compatibility is an issue even on E5 v3/4 levels, I only got it to run (more or less properly) on Skylake.

You will also need a matching PowerGem (battery) or you will not have Non-volatile memory.
 

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189
I've asked the seller if he's got the battery. Just waiting for a response. I found a 4G module that has super caps built in but 4G is to small.
 
Top