SOLVED LSI 2008 vs 3008 all-flash pool max IOPS

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
Trying to understand bottlenecks in a future all-flash pool of consumer SSDs, I've come across an interesting question - how much does a single LSI 2008 throttle back a mirrored pool of 8 SSDs? The idea is to use this pool as a fast SAN for VMs. After spending a fair amount doing these tests, I wanted to share my current results and observations in hope of finding some errors I might have made in testing. I am also curious in interpretations of some of the things I've found.

The specified max throughput of an LSI 2008 tops out at 290 kIOPS, which I initially suspected would hold back an all-flash pool, so I set out to investigate the max IOPS capabilities of basically three scenarios:

a) 4 SATA SSD drives on on-board HBA + 4 SATA SSD drives on LSI 2008
b) 8 SATA SSD drives on LSI 2008
c) 8 SATA SSD drives on LSI 3008

The SATA SSD drives are all consumer-grade Samsung drive - a mix of five 256 GB and three 512 GB drives (some PROs, some EVOs). Basically, these are just left-over drives I am trying to consolidate into a single fast pool.

The system under test is a E3-1231 v3 with Supermicro X10SLM-F and 32 GB of DDR3. Each scenario was done on a fresh pool with ashift=12, encryption=off, atime=off, recordSize=128k and disabled auto-trim. Pool configuration is four 2-way mirrored vdevs setup so that the pool has least amount of space loss with the available hardware. Additionally, for scenario a), I have distributed half of the pool to target the on-board HBA and the other half the LSI HBA (basically, each half of a 2-way mirror was "sitting" on two HBAs). The enclosure is Supermicro Mobile Rack CSE-M35TQB (not SAS3 capable but is a SATA/SAS enclosure).

Testing platform is latest stable TrueNAS (12.0-U7).

I ran the tests in each scenario on two separate datasets - both with recordSize=4k, one with sync=disabled, other with sync=always.

Right off the bat - I am not even going to bother tabulating scenario b). The results were simply inferior enough when compared to the other two that I didn't even bother writing them down (from memory, I'd say something like 30-50% poorer than a) and c)). That was certainly unexpected.

All fio tests are 4 KiB random write tests (bs=4k). IOPS reported are rounded off to a thousand - if there is a range of values, that just means the lo/hi observed values.

Scenariosyncjobsiodepthfile_sizecompressionIOPS reported
a)DISABLED114gOFF73k
a)DISABLED1616256mOFF82k - 161k
a)DISABLED6464256mOFF168k - 183k
a)ALWAYS114gOFF0.5k
a)ALWAYS114gZSTD0.5k
a)ALWAYS1616256mOFF16k
a)ALWAYS1616256mZSTD15k
c)DISABLED114gOFF72k - 103k
c)DISABLED1616256mOFF131k - 184k
c)DISABLED6464256mOFF81k -122k
c)ALWAYS114gOFF0.5k
c)ALWAYS1616256mOFF17.7 k
c)ALWAYS1616256mZSTD8.1 k (!)

Here is what I found interesting (on this system):
- combined LSI 2008 + Intel PCH performs better than LSI 3008...in some cases even 50% better!
- with sync=always, ZSTD compression comes basically for free (no reason not to turn it off)
- sync=always absolutely devastates IOPS, irrespective of the HBA
- the single LSI 2008 was considerably inferior to a combined LSI + Intel PCH HBA. I expected some variance but not 30+ percent.
- none of the reported IOPS came even close to reported maximums of their manufacturers
- CPU was maxed out in any test with jobs/iodepth > 1 (HyperThreading was ON for all tests)

One big piece of this puzzle is the latencies - and I haven't tracked them rigorously (due to just too much work and too little time available). What I could visually scan was roughly comparable, though (i.e. sub-ms latencies for most scenarios, except for the 64 jobs/64 iodepth combination, when latencies would shoot up into 10 ms territory).

Questions:
0) Is there anything wrong with these tests or are these SSDs just incapable of reaching higher IOPS in these settings? Why is this pool maxing out far below the stated IOPS limits of the HBAs? I understand that there is a lot of going on under the bonnet of such a pool and that a definition an IOPS is....relaxed....but what I found absolutely shocking are the results for sync=always which I feel like are tremendously poor (500 - 15000 IOPS.....for a pool of 8 SSDs and an LSI 3008?).
1) what is the purpose of having SANs with HBAs that can nominally reach hundreds of thousands or even millions of IOPS, when sync=always reduces these to a fraction of their capability?
2) what would it take to reach 100 kIOPS with sync=always? Is this NVMe-only territory?
3) why would a LSI+PCH combination defeat an LSI 3008 that should be almost three times faster than the 2008? I expected LSI 3008 to shred everything
4) what tuning could I try to further eek out more out of this setup? I still have a hunch this should be capable of much more IOPS
5) it seems to me CPU is the most limiting factor in this system - if going dual-processor (say Supermicro X10DRH-CT), what would be a better upgrade for higher IOPS in a moderately busy pool (i.e. what is represented by fio jobs=16 and iodepth=16)?
- 2x E5-2637 v4 (3.5 GHz, quad core)
- 2x E5-2640 v4 (2.4 GHz, 10-core)
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
why would a LSI+PCH combination defeat an LSI 3008 that should be almost three times faster than the 2008? I expected LSI 3008 to shred everything
My first guess would be PCI lanes... are you sure you're in one of the PCIe 3.0 slots? It seems one of your x8 slots on that board is actually a PCIe 2.0 x4 in disguise.

I'm also not sure on the CPU support for lanes and memory, but that might be a reason why your onboard controller can do a lot better as it will be custom tuned for the right number of lanes.

I'm, sure @jgreco will have some wisdom to add, so I won't go too broad with my comments, but I'd be curious to see if the way you're using fio is part of the problem... are you using a pre-made random sample file as the input? (if not the generation of the random content during the test may be making the CPU a bottleneck for the tests... but wouldn't have that same impact with real data).

I would also see the use of components matched with DDR3 better suited to spinning disks and not all-flash technologies.
 
Last edited:

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
Yeah, I am positive it was in the PCI 3.0 slot - you're right there is a 2.0 x4 disguised but I didn't use that one.

No, I wasn't using a pre-made random sample - I wanted to minimise the chance of finding data in ARC and then inflating the results. That being said, you make a good point - it might be that the rnd data generation is taxing that CPU, especially with parallel jobs and high IO depths. Any thoughts on how to reconcile this? Is there any way to use a pre-made random data set and somehow turn off or reduce ARC so that it doesn't interfere with the test results?

I suspected it might be tech generation problem as well - you'd regard Supermicro X10DRH-CT as better choice then, more capable at eeking out performance out of these drives?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
you'd regard Supermicro X10DRH-CT as better choice then, more capable at eeking out performance out of these drives?
Yes. I think without squeezing the PCIe lanes like you are now, it would put the HBA in better position to get the GTs it needs to do the job to its capability.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Is there any way to use a pre-made random data set and somehow turn off or reduce ARC so that it doesn't interfere with the test results?
Yes, fio allows specifying a pre-made sample file to use during its work (check out the option read_iolog and check the blktrace utility to create the input file for it). You could either restart the server right before testing or you could put that file in a dataset with primarycache and secondarycache set to none
 
Last edited:

QonoS

Explorer
Joined
Apr 1, 2021
Messages
87
2) what would it take to reach 100 kIOPS with sync=always? Is this NVMe-only territory?
Just have some enterprise grade SSD with proper Power-Loss Protection. ;)

Do you mind sharing your fio cmd example?
 

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
Yes, fio allows specifying a pre-made sample file to use during its work (check out the option read_iolog and check the blktrace utility to create the input file for it). You could either restart the server right before testing or you could put that file in a dataset with primarycache and secondarycache set to none

Excellent, thank you - I'll give at shot and report back on the results.

Just have some enterprise grade SSD with proper Power-Loss Protection. ;)

Do you mind sharing your fio cmd example?

Certainly - here's are fio commands I've been running with. Nothing fancy - just everything set to 4k and also run on a 4k dataset.

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=16 --size=256m --iodepth=16 --runtime=60 --time_based --end_fsync=1
fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=64 --size=256m --iodepth=64 --runtime=120 --time_based --end_fsync=1
 

QonoS

Explorer
Joined
Apr 1, 2021
Messages
87
zpool of a single Micron 5100 Max 1.92TB :

root@truenas[~]# zpool get all micron5100max | grep 'ashift\|trim'
micron5100max ashift 12 local
micron5100max autotrim off default
root@truenas[~]# zfs get all micron5100max | grep 'encryp\|atime\|record\|sync\|compress'
micron5100max recordsize 4K default
micron5100max compression off local
micron5100max atime off local
micron5100max sync always local
micron5100max relatime off default
micron5100max encryption off default

root@truenas[/mnt/micron5100max]# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
>> 7.9k IOPS

root@truenas[/mnt/micron5100max]# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=16 --size=256m --iodepth=16 --runtime=60 --time_based --end_fsync=1
>> 50.9k IOPS

root@truenas[/mnt/micron5100max]# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=64 --size=256m --iodepth=64 --runtime=120 --time_based --end_fsync=1
>> 44k IOPS




edit:
same test just with zpool of striped 2x Intel SSD 320 80GB SATA 3Gb/s (feat. PLP) results in:

>> 3.8k
>> 16.5k
>> 14.8k
 
Last edited:

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
Thanks for these results - I am confused to the point that I am either misreading the results or all drives are faulty.

I've tested each drive individually as a sole drive in a pool, with sync=always and on a 4k dataset, comp=off and atime=off. Tested like that in isolation, each drive gets from 300 IOPS (256 GB drives) to 950 IOPS (860 PRO 512 GB) - I am at a loss to explain these numbers. I am looking at this white paper from TrueNAS - assuming a 2-way mirror gets the write IOPS from the slowest drive, I still don't see how a 4x 2-way mirror of SSDs would get ~500 IOPS....

Here is a test output of that 860 PRO in isolation with sync=disabled and sync=always - am I reading these results wrong? Same pool, two datasets - > 80x drop in performance.

sync=disabled => 75.5 kIOPS (CPU usage during test 33-38%)
Code:
root@truenas[/mnt/Test/4k_nosync]# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.27
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=313MiB/s][w=80.2k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=7836: Fri Jan 28 23:28:50 2022
  write: IOPS=75.5k, BW=295MiB/s (309MB/s)(17.3GiB/60001msec); 0 zone resets
    slat (nsec): min=771, max=161678, avg=1562.80, stdev=3329.44
    clat (nsec): min=728, max=4555.0k, avg=11134.28, stdev=7402.02
     lat (usec): min=7, max=4557, avg=12.70, stdev= 7.80
    clat percentiles (nsec):
     |  1.00th=[  1256],  5.00th=[  8256], 10.00th=[  8640], 20.00th=[  9024],
     | 30.00th=[  9280], 40.00th=[  9536], 50.00th=[  9664], 60.00th=[  9792],
     | 70.00th=[ 10048], 80.00th=[ 10560], 90.00th=[ 13632], 95.00th=[ 17024],
     | 99.00th=[ 50944], 99.50th=[ 59136], 99.90th=[ 73216], 99.95th=[ 80384],
     | 99.99th=[103936]
   bw (  KiB/s): min=245272, max=406332, per=100.00%, avg=302475.52, stdev=30617.64, samples=119
   iops        : min=61318, max=101583, avg=75618.63, stdev=7654.38, samples=119
  lat (nsec)   : 750=0.01%, 1000=0.28%
  lat (usec)   : 2=1.38%, 4=0.07%, 10=68.04%, 20=26.43%, 50=2.73%
  lat (usec)   : 100=1.05%, 250=0.01%, 500=0.01%
  lat (msec)   : 10=0.01%
  cpu          : usr=12.39%, sys=19.37%, ctx=4648791, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,4532876,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=295MiB/s (309MB/s), 295MiB/s-295MiB/s (309MB/s-309MB/s), io=17.3GiB (18.6GB), run=60001-60001msec


sync=always => 0.92 kIOPS (CPU usage during test 2-3%)
Code:
root@truenas[/mnt/Test/4k]# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.27
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=3708KiB/s][w=927 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=7818: Fri Jan 28 23:27:15 2022
  write: IOPS=897, BW=3591KiB/s (3677kB/s)(210MiB/60001msec); 0 zone resets
    slat (nsec): min=899, max=14140, avg=1376.08, stdev=493.83
    clat (usec): min=114, max=6296, avg=1111.63, stdev=340.01
     lat (usec): min=115, max=6301, avg=1113.01, stdev=340.18
    clat percentiles (usec):
     |  1.00th=[  807],  5.00th=[  930], 10.00th=[  988], 20.00th=[ 1004],
     | 30.00th=[ 1029], 40.00th=[ 1057], 50.00th=[ 1074], 60.00th=[ 1090],
     | 70.00th=[ 1090], 80.00th=[ 1123], 90.00th=[ 1172], 95.00th=[ 1385],
     | 99.00th=[ 1958], 99.50th=[ 3982], 99.90th=[ 5669], 99.95th=[ 5669],
     | 99.99th=[ 5800]
   bw (  KiB/s): min= 2642, max= 3944, per=100.00%, avg=3592.66, stdev=279.96, samples=118
   iops        : min=  660, max=  986, avg=897.94, stdev=70.06, samples=118
  lat (usec)   : 250=0.01%, 1000=19.44%
  lat (msec)   : 2=79.76%, 4=0.31%, 10=0.50%
  cpu          : usr=0.21%, sys=0.29%, ctx=53869, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,53866,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=3591KiB/s (3677kB/s), 3591KiB/s-3591KiB/s (3677kB/s-3677kB/s), io=210MiB (221MB), run=60001-60001msec


I've done something silly and threw all these drives in a 8-wide stripe (RAID0) and then ran the tests with sync=always - I am getting around 3.5 MB/s . For an 8-wide stripe. Of SSDs! This can't be right.

Code:
root@truenas[/mnt/Test/4k]# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.27
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=3384KiB/s][w=846 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=6698: Fri Jan 28 23:22:28 2022
  write: IOPS=824, BW=3297KiB/s (3376kB/s)(193MiB/60001msec); 0 zone resets
    slat (nsec): min=907, max=716118, avg=1627.10, stdev=3364.64
    clat (usec): min=359, max=15315, avg=1210.23, stdev=1017.94
     lat (usec): min=360, max=15316, avg=1211.86, stdev=1018.02
    clat percentiles (usec):
     |  1.00th=[  383],  5.00th=[  416], 10.00th=[  441], 20.00th=[  510],
     | 30.00th=[  676], 40.00th=[  824], 50.00th=[  865], 60.00th=[  938],
     | 70.00th=[ 1221], 80.00th=[ 1647], 90.00th=[ 2769], 95.00th=[ 2868],
     | 99.00th=[ 6456], 99.50th=[ 7046], 99.90th=[ 8717], 99.95th=[ 8848],
     | 99.99th=[12780]
   bw (  KiB/s): min= 2809, max= 3557, per=100.00%, avg=3301.52, stdev=147.86, samples=118
   iops        : min=  702, max=  889, avg=825.08, stdev=37.00, samples=118
  lat (usec)   : 500=19.20%, 750=15.98%, 1000=26.01%
  lat (msec)   : 2=25.09%, 4=12.38%, 10=1.32%, 20=0.02%
  cpu          : usr=0.26%, sys=0.31%, ctx=49478, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,49461,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=3297KiB/s (3376kB/s), 3297KiB/s-3297KiB/s (3376kB/s-3376kB/s), io=193MiB (203MB), run=60001-60001msec


Any diagnostic output I could provide? Either I am overlooking something stupidly simple or usable all-flash pools are a fools dream on DDR3 platforms (which, to be honest, I have difficulty accepting).

EDIT: Enabling autotune didn't do anything to improve the SYNC performance. I might chuck in an Optane 900p to use as a SLOG but that feels absurd.
 
Last edited:

QonoS

Explorer
Joined
Apr 1, 2021
Messages
87
Thanks for these results - I am confused to the point that I am either misreading the results or all drives are faulty.
[...]
Here is a test output of that 860 PRO in isolation with sync=disabled and sync=always - am I reading these results wrong? Same pool, two datasets - > 80x drop in performance.
[...]
sync=disabled => 75.5 kIOPS (CPU usage during test 33-38%)
[...]
sync=always => 0.92 kIOPS (CPU usage during test 2-3%)

These results are to be expected and explanation is quite simlpe. You are just missing some key information. ;)

For a SSD without Power-Loss Protection (PLP) sync writing effectively disables any form of caching. Why? Sync writing ensures that data is written to NAND. Even if power fails data is safely stored. Contrary to that SSDs with PLP do caching - via their onboard DRAM - and thus be a lot faster. They are able to because they have capacitors that hold just enough energy to power themselves and write data from volatile DRAM to non-volatile NAND in an event of power failing. And so data is also safely stored.
 
Last edited:

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
Much as I am surprised with this, it may be the explanation that I was looking for.

What I find immensely surprising is how with sync=always NAND flash behaves basically like a platter drive - which I am still having a hard time wrapping my head around....how can ZIL still be so slow on an an all-NAND pool? I've tried adding in one of these consumer SSDs as a SLOG and gotten even worse performance than without it. My understanding was that PLP serves to provide the guarantee of no data loss but that it would still behave at least an order of magnitude faster than spinners....but that seems not to be the case. It wasn't until I added Optane that things actually started to perform well enough to be considered as useful.

Here is the same pool before (0.6 kIOPS = 2.3 MB/s) and after (14 kIOPS = 76.7 MB/s) adding an Optane 900p as a SLOG.

Code:
// without Optane SLOG
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1

Jobs: 1 (f=1): [w(1)][100.0%][w=2362KiB/s][w=590 IOPS][eta 00m:00s]
  write: IOPS=579, BW=2318KiB/s (2373kB/s)(136MiB/60001msec); 0 zone resets...

Run status group 0 (all jobs):
  WRITE: bw=2318KiB/s (2373kB/s), 2318KiB/s-2318KiB/s (2373kB/s-2373kB/s), io=136MiB (142MB), run=60001-60001msec


Code:
// with Optane SLOG
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1

Jobs: 1 (f=1): [w(1)][100.0%][w=54.8MiB/s][w=14.0k IOPS][eta 00m:00s]
  write: IOPS=18.7k, BW=73.1MiB/s (76.7MB/s)(4387MiB/60001msec); 0 zone resets...

Run status group 0 (all jobs):
  WRITE: bw=73.1MiB/s (76.7MB/s), 73.1MiB/s-73.1MiB/s (76.7MB/s-76.7MB/s), io=4387MiB (4600MB), run=60001-60001msec


Here are few other results with higher jobs/iodepths (16/16 => 96 kIOPS, 345 MB/s ... 64/64 => 82 kIOPS, 279 MB/s -- probably due to CPU bottleneck).

Code:
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=16

Jobs: 16 (f=16): [w(16)][100.0%][w=375MiB/s][w=96.0k IOPS][eta 00m:00s]

Run status group 0 (all jobs):
  WRITE: bw=329MiB/s (345MB/s), 20.3MiB/s-20.9MiB/s (21.3MB/s-21.9MB/s), io=19.3GiB (20.7GB), run=60001-60001msec


Code:
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=64

Jobs: 46 (f=18): [f(1),E(2),f(1),E(1),f(1),E(1),f(1),_(1),E(2),f(2),E(2),f(2),E(1),f(2),E(2),f(2),E(4),f(1),E(1),f(3),E(1),f(10),F(1),f(2),F(1),w(16)][100.0%][w=316MiB/s][w=80.9k IOPS][eta 00m:00s]

Run status group 0 (all jobs):
  WRITE: bw=266MiB/s (279MB/s), 4227KiB/s-4321KiB/s (4328kB/s-4424kB/s), io=31.2GiB (33.5GB), run=120001-120008msec
 

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
One additional data point - consider this (problematic) pool:

1643491680532.png


Again, surprisingly (to me), this pool behaves either the same or with only a modest performance penalty than a pool with four 2-way mirrors.
Especially surprising is that this RAID-Z1 pool outperforms the striped mirror significantly??

Results:
JobsIO depthIOPSMirror IOPS perf delta
4 x 2-way mirror1114 19.9 kIOPSN/A
4 x 2-way mirror161696 kIOPSN/A
4 x 2-way mirror646482 kIOPSN/A
2x RAID-Z (above)1119.7 kIOPS- 1.1 %
2x RAID-Z (above)161686.7 kIOPS- 9.7 %
2x RAID-Z (above)646447.4 kIOPS- 43.2 %

Note - the large perf delta at 64/64 is at least partially attributable to CPU bottleneck - as @sretalla had pointed out before, I am generating random data during test time so CPU is taxed with a load non-existant in normal operations.

Test excerpts:
Code:
// jobs = 1, iodepth = 1
root@truenas[/mnt/Test/4k]# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
Jobs: 1 (f=1): [w(1)][100.0%][w=76.9MiB/s][w=19.7k IOPS][eta 00m:00s]

Run status group 0 (all jobs):
  WRITE: bw=72.1MiB/s (75.6MB/s), 72.1MiB/s-72.1MiB/s (75.6MB/s-75.6MB/s), io=4328MiB (4538MB), run=60001-60001msec

// jobs = 16, iodepth = 16
root@truenas[/mnt/Test/4k]# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=16 --size=256m --iodepth=16 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=16
Jobs: 16 (f=16): [w(16)][98.4%][w=339MiB/s][w=86.7k IOPS][eta 00m:01s]

Run status group 0 (all jobs):
  WRITE: bw=312MiB/s (327MB/s), 19.1MiB/s-19.8MiB/s (20.1MB/s-20.7MB/s), io=18.3GiB (19.6GB), run=60001-60003msec

// jobs = 64, iodepth = 64
root@truenas[/mnt/Test/4k]# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=64 --size=256m --iodepth=64 --runtime=120 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=64
Jobs: 64 (f=64): [w(64)][99.2%][w=185MiB/s][w=47.4k IOPS][eta 00m:01s]

Run status group 0 (all jobs):
  WRITE: bw=227MiB/s (238MB/s), 3601KiB/s-3684KiB/s (3687kB/s-3773kB/s), io=26.6GiB (28.6GB), run=120001-120006msec
 
Last edited:

QonoS

Explorer
Joined
Apr 1, 2021
Messages
87
Yes, Optane is just another league, since it is not NAND but 3DXpoint technology, which offers extremely low latencies.

For completeness reasons: If an application/usecase requires doing sync writes there are 3 ways to improve performance:
  • Optane/3DXPoint alone or additionally as LOG device
  • NAND with PLP alone or additionally as LOG device
  • configuring sync=disabled, which transmutes any sync write into a standard (cached) write. this is not recommended of course, but with UPS and regular backups it is a viable alternative.


RAID-Z always hurts performance. It's recommended only for having larger pools. If performance matters the choice is 1-x mirrors.
(And I have no clue for the +40.7%... maybe retest ?)

PS: Great testing so far. In my eyes the best way to make sure. :)
 
Last edited:

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
Thanks - you're right, that 14 kIOPS was a test anomaly - re-testing the low QD with mirror yielded close to 20 kIOPS (with again few outliers, which don't seem to be representative). It now makes more sense to me as well - a performance drop that increases proportionally with QD and parallel tasks would be what I would've expected in mirror vs RAID-Zx cases.

Doing really is the best way of learning - I was convinced that the PLP refers to supercaps only (not the SRAM/SLC these super-caps back) and that the actual speed at which a system with all-NAND pool responds back with ACK would still trounce HDDs considerably. I never would've guessed ZIL-on-NAND pool is still such a bottleneck. Who knows if I would've picked up on this subtlety without these tests and your insight.

Thank you again, and thanks @sretalla for the insights around in-flight random data generation!
 
Top