TrueNAS Scale NVME Performance Scaling

NickF

Guru
Joined
Jun 12, 2014
Messages
763
Over the past year or so I have been obsessively exploring various aspects of ZFS performance, from large SATA arrays with multiple HBA cards, to testing NVME performance. In my previous testing I was leveraging castoff enterprise servers that were Westmere, Sandy Bridge and Ivy Bridge based platforms. There was some interesting performance variations between those platforms and I was determined to see what a more modern platform would do. It seemed that most of my testing indicated that ZFS was being bottlenecked by the platform it was running on, with high CPU usage being present during testing.

I recently picked up an AMD EPYC 7282, Supermicro H12SSL-I, and 256GB of DDR4-2133 RAM. While the RAM is certainly not the fastest, I now have a lot of PCIE lanes to play with and I don't have to worry as much about which slot goes to which CPUs.

For today's adventure, I tested 4 and 8 Samsung 9A1 512GB SSDs (PCIE Gen4), in 2 PLX PEX8747 (PCIE Gen3) Linkreal Quad M.2 adapter as well as 2 Bifurcation-based Linkreal Quad M.2 adapters in that new platform. My goal was to determine the performance differances between relying on motherboard bifurcation vs a PLX chip. I also wanted to test the performance impacts of compression and deduplication on NVME drives in both configurations. Testing was done using FIO, in a mixed read-and-write workload
fio --bs=128k --direct=1 --directory=/mnt/newprod/ --gtod_reduce=1 --ioengine=posixaio --iodepth=32 --group_reporting --name=randrw --numjobs=16 --ramp_time=10 --runtime=60 --rw=randrw --size=256M --time_based

I hope this helps some folks. :)

The first set of tests was done on single card with (4) 9A1s in a 2 VDEV mirrored configuration on each of the different cards.

Test SetupRead BW MinRead BW maxRead BW AveRead BW Std DeviationRead IOPs MinRead IOPs maxRead IOPs AveRead IOPs Std DeviationWrite BW MinWrite BW maxWrite BW AveWrite BW Std DeviationWrite IOPs MinWrite IOPs maxWrite IOPs AveWrite IOPs Std Deviation
Bifurcation 4x9A1 2xMirrors with Dedupe and Compression
142​
7679​
1163​
103.88​
1140​
61437​
9305​
830.99​
176​
7642​
1160​
103.45​
1414​
61139​
9282​
827.56​
PLX 4x9A1 2xMirrors with Dedupe and Compression
87​
10065​
1110​
116.32​
698​
80264​
8885​
930.54​
116​
10064​
1109​
116.32​
928​
80264​
8874​
930.65​
Bifurcation 4x9A1 2xMirrors with Compression No Dedupe
952​
1967​
1334​
13.6​
7621​
15739​
10655​
95.55​
1043​
1931​
1332​
11.95​
8346​
15452​
10655​
95.55​
PLX 4x9A1 2xMirrors with Compression No Dedupe
693​
2031​
1114​
14.38​
5548​
16252​
8918​
115.05​
777​
2033​
1112​
13.29​
6216​
16264​
8898​
106.33​
Bifurcation 4x9A1 2xMirrors No Compression No Dedupe
835​
2471​
1578​
21.13​
6686​
6686​
19770​
168.97​
857​
2387​
1579​
20.02​
6856​
19098​
12632​
160.15​
PLX 4x9A1 2xMirrors No Compression No Dedupe
692​
1654​
1091​
13.01​
5542​
13232​
8734​
104.04​
764​
1574​
1089​
11.58​
6114​
12598​
8716​
92.66​


The second set of tests was done with two matching cards with (8) 9A1s in a 4 VDEV mirrored configuration. The mirrors span between the cards, so if one entire card were to fail, the pool would remain in tact.

Test SetupRead BW MinRead BW maxRead BW AveRead BW Std DeviationRead IOPs MinRead IOPs maxRead IOPs AveRead IOPs Std DeviationWrite BW MinWrite BW maxWrite BW AveWrite BW Std DeviationWrite IOPs MinWrite IOPs maxWrite IOPs AveWrite IOPs Std Deviation
Bifurcation 8x9A1 4xMirrors Cross Cards with Dedupe and Compression
131​
7641​
1207​
106.91​
1055​
61131​
9658​
855.27​
171​
7661​
1204​
106.68​
1372​
61294​
9636​
853.4​
PLX 8x9A1 4xMirrors Cross Cards with Dedupe and Compression
285​
6266​
1273​
89.59​
2910​
50290​
10169​
713.47​
363​
6286​
1271​
89.19​
2910​
50290​
10169​
89.19​
Bifurcation 8x9A1 4xMirrors Cross Cards with Compression no Dedupe
1063​
2092​
1496​
14.9​
8506​
16743​
11968​
108.81​
1187​
1979​
1494​
13.6​
9500​
15834​
11959​
108.81​
PLX 8x9A1 4xMirrors Cross Cards with Compression no Dedupe
1074​
2152​
1519​
14.8​
8594​
17217​
12155​
118.4​
1241​
2009​
1518​
13.29​
9930​
16075​
12147​
106.31​
Bifurcation 8x9A1 4xMirrors Cross Cards no Compression no Dedupe
1664​
3476​
2412​
22.98​
13316​
27809​
19298​
183.85​
1741​
3384​
2415​
172.6​
13926​
27077​
19323​
172.6​
PLX 8x9A1 4xMirrors Cross Cards no Compression no Dedupe
2010​
3718​
2811​
23.32​
16082​
29747​
22490​
186.51​
2073​
3594​
2815​
21.73​
16588​
28758​
22524​
173.81​

Some Bar Graphs:
1665902661964.png


1665902750434.png


1665902814084.png


1665902863062.png


Some interesting conclusions to be drawn :) The narrower 4 disk pools seem to perform better with the bifurcation based solution, which is likely due to the fact these are PCIE Gen 4 drives. However, as we get wider, the overhead of relying on the mainboard to do the switching seems to grow and the PLX chip solution seems to deliver better performance.
 

Attachments

  • 1665902798208.png
    1665902798208.png
    22.3 KB · Views: 99
Last edited:

justjasch

Dabbler
Joined
May 8, 2022
Messages
20
Sorry too say (not meant badly), u did a lot of Work but i dont thing its the way to do it.
U mesured a lot of stuff but not really PCIE perf. .
All Test with comp. and Dedupe u only messure ZFS performance (CPU mem etc Bottlenecks).

U should just make a 4 way stripe on both cards with everyting off. (IMHO)

From my pers. Tests there is no Diff., and it should not be any, because Bifurcation does nothing.(its done direct in the CPU controller, Board has no impact on it)
Its the same like having 2 Pcie x4 ports instead of one x8 on the Board.(its the same way like all your lanes are spread accoss your Board)
(A few Years back i also tested this with 4 Intel DC P3600 2TB U2 (in Enterprise u not really use M2 u use U2/U3 (techn. the same but hotswap stuff)
direct on Board U2 Connector vs Supermicro AOC-SLG3-4E4R vs Broadcom 9405W-16i Trimode.
And like it should it was all the same.)

Those extra chip cards only draw more Power (even add Latency but negligible) , they are for use on Boards without support of Bifurcation,
or u can put 4 Drives in an x8 or x4 slot because the extra bridge will spilt the lanes (but then only 2/1 Lanes per Drive).... .
Can be useful if un dont use all at one, if you only access one drive u still will get full speed in x4 slot.

wbr Alex
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
Hello,
Comparisons with compression on, dedupe on, and also with both off were done in the charts. I thought it important to show performance scaling on a relatively modern platform so that people have some idea of what to expect in real world use. Showing the same test with both features turned on, compression only, and neither are the way I chose to illustrate the performance scaling. If I wanted to test NVME performance and review SSDs, I wouldn't be using ZFS, so I understand the bottleneck is not the drives themselves.

I would love to have done similar tests with U.2 drives, but my budget is obviously not unlimited. I was able to get the M.2 drives for ~$60 whereby the P4610s I have cost $200 each.

As far as the PLX chips, at least in my testing, it does seem that when you go beyond 4 the cards with the PLX chips seem to have performed more consistently. This is testing I have not seen anyone else do, and it is valuable information if someone is specing out a build for an actual production workload and not me, some guy on the internet.

Similar cards, like this one,

1666153570402.png
https://www.aliexpress.us/item/2255...00030357810338!sea&curPageLogUid=DG277n4nDrsP

exist to fill that niche, and I would bet that scaling vs something like this
https://www.aliexpress.us/item/2255...00000025709768!sea&curPageLogUid=D0173VXaAKm4
1666153626806.png




Would prove that PLX chips still have a place in the market even in a world with bifurcation. I will someday test that theory, but I can't afford over $3000 in SSDs to explore that, and I chose an option that was affordable to show my friends here what is possible.
 

mgerdts

Cadet
Joined
Nov 11, 2022
Messages
3
One thing to watch for with the PM9A1 is that its write performance is wildly different when it has at least 30% of its NAND blocks erased than when it doesn’t. The difference is on the order of 5 GB/s vs 2 GB/s. This is because it will use 30% of he capacity as pseudo TLC (pTLC). This allows you to burst write up to 10% of the drive’s capacity at the higher rate. Over time, this data written to pTLC should be destaged to TLC if it remains idle.

You can ensure that your drives stay in this higher performance mode by creating a partition that uses at least 30% of the space and ensure that space is and stays erased. On Linux I use blkdiscard to erase the partition. I’m not sure what the tool is on FreeBSD.

I think that ZFS now has an auto trim option. That may be helpful as well.

Even with these attempts to ensure pTLC is available, wiring more than 10% of the drive’s capacity quickly will cause the drive to start writing to TLC which will be slower than pTLC.
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Some interesting conclusions to be drawn :) The narrower 4 disk pools seem to perform better with the bifurcation based solution, which is likely due to the fact these are PCIE Gen 4 drives. However, as we get wider, the overhead of relying on the mainboard to do the switching seems to grow and the PLX chip solution seems to deliver better performance.
Thanks! This is interesting food for thought.

Do you know of PLX switching cards for PCIe 4.0? The gen. 4 PLX chips exist, but I have not found equivalents to the Linksys cards for PCIe 4.0.
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
Thanks! This is interesting food for thought.

Do you know of PLX switching cards for PCIe 4.0? The gen. 4 PLX chips exist, but I have not found equivalents to the Linksys cards for PCIe 4.0.
Not that I have seen, at least not yet.

Linus Posted a video recently on this topic

That solution exists, but pre-order pricing is $2,800.

And, of course, the Honey Badger from Liqid (not to be confused with @HoneyBadger )
The Liqid Element LQD4500

But I have no idea what pricing is on that card. I'd bet it's more than the Apex one, and its not even as "good".

We're going to have to wait a little longer for Shenzen.
 
Last edited:
Top