TrueNAS Scale NVME Performance Scaling

NickF · Oct 15, 2022

Over the past year or so I have been obsessively exploring various aspects of ZFS performance, from large SATA arrays with multiple HBA cards, to testing NVME performance. In my previous testing I was leveraging castoff enterprise servers that were Westmere, Sandy Bridge and Ivy Bridge based platforms. There was some interesting performance variations between those platforms and I was determined to see what a more modern platform would do. It seemed that most of my testing indicated that ZFS was being bottlenecked by the platform it was running on, with high CPU usage being present during testing.

I recently picked up an AMD EPYC 7282, Supermicro H12SSL-I, and 256GB of DDR4-2133 RAM. While the RAM is certainly not the fastest, I now have a lot of PCIE lanes to play with and I don't have to worry as much about which slot goes to which CPUs.

For today's adventure, I tested 4 and 8 Samsung 9A1 512GB SSDs (PCIE Gen4), in 2 PLX PEX8747 (PCIE Gen3) Linkreal Quad M.2 adapter as well as 2 Bifurcation-based Linkreal Quad M.2 adapters in that new platform. My goal was to determine the performance differances between relying on motherboard bifurcation vs a PLX chip. I also wanted to test the performance impacts of compression and deduplication on NVME drives in both configurations. Testing was done using FIO, in a mixed read-and-write workload

fio --bs=128k --direct=1 --directory=/mnt/newprod/ --gtod_reduce=1 --ioengine=posixaio --iodepth=32 --group_reporting --name=randrw --numjobs=16 --ramp_time=10 --runtime=60 --rw=randrw --size=256M --time_based

I hope this helps some folks. :)

The first set of tests was done on single card with (4) 9A1s in a 2 VDEV mirrored configuration on each of the different cards.

Test Setup	Read BW Min	Read BW max	Read BW Ave	Read BW Std Deviation	Read IOPs Min	Read IOPs max	Read IOPs Ave	Read IOPs Std Deviation	Write BW Min	Write BW max	Write BW Ave	Write BW Std Deviation	Write IOPs Min	Write IOPs max	Write IOPs Ave	Write IOPs Std Deviation
Bifurcation 4x9A1 2xMirrors with Dedupe and Compression	142	7679	1163	103.88	1140	61437	9305	830.99	176	7642	1160	103.45	1414	61139	9282	827.56
PLX 4x9A1 2xMirrors with Dedupe and Compression	87	10065	1110	116.32	698	80264	8885	930.54	116	10064	1109	116.32	928	80264	8874	930.65
Bifurcation 4x9A1 2xMirrors with Compression No Dedupe	952	1967	1334	13.6	7621	15739	10655	95.55	1043	1931	1332	11.95	8346	15452	10655	95.55
PLX 4x9A1 2xMirrors with Compression No Dedupe	693	2031	1114	14.38	5548	16252	8918	115.05	777	2033	1112	13.29	6216	16264	8898	106.33
Bifurcation 4x9A1 2xMirrors No Compression No Dedupe	835	2471	1578	21.13	6686	6686	19770	168.97	857	2387	1579	20.02	6856	19098	12632	160.15
PLX 4x9A1 2xMirrors No Compression No Dedupe	692	1654	1091	13.01	5542	13232	8734	104.04	764	1574	1089	11.58	6114	12598	8716	92.66

The second set of tests was done with two matching cards with (8) 9A1s in a 4 VDEV mirrored configuration. The mirrors span between the cards, so if one entire card were to fail, the pool would remain in tact.

Test Setup	Read BW Min	Read BW max	Read BW Ave	Read BW Std Deviation	Read IOPs Min	Read IOPs max	Read IOPs Ave	Read IOPs Std Deviation	Write BW Min	Write BW max	Write BW Ave	Write BW Std Deviation	Write IOPs Min	Write IOPs max	Write IOPs Ave	Write IOPs Std Deviation
Bifurcation 8x9A1 4xMirrors Cross Cards with Dedupe and Compression	131	7641	1207	106.91	1055	61131	9658	855.27	171	7661	1204	106.68	1372	61294	9636	853.4
PLX 8x9A1 4xMirrors Cross Cards with Dedupe and Compression	285	6266	1273	89.59	2910	50290	10169	713.47	363	6286	1271	89.19	2910	50290	10169	89.19
Bifurcation 8x9A1 4xMirrors Cross Cards with Compression no Dedupe	1063	2092	1496	14.9	8506	16743	11968	108.81	1187	1979	1494	13.6	9500	15834	11959	108.81
PLX 8x9A1 4xMirrors Cross Cards with Compression no Dedupe	1074	2152	1519	14.8	8594	17217	12155	118.4	1241	2009	1518	13.29	9930	16075	12147	106.31
Bifurcation 8x9A1 4xMirrors Cross Cards no Compression no Dedupe	1664	3476	2412	22.98	13316	27809	19298	183.85	1741	3384	2415	172.6	13926	27077	19323	172.6
PLX 8x9A1 4xMirrors Cross Cards no Compression no Dedupe	2010	3718	2811	23.32	16082	29747	22490	186.51	2073	3594	2815	21.73	16588	28758	22524	173.81

Some Bar Graphs:

Some interesting conclusions to be drawn :) The narrower 4 disk pools seem to perform better with the bifurcation based solution, which is likely due to the fact these are PCIE Gen 4 drives. However, as we get wider, the overhead of relying on the mainboard to do the switching seems to grow and the PLX chip solution seems to deliver better performance.

justjasch · Oct 18, 2022

Sorry too say (not meant badly), u did a lot of Work but i dont thing its the way to do it.
U mesured a lot of stuff but not really PCIE perf. .
All Test with comp. and Dedupe u only messure ZFS performance (CPU mem etc Bottlenecks).

U should just make a 4 way stripe on both cards with everyting off. (IMHO)

From my pers. Tests there is no Diff., and it should not be any, because Bifurcation does nothing.(its done direct in the CPU controller, Board has no impact on it)
Its the same like having 2 Pcie x4 ports instead of one x8 on the Board.(its the same way like all your lanes are spread accoss your Board)
(A few Years back i also tested this with 4 Intel DC P3600 2TB U2 (in Enterprise u not really use M2 u use U2/U3 (techn. the same but hotswap stuff)
direct on Board U2 Connector vs Supermicro AOC-SLG3-4E4R vs Broadcom 9405W-16i Trimode.
And like it should it was all the same.)

Those extra chip cards only draw more Power (even add Latency but negligible) , they are for use on Boards without support of Bifurcation,
or u can put 4 Drives in an x8 or x4 slot because the extra bridge will spilt the lanes (but then only 2/1 Lanes per Drive).... .
Can be useful if un dont use all at one, if you only access one drive u still will get full speed in x4 slot.

wbr Alex

NickF · Oct 18, 2022

Hello,
Comparisons with compression on, dedupe on, and also with both off were done in the charts. I thought it important to show performance scaling on a relatively modern platform so that people have some idea of what to expect in real world use. Showing the same test with both features turned on, compression only, and neither are the way I chose to illustrate the performance scaling. If I wanted to test NVME performance and review SSDs, I wouldn't be using ZFS, so I understand the bottleneck is not the drives themselves.

I would love to have done similar tests with U.2 drives, but my budget is obviously not unlimited. I was able to get the M.2 drives for ~$60 whereby the P4610s I have cost $200 each.

As far as the PLX chips, at least in my testing, it does seem that when you go beyond 4 the cards with the PLX chips seem to have performed more consistently. This is testing I have not seen anyone else do, and it is valuable information if someone is specing out a build for an actual production workload and not me, some guy on the internet.

Similar cards, like this one,

https://www.aliexpress.us/item/2255...00030357810338!sea&curPageLogUid=DG277n4nDrsP

exist to fill that niche, and I would bet that scaling vs something like this
https://www.aliexpress.us/item/2255...00000025709768!sea&curPageLogUid=D0173VXaAKm4

Would prove that PLX chips still have a place in the market even in a world with bifurcation. I will someday test that theory, but I can't afford over $3000 in SSDs to explore that, and I chose an option that was affordable to show my friends here what is possible.

mgerdts · Nov 20, 2022

One thing to watch for with the PM9A1 is that its write performance is wildly different when it has at least 30% of its NAND blocks erased than when it doesn’t. The difference is on the order of 5 GB/s vs 2 GB/s. This is because it will use 30% of he capacity as pseudo TLC (pTLC). This allows you to burst write up to 10% of the drive’s capacity at the higher rate. Over time, this data written to pTLC should be destaged to TLC if it remains idle.

You can ensure that your drives stay in this higher performance mode by creating a partition that uses at least 30% of the space and ensure that space is and stays erased. On Linux I use blkdiscard to erase the partition. I’m not sure what the tool is on FreeBSD.

I think that ZFS now has an auto trim option. That may be helpful as well.

Even with these attempts to ensure pTLC is available, wiring more than 10% of the drive’s capacity quickly will cause the drive to start writing to TLC which will be slower than pTLC.

mgerdts · Nov 20, 2022

Alan Jude’s talk on scaling zfs on nvme is likely quite relevant.

Scaling ZFS for NVMe - Allan Jude - EuroBSDcon 2022

Learn about how ZFS is being adapted to the ways the rules of storage are being changed by NVMe. In the past, storage was slow relative to the CPU so request...

youtu.be

NickF · Nov 20, 2022

mgerdts said:
Alan Jude’s talk on scaling zfs on nvme is likely quite relevant.

Scaling ZFS for NVMe - Allan Jude - EuroBSDcon 2022

Learn about how ZFS is being adapted to the ways the rules of storage are being changed by NVMe. In the past, storage was slow relative to the CPU so request...

youtu.be

Thanks man that's helpful!

Etorix · May 17, 2023

NickF said:
Some interesting conclusions to be drawn :) The narrower 4 disk pools seem to perform better with the bifurcation based solution, which is likely due to the fact these are PCIE Gen 4 drives. However, as we get wider, the overhead of relying on the mainboard to do the switching seems to grow and the PLX chip solution seems to deliver better performance.

Thanks! This is interesting food for thought.

Do you know of PLX switching cards for PCIe 4.0? The gen. 4 PLX chips exist, but I have not found equivalents to the Linksys cards for PCIe 4.0.

NickF · May 17, 2023

Etorix said:
Thanks! This is interesting food for thought.

Do you know of PLX switching cards for PCIe 4.0? The gen. 4 PLX chips exist, but I have not found equivalents to the Linksys cards for PCIe 4.0.

Not that I have seen, at least not yet.

Linus Posted a video recently on this topic

This SSD is Faster Than Your RAM - Apex Storage X21

Looking for a reliable mobile provider that won't break the bank? Meet Tello Mobile! With Tello, you get top-notch coverage, no contracts, and fully customiz...

www.youtube.com

That solution exists, but pre-order pricing is $2,800.

Important Announcement for the TrueNAS Community.

TrueNAS Scale NVME Performance Scaling

NickF

Guru

Attachments

justjasch

Dabbler

NickF

Guru

mgerdts

Cadet

mgerdts

Cadet

Scaling ZFS for NVMe - Allan Jude - EuroBSDcon 2022

NickF

Guru

Scaling ZFS for NVMe - Allan Jude - EuroBSDcon 2022

Etorix

Wizard

NickF

Guru

This SSD is Faster Than Your RAM - Apex Storage X21

Apex Storage X21 NVMe AIC

This is 50x faster than your PC… HOLY $H!T - Liqid LQD4500 Honey Badger

Liqid Element LQD4500 Review (Honey Badger)

LQD4500 Performance Report

Similar threads

Important Announcement for the TrueNAS Community.

TrueNAS Scale NVME Performance Scaling

Guru

Attachments

Dabbler

Guru

Cadet

Cadet

Guru

Wizard

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "TrueNAS Scale NVME Performance Scaling"

Similar threads