Register for the iXsystems Community to get an ad-free experience and exclusive discounts in our eBay Store.

NVME ZFS Pool over 100GbE

kyaky

Newbie
Joined
Feb 25, 2021
Messages
2
Hi Community,

I have some issues with the NVME ZFS pool performance over 100GbE

The setup:

The Server:
TrueNAS 12.0 U2.1
CPU:AMD Ryzen 9 3900x
Motherboard: X570 Aorus
Memory: 128G
Mellanox 100G SFP
3x 2T PCIE4 NVME (benchmark> 3500mb/s r/w)


The Client:
CPU: i7 9th Gen
Memory: 16G
Mellanox 100G SFP
1x NVME (benchmark> 3500mb/s r/w)

Network bench using iperf3 between client and server:

30~40Gbit/s (although there is a bottleneck somewhere between two systems, but theoretically should do 30/8~40/8 = 3.75~5 Gbyte/s)

iperf3_1.png


iperf3_2.png


The pool test:

pool1:
1x nvme by itself.

pool2:
2x nvme stripe
or
vdev1: 1x nvme
vdev2: 1x nvme


write sync off, mtu 9000,

pool1:
pool_1.png


pool2:
pool_2.png



as you can see there is no performance gain at all for pool2. I tested 3 nvme in stripe and still not gain.

Shouldn't pool2 perform 2x times faster than pool1 and saturate the available bottlenecked 100G (30~40Gbit/s)?

In real world copy via samba share,

server -> client
60G file
pool1: 1.5Gbyte/s
pool2: 1.5Gbyte/s

to test if thats the single pool limit, i have:

server -> client
2x 60G file from pool1 + pool2 at same time: 800mb/s + 800mb/s

so, it seems in this system, the total traffic that is able to transfer out is limited to 1.5~1.6Gbtye/s

with the default FreeBSD tuneable settings the result was worse. This test above was done after some tweaks on tuneable settings.


Is this a bottleneck with TrueNAS or a bottleneck of my existing hardware? How can I improve this?


Many Thanks
 

ChrisRJ

Senior Member
Joined
Oct 23, 2020
Messages
336
With that network bandwidth you are in a league, where details matter much more than usual. I have no practical experience with this particular question, but a fair amount with performance issues in general. What you need to do is find the bottlenecks (in many cases there is not only one) that contribute to the overall result. This usually means to "strategically" replace various parts (only one at a time!!!) of the setup and see what happens. So unless someone has the knowledge to point directly at a certain component, this would mean to purchase or borrow some additional hardware.

As to your hardware, I would think that a consumer system is not an ideal choice, because you are probably the first person in the universe to run that combination. Also, this network bandwidth is something that at least now does not usually get deployed for connecting a single client directly. Rather you would e.g. connect a bunch of beefy ESXi servers with 10 Gbps or 25 Gbps each. In other words, it is usually not a trivial thing to bring enough load on a system, once we are in for "serious business".

My overall advice would be to not exclusively hope for specific recommendations from people, because it will be a hit-or-miss game. Rather read up on how to approach such a situation in a structured way (e.g. what is a good order to look at individual components to narrow down the problem). You will not only learn a great deal of things in general, but also truly understand how your system behaves.

Lastly, please read your post again and think about whether or not it contains enough information for someone to provided help that is not purely guess-work. Hint: The forum rules also contain something about that :smile: .

Good luck!
 

ehsab

Member
Joined
Aug 2, 2020
Messages
45
How about if you try more then one job at the same time (iperf3 -c <ip> -P 4)? My bet is that your 100G is actually 4x25Gbit.
 

Ericloewe

Not-very-passive-but-aggressive
Moderator
Joined
Feb 15, 2014
Messages
17,036
Some scattered thoughts:

Samba is single-threaded per connection. 25-30 Gb/s sounds like something I'd expect out of a single SMB connection using Samba. The client side of things is also bound to be limited. This is most likely causing this:
so, it seems in this system, the total traffic that is able to transfer out is limited to 1.5~1.6Gbtye/s
Try using more clients at once and see what happens.

Three PCIe 3.0 NVMe SSDs are pretty much in the same ballpark as a 50 GbE connection in terms of theoretical maximum. Once you account for overheads, I have some doubts that your setup can sustain the "bottlenecked 100 GbE" speeds for long.

Mellanox 100G SFP
I'm not sure how good the Mellanox FreeBSD driver is. That can have a huge impact.

Memory: 128G
For optimum use of 100+ GbE, having most stuff in memory is probably a requirement. L2ARC is probably a hindrance if you're doing mostly reads and an ultra-low latency SLOG (think Optane or even Optane on the memory bus) is probably a necessity if you plan to do sync writes.
 

Ericloewe

Not-very-passive-but-aggressive
Moderator
Joined
Feb 15, 2014
Messages
17,036
Sidenote: Don't those NICs kinda cost more than the rest of the systems they're in? I presume this is some sort of dev, test, lab setup thing.
 

jpmomo

Junior Member
Joined
Apr 13, 2021
Messages
12
I am trying to do similar benchmarking and have posed similar questions. There does not seem to be a lot (or any) empirical data with regards to pcie gen 4 systems. I have a variety of motherboards, cpus, ssds and nic cards. Anytime I have a pcie 4.0 device in the mix, I am getting a lot of error msgs from ipmi and truenas cli reporting. I can still configure a truenas setup but would like to eliminate/understand these error msgs. My ultimate goal is to use several pcie gen 4 ssds (m.2 or other) in a stiped configuration backed by a lot of fast 3200Mhz ram. this would go over the smb protocol over a proven (separate testing sw) 100Gbps network. The benchmarking sw that I am using can run multiple parallel smb streams with configurable file sizes. with a verified pcie 4.0 end to end system (using the latest amd epyc rome or milan cpus), there should not be any bottleneck with regards to the hw. I am using mellanox connectx-5 and connectx-6 nics both pcie 4.0 compliant. They are dual port 100GE nics. It would be helpful to get any confirmation on any truenas working configs with pcie 4.0 components. Thanks for any help as we try and push the envelop on this amazing sw!
 

tech.guru

Newbie
Joined
May 6, 2021
Messages
2
you need to confirm if true nas build supports ROCEv2 for its file services.
Otherwise your CPU is most likely limiting the performance you can achieve.

try testing iscsi over RDMA (iSER) and SMB direct if supported.
you might have to explicit turn on. may be in experimental read release notes.

also make sure RCOEv2 is supported on the switch ports for your true nas.
make sure you test using layer 2 on the same switch using iperf.
move and test switch uplinks afterwards for L3.

you appear rate limited, and you arnt getting full speeds.
 

jpmomo

Junior Member
Joined
Apr 13, 2021
Messages
12
Thanks for the helpful info. All of the hw components are rocev1,2 capable. Not sure where TN stands with that protocol though. I am using the latest version of TN.

Can you let me know the details of why the cpu might be the bottleneck? I don't see it hitting any ceiling (utilization monitoring) but there might be some other metric that I am missing. In the current test, I am using an amd epyc 7502 (32c) cpu.

Just as an update, I was able to perform some basic iperf tests (the iperf that is embedded with TN). It was showing around 85Gbps without any optimizations. That appears to validate the network part of this testing. Is there a way to perform basic I/O storage testing in a similar way (meaning built into TN)? I still do not have my test sw setup yet (waiting on some hw traffic generators-fpga based.)

I would like to ensure that I have my disks configured with the optimal block/sector size. I was told that some of the disks may be set to 512 or 512e and that I should try and change them to 4K. I am not sure if there is a generic way to do that within TN (like the offset setting?)

another update with regards to the pci errors- this seems to be related to specific motherboards (tyan). they do not appear with the asrock rack mb that I am now testing.

Thanks again for taking the time to give some suggestions.
 

Ericloewe

Not-very-passive-but-aggressive
Moderator
Joined
Feb 15, 2014
Messages
17,036
Is there a way to perform basic I/O storage testing in a similar way (meaning built into TN)?
Depends on how fancy you want to get. If you just want to see what the full-tilt, best case looks like, dd is a decent starting point.

I am not sure if there is a generic way to do that within TN
If im not mistaken, this tends to need some proprietary command.
 

tech.guru

Newbie
Joined
May 6, 2021
Messages
2
Thanks for the helpful info. All of the hw components are rocev1,2 capable. Not sure where TN stands with that protocol though. I am using the latest version of TN.

Can you let me know the details of why the cpu might be the bottleneck? I don't see it hitting any ceiling (utilization monitoring) but there might be some other metric that I am missing. In the current test, I am using an amd epyc 7502 (32c) cpu.

Just as an update, I was able to perform some basic iperf tests (the iperf that is embedded with TN). It was showing around 85Gbps without any optimizations. That appears to validate the network part of this testing. Is there a way to perform basic I/O storage testing in a similar way (meaning built into TN)? I still do not have my test sw setup yet (waiting on some hw traffic generators-fpga based.)

I would like to ensure that I have my disks configured with the optimal block/sector size. I was told that some of the disks may be set to 512 or 512e and that I should try and change them to 4K. I am not sure if there is a generic way to do that within TN (like the offset setting?)

another update with regards to the pci errors- this seems to be related to specific motherboards (tyan). they do not appear with the asrock rack mb that I am now testing.

Thanks again for taking the time to give some suggestions.
from what i read truenas doesnt support (ROCEv2) yet and you really need to get that support for the type of disks and bandwidth that you want.

i would download windows server trial and try out S2D, see if that gives the performance you are asking and you will know its an issue with the samba. from what i have read rmda support (smb direct) in samba seems very experiential.
 

Morris

Neophyte
Joined
Nov 21, 2020
Messages
8
Be carful about slot use on the X570 Aorus. One of your NVME drives will be on the Chipset. If your NIC is the chipset to CPU link will be a bottleneck.
 

jenksdrummer

Member
Joined
Jun 7, 2011
Messages
199
Consider running shell based / on-host testing to confirm what your disks are able to provide. If that lines up with expectations, then that can be a line to draw from as your troubleshoot the rest of the topology.
 
Top