NVME ZFS Pool over 100GbE

kyaky

Cadet
Joined
Feb 25, 2021
Messages
2
Hi Community,

I have some issues with the NVME ZFS pool performance over 100GbE

The setup:

The Server:
TrueNAS 12.0 U2.1
CPU:AMD Ryzen 9 3900x
Motherboard: X570 Aorus
Memory: 128G
Mellanox 100G SFP
3x 2T PCIE4 NVME (benchmark> 3500mb/s r/w)


The Client:
CPU: i7 9th Gen
Memory: 16G
Mellanox 100G SFP
1x NVME (benchmark> 3500mb/s r/w)

Network bench using iperf3 between client and server:

30~40Gbit/s (although there is a bottleneck somewhere between two systems, but theoretically should do 30/8~40/8 = 3.75~5 Gbyte/s)

iperf3_1.png


iperf3_2.png


The pool test:

pool1:
1x nvme by itself.

pool2:
2x nvme stripe
or
vdev1: 1x nvme
vdev2: 1x nvme


write sync off, mtu 9000,

pool1:
pool_1.png


pool2:
pool_2.png



as you can see there is no performance gain at all for pool2. I tested 3 nvme in stripe and still not gain.

Shouldn't pool2 perform 2x times faster than pool1 and saturate the available bottlenecked 100G (30~40Gbit/s)?

In real world copy via samba share,

server -> client
60G file
pool1: 1.5Gbyte/s
pool2: 1.5Gbyte/s

to test if thats the single pool limit, i have:

server -> client
2x 60G file from pool1 + pool2 at same time: 800mb/s + 800mb/s

so, it seems in this system, the total traffic that is able to transfer out is limited to 1.5~1.6Gbtye/s

with the default FreeBSD tuneable settings the result was worse. This test above was done after some tweaks on tuneable settings.


Is this a bottleneck with TrueNAS or a bottleneck of my existing hardware? How can I improve this?


Many Thanks
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
With that network bandwidth you are in a league, where details matter much more than usual. I have no practical experience with this particular question, but a fair amount with performance issues in general. What you need to do is find the bottlenecks (in many cases there is not only one) that contribute to the overall result. This usually means to "strategically" replace various parts (only one at a time!!!) of the setup and see what happens. So unless someone has the knowledge to point directly at a certain component, this would mean to purchase or borrow some additional hardware.

As to your hardware, I would think that a consumer system is not an ideal choice, because you are probably the first person in the universe to run that combination. Also, this network bandwidth is something that at least now does not usually get deployed for connecting a single client directly. Rather you would e.g. connect a bunch of beefy ESXi servers with 10 Gbps or 25 Gbps each. In other words, it is usually not a trivial thing to bring enough load on a system, once we are in for "serious business".

My overall advice would be to not exclusively hope for specific recommendations from people, because it will be a hit-or-miss game. Rather read up on how to approach such a situation in a structured way (e.g. what is a good order to look at individual components to narrow down the problem). You will not only learn a great deal of things in general, but also truly understand how your system behaves.

Lastly, please read your post again and think about whether or not it contains enough information for someone to provided help that is not purely guess-work. Hint: The forum rules also contain something about that :smile: .

Good luck!
 

ehsab

Dabbler
Joined
Aug 2, 2020
Messages
45
How about if you try more then one job at the same time (iperf3 -c <ip> -P 4)? My bet is that your 100G is actually 4x25Gbit.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Some scattered thoughts:

Samba is single-threaded per connection. 25-30 Gb/s sounds like something I'd expect out of a single SMB connection using Samba. The client side of things is also bound to be limited. This is most likely causing this:
so, it seems in this system, the total traffic that is able to transfer out is limited to 1.5~1.6Gbtye/s
Try using more clients at once and see what happens.

Three PCIe 3.0 NVMe SSDs are pretty much in the same ballpark as a 50 GbE connection in terms of theoretical maximum. Once you account for overheads, I have some doubts that your setup can sustain the "bottlenecked 100 GbE" speeds for long.

Mellanox 100G SFP
I'm not sure how good the Mellanox FreeBSD driver is. That can have a huge impact.

Memory: 128G
For optimum use of 100+ GbE, having most stuff in memory is probably a requirement. L2ARC is probably a hindrance if you're doing mostly reads and an ultra-low latency SLOG (think Optane or even Optane on the memory bus) is probably a necessity if you plan to do sync writes.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Sidenote: Don't those NICs kinda cost more than the rest of the systems they're in? I presume this is some sort of dev, test, lab setup thing.
 

jpmomo

Dabbler
Joined
Apr 13, 2021
Messages
12
I am trying to do similar benchmarking and have posed similar questions. There does not seem to be a lot (or any) empirical data with regards to pcie gen 4 systems. I have a variety of motherboards, cpus, ssds and nic cards. Anytime I have a pcie 4.0 device in the mix, I am getting a lot of error msgs from ipmi and truenas cli reporting. I can still configure a truenas setup but would like to eliminate/understand these error msgs. My ultimate goal is to use several pcie gen 4 ssds (m.2 or other) in a stiped configuration backed by a lot of fast 3200Mhz ram. this would go over the smb protocol over a proven (separate testing sw) 100Gbps network. The benchmarking sw that I am using can run multiple parallel smb streams with configurable file sizes. with a verified pcie 4.0 end to end system (using the latest amd epyc rome or milan cpus), there should not be any bottleneck with regards to the hw. I am using mellanox connectx-5 and connectx-6 nics both pcie 4.0 compliant. They are dual port 100GE nics. It would be helpful to get any confirmation on any truenas working configs with pcie 4.0 components. Thanks for any help as we try and push the envelop on this amazing sw!
 

tech.guru

Cadet
Joined
May 6, 2021
Messages
2
you need to confirm if true nas build supports ROCEv2 for its file services.
Otherwise your CPU is most likely limiting the performance you can achieve.

try testing iscsi over RDMA (iSER) and SMB direct if supported.
you might have to explicit turn on. may be in experimental read release notes.

also make sure RCOEv2 is supported on the switch ports for your true nas.
make sure you test using layer 2 on the same switch using iperf.
move and test switch uplinks afterwards for L3.

you appear rate limited, and you arnt getting full speeds.
 

jpmomo

Dabbler
Joined
Apr 13, 2021
Messages
12
Thanks for the helpful info. All of the hw components are rocev1,2 capable. Not sure where TN stands with that protocol though. I am using the latest version of TN.

Can you let me know the details of why the cpu might be the bottleneck? I don't see it hitting any ceiling (utilization monitoring) but there might be some other metric that I am missing. In the current test, I am using an amd epyc 7502 (32c) cpu.

Just as an update, I was able to perform some basic iperf tests (the iperf that is embedded with TN). It was showing around 85Gbps without any optimizations. That appears to validate the network part of this testing. Is there a way to perform basic I/O storage testing in a similar way (meaning built into TN)? I still do not have my test sw setup yet (waiting on some hw traffic generators-fpga based.)

I would like to ensure that I have my disks configured with the optimal block/sector size. I was told that some of the disks may be set to 512 or 512e and that I should try and change them to 4K. I am not sure if there is a generic way to do that within TN (like the offset setting?)

another update with regards to the pci errors- this seems to be related to specific motherboards (tyan). they do not appear with the asrock rack mb that I am now testing.

Thanks again for taking the time to give some suggestions.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Is there a way to perform basic I/O storage testing in a similar way (meaning built into TN)?
Depends on how fancy you want to get. If you just want to see what the full-tilt, best case looks like, dd is a decent starting point.

I am not sure if there is a generic way to do that within TN
If im not mistaken, this tends to need some proprietary command.
 

tech.guru

Cadet
Joined
May 6, 2021
Messages
2
Thanks for the helpful info. All of the hw components are rocev1,2 capable. Not sure where TN stands with that protocol though. I am using the latest version of TN.

Can you let me know the details of why the cpu might be the bottleneck? I don't see it hitting any ceiling (utilization monitoring) but there might be some other metric that I am missing. In the current test, I am using an amd epyc 7502 (32c) cpu.

Just as an update, I was able to perform some basic iperf tests (the iperf that is embedded with TN). It was showing around 85Gbps without any optimizations. That appears to validate the network part of this testing. Is there a way to perform basic I/O storage testing in a similar way (meaning built into TN)? I still do not have my test sw setup yet (waiting on some hw traffic generators-fpga based.)

I would like to ensure that I have my disks configured with the optimal block/sector size. I was told that some of the disks may be set to 512 or 512e and that I should try and change them to 4K. I am not sure if there is a generic way to do that within TN (like the offset setting?)

another update with regards to the pci errors- this seems to be related to specific motherboards (tyan). they do not appear with the asrock rack mb that I am now testing.

Thanks again for taking the time to give some suggestions.

from what i read truenas doesnt support (ROCEv2) yet and you really need to get that support for the type of disks and bandwidth that you want.

i would download windows server trial and try out S2D, see if that gives the performance you are asking and you will know its an issue with the samba. from what i have read rmda support (smb direct) in samba seems very experiential.
 

Morris

Contributor
Joined
Nov 21, 2020
Messages
120
Be carful about slot use on the X570 Aorus. One of your NVME drives will be on the Chipset. If your NIC is the chipset to CPU link will be a bottleneck.
 

jenksdrummer

Patron
Joined
Jun 7, 2011
Messages
250
Consider running shell based / on-host testing to confirm what your disks are able to provide. If that lines up with expectations, then that can be a line to draw from as your troubleshoot the rest of the topology.
 

skyyxy

Contributor
Joined
Jul 16, 2016
Messages
136
Hi Community,

I have some issues with the NVME ZFS pool performance over 100GbE

The setup:

The Server:
TrueNAS 12.0 U2.1
CPU:AMD Ryzen 9 3900x
Motherboard: X570 Aorus
Memory: 128G
Mellanox 100G SFP
3x 2T PCIE4 NVME (benchmark> 3500mb/s r/w)


The Client:
CPU: i7 9th Gen
Memory: 16G
Mellanox 100G SFP
1x NVME (benchmark> 3500mb/s r/w)

Network bench using iperf3 between client and server:

30~40Gbit/s (although there is a bottleneck somewhere between two systems, but theoretically should do 30/8~40/8 = 3.75~5 Gbyte/s)

View attachment 45826

View attachment 45827

The pool test:

pool1:
1x nvme by itself.

pool2:
2x nvme stripe
or
vdev1: 1x nvme
vdev2: 1x nvme


write sync off, mtu 9000,

pool1:
View attachment 45824

pool2:
View attachment 45825


as you can see there is no performance gain at all for pool2. I tested 3 nvme in stripe and still not gain.

Shouldn't pool2 perform 2x times faster than pool1 and saturate the available bottlenecked 100G (30~40Gbit/s)?

In real world copy via samba share,

server -> client
60G file
pool1: 1.5Gbyte/s
pool2: 1.5Gbyte/s

to test if thats the single pool limit, i have:

server -> client
2x 60G file from pool1 + pool2 at same time: 800mb/s + 800mb/s

so, it seems in this system, the total traffic that is able to transfer out is limited to 1.5~1.6Gbtye/s

with the default FreeBSD tuneable settings the result was worse. This test above was done after some tweaks on tuneable settings.


Is this a bottleneck with TrueNAS or a bottleneck of my existing hardware? How can I improve this?


Many Thanks

I have same plan to build my new FNS server, Please whats your 100GbE NIC model? Thanks.
 

Kailee71

Contributor
Joined
Jul 8, 2018
Messages
110
Just as a side note, if you're getting your windows iperf binaries from the official site (iperf.fr) then you're likely using an old version (3.1.3). I've found that the latest version actually show much better performance; with 3.1.3 Win10 <->TrueNAS I was getting ~3Gbit on a 10GbE link, whereas the latest version from here https://files.budman.pw/ was getting full-speed (i.e. 9.8Gbit or thereabouts). It's not relevant for your usecase but may help others.
 

Monstieur

Cadet
Joined
Apr 11, 2022
Messages
2
iPerf3 is single-threaded, which is a regression from iPerf2. The multiple connections option merely opens multiple sockets on the same thread. A single thread maxes out at about 20 Gb/s on my setup.

I have ConnectX-4 100 Gb/s NICs on my TrueNAS server and a directly attached client.
With iSCSI I get the expected maximum disk performance (~6.6 GiB/s) from a stripe of mirrors 4x Optane 905p 280 GB SLOG.
SMB maxes out at about 3.7 GiB/s with CrystalDiskMark, and at about 1.8 GiB/s with Windows Explorer.

I have not tried any SMB-specific tuning. Here are my other tunables.
Code:
kern.ipc.maxsockbuf
134217728

net.inet.tcp.recvbuf_max
134217728

net.inet.tcp.recvspace
4194304

net.inet.tcp.sendbuf_inc
16777216

net.inet.tcp.sendbuf_max
134217728

net.inet.tcp.sendspace
4194304

vfs.zfs.l2arc.rebuild_enabled
1

vfs.zfs.l2arc_write_boost
3670016000

vfs.zfs.l2arc_write_max
3670016000

vfs.zfs.dirty_data_max
85899345920

vfs.zfs.dirty_data_max_percent
80
loader

vfs.zfs.dirty_data_max_max
85899345920
loader

vfs.zfs.dirty_data_max_max_percent
80
loader
 

jamiejunk

Contributor
Joined
Jan 13, 2013
Messages
134
I'm about to build up 100 gig all NVME TrueNAS box to connect to some Proxmox hosts. I was thinking about going 100gig in the Proxmox hosts. But maybe that's a waste and 40 or 25 is the way to go. Anyways.. subscribing to this thread to watch for updates.
 
Top