We've built a storage appliance out of a Lenovo ThinkSystem SR650 V3. On board are 24x Micron 7450 NVMEs and an Intel E810-XXVDA4 NIC sporting 4x 25GbE QSFP adapters. All network interfaces on client, server and switch are configured for jumbo frames and verified that no fragmentation of ~8K packets is happening. Everything is fiber attached through a 25/40/100GbE capable switch.
The NVME drives get around 5-6 GB/s reads each, verified by fio tests directly on the raw devices. Using a 4x6 RAIDZ1 setup (no L2ARC configured, no compression), we can get around 16 GB/s reads happily straight off the storage at the ZFS file system layer, which is more than enough to feed the network and clients.
Sharing these out to Windows 10 workstations equipped with Intel XXV710 2xQSFP network cards (we're only connecting one of these per client however), this resource was very helpful:
www.truenas.com
Initially speeds were extremely poor, even for iperf3. That resource assisted in getting iperf3 requiring 3-4 processes to max out the line speed to now being able to hit the full 25 Gbit/s on a single process between client and server (and still with a little CPU headroom). The Windows 10 workstations are running the latest version of Windows 10, and have the latest Intel network card drivers and Pro Set management software from Intel.
SMB reads haven't behaved as well, but the results are somewhat strange. Writes back to TrueNAS are excellent - we can get 2.3 GB/s (24-ish Gbit/s) happily from a local NVME on the Windows 10 client over the wire writing back to TrueNAS (sustained writes over 15 minutes of continual writing), so we're maxing out the local 25GbE NICs there. Reads are extremely problematic. I've spent around 4 days solid trying a host of different combinations, including:
* SMB3 multi channel and disable multi channel, using LACP with a single IP
* SMB3 multii channel using 4x IPs (1 per physical interface)
* Simple single interface setup (no LACP, no VLANs, no multi channel)
* Various combinations of RSS on the server side (client and server are both verified able to do RSS via powershell commandlets Get-SmbClientNetworkInterface and Get-SmbMultichannelConnection ), including both LACP and no LACP (single interface only) on the server side
* aio read and aio write configuration changes
* Disabling all anti virus on the client
* ZFS record sizes from 128K up to 1M
* disable ZFS prefetching
Where things get really weird are that, with the tunables from the linked resource above, we can get around 12 Gbit/s from a single threaded read (disable smb multi channel on the server side, disable anti virus on the client side, do a simple copy to local NVME capable of 4GB/s+ writes). Whether the server is configured for LACP or a single NIC makes zero difference to this.
When we enable smb multi channel via any method (whether single IP on the server with RSS, whether multiple IPs on a single NIC via VLANs, whether multiple IPs on multiple NICs), read speeds come crashing down. The client struggles to peak above 1.4 Gbit/s reads. Windows task manager confirms the reads are splitting over multiple processes, and that when we configure multiple interfaces/VLANs/IPs on the client side, we can see load being shared over the NICs. However that load is tiny - 700 Mbit/s per interface.
The client can definitely read 25 Gbit/s of data via iperf3 on a single thread, and likewise the client has CPU to burn (i9-7940X with 28 threads). However at absolute best we're seeing around half the network read performance expected as soon as SMB is serving data. And again, the strangeness there is that writes to SMB on the same share happily hit line speed, even on a single write thread copying a single file.
I should probably note I did attempt to install TrueNAS SCALE to see if that made a difference, however the bootable installer crashed the system, and an attempted up/cross grade from CORE to SCALE caused the same crashes, so I reinstalled CORE fresh. I also played about with Ubuntu 22.04LTS on the same hardware which worked fine (including importing/mounting the ZFS pool), however performance differences across things like fio testing (both off raw NVME and off the ZFS pool), iperf3, etc were negligible between TrueNAS/FreeBSD and Ubuntu Linux.
At this stage I think the issue lies with Samba itself. Some reading let me to see that Samba doesn't use O_DIRECT from what I can tell, which definitely is needed to make things like fio get the results required when using "-direct=1". Likewise fio gives it's highest read speeds for 64k reads and up, with a notable drop using 32k reads or lower. I'm unsure what Samba is configured for internally when it comes to reads. There's a Samba VFS module that allows O_DIRECT for reads, but it's not bundled with CORE.
It is worth noting too that various "cat file | pv >/dev/null" and "dd if=file of=/dev/null bs=1m status=progress" type tests also don't give amazing results. cat yields a mere 750MB/s (which could be just the limit of pipes), and the dd test does hit around 4GB/s, which is faster than the smb read (still about 1/4 of what can be read with fio). This all leads me to think that the way Samba is configured to actually read from disk might be part of the problem? But I also don't understand why writes are then double the speed.
If anyone can point out anything obvious I've missed, or avenues to chase to figure out why this read performance is capped, I'd be very grateful.
The NVME drives get around 5-6 GB/s reads each, verified by fio tests directly on the raw devices. Using a 4x6 RAIDZ1 setup (no L2ARC configured, no compression), we can get around 16 GB/s reads happily straight off the storage at the ZFS file system layer, which is more than enough to feed the network and clients.
Sharing these out to Windows 10 workstations equipped with Intel XXV710 2xQSFP network cards (we're only connecting one of these per client however), this resource was very helpful:
Resource - High Speed Networking Tuning to maximize your 10G, 25G, 40G networks
Both FreeBSD and Linux come by default highly optimized for classic 1Gbps ethernet. This is by far the most commonly deployed networking for both clients and servers, and a lot of research has been done to tune performance especially for local...

Initially speeds were extremely poor, even for iperf3. That resource assisted in getting iperf3 requiring 3-4 processes to max out the line speed to now being able to hit the full 25 Gbit/s on a single process between client and server (and still with a little CPU headroom). The Windows 10 workstations are running the latest version of Windows 10, and have the latest Intel network card drivers and Pro Set management software from Intel.
SMB reads haven't behaved as well, but the results are somewhat strange. Writes back to TrueNAS are excellent - we can get 2.3 GB/s (24-ish Gbit/s) happily from a local NVME on the Windows 10 client over the wire writing back to TrueNAS (sustained writes over 15 minutes of continual writing), so we're maxing out the local 25GbE NICs there. Reads are extremely problematic. I've spent around 4 days solid trying a host of different combinations, including:
* SMB3 multi channel and disable multi channel, using LACP with a single IP
* SMB3 multii channel using 4x IPs (1 per physical interface)
* Simple single interface setup (no LACP, no VLANs, no multi channel)
* Various combinations of RSS on the server side (client and server are both verified able to do RSS via powershell commandlets Get-SmbClientNetworkInterface and Get-SmbMultichannelConnection ), including both LACP and no LACP (single interface only) on the server side
* aio read and aio write configuration changes
* Disabling all anti virus on the client
* ZFS record sizes from 128K up to 1M
* disable ZFS prefetching
Where things get really weird are that, with the tunables from the linked resource above, we can get around 12 Gbit/s from a single threaded read (disable smb multi channel on the server side, disable anti virus on the client side, do a simple copy to local NVME capable of 4GB/s+ writes). Whether the server is configured for LACP or a single NIC makes zero difference to this.
When we enable smb multi channel via any method (whether single IP on the server with RSS, whether multiple IPs on a single NIC via VLANs, whether multiple IPs on multiple NICs), read speeds come crashing down. The client struggles to peak above 1.4 Gbit/s reads. Windows task manager confirms the reads are splitting over multiple processes, and that when we configure multiple interfaces/VLANs/IPs on the client side, we can see load being shared over the NICs. However that load is tiny - 700 Mbit/s per interface.
The client can definitely read 25 Gbit/s of data via iperf3 on a single thread, and likewise the client has CPU to burn (i9-7940X with 28 threads). However at absolute best we're seeing around half the network read performance expected as soon as SMB is serving data. And again, the strangeness there is that writes to SMB on the same share happily hit line speed, even on a single write thread copying a single file.
I should probably note I did attempt to install TrueNAS SCALE to see if that made a difference, however the bootable installer crashed the system, and an attempted up/cross grade from CORE to SCALE caused the same crashes, so I reinstalled CORE fresh. I also played about with Ubuntu 22.04LTS on the same hardware which worked fine (including importing/mounting the ZFS pool), however performance differences across things like fio testing (both off raw NVME and off the ZFS pool), iperf3, etc were negligible between TrueNAS/FreeBSD and Ubuntu Linux.
At this stage I think the issue lies with Samba itself. Some reading let me to see that Samba doesn't use O_DIRECT from what I can tell, which definitely is needed to make things like fio get the results required when using "-direct=1". Likewise fio gives it's highest read speeds for 64k reads and up, with a notable drop using 32k reads or lower. I'm unsure what Samba is configured for internally when it comes to reads. There's a Samba VFS module that allows O_DIRECT for reads, but it's not bundled with CORE.
It is worth noting too that various "cat file | pv >/dev/null" and "dd if=file of=/dev/null bs=1m status=progress" type tests also don't give amazing results. cat yields a mere 750MB/s (which could be just the limit of pipes), and the dd test does hit around 4GB/s, which is faster than the smb read (still about 1/4 of what can be read with fio). This all leads me to think that the way Samba is configured to actually read from disk might be part of the problem? But I also don't understand why writes are then double the speed.
If anyone can point out anything obvious I've missed, or avenues to chase to figure out why this read performance is capped, I'd be very grateful.