SMB performance not keeping up with fast NVME drives

NickF

Guru
Joined
Jun 12, 2014
Messages
763
@NickF @elvisa has established the issue is only with windows clients. We have talked on the phone.
I still think tuning the server may yield further improvements... but from a client perspective, this is also relevant.



I would say, start with the values in their example...then start tweaking. I don't know enough about these tunables to get you on a better starting point than that. @elvisa

Client tuning example​

The general tuning parameters for client computers can optimize a computer for accessing remote file shares, particularly over some high-latency networks (such as branch offices, cross-datacenter communication, home offices, and mobile broadband). The settings are not optimal or appropriate on all computers. You should evaluate the impact of individual settings before applying them.

ParameterValueDefault
DisableBandwidthThrottling10
FileInfoCacheEntriesMax3276864
DirectoryCacheEntriesMax409616
FileNotFoundCacheEntriesMax32768128
MaxCmds3276815
Starting in Windows 8, you can configure many of these SMB settings by using the Set-SmbClientConfiguration and Set-SmbServerConfiguration Windows PowerShell cmdlets. Registry-only settings can be configured by using Windows PowerShell as well.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
This is not an eventuality that I think software developers ever planned for. At the very least it's orders of magnitude faster than the hard drives alot of code was written for.
I am aware that the issue is on the client side. But I still wanted to add something to this point.

About 14 years ago, when SSDs were relatively new, I spoke with a university professor who specialized in database management systems (esp. relational ones). He told me that at the time high-end DB servers wouldn't benefit from SSDs much, if at all. The reason was that e.g. Oracle had tuned their raw devices to an extent that the latency induced by the rotating disk was factored in how operations were scheduled by the execution plan.

I don't know to what extent something comparable applies here. But the general point that @NickF makes is indeed a very important one: Performance-critical software is (or should be) designed with the relevant performance characteristics of hardware in mind. Performance optimization is about finding the next bottleneck and trying to get rid of it. Only to then run into the next one. And if, as an example, the software assumes that disk access is x orders of magnitude slower than RAM access, this will somehow be reflected in the overall architecture as well as the code.

Nothing that solves the problem at hand, I know, but perhaps interesting and/or helpful nonetheless.
 
Last edited:

NickF

Guru
Joined
Jun 12, 2014
Messages
763
@ChrisRJ I always appreciate the wisdom of folks who've been in the trenches longer than I. Thanks so much for that post :smile:

If nothing else, I'd like to re-iterate and re-phrase my position to the stakeholders here, this problem is a problem from multiple vectors, not just client or just server.
 

Nater41

Cadet
Joined
Sep 2, 2023
Messages
1
We've built a storage appliance out of a Lenovo ThinkSystem SR650 V3. On board are 24x Micron 7450 NVMEs and an Intel E810-XXVDA4 NIC sporting 4x 25GbE QSFP adapters. All network interfaces on client, server and switch are configured for jumbo frames and verified that no fragmentation of ~8K packets is happening. Everything is fiber attached through a 25/40/100GbE capable switch.

The NVME drives get around 5-6 GB/s reads each, verified by fio tests directly on the raw devices. Using a 4x6 RAIDZ1 setup (no L2ARC configured, no compression), we can get around 16 GB/s reads happily straight off the storage at the ZFS file system layer, which is more than enough to feed the network and clients.

Sharing these out to Windows 10 workstations equipped with Intel XXV710 2xQSFP network cards (we're only connecting one of these per client however), this resource was very helpful:

Initially speeds were extremely poor, even for iperf3. That resource assisted in getting iperf3 requiring 3-4 processes to max out the line speed to now being able to hit the full 25 Gbit/s on a single process between client and server (and still with a little CPU headroom). The Windows 10 workstations are running the latest version of Windows 10, and have the latest Intel network card drivers and Pro Set management software from Intel.

SMB reads haven't behaved as well, but the results are somewhat strange. Writes back to TrueNAS are excellent - we can get 2.3 GB/s (24-ish Gbit/s) happily from a local NVME on the Windows 10 client over the wire writing back to TrueNAS (sustained writes over 15 minutes of continual writing), so we're maxing out the local 25GbE NICs there. Reads are extremely problematic. I've spent around 4 days solid trying a host of different combinations, including:

* SMB3 multi channel and disable multi channel, using LACP with a single IP
* SMB3 multii channel using 4x IPs (1 per physical interface)
* Simple single interface setup (no LACP, no VLANs, no multi channel)
* Various combinations of RSS on the server side (client and server are both verified able to do RSS via powershell commandlets Get-SmbClientNetworkInterface and Get-SmbMultichannelConnection ), including both LACP and no LACP (single interface only) on the server side
* aio read and aio write configuration changes
* Disabling all anti virus on the client
* ZFS record sizes from 128K up to 1M
* disable ZFS prefetching

Where things get really weird are that, with the tunables from the linked resource above, we can get around 12 Gbit/s from a single threaded read (disable smb multi channel on the server side, disable anti virus on the client side, do a simple copy to local NVME capable of 4GB/s+ writes). Whether the server is configured for LACP or a single NIC makes zero difference to this.

When we enable smb multi channel via any method (whether single IP on the server with RSS, whether multiple IPs on a single NIC via VLANs, whether multiple IPs on multiple NICs), read speeds come crashing down. The client struggles to peak above 1.4 Gbit/s reads. Windows task manager confirms the reads are splitting over multiple processes, and that when we configure multiple interfaces/VLANs/IPs on the client side, we can see load being shared over the NICs. However that load is tiny - 700 Mbit/s per interface.

The client can definitely read 25 Gbit/s of data via iperf3 on a single thread, and likewise the client has CPU to burn (i9-7940X with 28 threads). However at absolute best we're seeing around half the network read performance expected as soon as SMB is serving data. And again, the strangeness there is that writes to SMB on the same share happily hit line speed, even on a single write thread copying a single file.

I should probably note I did attempt to install TrueNAS SCALE to see if that made a difference, however the bootable installer crashed the system, and an attempted up/cross grade from CORE to SCALE caused the same crashes, so I reinstalled CORE fresh. I also played about with Ubuntu 22.04LTS on the same hardware which worked fine (including importing/mounting the ZFS pool), however performance differences across things like fio testing (both off raw NVME and off the ZFS pool), iperf3, etc were negligible between TrueNAS/FreeBSD and Ubuntu Linux.

At this stage I think the issue lies with Samba itself. Some reading let me to see that Samba doesn't use O_DIRECT from what I can tell, which definitely is needed to make things like fio get the results required when using "-direct=1". Likewise fio gives it's highest read speeds for 64k reads and up, with a notable drop using 32k reads or lower. I'm unsure what Samba is configured for internally when it comes to reads. There's a Samba VFS module that allows O_DIRECT for reads, but it's not bundled with CORE.

It is worth noting too that various "cat file | pv >/dev/null" and "dd if=file of=/dev/null bs=1m status=progress" type tests also don't give amazing results. cat yields a mere 750MB/s (which could be just the limit of pipes), and the dd test does hit around 4GB/s, which is faster than the smb read (still about 1/4 of what can be read with fio). This all leads me to think that the way Samba is configured to actually read from disk might be part of the problem? But I also don't understand why writes are then double the speed.

If anyone can point out anything obvious I've missed, or avenues to chase to figure out why this read performance is capped, I'd be very grateful.
Lol, think you and I are messing around with a lot of the same stuff.


Granted I am working with a QNAP, but its ZFS based and I am tinkering with ZFS settings within it that I probably shouldn't be. I seem to be hitting very similar limitations that you are. My post above from earlier this week I was hitting this mystical 2.3GB/sec Write speed that I couldn't grasp.

Last night I finally think I got past that 2.3GB/sec, but gonna do some more testing here to make sure. I have been changing so many settings I gotta figure out whats helping the most. Seems I am hitting the steady stream of 2.3GB/sec for min or two and then very slowly it crawls up to 3.4GB/sec then will peak up to 4.5-5GB/sec at the tail end. Not sure yet if that tail end is just the QNAP's ram expelling the files which is why it can creep higher. I can max out my dual 25gbe NICs with reads at 5.7GB/sec. However, so far it seems I can't really achieve these speeds with a normal Windows File Copy in Windows Explorer. I can only seem to get really high throughput with various robocopy commands I am trying.
 

MrGuvernment

Patron
Joined
Jun 15, 2017
Messages
268
Can you verify the TCP windows sizes...... at each end.
At 25Gbit/s = 3GByte/s... even at 1ms latency, you would need 3MByte
If its faster, 1MB would be needed... that larger than windows defaults.

This, I was just doing some iperfv3 testing myself and on windows, i had to set the -w 1M flag to max out my 10Gbps link to my TrueNAS in testing

Dest is my TrueNAS 13 CORE, running iperf3
Source : Win 10 Ent x64
iperf3

-Standard test - TCP
iperf3 -c [IP]
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-60.00 sec 37.6 GBytes 5.39 Gbits/sec sender
[ 4] 0.00-60.00 sec 37.6 GBytes 5.39 Gbits/sec receiver

-Increased buffer size - TCP no timer (-w 1M)
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-10.00 sec 11.0 GBytes 9.48 Gbits/sec sender
[ 4] 0.00-10.00 sec 11.0 GBytes 9.48 Gbits/sec receiver

-Standard test - TCP with timer
iperf3 -c [IP] -i 10 -t 60
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-60.00 sec 37.6 GBytes 5.39 Gbits/sec sender
[ 4] 0.00-60.00 sec 37.6 GBytes 5.39 Gbits/sec receiver

-Increased buffer size - TCP with timer (-w 1M)
iperf3 -c [IP] -i 10 -w 1M -t 60
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-60.00 sec 66.3 GBytes 9.49 Gbits/sec sender
[ 4] 0.00-60.00 sec 66.3 GBytes 9.49 Gbits/sec receiver
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
This, I was just doing some iperfv3 testing myself and on windows, i had to set the -w 1M flag to max out my 10Gbps link to my TrueNAS in testing

Dest is my TrueNAS 13 CORE, running iperf3
Source : Win 10 Ent x64
iperf3

-Standard test - TCP
iperf3 -c [IP]
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-60.00 sec 37.6 GBytes 5.39 Gbits/sec sender
[ 4] 0.00-60.00 sec 37.6 GBytes 5.39 Gbits/sec receiver

-Increased buffer size - TCP no timer (-w 1M)
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-10.00 sec 11.0 GBytes 9.48 Gbits/sec sender
[ 4] 0.00-10.00 sec 11.0 GBytes 9.48 Gbits/sec receiver

-Standard test - TCP with timer
iperf3 -c [IP] -i 10 -t 60
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-60.00 sec 37.6 GBytes 5.39 Gbits/sec sender
[ 4] 0.00-60.00 sec 37.6 GBytes 5.39 Gbits/sec receiver

-Increased buffer size - TCP with timer (-w 1M)
iperf3 -c [IP] -i 10 -w 1M -t 60
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-60.00 sec 66.3 GBytes 9.49 Gbits/sec sender
[ 4] 0.00-60.00 sec 66.3 GBytes 9.49 Gbits/sec receiver

Obviously 25 GbE will behave a bit differently

Throughput is also reduced if there is a need to actually perform Reads and there is a limited queue depth.
 

elvisa

Dabbler
Joined
Aug 14, 2023
Messages
11
Can you verify the TCP windows sizes...... at each end.
At 25Gbit/s = 3GByte/s... even at 1ms latency, you would need 3MByte
If its faster, 1MB would be needed... that larger than windows defaults.
Sorry for my delay in responding to this.

On the Windows side, I've followed this:

And inside HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters set a DWORD " TcpWindowSize" to "0x3FFFC000" (largest possible value according to that guide, although I've seen higher values around too). I've rebooted to be sure. The DWORD "Tcp1323Opts" is set to "3" which I believe is the default.

I've also used the PowerShell cmdlet "Set-NetTCPSetting -AutoTuningLevelLocal" using values "Normal" and "Experimental", neither of which change the performance.

On the storage we've just migrated to TrueNAS SCALE, and on that system I have:

net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_rmem = 4096 131072 6291456
net.ipv4.tcp_wmem = 4096 131072 6291456
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

Changing these to larger values doesn't appear to make any difference (smaller begins to bring performance down at a certain point).

Are there better ways to verify this?

[edit] The "core.rmem_max" and "wmem_max" options in particular on the Linux side set how high I can set the "-w" flag in iperf3 in either direction. By default these stop at around 500K or so, with the values above I can push the -w flag to 10M happily, although I don't get much benefit over 3M as you mentioned.

One thing I've just discovered is that the iperf3 (running as default, no arguments other than the address to connect to) performance is asymmetric. I thought I'd tested in both directions earlier and gotten close to the 25Gbit/s in both directions, but I've since worked out that this is only from Windows to TrueNAS, not the other way around. From TrueNAS back to Windows (aka "read" direction) with a default (no arguments) iperf3 command yields around 10Gbit/s, which seems similar to our SMB cap.

I can change this by running the following on the Windows side (the "-R" flag here is telling the traffic to go in reverse direction, from TrueNAS to Windows):

iperf3 -i1 -R -P6 -A 4,4 -c TrueNAS-address

That combo of both the 6 processes and the affinity argument at "4,4" is necessary, and gets traffic up to a high of 21Gbit/s into Windows ("read" direction). Any other lower value of processes or affinity brings the numbers down, and any higher makes no noticeable improvement.

Further testing with our AJA disk speed test under Windows shows the same ~2.4GB/s writes and ~1.3GB/s reads now with TrueNAS SCALE. A smidge better than TrueNAS CORE, but still just shy of our necessary ~1.5-1.6GB/s sustained.
 
Last edited:

NickF

Guru
Joined
Jun 12, 2014
Messages
763
Am assuming these are L2 adjacent to each other with no sort of overlay network (VXLAN, etc) and you are using large MTUs aka jumbo frames 9000?
Some other tuning suggestions that may help
net.core.netdev_max_backlog 8192
TCP Receive Queue

This parameter sets the maximum size of the network interface's receive queue. The queue is used to store received frames after removing them from the network adapter's ring buffer. High speed adapters should use a high value to prevent the queue from becoming full and dropping packets causing retransmits


net.ipv4.tcp_max_syn_backlog
May also be a good one to tune.

@jgreco had mentioned possibly tuning the TCP congestion control algorithm.
modprobe tcp_dctcp

These seem relevant:
 

MrGuvernment

Patron
Joined
Jun 15, 2017
Messages
268
Just a note on Jumbo Frames:
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
Just a note on Jumbo Frames:
I think there is some merit to both sides of this argument, which is as old as time. For a L2 network that doesn’t have a gateway, any traffic is effectively TTL of 1. In that situation, which is common design for iSCSI and SAN networks, there is very little design downside. However if you need to cross L3 boundaries I concur its not worth the headache IMO.
 
Last edited:

elvisa

Dabbler
Joined
Aug 14, 2023
Messages
11
This infra is on its own network, all connected to the same Dell S5212F-ON 25/40/50/100GbE switch (only that one switch between clients and storage). We're using exclusively 25GbE over fiber with SFP28 modules. The switch has the latest available firmware installed.

Jumbo frames are in play across the board - servers, clients, and switch. Gateway and upstream stuff is fine - there's a pfSense firewall acting as the inter-VLAN router (and default gateway to the Internet), and its interface on the subnet is manually set to MTU 1500 / MSS 1460, which solves all issues with these devices getting out to other networks (this also works for Linux based routers as well). But that isn't in the mix for client-to-server traffic.

For all of this testing and gear, yes it's all on the same IPv4 subnet, same switch, same VLAN, and all MTU 9000 or higher (Dell switches set port MTUs to 9216 to cover any packet overhead, etc). Packet captures at each end verify no fragmentation of 4-8K packets.

It has been suggested that we direct-connect client and server without a switch in the middle to eliminate that as a potential issue, but I can't see any evidence to back needing that testing. I'm quite happy with the quality of Dell enterprise switching in general, and LibreNMS attached to the switch show it's doing very little in terms of its full CPU/memory/packet-switching capacity during our tests, and again we're getting line speed on various tests without any issues, despite the real-world SMB problems.

In general, I'm a fan of jumbo frames. I hear a lot of network and storage admins complain about the idea often, but in my history as sysadmin in places like VFX, media, science, research, etc, they've always been a thing that I've deployed to good effect. Dealing with the issues connecting to other networks outside of our production environments have never been an issue with a correctly configured router with hard-defined MTU and MSS to deal with network boundaries properly. Zero issues with jumbo frames for me in either IPv4 or IPv6 networks (and now moving into converged networks with RDMA/RoCE and the like in play).
 
Last edited:

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Sorry for my delay in responding to this.

On the Windows side, I've followed this:

And inside HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters set a DWORD " TcpWindowSize" to "0x3FFFC000" (largest possible value according to that guide, although I've seen higher values around too). I've rebooted to be sure. The DWORD "Tcp1323Opts" is set to "3" which I believe is the default.

I've also used the PowerShell cmdlet "Set-NetTCPSetting -AutoTuningLevelLocal" using values "Normal" and "Experimental", neither of which change the performance.

On the storage we've just migrated to TrueNAS SCALE, and on that system I have:

net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_rmem = 4096 131072 6291456
net.ipv4.tcp_wmem = 4096 131072 6291456
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

Changing these to larger values doesn't appear to make any difference (smaller begins to bring performance down at a certain point).

Are there better ways to verify this?

[edit] The "core.rmem_max" and "wmem_max" options in particular on the Linux side set how high I can set the "-w" flag in iperf3 in either direction. By default these stop at around 500K or so, with the values above I can push the -w flag to 10M happily, although I don't get much benefit over 3M as you mentioned.

One thing I've just discovered is that the iperf3 (running as default, no arguments other than the address to connect to) performance is asymmetric. I thought I'd tested in both directions earlier and gotten close to the 25Gbit/s in both directions, but I've since worked out that this is only from Windows to TrueNAS, not the other way around. From TrueNAS back to Windows (aka "read" direction) with a default (no arguments) iperf3 command yields around 10Gbit/s, which seems similar to our SMB cap.

I can change this by running the following on the Windows side (the "-R" flag here is telling the traffic to go in reverse direction, from TrueNAS to Windows):

iperf3 -i1 -R -P6 -A 4,4 -c TrueNAS-address

That combo of both the 6 processes and the affinity argument at "4,4" is necessary, and gets traffic up to a high of 21Gbit/s into Windows ("read" direction). Any other lower value of processes or affinity brings the numbers down, and any higher makes no noticeable improvement.

Further testing with our AJA disk speed test under Windows shows the same ~2.4GB/s writes and ~1.3GB/s reads now with TrueNAS SCALE. A smidge better than TrueNAS CORE, but still just shy of our necessary ~1.5-1.6GB/s sustained.

As discussed, next test is to see whether it the SCALE/Linux end or the Windows end causing the issue.

Testing iperf3 from SCALE to a Linux client on 25Gbe would resolve that.

There is a normal limitation in windows of only using a single core for TCP connection. Receive Side Scaling (RSS) may be required.
 
Joined
Oct 2, 2023
Messages
2
So I had a similar issue.... and what may be related to yours.

I upgraded my pool from 2 striped vdevs (I was just testing and messing around) to a 3x4wide raidz1 array. You'd think I would get better performance right? Well read speeds were about 30MB/s. Write speeds were the opposite and I was getting around .9-1.3 GB/s.
I thought maybe the sector size was messed up, or some other zfs parameter. I dug through all of them for hours, made some tuning adjustments with arc, queues etc. I even checked individual pcie link speeds for my drives, the 10Gb Eth card.


I threw in some old drives and made a 4 wide stripe pool, and even a dual NVME pool to test. Still terrible performance. I was getting around 100 MB/s from the DUAL NVME pool lol.

Solution? SMB config was the culprit. I added the following auxiliary parameters and just like that, I'm just about fully saturating my 10G network.


read raw=yes
write raw=yes
max xmit=65535
getwd cache=yes

Other zfs parameters that appear to have helped are:

echo 12 >> /sys/module/zfs/parameters/zfs_vdev_async_read_max_active

This one was set to 3, vs 10 on async writes. Not sure if it came from when I had 2 drives in my previous pool and then migrated... but changing this has been positive.

I also changed zfs_vdev_max_active from 1000 to 4000. Not sure how much of a difference this one made, but it doesn't appear to have been a bad adjustment.

echo 4000 >> /sys/module/zfs/parameters/zfs_vdev_max_active

It would be nice if there were some scripts that would force re-tuning for things like zfs if hardware changes.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Solution? SMB config was the culprit. I added the following auxiliary parameters and just like that, I'm just about fully saturating my 10G network.


read raw=yes
write raw=yes
max xmit=65535
getwd cache=yes

Which version of TrueNAS were you running?
How were you testing it?

If you could verify by removing them (The SMB aux parameters) and retesting.. we'd like to confirm this as we are not aware of the issue.
 
Last edited:

NickF

Guru
Joined
Jun 12, 2014
Messages
763
So I had a similar issue.... and what may be related to yours.

I upgraded my pool from 2 striped vdevs (I was just testing and messing around) to a 3x4wide raidz1 array. You'd think I would get better performance right? Well read speeds were about 30MB/s. Write speeds were the opposite and I was getting around .9-1.3 GB/s.
I thought maybe the sector size was messed up, or some other zfs parameter. I dug through all of them for hours, made some tuning adjustments with arc, queues etc. I even checked individual pcie link speeds for my drives, the 10Gb Eth card.

RaidZ1 will always be slower than mirrors, and that may be exacerbated by relatively low CPU performance. Parity calculations and compression have a cost that's felt harder on low-end systems. Similarly, ethernet performance greater than 1Gb can/will be bottlenecked by low IPC CPUs or poor memory bandwidth, or both.

How are you testing and what are your specs?
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
@elvisa I did some further exploration on this topic, but focusing on the client side performance.

I think it would be interesting and helpful to compare notes on what you've done client side vs what I have tested to see where we are. I think there may be some performance to squeeze. Let me know if this helps.
 
Joined
Oct 2, 2023
Messages
2
I mean sure, rz1 is the jack of all trades, master of none, but 100MB is slower than even a single drive, plus the issue remained on a NVME stripe... Case dismissed.

Further testing revealed my pool was performing just fine with things like copying files internally from one pool to another etc.

The final test that lead me to the solution was creating an NFS share and immediately saturated the 10G link.

I suspect multichannel had a role in this too because I have multiple networks, but SMB only is supposed to be using the 10G network... not sure, but either way you gotta tune your SMB config, especially if you migrate hardware.

I used htop to help narrow the issue down from being maybe a pcie limitation, memory, bad hardware, too large of a pool etc etc.
 
Top