SMB performance not keeping up with fast NVME drives

elvisa · Aug 14, 2023

We've built a storage appliance out of a Lenovo ThinkSystem SR650 V3. On board are 24x Micron 7450 NVMEs and an Intel E810-XXVDA4 NIC sporting 4x 25GbE QSFP adapters. All network interfaces on client, server and switch are configured for jumbo frames and verified that no fragmentation of ~8K packets is happening. Everything is fiber attached through a 25/40/100GbE capable switch.

The NVME drives get around 5-6 GB/s reads each, verified by fio tests directly on the raw devices. Using a 4x6 RAIDZ1 setup (no L2ARC configured, no compression), we can get around 16 GB/s reads happily straight off the storage at the ZFS file system layer, which is more than enough to feed the network and clients.

Sharing these out to Windows 10 workstations equipped with Intel XXV710 2xQSFP network cards (we're only connecting one of these per client however), this resource was very helpful:

Resource - High Speed Networking Tuning to maximize your 10G, 25G, 40G networks

Both FreeBSD and Linux come by default highly optimized for classic 1Gbps ethernet. This is by far the most commonly deployed networking for both clients and servers, and a lot of research has been done to tune performance especially for local...

www.truenas.com

Initially speeds were extremely poor, even for iperf3. That resource assisted in getting iperf3 requiring 3-4 processes to max out the line speed to now being able to hit the full 25 Gbit/s on a single process between client and server (and still with a little CPU headroom). The Windows 10 workstations are running the latest version of Windows 10, and have the latest Intel network card drivers and Pro Set management software from Intel.

SMB reads haven't behaved as well, but the results are somewhat strange. Writes back to TrueNAS are excellent - we can get 2.3 GB/s (24-ish Gbit/s) happily from a local NVME on the Windows 10 client over the wire writing back to TrueNAS (sustained writes over 15 minutes of continual writing), so we're maxing out the local 25GbE NICs there. Reads are extremely problematic. I've spent around 4 days solid trying a host of different combinations, including:

* SMB3 multi channel and disable multi channel, using LACP with a single IP
* SMB3 multii channel using 4x IPs (1 per physical interface)
* Simple single interface setup (no LACP, no VLANs, no multi channel)
* Various combinations of RSS on the server side (client and server are both verified able to do RSS via powershell commandlets Get-SmbClientNetworkInterface and Get-SmbMultichannelConnection ), including both LACP and no LACP (single interface only) on the server side
* aio read and aio write configuration changes
* Disabling all anti virus on the client
* ZFS record sizes from 128K up to 1M
* disable ZFS prefetching

Where things get really weird are that, with the tunables from the linked resource above, we can get around 12 Gbit/s from a single threaded read (disable smb multi channel on the server side, disable anti virus on the client side, do a simple copy to local NVME capable of 4GB/s+ writes). Whether the server is configured for LACP or a single NIC makes zero difference to this.

When we enable smb multi channel via any method (whether single IP on the server with RSS, whether multiple IPs on a single NIC via VLANs, whether multiple IPs on multiple NICs), read speeds come crashing down. The client struggles to peak above 1.4 Gbit/s reads. Windows task manager confirms the reads are splitting over multiple processes, and that when we configure multiple interfaces/VLANs/IPs on the client side, we can see load being shared over the NICs. However that load is tiny - 700 Mbit/s per interface.

The client can definitely read 25 Gbit/s of data via iperf3 on a single thread, and likewise the client has CPU to burn (i9-7940X with 28 threads). However at absolute best we're seeing around half the network read performance expected as soon as SMB is serving data. And again, the strangeness there is that writes to SMB on the same share happily hit line speed, even on a single write thread copying a single file.

I should probably note I did attempt to install TrueNAS SCALE to see if that made a difference, however the bootable installer crashed the system, and an attempted up/cross grade from CORE to SCALE caused the same crashes, so I reinstalled CORE fresh. I also played about with Ubuntu 22.04LTS on the same hardware which worked fine (including importing/mounting the ZFS pool), however performance differences across things like fio testing (both off raw NVME and off the ZFS pool), iperf3, etc were negligible between TrueNAS/FreeBSD and Ubuntu Linux.

At this stage I think the issue lies with Samba itself. Some reading let me to see that Samba doesn't use O_DIRECT from what I can tell, which definitely is needed to make things like fio get the results required when using "-direct=1". Likewise fio gives it's highest read speeds for 64k reads and up, with a notable drop using 32k reads or lower. I'm unsure what Samba is configured for internally when it comes to reads. There's a Samba VFS module that allows O_DIRECT for reads, but it's not bundled with CORE.

It is worth noting too that various "cat file | pv >/dev/null" and "dd if=file of=/dev/null bs=1m status=progress" type tests also don't give amazing results. cat yields a mere 750MB/s (which could be just the limit of pipes), and the dd test does hit around 4GB/s, which is faster than the smb read (still about 1/4 of what can be read with fio). This all leads me to think that the way Samba is configured to actually read from disk might be part of the problem? But I also don't understand why writes are then double the speed.

If anyone can point out anything obvious I've missed, or avenues to chase to figure out why this read performance is capped, I'd be very grateful.

Johnny Fartpants · Aug 18, 2023

How much RAM does the system have? Have you tried a config on the pool of just mirrors and tested that?

elvisa · Aug 18, 2023

Johnny Fartpants said:
How much RAM does the system have? Have you tried a config on the pool of just mirrors and tested that?

128 GB DDR5.

It's not a disk/storage problem that I can tell. The local storage happily delivers around 30 GB/s (300 Gbit/s) of sequential read throughput using tests like fio on the server itself. Whether that's presented as RAIDZ1 or simple stripes or even single drives (the latter capping out at 6 GB/s / 60 Gbit/s), it makes no difference to the SMB tests.

The Windows client can happily hit line speed when testing with iperf3 back to the server (24.3 Gbit/s single process). The storage can deliver well over 10x that throughput. As soon as SMB is in the mix, the speeds are hampered severely, and more than around 12 Gbit/s reads don't seem to want to occur to a single Windows client.

Again, storage tests orders of magnitude higher than this. Network throughput tests higher than this. CPU on all systems involved is never hitting 100% of a single thread, and in rarely sees activity on more than a single CPU.

I've also attempted a Linux install with a recent kernel and ksmbd (the Linux kernel SMB server). Server side CPU load is much lower than Samba with this configuration, however there's zero difference on the Windows client with this in place.

Playing with "Sett-SmbClientConfiguration" PowerShell options (and rebooting to be certain) on the Windows client makes negligible difference. Tweaking options like compression, encryption, large MTU, multi channel and others have almost zero impact.

I'm not seeing any obvious reason why this can't achieve more. But also, chatting with peers, I'm not seeing anyone achieving more than this upper limit of 12 Gbit/s reads to a single client. People are quick to tell me they can go higher, but on deeper investigation it's aggregate throughput to multiple clients, and their single client sustained/sequential read speeds are far lower (including those who have invested in 25GbE on the client, and found out that they're not getting close to that).

So before I send myself crazy - has anyone here witnessed higher read speeds than this to a single client? Or am I asking too much of Windows as a client OS?

elvisa · Aug 19, 2023

Some of the client-side tuning that is floating around the Internet:

Boost Windows Server File Sharing Performance | Some Notes on the Machines

notes.ponderworthy.com

As above, these make no difference. Rates are still capped at that magic 12Gbit/s mark. It also seems a lot of people are tuning to 10GbE workloads, and there's little documentation available of people pushing past that on a single client.

Somewhat interesting article here:

How Can I Help with the new TRUENAS / 100G testing?

L1 Team: I am interested in helping with any testing, brainstorming or otherwise with regard to ZFS performance over 100G. I have a rather high end setup with similar hardware to what was discussed in today’s video (TRUENAS SCALE: STATE OF THE BETA Q4 202) and would like to support in any...

forum.level1techs.com

There's a diskspd.exe command in there that yields higher results connecting from a Windows client to Linux+ksmbd (around 22Gbit/s), however it spawns far more read threads than standard Windows applications (hits peak sequential read speed around 5 threads or so, with no real benefit past that). I'll need to reinstall TrueNAS CORE to see what that looks like under that configuration, but it's also not much use when standard applications don't behave that way.

elvisa · Aug 21, 2023

So the best performance I've had to date is:
* Linux kernel 6.2 on the server, running ksmbd
* Linux kernel 6.3 on the client, running mount.cifs

That nets me close to 2.3 GB/s / 23 Gbit/s of SMB transfer speed testing with fio. Thread/process count can be 1 in that case, as long as the block size reads are 64k or higher. Any lower and speeds take a tumble.

The same testing with TrueNAS CORE and Samba on the server side (same client) yields only about 1.8 GB/s / 18 Gbit/s. However I do seem to need to get the thread count up on the client/fio side to hit that happily too, which is a worry.

Next test is putting a Windows VM on top of the Linux client and passing through the mount point using QEMU 8.0 and virtiofsd. Based on the testing I'm seeing people do in the thread I linked one post back, that might work. I'll need to see what control I've got over the virtual disk presented to Windows with that method, and whether or not I can increase things like queue depth and block size requests, or if I'm back to a complete lack of control inside of Windows clients.

Currently huge limiting factors appear to be Windows itself, as well as a complete lack of tuning for the SMB client within Windows. "Dumb" applications on top of this then do nothing clever with their requests, which sees their read rates abysmally low compared to what either the network stack or storage can achieve (less than half of what I can achieve with Linux under the right conditions).

As I mentioned in other posts, all of the tuning guides out there center around the server side dishing out data in parallel to a large number of client systems. There appears to be very few people documenting a low number of clients that need extremely fast reads. I've managed to contact a few peers running similar hardware, and they confirm their single system reads match the numbers I'm getting (and often at great surprise to them, as they expected much higher).

I wonder if there's a way to get the attention of Microsoft with this? Particularly for M&E clients moving into a raw 4K and soon 8K workflows (throw in Rec.2020/Rec.2100, wide colour gamuts, HDR, etc which are all going to make uncompressed content sizes hit very high numbers), and these data rates to single client systems are no longer going to be niche desires.

sretalla · Aug 22, 2023

elvisa said:
I wonder if there's a way to get the attention of Microsoft with this?

Use MS SMB stack on a Windows server.

They don't care at all about SAMBA as it nets them no revenue.

elvisa · Aug 22, 2023

sretalla said:
Use MS SMB stack on a Windows server.

They don't care at all about SAMBA as it nets them no revenue.

We did do that as part of our testing. Windows Server 2022 to a Windows 11 client yielded a very disappointing 7 Gbit/s / 700 MB/s peak, even with every single Microsoft technet recommended tweak on the server side.

I'm still wondering if anyone has seen better numbers using SMB over IP than what I've achieved here. The best I've seen involve SMB Direct (RDMA), which my client side cards can't do.

I can demonstrate much higher numbers with Linux as a client, so the doesn't appear to be the storage/server (although I'll be very clear that Linux+ksmbd on the storage/server is faster). However ignoring that, there's a clear ~25% performance loss using Windows compared to Linux even with the slowest of the non-Windows server options in play.

Repeating again that every tuning guide concentrates on raw non-SMB network speed (we have that already) or server side performance (we have that already), and there's very little that seems to concentrate on Windows SMB client performance tuning. If anyone has any light to shed on that, I'd be greatful.

morganL · Aug 23, 2023

Hi @elvisa

You confirmed that jumbo frames are set-up in client, switch and trueNAS,
When doing a sustained reads, you should see a high average packet size and no tcp drops/retransmits.
The TrueNAS is connected to the switch at what speed relative to the client?

It probably a good idea to confirm the detailed hardware specs of the client.

Can you describe the actual read performance test?... a problem often occurs with small read sizes if the queue depth is low. The client spends time waiting for 1 read to complete before issuing another request. Queue depth enables parallelism and increases bandwidth.

Writes have less issue because the NAS response can be faster.

elvisa · Aug 23, 2023

morganL said:
Hi @elvisa

You confirmed that jumbo frames are set-up in client, switch and trueNAS,
When doing a sustained reads, you should see a high average packet size and no tcp drops/retransmits.
The TrueNAS is connected to the switch at what speed relative to the client?

It probably a good idea to confirm the detailed hardware specs of the client.

Can you describe the actual read performance test?... a problem often occurs with small read sizes if the queue depth is low. The client spends time waiting for 1 read to complete before issuing another request. Queue depth enables parallelism and increases bandwidth.

Writes have less issue because the NAS response can be faster.

Client specs are a Dell Precision 5820 workstation with a Xeon W-2275 CPU, 64GB DDR4. Installed NIC is the xxv710-DA2 mentioned in the OP. Everything is connected physically by SFP28 to a Dell S5212F-ON switch which is running the latest SmartFabric OS10 firmware.

Raw network performance between client and server are confirmed by not only iperf3 (25GbE line speed achieved on a single thread), but by also running Linux on the client, mounting the storage via mount.cifs, and testing that way. We confirm > 1500 byte packets with wireshark. Definitely no fragmentation, drops or retransmits.

fio performance numbers with Linux kernel 6.3 and mount.cifs (i.e.: kernel space CIFS mount support) as a client are as follows to the following server installs (same hardware as OP, but with the following OS and software installs):

* TrueNAS CORE / FreeBSD + Samba 4.15 = 1.7GB/s (~17Gbit/s)
* Linux (kernel 6.2) + Samba 4.17 = 1.8 GB/s (~18Gbit/s)
* Linux (kernel 6.2) + ksmbd = 2.0 GB/s (~20Gbit/s)

No matter what the server side configuration listed above, if the client is Microsoft Windows 10 Pro or 11 (we've tried both, fully patched to latest OS versions, latest chipset/NIC/etc drivers), we're capped at around 1.2 GB/s.

Our performance tests are two fold. For synthetic tests under Windows, we use a combination of fio and diskspd.exe. These are fine and give us nice control over things like queue depth, process/thread count, direct IO, etc. However they're entirely synthetic because they never translate to our actual client applications. With these we can eke out slightly more performance (maybe peaking at 1.3GB/s).

For real world tests, we get identical results out of tools like DJV (an open source DPX frame reader), DaVinci Resolve, Adobe Premier, and the AJA disk speed tests (the latter nice tool because it's easy to download and benchmark, and gives real world performance numbers closer to the applications listed).

Very specifically, we need to read 4K uncompressed 10bit RGB frames from the storage in realtime to the client to review, fix/grade/correct, and save out again. That appears to be around 1.5-1.6GB/s or so (we're seeing that with writes when the footage is captured in realtime and written to the storage). Reading these back in from the storage results in slower speeds (yes, writes are faster than reads under Windows clients).

Repeating that we've confirmed the storage is not the culprit, as we can trivially push this an order of magnitude faster than the peak network speeds with tools running locally on the storage itself.

Ultimately the synthetic tests are nice but somewhat useless, as they don't reflect the tools being used. With that said, under Windows especially, they vary little. In fact there are even times when Windows multi-channel benchmarks are lower than when we disable multi-channel, which seems particularly odd.

Work-around workflows like "copy to local NVME, edit, copy back" aren't practical because of the sheer size of the media in question. These need to live on the fast NAS storage for space considerations (far in excess of what we can build practically on a local DAS storage system).

As far as I can tell, I've confirmed it is not any of the following:
* The network - MTU, iperf3 performance, settings (we've tried all combinations and permutations of hardware offload enable/disable, etc), etc
* The storage - all tests on the storage itself show this as fine (order of magnitude faster than required)
* The storage system's CPU - we're seeing this barely hit a few percent of a single core/thread
* The client system's CPU - it's high, but not ever 100% of a single thread. Maybe 80% absolute peak (again, of a single thread - see notes above about multi-channel weird behaviour, but either way it doesn't help)
* Either the client hardware or the server hardware as a whole - Linux running on both at the same time demonstrates we can hit 20 GBit/s real world sustained reads. As soon as Windows is in the mix, these numbers come crashing down.

What's left? Seems to be "Windows" - or at least, I can't think of anything else.

My outstanding questions that I can't find answers to anywhere on the wider Internet nor talking to industry peers:

1) Has anyone anywhere ever reported faster real-world results than the numbers reported here out of a Windows client mounting SMB over IP (not RDMA aka "SMB Direct")? I see people beating these numbers using RDMA, but we don't have that option, unfortantely. Are we simply hitting the upper peak of what Windows can do? And if not, what's the difference in their setup versus ours?

2) Are there any tuning options for the Windows SMB stack that aren't covered in the extremely limited options presented by
Set-SmbClientConfiguration (also ignoring trivial things like "check your MTU, check your hardware offloading", which have all been tested and confirmed).

I'm in the process of trying to build a Linux client running a Windows VM inside QEMU, and presenting a Linux "mount.cifs" storage path back up to virtual Windows via virtiofsd (RE: breadcrumbs in this thread). It's ugly, it's complex, it's nasty, but currently it's the only demonstrated way I've seen anyone manage to beat the numbers I've mentioned above, using the hardware we have available to us. I have no idea if it works yet, but I'll report back with whatever I find.

morganL · Aug 23, 2023

elvisa said:
Client specs are a Dell Precision 5820 workstation with a Xeon W-2275 CPU, 64GB DDR4. Installed NIC is the xxv710-DA2 mentioned in the OP. Everything is connected physically by SFP28 to a Dell S5212F-ON switch which is running the latest SmartFabric OS10 firmware.

Raw network performance between client and server are confirmed by not only iperf3 (25GbE line speed achieved on a single thread), but by also running Linux on the client, mounting the storage via mount.cifs, and testing that way. We confirm > 1500 byte packets with wireshark. Definitely no fragmentation, drops or retransmits.

fio performance numbers with Linux kernel 6.3 and mount.cifs (i.e.: kernel space CIFS mount support) as a client are as follows to the following server installs (same hardware as OP, but with the following OS and software installs):

* TrueNAS CORE / FreeBSD + Samba 4.15 = 1.7GB/s (~17Gbit/s)
* Linux (kernel 6.2) + Samba 4.17 = 1.8 GB/s (~18Gbit/s)
* Linux (kernel 6.2) + ksmbd = 2.0 GB/s (~20Gbit/s)

No matter what the server side configuration listed above, if the client is Microsoft Windows 10 Pro or 11 (we've tried both, fully patched to latest OS versions, latest chipset/NIC/etc drivers), we're capped at around 1.2 GB/s.

Our performance tests are two fold. For synthetic tests under Windows, we use a combination of fio and diskspd.exe. These are fine and give us nice control over things like queue depth, process/thread count, direct IO, etc. However they're entirely synthetic because they never translate to our actual client applications. With these we can eke out slightly more performance (maybe peaking at 1.3GB/s).

For real world tests, we get identical results out of tools like DJV (an open source DPX frame reader), DaVinci Resolve, Adobe Premier, and the AJA disk speed tests (the latter nice tool because it's easy to download and benchmark, and gives real world performance numbers closer to the applications listed).

Very specifically, we need to read 4K uncompressed 10bit RGB frames from the storage in realtime to the client to review, fix/grade/correct, and save out again. That appears to be around 1.5-1.6GB/s or so (we're seeing that with writes when the footage is captured in realtime and written to the storage). Reading these back in from the storage results in slower speeds (yes, writes are faster than reads under Windows clients).

Repeating that we've confirmed the storage is not the culprit, as we can trivially push this an order of magnitude faster than the peak network speeds with tools running locally on the storage itself.

Ultimately the synthetic tests are nice but somewhat useless, as they don't reflect the tools being used. With that said, under Windows especially, they vary little. In fact there are even times when Windows multi-channel benchmarks are lower than when we disable multi-channel, which seems particularly odd.

Work-around workflows like "copy to local NVME, edit, copy back" aren't practical because of the sheer size of the media in question. These need to live on the fast NAS storage for space considerations (far in excess of what we can build practically on a local DAS storage system).

As far as I can tell, I've confirmed it is not any of the following:
* The network - MTU, iperf3 performance, settings (we've tried all combinations and permutations of hardware offload enable/disable, etc), etc
* The storage - all tests on the storage itself show this as fine (order of magnitude faster than required)
* The storage system's CPU - we're seeing this barely hit a few percent of a single core/thread
* The client system's CPU - it's high, but not ever 100% of a single thread. Maybe 80% absolute peak (again, of a single thread - see notes above about multi-channel weird behaviour, but either way it doesn't help)
* Either the client hardware or the server hardware as a whole - Linux running on both at the same time demonstrates we can hit 20 GBit/s real world sustained reads. As soon as Windows is in the mix, these numbers come crashing down.

What's left? Seems to be "Windows" - or at least, I can't think of anything else.

My outstanding questions that I can't find answers to anywhere on the wider Internet nor talking to industry peers:

1) Has anyone anywhere ever reported faster real-world results than the numbers reported here out of a Windows client mounting SMB over IP (not RDMA aka "SMB Direct")? I see people beating these numbers using RDMA, but we don't have that option, unfortantely. Are we simply hitting the upper peak of what Windows can do? And if not, what's the difference in their setup versus ours?

2) Are there any tuning options for the Windows SMB stack that aren't covered in the extremely limited options presented by
Set-SmbClientConfiguration (also ignoring trivial things like "check your MTU, check your hardware offloading", which have all been tested and confirmed).

I'm in the process of trying to build a Linux client running a Windows VM inside QEMU, and presenting a Linux "mount.cifs" storage path back up to virtual Windows via virtiofsd (RE: breadcrumbs in this thread). It's ugly, it's complex, it's nasty, but currently it's the only demonstrated way I've seen anyone manage to beat the numbers I've mentioned above, using the hardware we have available to us. I have no idea if it works yet, but I'll report back with whatever I find.

I this the question is "why is Windows slower"?

There are two culprits I can think of:

TCP - are you sure there are no losses/retransmits? Windows and BSD use different algorithms and setting. A setting might be impacting latency/throughput.

Queue depth... if you adjust fio queue depth to 1 do you get similar performance?

morganL · Aug 23, 2023

Can you verify the TCP windows sizes...... at each end.
At 25Gbit/s = 3GByte/s... even at 1ms latency, you would need 3MByte
If its faster, 1MB would be needed... that larger than windows defaults.

elvisa · Aug 23, 2023

morganL said:
Queue depth... if you adjust fio queue depth to 1 do you get similar performance?

Yes, with low queue depth the Windows fio tests are very low.

Is there a way on a Windows SMB client to increase the queue depth generically? I've seen options to increase threads per queue for the SMB client (these make no difference), but nothing for queue depth. (All documentation I've seen talks about physical disks only, never SMB connections).

elvisa said:
I'm in the process of trying to build a Linux client running a Windows VM inside QEMU, and presenting a Linux "mount.cifs" storage path back up to virtual Windows via virtiofsd

This was a bust. Performance was quite low (far lower than the linked threads), whether from local disk or re-exporting a cifs mount. Unsure why, but the complexity of the solution isn't worth it at this point.

morganL said:
Can you verify the TCP windows sizes...... at each end.

I'll investigate.

morganL · Aug 23, 2023

Queue depth is an application issue. The testing software can adjust. eg fio
A lot of disk test software doesn't assume queue depth matters.... but when testing a storage system or NVMe drive, queue depth matters quite a lot.

TCP window size can both add latency and reduce throughput.... windows scaling is important at higher speeds.

morganL · Aug 23, 2023

There is another cause to check.. antivirus software. It adds latency.. which might exaggerate any queue depth issues.

SOLVED - Slow Windows R/W Performance over 10GbE with my TrueNAS box

Oh, I am surprised that the Intel went down to 2.5 - I had assumed that that just wouldn't work. I have a 10Gb client (windows), with a 10Gb server (TrueNAS Core) and a pool of SSD's with some spare space I can run some tests on for you to see if I can replicate. When I read a large file from a...

www.truenas.com

elvisa · Aug 24, 2023

morganL said:
There is another cause to check.. antivirus software. It adds latency.. which might exaggerate any queue depth issues.

SOLVED - Slow Windows R/W Performance over 10GbE with my TrueNAS box

Oh, I am surprised that the Intel went down to 2.5 - I had assumed that that just wouldn't work. I have a 10Gb client (windows), with a 10Gb server (TrueNAS Core) and a pool of SSD's with some spare space I can run some tests on for you to see if I can replicate. When I read a large file from a...

www.truenas.com

This one we've definitely already covered. That was flagged early as a contributor to high CPU as well as read slowness, and disabled for all benchmarks listed here.

NickF · Aug 24, 2023

Preface: I'm reading through this and it's difficult to read. Please add some BOLD words or COLORED sections...break in with some screenshots...any kind of formatting...so people can follow more easily. I literally have this same thread open in two windows so I can actually respond to you effectively. /rant TY

Questions/Context: I think you have documented well enough the hardware you are using, and the sharing protocol you are using. But WHAT is the purpose of the system. Here are some more specific questions, but any context would help. With some context, I may have a few paths forward and things to try.

How fungible is the data you are serving? I have many ideas, but I don't know what the risks involved are so I can't advise.
Do you have lots and lots of big files, lots and lots of small files?
What types of files are we talking about? Pictures, scientific research, video editing, Hyper-V vms, what?
Are you trying to get One or two clients REALLY, REALLY fast?
Are you trying 60 clients all simultaneously accessing the same system and still getting better performance than a local sata SSD?
Just so I am clear - you are currently on CORE? There are some tunables on Linux that are not on FreeBSD. I know you ALSO have issues getting SCALE booting... with would you be interested in working through them?

elvisa · Aug 24, 2023

NickF said:
How fungible is the data you are serving? I have many ideas, but I don't know what the risks involved are so I can't advise.

Willing to consider alternatives at this point. I've just done some iSCSI testing here to investigate as an alternative, but it's proven similarly disappointing on reads (similar caps at around 1000MB/s).

NickF said:
Do you have lots and lots of big files, lots and lots of small files?

Mostly large media files. Our current working set are 4K 10bit-RGB DPX frames, clocking in at 60MB per frame. These come from 24FPS film, so simple maths there is 60*24=1440MB/s required to read these real time.

Storing these on direct-attached storage doesn't work for the workflow, nor the size/scale of the data that is going to be used.

NickF said:
What types of files are we talking about? Pictures, scientific research, video editing, Hyper-V vms, what?

As above, mostly media, and mostly frame based. There could be other (video data inside compressed containers), but this specific 4K workload is the one that matters.

NickF said:
Are you trying to get One or two clients REALLY, REALLY fast?

A few more than that, but yes. Single client performance is the critical element, and there seems to be very little guidance from people like Microsoft on that matter. Most of their focus appears to be on many users accessing small files, and the aggregate performance of that. Understandable, as that's pretty banal average corporate stuff. But we're looking more towards what you've said there - fewer clients, higher speeds per client, with a hard minimum cap required that we're just falling short of.

NickF said:
Are you trying 60 clients all simultaneously accessing the same system and still getting better performance than a local sata SSD?

We've not tried that many. But a mixed read/write test with maybe 5 clients shows little relative impact on the storage side. Samba spawns a new process per unique client IP, and the numbers appear much the same for clients even in parallel.

NickF said:
Just so I am clear - you are currently on CORE? There are some tunables on Linux that are not on FreeBSD. I know you ALSO have issues getting SCALE booting... with would you be interested in working through them?

So far I've tested:

* TrueNAS CORE (FreeBSD) + Samba 4.15
* Linux (kernel 6.2) + Samba 4.17
* Linux (kernel 6.2) + ksmbd

With these three configurations, and specific tuning done on the SERVER side for each, there's almost no difference on single Windows client performance.

When we put Linux on the CLIENT side, we see (a) much better read performance, and (b) a difference between the three. Not a lot (maybe +10% as you move down that list).

All of the testing to date seems to indicate the problem is Windows as an SMB client, and it's complete absence of tuning available to the user. We've been digging at length through all sorts of registry settings, PowerShell cmdlets, and other things for the SMB layer, the TCP/IP layer, and the physical network card properties. None make any positive impact.

More importantly, I've yet to see anyone post higher real world numbers online using standard SMB over IP. I can see higher numbers from people testing SMB over RDMA (aka "SMB Direct"), but not IP.

Yes, we're interested in testing TrueNAS SCALE. Access to the system is starting to become limited now, so I'll need to organise an outage to test the latest beta. Latest stable wouldn't boot, and the failure was immediately on OS boot / kernel load. However this hardware happily runs Linux, so I'm sure we can find a solution to get TrueNAS SCALE working. However the testing I've done with other distros on the server side indicate to me that they won't deliver much more performance to Windows, as again I suspect Windows as a client is the issue.

NickF said:
Preface: I'm reading through this and it's difficult to read. Please add some BOLD words or COLORED sections...break in with some screenshots...any kind of formatting...so people can follow more easily. I literally have this same thread open in two windows so I can actually respond to you effectively. /rant TY

My apologies. These posts are usually done by me at 2am after long days of troubleshooting, OS reinstalling, testing, etc. They're not at my usual standard of clarity, and I'm often just brain-dumping immediately before I head to sleep.

I plan to at some point put together a simple table of things I've tested, and their resulting speeds. I hope that makes it easy for people to see all of the combinations and permutations tested so far.

And again, I invite anyone to share real world numbers of single client performance that are faster than our numbers. I sincerely hope they exist, and we find some silly tweak somewhere that magically makes this better. But as it stands it's quite frustrating to see Windows SMB limp along at 40% of the network card's upper limit, particularly when our requirement is only around 60% to achieve real-time use.

morganL · Aug 25, 2023

elvisa said:
Willing to consider alternatives at this point. I've just done some iSCSI testing here to investigate as an alternative, but it's proven similarly disappointing on reads (similar caps at around 1000MB/s).

If iSCSI with fio (and queue depth) is similarly slow, that's a clue that the issue is at the TCP or below level.

TCP windows size or scaling being too low would cause the same issue on SMB or iSCSI

NickF · Aug 25, 2023

LOL! My response is also at 3AM now. So I do understand. Here's some nuggets of information until I can form a more cohesive response

The TLDR; is that extracting the full potential of NVME drives on even the most modern current generation platforms is challenging for more than just TrueNAS.

I did alot of messing around with NVME last year

TrueNAS Scale NVME Performance Scaling

Over the past year or so I have been obsessively exploring various aspects of ZFS performance, from large SATA arrays with multiple HBA cards, to testing NVME performance. In my previous testing I was leveraging castoff enterprise servers that were Westmere, Sandy Bridge and Ivy Bridge based...

www.truenas.com

I learned a few lessons since then. This conversation from last week may help:

Z2 or Z3 for SSD pool?

I'm in the process of migrating my pool on Scale from hard drive (3-way mirror of 12 TB drives) to SSD using 4 TB NVME modules. I'm deciding between Z2 and Z3. That is: 5 x 4 TB SSD modules @ RAID Z2 ==> 12 TB capacity; or 6 x 4 TB SSD modules @ RAID Z3 ==> 12 TB capacity. I'd rather save the...

www.truenas.com

Here's some more breadcrumbs to look at:

Performance tuning — openzfs latest documentation

There is no GOOD resource for NVME performance scaling on ZFS. I've looked, and I've looked a lot. I have been interested and been trying to piece it together bits at a time, but I don't have enough information to really hammer it in yet.

My general, if perhaps incorrect, assumptions around your problem.
I think we collectively fall into the trap of accepting things "as givens" that we should not.

I can double check the math, but I'm fairly confident...NVME drives in this quantity are theoretically faster than your system RAM. This is not an eventuality that I think software developers ever planned for. At the very least it's orders of magnitude faster than the hard drives alot of code was written for.
We tend to forget that RAM is just faster storage than disks. The same factors matter in both storage mediums, latency, IOPS, bandwidth.
To make a badly bottlenecked situation worse, we have to factor in other bandwidth requirements on the bus, particularly your network adapters.
Storage systems in general, and NVME in particular tend to have "better" performance scaling with more clients. Low client counts work against our goals of MIN/MAX ing performance, but some tweaks can swing performance in low client count's favor.
NVME behaves best when it's got queue depths greater than 1. We need to feed the beast. Even more at odds with low client counts.
There are many things, even in 2023, that still rely heavily on single-threaded CPU performance characteristics (Frequency, IPC)

If you've tried any of these already I am sorry I am tired. These are some disjointed ideas on starting points. In no particular order, but there are going to be compromises you have to accept.

Use a Mirrored ZFS pool topology. RAIDZ requires CPU cycles to calculate parity.
Try disabling compression Compression increases the demand on both CPU and RAM.
zfs set compression=off pool/dataset
Try only caching metadata: Your pool is faster than your RAM. At best, you're going to make performance worse by trying to cache it.
zfs set primarycache=metadata pool/dataset
Try decreasing the record size of your dataset Typically for video I'd recommend making it larger...but NVME behaves differently. From my testing, it seems to prefer more, smaller fetches, of data, instead of fewer, larger ones, But I could be wrong.
zfs set recordsize=32k pool/dataset
Disable Prefetch on your system: We already have memory bandwidth saturation. Let's not make it worse by fetching extra blocks.
vfs.zfs.prefetch_disable="1"

If you want to color outside the lines a bit, read this:

Poll queueing on Linux (and maybe on SCALE?) might be the ultimate "solution". But with Linux kernel level tweaks there are always caveats and potential stability concerns...so YMMV
https://www.linkedin.com/pulse/tuning-performance-intel-optane-ssds-linux-operating-systems-ober/

morganL · Aug 25, 2023

NickF said:
LOL! My response is also at 3AM now. So I do understand. Here's some nuggets of information until I can form a more cohesive response

The TLDR; is that extracting the full potential of NVME drives on even the most modern current generation platforms is challenging for more than just TrueNAS.

I did alot of messing around with NVME last year

TrueNAS Scale NVME Performance Scaling

Over the past year or so I have been obsessively exploring various aspects of ZFS performance, from large SATA arrays with multiple HBA cards, to testing NVME performance. In my previous testing I was leveraging castoff enterprise servers that were Westmere, Sandy Bridge and Ivy Bridge based...

www.truenas.com

I learned a few lessons since then. This conversation from last week may help:

Z2 or Z3 for SSD pool?

I'm in the process of migrating my pool on Scale from hard drive (3-way mirror of 12 TB drives) to SSD using 4 TB NVME modules. I'm deciding between Z2 and Z3. That is: 5 x 4 TB SSD modules @ RAID Z2 ==> 12 TB capacity; or 6 x 4 TB SSD modules @ RAID Z3 ==> 12 TB capacity. I'd rather save the...

www.truenas.com

Here's some more breadcrumbs to look at:

Performance tuning — openzfs latest documentation

There is no GOOD resource for NVME performance scaling on ZFS. I've looked, and I've looked a lot. I have been interested and been trying to piece it together bits at a time, but I don't have enough information to really hammer it in yet.

My general, if perhaps incorrect, assumptions around your problem.
I think we collectively fall into the trap of accepting things "as givens" that we should not.

I can double check the math, but I'm fairly confident...NVME drives in this quantity are theoretically faster than your system RAM. This is not an eventuality that I think software developers ever planned for. At the very least it's orders of magnitude faster than the hard drives alot of code was written for.

We tend to forget that RAM is just faster storage than disks. The same factors matter in both storage mediums, latency, IOPS, bandwidth.

To make a badly bottlenecked situation worse, we have to factor in other bandwidth requirements on the bus, particularly your network adapters.

Storage systems in general, and NVME in particular tend to have "better" performance scaling with more clients. Low client counts work against our goals of MIN/MAX ing performance, but some tweaks can swing performance in low client count's favor.

NVME behaves best when it's got queue depths greater than 1. We need to feed the beast. Even more at odds with low client counts.

There are many things, even in 2023, that still rely heavily on single-threaded CPU performance characteristics (Frequency, IPC)

If you've tried any of these already I am sorry I am tired. These are some disjointed ideas on starting points. In no particular order, but there are going to be compromises you have to accept.

Use a Mirrored ZFS pool topology. RAIDZ requires CPU cycles to calculate parity.

Try disabling compression Compression increases the demand on both CPU and RAM.
zfs set compression=off pool/dataset

Try only caching metadata: Your pool is faster than your RAM. At best, you're going to make performance worse by trying to cache it.
zfs set primarycache=metadata pool/dataset

Try decreasing the record size of your dataset Typically for video I'd recommend making it larger...but NVME behaves differently. From my testing, it seems to prefer more, smaller fetches, of data, instead of fewer, larger ones, But I could be wrong.
zfs set recordsize=32k pool/dataset

Disable Prefetch on your system: We already have memory bandwidth saturation. Let's not make it worse by fetching extra blocks.
vfs.zfs.prefetch_disable="1"

If you want to color outside the lines a bit, read this:

Poll queueing on Linux (and maybe on SCALE?) might be the ultimate "solution". But with Linux kernel level tweaks there are always caveats and potential stability concerns...so YMMV
https://www.linkedin.com/pulse/tuning-performance-intel-optane-ssds-linux-operating-systems-ober/

@NickF @elvisa has established the issue is only with windows clients. We have talked on the phone.

Important Announcement for the TrueNAS Community.

SMB performance not keeping up with fast NVME drives

Dabbler

Guru

Dabbler

Dabbler

Dabbler

Powered by Neutrality

Dabbler

Captain Morgan

Dabbler

Captain Morgan

Captain Morgan

Dabbler

Captain Morgan

Captain Morgan

Dabbler

Guru

Dabbler

Captain Morgan

Guru

Captain Morgan

Similar threads