SOLVED SMB Slow write speeds on TrueNAS-12.0-U6

Ralms

Dabbler
Joined
Jan 28, 2019
Messages
29
Hi all,

I've noticed yesterday that my TrueNAS box write speeds over SMB are very slow for some reason now.

In the past I could easily max out my 1Gbit LAN connection and now its doing only around 55MB/s.
Reading on the other hand, still works fine and maxes the connection at 110MB/s.
So really no idea what is going on with the writes.

My Config is a HP DL380p Gen8:
CPU: E5-2630L v2 @ 2.40GHz (average usage is around 20% max)
RAM: 32GB ECC

5 spinning disks in Raid-Z1.
Current pool used space: 86% (1.48TiB free)
Data being transferred: Media and also tested with large Zip files.
Client PC: Ryzen 3800X, Intel Nic, running Windows 10 Pro 20H2.

Troubleshooting performed:
  1. Added a 250GB NvME drive (Samsung 960 Evo) as cache to see if it would make a difference, but exact same results with it or without.
  2. Increased SMB threads with "aio max threads = 10", no difference without it.
  3. Set "aio write size = 1" no difference.
  4. All 5 disks at around 12% Busy doing 13MB/s each only.
I seriously have no idea what is going on, considering in the past I could hit 110MB/s Writes all the time.
To clarify, my Reads are fine, just the Writes is the problem.

Any idea?

-------------------------

Edit: TLDR = My Desktop and TrueNAS server are on separate subnets, which forces traffic to be routed by my VyOS router. For some reason, is maxing out 1 core with interrupts only when traffic goes from my Desktop to TrueNAS. The other way around its fine.
Either way, it's a performance issue on the router, not TrueNAS.
 
Last edited:

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Your pool is too full. This is the expected drop in performance when above 80%.
 

c77dk

Patron
Joined
Nov 27, 2019
Messages
468
Not sure if it's the case, but try looking at how busy your disk are - the fill-rate of your pool is quite high, so the disks might spin a lot to get to free space.

@Samuel Tai just beat me to it :smile:
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
For an explanation of the why you can expect performance to degrade above 80%, see @jgreco's excellent post:
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
And while the answer is more-or-less there in the block storage post, you are basically fighting several things.

1) Fragmentation, or, well, really, "large scale" fragmentation. On a ZFS pool where you only ever write content and do not delete content ("archival use"), you can pretty much write at near full speed right up 'til 99% pool capacity. ZFS will tend to keep laying data down in the free space that it had already looked up. For ${reasons} it isn't QUITE as good as what I just said, but it isn't a huge dropoff. This means that there is minimal disk seeking to stash writes.

2) ZFS is having to work much harder to find free storage. ZFS is managing a "virtual disk" that is much larger than your average filesystem, and it makes great efforts to maintain as much of the metadata in memory as it can. However, if it burns through the easily available space in the metaslab space maps it has in ARC, then it has to go out and pull others from disk, and as free space plummets, this process yields less contiguous free space per retrieval, which means you have to read more and more metadata in order to be finding less and less good candidates for space. This gets significantly worse at 4% free space when the allocator moves from first-fit to best-fit.

Lots of people talk about some magic number like 80% or 85% or 96% where ZFS suddenly starts exhibiting much worse performance, but in my experience, it's really much more of a long curve that starts really out at maybe 10%, with the falloff being barely noticeable below 25%, then barely noticeable for normal large-file usage up to about 50-60%, and gets worse from there. It is dictated by the number of write-delete cycles and the size of objects being written and deleted. The end result is that you get the sort of thing outlined in the Delphix steady state experiment from years ago;

https://extranet.www.sol.net/files/freenas/fragmentation/delphix-small.png

It's mainly just a matter of what the numbers end up being, and where the big dip in the curve happens -- it will generally move towards the left of the graph as the pool ages and write cycles increase, unless you have an event where you clear a bunch of space off the pool.
 

Ralms

Dabbler
Joined
Jan 28, 2019
Messages
29
@c77dk @Samuel Tai I have a hard time accepting that the fact that a pool is with 1.4TiB free is now horrible in performance.
Specially considering the disks are sleeping almost, the average busy time per disk is 12%, very very low.

CPU usage is very low, memory is plenty for this volume size, disks are not maxed out, not even close.

@jgreco Thank you for the insights.
Yeah that would be my expectation also, but when I look at Disk I/O, the disks are not reading at all, leading me to believe its fine for the file size I'm transferring.
The screenshot bellow is while copying a 15GB File, it transferred 8GB and I canceled, but you can see 0 Read during this Write process.
1635172291688.png


Disk Busy:
1635172334655.png


I'm just showing 3 of the 5 drives due to the screenshots size, but the other 2 are doing the same.

Because of this I don't think its due to being at 86% filled, its still has 1.39TiB free and I'm copying like 50 to 100GB of files.
If the available space was the problem in this case, it would be a massive optimization issue on ZFS.

Any other idea?

Thank you.
 

Ralms

Dabbler
Joined
Jan 28, 2019
Messages
29
To add an update regarding my write issue and to counter argument everyone leaning on the storage used on my main pool being high, even though there is no indication of that being the issue.

I just created a SMB share on a Pool that is 7% Used (207GiB Free), running on a single Samsung SSD.
And even against an pretty much empty SSD, writes don't go over 40MB/s anyway.

This is a 50GB file transfer:
1635902185747.png


So this leaves me to 2 possible issues:
- Networking - This is unlikely, its a Intel Nic on client desktop machine running at 1Gbits and the NAS is also using Intel Nic, with 2 ports on a LAGG. I've had this setup like this for a very long time, however I'm only using 2 of the 4 LAGG ports I have configured.
- Issue with FreeNAS
- Issue with the Client.

Any ideas or suggestions would be appreciated.

Thanks.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The plot thickens.

What are your iperf3 numbers showing?

How does it work when you disassemble the lagg and just use a single ethernet interface?

LACP can be troublesome even under good conditions.

 

Ralms

Dabbler
Joined
Jan 28, 2019
Messages
29
The plot thickens.

What are your iperf3 numbers showing?

How does it work when you disassemble the lagg and just use a single ethernet interface?

LACP can be troublesome even under good conditions.


I've found what the source of the issue, is my Desktop for some reason, clearly some software problem since the hardware is more then capable.
I've tested with a high end workstation laptop and I get straight away 112MB/s constantly.

So this is clearly an issue with my main desktop client.

Interesting enough, iPerf3 has the same limitation.
  • Desktop as client, TrueNas as server: 482 Mbits/sec (574 MBytes transfer)
  • TrueNAS as client, Desktop as server: 906 Mbits/sec (1.05GBytes transfer)
This is the part I don't get, why it affects only traffic coming from TrueNAS to my Desktop but not the other way around.

I wonder if its something with my Vyos router as TrueNAS and the Desktop are on separate subnets, while the test with the Laptop was done on the same subnet as TrueNAS is.
Need to run some more tests, but we are getting somewhere :D
 

Ralms

Dabbler
Joined
Jan 28, 2019
Messages
29
@jgreco Wow, interesting... checked Top on my Vyos while doing the iPerf tests again.

TrueNAS -> Desktop (the affected one), I have an interrupt reaching 99% CPU, essentially maxing out a single core, which would explain the performance hit.

Desktop -> TrueNAS: Max CPU usage on a single interrupt I see is like 5%.

So I might be missing some NIC offload in the router.
The more we learn lol
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
So LACP *and* passing traffic through a software router. Not a "1Gbit LAN connection". See, this is what can be frustrating. When you fail to describe your environment accurately, it makes us puzzle our way through a problem, looking for potential problems, and then have it turn out to be the fact that you were routing traffic from a LACP uplink to a different network.

Your problem may be as simple as you're swamping the router. It isn't going to be TCP offload, because routers only PASS packets, they do not validate or reassemble them (which is "offload" stuff). It could be interrupt coalescing stuff. Without insight into the way you have your router set up, it's hard to say, but you can look at factors like whether the router is finding it difficult to send packets in some direction and needing to queue. This is one of the classic causes of excessive CPU usage on a router, as are things like fragmentation when you have different MTU's on your networks.
 

Ralms

Dabbler
Joined
Jan 28, 2019
Messages
29
So LACP *and* passing traffic through a software router. Not a "1Gbit LAN connection". See, this is what can be frustrating. When you fail to describe your environment accurately, it makes us puzzle our way through a problem, looking for potential problems, and then have it turn out to be the fact that you were routing traffic from a LACP uplink to a different network.

Your problem may be as simple as you're swamping the router. It isn't going to be TCP offload, because routers only PASS packets, they do not validate or reassemble them (which is "offload" stuff). It could be interrupt coalescing stuff. Without insight into the way you have your router set up, it's hard to say, but you can look at factors like whether the router is finding it difficult to send packets in some direction and needing to queue. This is one of the classic causes of excessive CPU usage on a router, as are things like fragmentation when you have different MTU's on your networks.
I understand, however the LACP is being performed by the switch and has been working for several years at this point without any issue.

The router on the other hand, when VyOS moved from 1.2 to 1.3 they disabled all offloads by default, so they have to be explicitly enabled now.
This resulted on some performance issues on my setup, however felt fixed for a couple months now.
I can easily pull 1Gbit/s from the Internet since fixing it, after reenabling the offloads and was not expecting this edge case.

As there is little to no firewall rules between the subnets, I didn't expect the router to be the bottleneck.
Like everything in IT, if nobody would forget anything, there wouldn't be issues.
Also, nobody asked about the network topology either and focused on an UI warning :)

What you mention regarding MTUs is a good point, need to double check.
Thanks.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Also, nobody asked about the network topology either

Because you DESCRIBED the topology as a 1Gbit LAN connection.

my 1Gbit LAN connection

Speaking as someone who does infrastructure professionally, two networks separated by a router would not be considered a "1Gbit LAN connection." Using LACP would also not be considered a "1Gbit LAN connection." You actually have two LANs connected with a router, with one of them being a multiple gig LACP setup of unknown design. It stopped being a local area network at the introduction of the second network.

We trust users to be eyes and ears, because we're not there with you, and have to trust in what you say. Your words in your first post derailed the discussion, and I understand why @Samuel Tai jumped where he did, because you ALSO listed another massive red flag. However, we did circle around right back to suspecting the network itself. I am always suspicious of the hardware and network because this is so often the issue.

I see Vyatta has taken to calling a bunch of stuff "offloading" that aren't conventionally understood to be such. Mmm.
 

Ralms

Dabbler
Joined
Jan 28, 2019
Messages
29
Because you DESCRIBED the topology as a 1Gbit LAN connection.



Speaking as someone who does infrastructure professionally, two networks separated by a router would not be considered a "1Gbit LAN connection." Using LACP would also not be considered a "1Gbit LAN connection." You actually have two LANs connected with a router, with one of them being a multiple gig LACP setup of unknown design. It stopped being a local area network at the introduction of the second network.

We trust users to be eyes and ears, because we're not there with you, and have to trust in what you say. Your words in your first post derailed the discussion, and I understand why @Samuel Tai jumped where he did, because you ALSO listed another massive red flag. However, we did circle around right back to suspecting the network itself. I am always suspicious of the hardware and network because this is so often the issue.

I see Vyatta has taken to calling a bunch of stuff "offloading" that aren't conventionally understood to be such. Mmm.

I never described how my network connection was configured, only mentioned that I could max out a 1Gbit Lan connection, which is not incorrect.
Either way, I don't feel we continuing with this discussion will be helpful in anyway.

Regarding the network issue, VyOS is a development which started from Vyatta, but a lot has changed since the projects forked a long time ago.
1.2 Crux still had a lot of things from Vyatta, but 1.3 Equuleus is already moving away with many things re-written and 1.4 even more.

For anyone interested, I've raised a thread on VyOS Forum also where I describe the network configuration in more detail: https://forum.vyos.io/t/help-troubleshooting-routing-performance-issue-1-way-only/7978
 
Top