Slow SMB write speeds over 40GbE to high-spec server - where is the bottleneck?

simonj

Dabbler
Joined
Feb 28, 2022
Messages
32
Hi Everyone

I'm fairly new to TrueNAS but OK experienced with IT and storage as I have been managing the IT of our small film post house.

We are currently installing and testing two new TrueNAS servers which should replace our QNAP that had a hardware failure, which cost us a lot of time fixing and killed our trust in QNAP.

Specs:
Supermicro SC848 case with 6G sas expander backplane
Motherboard Supermicro X9QRI-F
4 x CPU 1 Intel Xeon E5-4650, 8x 2.70GHz, 8 GT/s, 20MB Cache
RAM 512GB ECC DDR3 PC12800 (16x 32GB)
HP H220 SAS2 PCIe 3.0 HBA JBOD IT MODE
Intel T540 10G copper ethernet
Mellanox ConnectX3 40G ethernet
TrueNAS-12.0-U8

Storage:
16 x Seagate Exos 16TB sata disks in a pool with 2 VDEVS of 8 disks in raidZ2
(we will add another 8 disk RaidZ2 vdev of the same disks as soon as our hw raid archive is offloaded)

Client:
2019 Mac Pro
Mac OS Catalina
Chelsio T580 40G ethernet

40GbE directly connected to TrueNAS server with QSFP cable
10GbE connected over Netgear 10G switch
MTU 9000

The problem:
Performace is very erratic. Mostly with writes.

I've experimented a lot with tunables. Mostly using the ones from https://calomel.org/freebsd_network_tuning.html

3 loader 1 hint.isp.0.role 2
4 loader 1 hint.isp.1.role 2
5 loader 1 hint.isp.2.role 2
6 loader 1 hint.isp.3.role 2
23 sysctl 0 net.iflib.min_tx_latency 1
24 loader calomel 1 vfs.zfs.dirty_data_max_max 1.37439E+11
25 loader calomel 1 vfs.zfs.dirty_data_max_percent 25
26 loader calomel 1 net.inet.tcp.hostcache.enable 0
27 loader calomel 1 net.inet.tcp.hostcache.cachelimit 0
28 loader calomel 1 machdep.hyperthreading_allowed 0
29 loader calomel 1 net.inet.tcp.soreceive_stream 1
30 loader calomel 1 net.isr.maxthreads -1
31 loader calomel 1 net.isr.bindthreads 1
32 loader calomel 1 net.pf.source_nodes_hashsize 1048576
33 sysctl calomel 1 kern.ipc.maxsockbuf 16777216
34 sysctl calomel 1 net.inet.tcp.recvbuf_max 4194304
35 sysctl calomel 1 net.inet.tcp.recvspace 65536
36 sysctl calomel 1 net.inet.tcp.sendbuf_inc 65536
37 sysctl calomel 1 net.inet.tcp.sendbuf_max 4194304
38 sysctl calomel 1 net.inet.tcp.sendspace 65536
39 sysctl calomel 1 net.inet.tcp.mssdflt 8934
40 sysctl calomel 1 net.inet.tcp.minmss 536
41 sysctl calomel 1 net.inet.tcp.abc_l_var 7
42 sysctl calomel 1 net.inet.tcp.initcwnd_segments 7
43 sysctl calomel 1 net.inet.tcp.cc.abe -1
44 sysctl calomel 1 net.inet.tcp.rfc6675_pipe 1
45 sysctl calomel 1 net.inet.tcp.syncache.rexmtlimit 0
46 sysctl calomel 1 vfs.zfs.txg.timeout 120
47 sysctl calomel 1 vfs.zfs.delay_min_dirty_percent 98
48 sysctl calomel 1 vfs.zfs.dirty_data_sync_percent 95
49 sysctl calomel 1 vfs.zfs.min_auto_ashift 12
50 sysctl calomel 1 vfs.zfs.trim.txg_batch 128
51 sysctl calomel 1 vfs.zfs.vdev.def_queue_depth 128
52 sysctl calomel 1 vfs.zfs.vdev.write_gap_limit 0

The network connection and SMB seem to be capable of high R/W speeds as this test with a very high zfs.txg.timeout shows.
I guess this is all written to and read from ARC:
1646284230156.png

However. More realistic tests show write speeds more around 200MB/s
Again, this is changing a lot. Sometimes ramping up to 500MB/s or even more. Sometimes stalling below 100MB/s
1646287679227.png


A real-world transfer that is running atm:

1646284406383.png


Disk activity during this transfer is very low with the disks mostly idle and short spikes every few minutes:

1646284978447.png


A dd test on the TrueNAS showed that the pool (compression off) should be capable of ~950MB/s write and 1500MB/s read.

Does any of you have an idea where to look at?
I don't expect 1000MB/s write speeds but something around 500MB/s should be possible with this hardware.

My suspicion is that there is something wrong with my ARC / ZFS tunables.
But completely turning off all zfs tunables doesn't help.
I haven't systematically turned on and off the tunables. But of course started with a clean system and added them gradually.
The Calomel network tunables led to an improvement. Otherwise setting MTU to 9000 and strict sync = no in the SMB shares led to the biggest improvement.
Of course I read many similar posts but nothing led to a breakthrough yet.

Happy for any hints.
Also interested in hiring a TrueNAS / ZFS specialist who could help us setting everything up.
 

simonj

Dabbler
Joined
Feb 28, 2022
Messages
32
Hi All

I know it was maybe a complicated question.

At the moment I am getting a whopping 7.9 MB/s speed on a real-world transfer test for a .dpx sequence.

Please tell me that this is not what should be expected on a TrueNAS setup with the kind of hardware we have.

1646421680810.png
 

Attachments

  • 1646421664254.png
    1646421664254.png
    8.3 KB · Views: 187

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,909
If you need speed you should go for mirrors, not RAIDZ2.

Also, you do need to provide more details on the use-case and under which circumstances you observe this behavior. Please try to quantify whatever is possible. It does not help to read about e.g. "large files", which is what many people talk about. Also, whenever you write about "seems" or "guess", it means that this is actually not clear at all. I do not mean to pick on you, but that is a critical distinction.

A good approach is to describe exactly (and in a quantitative way) what you did, what the expected result was, and how the actual result deviated from the the expected one.
 

simonj

Dabbler
Joined
Feb 28, 2022
Messages
32
Thank you.

This server is supposed to be the backup / archive part of a 2 server system.

We are a small video post-house. So we work with large (between a few gigabytes up to 2-3 terabytes) media files and image sequences (dpx files of ~10MB each)

I'm doing the tests with tools to simulate a video editing workflow: Blackmagic Disk Speed Test & AJA System Test. And transferring files with a tool called Hedge.

The results vary a lot. I got OK speeds for some short moments, which I cannot explain.
But mostly write speeds are below 200 MB/s and even going down to completely unacceptable speeds below 10MB/s like in my last post.

This is a test when speeds are inexplicably good:

1646494686192.png


Same test a bit later with nothing changed:

1646494854204.png


There is nothing else going on on the network.

I know that mirrors would be faster but I don't think that the ZFS pools are the bottleneck.

dd tests on a pool with compression off shows the following.

Write:
Code:
root@truenas[~]# dd if=/dev/zero of=/mnt/Tank/Benchmark/dd-test count=50000 bs=2048k
50000+0 records in
50000+0 records out
104857600000 bytes transferred in 113.592102 secs (923106433 bytes/sec)
root@truenas[~]#


Read:
Code:
root@truenas[~]# dd of=/dev/zero if=/mnt/Tank/Benchmark/dd-test count=50000 bs=2048k
50000+0 records in
50000+0 records out
104857600000 bytes transferred in 55.068231 secs (1904139624 bytes/sec)
root@truenas[~]#


In the meantime I think I have narrowed it down to a networking issue.
The 40G Chelsio adapter in the Mac has died (Mac not recognizing the card - another problem)

So I'm testing over the 10G link, which shows the same behaviour with OK-ish read and terrible write speeds.

iperf from TrueNAS to Mac Pro. Not great, but OK:

Code:
root@truenas[~]# iperf -c 192.168.178.25 -P 1 -i 1 -t 30
------------------------------------------------------------
Client connecting to 192.168.178.25, TCP port 5001
TCP window size: 4.01 MByte (default)
------------------------------------------------------------
[  3] local 192.168.178.81 port 57727 connected with 192.168.178.25 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   611 MBytes  5.12 Gbits/sec
[  3]  1.0- 2.0 sec   695 MBytes  5.83 Gbits/sec
[  3]  2.0- 3.0 sec   763 MBytes  6.40 Gbits/sec
[  3]  3.0- 4.0 sec   905 MBytes  7.59 Gbits/sec
[  3]  4.0- 5.0 sec   648 MBytes  5.43 Gbits/sec
[  3]  5.0- 6.0 sec   815 MBytes  6.84 Gbits/sec
[  3]  6.0- 7.0 sec   761 MBytes  6.39 Gbits/sec
[  3]  7.0- 8.0 sec   828 MBytes  6.94 Gbits/sec
[  3]  8.0- 9.0 sec   712 MBytes  5.98 Gbits/sec
[  3]  9.0-10.0 sec   785 MBytes  6.59 Gbits/sec
[  3] 10.0-11.0 sec   665 MBytes  5.58 Gbits/sec
[  3] 11.0-12.0 sec   749 MBytes  6.28 Gbits/sec
[  3] 12.0-13.0 sec   732 MBytes  6.14 Gbits/sec
[  3] 13.0-14.0 sec   935 MBytes  7.84 Gbits/sec
[  3] 14.0-15.0 sec   790 MBytes  6.63 Gbits/sec
[  3] 15.0-16.0 sec   821 MBytes  6.88 Gbits/sec
[  3] 16.0-17.0 sec   926 MBytes  7.77 Gbits/sec
[  3] 17.0-18.0 sec   758 MBytes  6.36 Gbits/sec
[  3] 18.0-19.0 sec   746 MBytes  6.25 Gbits/sec
[  3] 19.0-20.0 sec   904 MBytes  7.58 Gbits/sec
[  3] 20.0-21.0 sec   670 MBytes  5.62 Gbits/sec
[  3] 21.0-22.0 sec   709 MBytes  5.95 Gbits/sec
[  3] 22.0-23.0 sec   762 MBytes  6.39 Gbits/sec
[  3] 23.0-24.0 sec   791 MBytes  6.64 Gbits/sec
[  3] 24.0-25.0 sec   733 MBytes  6.15 Gbits/sec
[  3] 25.0-26.0 sec   794 MBytes  6.66 Gbits/sec
[  3] 26.0-27.0 sec   863 MBytes  7.24 Gbits/sec
[  3] 27.0-28.0 sec   730 MBytes  6.12 Gbits/sec
[  3] 28.0-29.0 sec   721 MBytes  6.05 Gbits/sec
[  3] 29.0-30.0 sec   793 MBytes  6.65 Gbits/sec
[  3]  0.0-30.0 sec  22.6 GBytes  6.46 Gbits/sec
root@truenas[~]#


From Mac Pro to TrueNAS. Horrible:

Code:
postpro8horses@Postpros-Mac-Pro ~ % iperf -c 192.168.178.81 -P 1 -i 1 -t 30
------------------------------------------------------------
Client connecting to 192.168.178.81, TCP port 5001
TCP window size:  131 KByte (default)
------------------------------------------------------------
[  1] local 192.168.178.25 port 64588 connected with 192.168.178.81 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-1.00 sec   123 MBytes  1.03 Gbits/sec
[  1] 1.00-2.00 sec   111 MBytes   928 Mbits/sec
[  1] 2.00-3.00 sec  91.8 MBytes   770 Mbits/sec
[  1] 3.00-4.00 sec  99.1 MBytes   832 Mbits/sec
[  1] 4.00-5.00 sec   101 MBytes   847 Mbits/sec
[  1] 5.00-6.00 sec  95.6 MBytes   802 Mbits/sec
[  1] 6.00-7.00 sec   104 MBytes   875 Mbits/sec
[  1] 7.00-8.00 sec   107 MBytes   900 Mbits/sec
[  1] 8.00-9.00 sec   128 MBytes  1.07 Gbits/sec
[  1] 9.00-10.00 sec  89.1 MBytes   748 Mbits/sec
[  1] 10.00-11.00 sec   108 MBytes   909 Mbits/sec
[  1] 11.00-12.00 sec  95.6 MBytes   802 Mbits/sec
[  1] 12.00-13.00 sec   116 MBytes   972 Mbits/sec
[  1] 13.00-14.00 sec  97.0 MBytes   814 Mbits/sec
[  1] 14.00-15.00 sec   108 MBytes   903 Mbits/sec
[  1] 15.00-16.00 sec  97.1 MBytes   815 Mbits/sec
[  1] 16.00-17.00 sec   101 MBytes   846 Mbits/sec
[  1] 17.00-18.00 sec   114 MBytes   955 Mbits/sec
[  1] 18.00-19.00 sec  99.2 MBytes   833 Mbits/sec
[  1] 19.00-20.00 sec  89.2 MBytes   749 Mbits/sec
[  1] 20.00-21.00 sec   116 MBytes   972 Mbits/sec
[  1] 21.00-22.00 sec   108 MBytes   904 Mbits/sec
[  1] 22.00-23.00 sec   108 MBytes   908 Mbits/sec
[  1] 23.00-24.00 sec  95.5 MBytes   801 Mbits/sec
[  1] 24.00-25.00 sec  96.1 MBytes   806 Mbits/sec
[  1] 25.00-26.00 sec  96.0 MBytes   805 Mbits/sec
[  1] 26.00-27.00 sec   105 MBytes   878 Mbits/sec
[  1] 27.00-28.00 sec   126 MBytes  1.05 Gbits/sec
[  1] 28.00-29.00 sec   114 MBytes   954 Mbits/sec
[  1] 29.00-30.00 sec   100 MBytes   843 Mbits/sec
[  1] 0.00-30.05 sec  3.06 GBytes   876 Mbits/sec


Jumbo frames are enabled on both sides. But a ping with jumbo frames shows that they don't go through from the Mac too Truenas. From TrueNAS to Mac they work:

Code:
postpro8horses@Postpros-Mac-Pro ~ % ping -s 8772 192.168.178.81           
PING 192.168.178.81 (192.168.178.81): 8772 data bytes
ping: sendto: Message too long
ping: sendto: Message too long


I will check the cables and probably replace the Intel X540 in the TrueNAS.

Or does anyone have other suggestions?
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,909
I know that mirrors would be faster but I don't think that the ZFS pools are the bottleneck.
This is exactly what I meant when saying that "guessing" means it is unclear. It is not meant to pick on you, but an observation I have made countless times over many years. I do performance analysis on transactional business software professionally, and the concepts are the same for storage.

I do not say that the ZFS pool design is responsible or not. All I am saying is that you have the hypothesis that it is not; and this hypothesis should be tested.

What also helps usually, is to write down (without the technical details) what has been so far. Example:

- Tested network connectivity using iperf; got confirmation from forum that the details how it was done deliver correct results; for the being excluded as root cause
- Tested local disk speed using dd; checked with forum that the parameters ensure that caching does not impact result
- Tested with Mac; bad result
- Test with Windows desktop; ??? result

Lastly, when you write that you tested a video workflow, you are not saying that your plan is to edit the video directly off of the NAS, right? Because in the case you would need an all SSD pool.
 

GreaseMonkey88

Dabbler
Joined
Dec 8, 2019
Messages
27
About your iperf3 performance on 10Gbit: Check your switch if you see package collisions, rx/tx pauses or a problem with flow control. If the TrueNAS server is connected via 40Gbit and you Mac is connected via 10Gbit there might be a problem...
Maybe try setting the TrueNAS server also to 10Gbit and measure again.
 

simonj

Dabbler
Joined
Feb 28, 2022
Messages
32
Hi

Thanks for all the answers. After much testing I think I have found the bottleneck: SAMBA

Removed almost all tunables and auxiliary parameters.

For smb shares I kept only these:
Code:
strict sync = no
case sensitive=yes


tunables:
Code:
net.inet.tcp.recvspace= 4194304
net.inet.tcp.sendbuf_inc = 2097152
net.inet.tcp.sendbuf_max = 16777216
net.inet.tcp.sendspace = 4194304
net.isr.bindthreads = 1
net.isr.maxthreads = -1
vfs.zfs.l2arc.rebuild_enabled = 1
vfs.zfs.l2arc_noprefetch = 0
vfs.zfs.l2arc_write_boost = 400000000
vfs.zfs.l2arc_write_max = 400000000


When I test a single file with the AJA tool i get OK speeds

1655990901165.png


As soon as I switch to file-per-frame, speed drops. The speed falls directly proportional to the file size. smbd maxes out at nearly 100%

1655991063412.png



Exact same settings testing with AFP share. Speed is much better:

1655991355284.png



Also some real world copy tests with folders full of small-ish (8MB) files confirm these results.

Workaround for the moment could be to use AFP for all write-heavy stuff and SMB for read-heavy.

But in the long term it would be great to use SMB.

Anyone any idea how to fix SMB? I suspect this could be an issue between macs and TrueNAS. All our workstations are Mac.

(specs are still same like in 1st post. Mac Pro workstation connected to TrueNas over direct 40G cable)
 

JoHapp

Dabbler
Joined
Jul 29, 2021
Messages
44
Hi, I do have quite a similar issue. (Also video editing postpro).

We have a truenas with 4 NICs, 1 connected via Switch, the other directly connected to the workstations, all with their own subnet.

The win10 workstation perform easy at 800mb/s. The Mac dies at 10mb/s. (All 10GbE). BUT the Mac gets also good speeds around 800-1000 mb/s when connected via the switch and not directly. I believe it’s some subnet issue I yet didn’t understand.

How is it with your environment- do you get better speeds when you don’t attach directly?

(And yes, in a video editing environment you have scenarios where you want to connect directly without a switch)
 

JoHapp

Dabbler
Joined
Jul 29, 2021
Messages
44
Hi @simonj - Did you check on your mac machine the Network>Ethernet>Hardware specs?
I noticed, that as soon you don't have a switch inbetween the automatic configuration drops to a 100baseTX speed MTU 1500 setting in the mac os.
 

ianwood

Dabbler
Joined
Sep 27, 2021
Messages
14
Trying some of your settings as I am still getting very poor write speed over 10Gb ethernet even though iperf shows great results. Mirrored 4TB NVMEs. Both Mac and PC.

slow-smb.png
 

JoHapp

Dabbler
Joined
Jul 29, 2021
Messages
44
Hi @ianwood - is that the same account / issue as @simonj ?

Personally my advise is to start the setup as normal as possible. A clean testpool, SMB service / connection and normal test client, behind a switch, with regular NICs, what ever is build in. Even on 1GbE the speed should be above. From that point (if everything performs as expected, you could start improving performance...)

Greetings!
 

simonj

Dabbler
Joined
Feb 28, 2022
Messages
32
Hi @simonj - Did you check on your mac machine the Network>Ethernet>Hardware specs?
I noticed, that as soon you don't have a switch inbetween the automatic configuration drops to a 100baseTX speed MTU 1500 setting in the mac os.
Hi. That's a good idea. The direct link is through 40G ethernet. We do have a switch with 40G ports but not installed yet. Will test.

I still suspect more of a Mac <> Truenas SMB compatibility issue as the speed drop on image sequences with many files happens over 40G direct link and over 10G with switch in between. Could also be just a limitation of TrueNAS / ZFS with writing many small files. Still with 24 disks set up as 3 RAIDZ2 vdevs I would expect a bit more. What also haunts me is that our old QNAP with just 16 disks in RAID6 never had this issue.

@ianwood : This looks really bad especially for a NVME pool. For us setting strict sync = no for the SMB share helped a lot in the beginning.
 
Joined
Jul 3, 2015
Messages
926
Have you disabled SMB signing on your Mac client?
 

JoHapp

Dabbler
Joined
Jul 29, 2021
Messages
44
Hi. That's a good idea. The direct link is through 40G ethernet. We do have a switch with 40G ports but not installed yet. Will test.

I still suspect more of a Mac <> Truenas SMB compatibility issue as the speed drop on image sequences with many files happens over 40G direct link and over 10G with switch in between. Could also be just a limitation of TrueNAS / ZFS with writing many small files. Still with 24 disks set up as 3 RAIDZ2 vdevs I would expect a bit more. What also haunts me is that our old QNAP with just 16 disks in RAID6 never had this issue.

@ianwood : This looks really bad especially for a NVME pool. For us setting strict sync = no for the SMB share helped a lot in the beginning.
You can also investigate further by using a nfs test pool and check performance on that. If results are fine, it would point into the direction of a configuration / software smb issue.
 

simonj

Dabbler
Joined
Feb 28, 2022
Messages
32
I couldn't investigate this further for some time as the Chelsio 40G card in the Mac Pro lost driver support.
We were resorting to 10GbE and things were running fairly well apart from some problems when accessing directories with many files: Finder-browsing gets SMB stuck at 100% cpu - in otherwise almost perfect film post-production setup with Mac clients

Now I got an Atto NQ41 40G card and tried again with direct connection to the servers. Card has only 1 port so I aimed to attach the Mac to recent projects "workhorse" server:

Read speeds from the Mac to the Server are fine with around 1500MB/s.
But write speeds totally suck:
1673068504081.png


compared to 10GbE Connection:
1673068640765.png


Same shows with iperf tests.

TrueNAS to Mac:

Code:
[ 80] local 192.168.2.1 port 5001 connected with 192.168.2.2 port 48540
[ 81] local 192.168.2.1 port 5001 connected with 192.168.2.2 port 34708
[ 82] local 192.168.2.1 port 5001 connected with 192.168.2.2 port 65518
[ 83] local 192.168.2.1 port 5001 connected with 192.168.2.2 port 63339
[ 84] local 192.168.2.1 port 5001 connected with 192.168.2.2 port 19772
[ 86] local 192.168.2.1 port 5001 connected with 192.168.2.2 port 35182
[ 87] local 192.168.2.1 port 5001 connected with 192.168.2.2 port 17663
[ 85] local 192.168.2.1 port 5001 connected with 192.168.2.2 port 20679
[ ID] Interval       Transfer     Bandwidth
[ 85] 0.00-10.01 sec  4.56 GBytes  3.91 Gbits/sec
[ 84] 0.00-10.01 sec  7.35 GBytes  6.31 Gbits/sec
[ 87] 0.00-10.01 sec  5.16 GBytes  4.42 Gbits/sec
[ 81] 0.00-10.01 sec  6.86 GBytes  5.88 Gbits/sec
[ 83] 0.00-10.01 sec  5.16 GBytes  4.42 Gbits/sec
[ 80] 0.00-10.02 sec  4.16 GBytes  3.57 Gbits/sec
[ 82] 0.00-10.01 sec  5.19 GBytes  4.45 Gbits/sec
[ 86] 0.00-10.01 sec  7.61 GBytes  6.53 Gbits/sec
[SUM] 0.00-10.01 sec  46.0 GBytes  39.5 Gbits/sec


Mac to TrueNAS:

Code:
[ 72] local 192.168.2.1 port 5001 connected with 192.168.2.2 port 11251 (reverse)
[ 73] local 192.168.2.1 port 5001 connected with 192.168.2.2 port 23752 (reverse)
[ 74] local 192.168.2.1 port 5001 connected with 192.168.2.2 port 27617 (reverse)
[ 75] local 192.168.2.1 port 5001 connected with 192.168.2.2 port 10403 (reverse)
[ 76] local 192.168.2.1 port 5001 connected with 192.168.2.2 port 11214 (reverse)
[ 77] local 192.168.2.1 port 5001 connected with 192.168.2.2 port 38580 (reverse)
[ 78] local 192.168.2.1 port 5001 connected with 192.168.2.2 port 56181 (reverse)
[ 79] local 192.168.2.1 port 5001 connected with 192.168.2.2 port 36659 (reverse)
[ ID] Interval       Transfer     Bandwidth
[ 72] 0.00-10.26 sec   208 MBytes   170 Mbits/sec
[ 73] 0.00-10.26 sec   198 MBytes   162 Mbits/sec
[ 79] 0.00-10.26 sec   192 MBytes   157 Mbits/sec
[ 78] 0.00-10.29 sec   198 MBytes   161 Mbits/sec
[ 74] 0.00-10.30 sec   202 MBytes   165 Mbits/sec
[ 76] 0.00-10.31 sec   198 MBytes   161 Mbits/sec
[ 75] 0.00-10.33 sec   290 MBytes   236 Mbits/sec
[ 77] 0.00-10.33 sec   196 MBytes   159 Mbits/sec
[SUM] 0.00-10.16 sec  1.64 GBytes  1.39 Gbits/sec


First tests were with a Chelsio card in the "Workhorse" TrueNAS. I switched the cable to our other "Archive" TrueNAS server, which has a Mellanox 40G card. And see: Speeds with "Archive" were as the should be with write around 1000MB/s (I didnt' take screenshots)

I thought that clearly the Chelsio card in "Workhorse" must be at fault, switched it with a same model Mellanox Connect-X 3 Pro card and re did the tests. The results are the ones I posted above.

I experimented with some tuneables. We recently removed most of them on the "Workhorse" server and left some on the "Archive". I copied all network tunables over but it did not make any difference. Having the same tunables on both servers still shows desolate 40G write speeds only to the "Workhorse" Server.

I'm truly lost here. What could be the reason that write speeds are so bad with only one server and close to identical hardware?
  • Storage cannot be the culprit as I get 900MB/s write speeds on 10G
  • It can't be an SMB issue as I see the same asymmetrical speed with iperf and 10G SMB speeds are fine.
  • The Ethernet cards are the same model with same settings (all default only MTU9000)
  • It cannot be the cable as I used the same on both servers
  • Hardware (processors, mainboard, RAM) is identical on both servers.
The only exotic thing I could still imagine is that the PCI slot, where the 40G card sits could be bad.
Or some configuration that I am overlooking.

Any ideas?
 
Joined
Dec 29, 2014
Messages
1,135
I had a similar issue when I first moved to 40G in my network. All of the Chelsio T580 cards I have were used eBay purchases. 1 of them was not quite right. I would have been easy to diagnose if it was stone dead, but it wasn't. It "kinda" worked, but with lousy throughput. I eventually replaced it and the problem went away. I was getting mid/upper 30Gb throughput with iperf with the new cards and life was good. It stinks because those cards are expensive, even used.
 

simonj

Dabbler
Joined
Feb 28, 2022
Messages
32
I had a similar issue when I first moved to 40G in my network. All of the Chelsio T580 cards I have were used eBay purchases. 1 of them was not quite right. I would have been easy to diagnose if it was stone dead, but it wasn't. It "kinda" worked, but with lousy throughput. I eventually replaced it and the problem went away. I was getting mid/upper 30Gb throughput with iperf with the new cards and life was good. It stinks because those cards are expensive, even used.
I was suspecting the same when it was slow on one server and working on the other. And those cards are all from Ebay. But after replacing the Chelsio card with a Mellanox and seeing the exact same issue (That same Mellanox used to work well in a QNAP server) I think it must be something else but I'm out of ideas. I will still try to put the Mellanox card that fully works from the "Archive" server into "Workhorse" and see if that changes something.

Maybe its's really something weird like a bad PCI lane or faulty processor. Is that a thing?

I did some tests with disabling TSO and LRO and it seemed to help a bit but still only around 400 MB/s. The full speed write bursts got a bit longer.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,152
I was suspecting the same when it was slow on one server and working on the other. And those cards are all from Ebay. But after replacing the Chelsio card with a Mellanox and seeing the exact same issue (That same Mellanox used to work well in a QNAP server) I think it must be something else but I'm out of ideas. I will still try to put the Mellanox card that fully works from the "Archive" server into "Workhorse" and see if that changes something.

Maybe its's really something weird like a bad PCI lane or faulty processor. Is that a thing?

I did some tests with disabling TSO and LRO and it seemed to help a bit but still only around 400 MB/s. The full speed write bursts got a bit longer.
Be aware that SMB defaults to syncwrites when talking to macOS.
If it's not a network issue, this could be the cause if you don't have a SLOG.
 
Last edited:

simonj

Dabbler
Joined
Feb 28, 2022
Messages
32
Be aware that SMB defaults to syncwrites when talking to macOS.
If it's not a network issue, this could be the cause if you don't have a SLOG.
Thanks. I know. Both my servers have
Code:
strict sync = no
in the SMB aux parameters. Since the bad speed in Mac > TrueNAS direction shows also in iPerf and not only with SMB transfers it looks like this must be a networking or hardware issue.

More and more leaning to believe it must be a HW thing. Will switch PCI slots as a first test.
 
Top