Chelsio T520-SO-CR 10Gb iperf test speeds slow down drastically after short time

kforbus

Cadet
Joined
Jun 17, 2021
Messages
2
I've run into a weird issue over the past couple of days. I just installed a 10Gb Chelsio card (T520-SO-CR) in my FREENAS-MINI-3.0-XL+ (TrueNAS-12.0-U4). I put the same model Chelsio in a Proxmox machine as well and connected both to a Microtik 10Gb switch via some DAC cables. Since they are the only two devices on the switch, I'm using a /30 and have assigned 10.0.0.1 to TrueNAS and 10.0.0.2 to Proxmox.

When I run iperf from TrueNAS, at first I get results like this:
Code:
iperf -c 10.0.0.2 -i1                     
------------------------------------------------------------
Client connecting to 10.0.0.2, TCP port 5001
TCP window size: 1.72 MByte (default)
------------------------------------------------------------
[  3] local 10.0.0.1 port 41027 connected with 10.0.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  1.09 GBytes  9.38 Gbits/sec
[  3]  1.0- 2.0 sec  1.09 GBytes  9.40 Gbits/sec
[  3]  2.0- 3.0 sec  1.08 GBytes  9.29 Gbits/sec
[  3]  3.0- 4.0 sec  1.09 GBytes  9.36 Gbits/sec
[  3]  4.0- 5.0 sec  1.09 GBytes  9.32 Gbits/sec
[  3]  5.0- 6.0 sec  1.08 GBytes  9.28 Gbits/sec
[  3]  6.0- 7.0 sec  1.09 GBytes  9.35 Gbits/sec
[  3]  7.0- 8.0 sec  1.09 GBytes  9.34 Gbits/sec
[  3]  8.0- 9.0 sec  1.07 GBytes  9.23 Gbits/sec
[  3]  9.0-10.0 sec  1.09 GBytes  9.33 Gbits/sec
[  3]  0.0-10.0 sec  10.9 GBytes  9.33 Gbits/sec


After a short while, less than an hour usually, every time I run iperf from TrueNAS, I get these results:
Code:
iperf -c 10.0.0.2 -i1
------------------------------------------------------------
Client connecting to 10.0.0.2, TCP port 5001
TCP window size: 96.8 KByte (default)
------------------------------------------------------------
[  3] local 10.0.0.1 port 19966 connected with 10.0.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   108 MBytes   907 Mbits/sec
[  3]  1.0- 2.0 sec   102 MBytes   858 Mbits/sec
[  3]  2.0- 3.0 sec   104 MBytes   872 Mbits/sec
[  3]  3.0- 4.0 sec   103 MBytes   861 Mbits/sec
[  3]  4.0- 5.0 sec   105 MBytes   884 Mbits/sec
[  3]  5.0- 6.0 sec   106 MBytes   885 Mbits/sec
[  3]  6.0- 7.0 sec   106 MBytes   886 Mbits/sec
[  3]  7.0- 8.0 sec   104 MBytes   875 Mbits/sec
[  3]  8.0- 9.0 sec   109 MBytes   915 Mbits/sec
[  3]  9.0-10.0 sec   105 MBytes   879 Mbits/sec
[  3]  0.0-10.0 sec  1.03 GBytes   882 Mbits/sec


And every run continues to stay at this 1Gb rate until I do something like reboot the TrueNAS box. Then after a little bit of time, the bandwidth goes back down from 10gb to 1gb speeds showing with iperf. I don't see anything related to the NIC happening in dmesg or /var/log/messages when the slowdown starts happening, either. And the switch itself always shows the autoneg speed of 10Gb for this connection.

If I run iperf from the Proxmox machine, it always shows 10Gb speeds, so the weird slowdown just seems to be in one direction.

I checked and the cxgbe firmware version is 1.25.0.0, which I believe is the expected version for this version of TrueNAS.

I was hoping maybe someone has seen this behavior before or would know if there is anything in particular that needs to be configured for the T520 series of cards when used with TrueNAS? I'm kind of at a loss here and would appreciate any help anyone might be able to offer.
 
Joined
Dec 29, 2014
Messages
1,135
This may not be what you want to hear, but I had some really mysterious issues with a T580, and it turned out one of the cards I had was bad. It is really hard to troubleshoot when you have a limited number of endpoints. The key is to try and find a way to isolate the failing component. Are you seeing any problems from a netstat -in? Look for things in the error or drop columns. Here is what that command shows on my FreeNAS.
Code:
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
igb0   1500 <Link#1>      5c:83:8f:80:6d:36 18012277     0     0 37134010     0     0
igb1   1500 <Link#2>      5c:83:8f:80:6d:36 14011743     0     0   708953     0     0
cxl0   9000 <Link#3>      00:07:43:2d:da:d0 661085701     0    12 470325537     0     0
cxl0      - 192.168.252.0 192.168.252.27    110233292     -     - 456267787     -     -
cxl1   9000 <Link#4>      00:07:43:2d:da:d8      135     0     0      218     0     0
cxl1      - 192.168.250.0 192.168.250.27         332     -     -      228     -     -
lo0   16384 <Link#5>      lo0               20644676     0     0 20644676     0     0
lo0       - ::1/128       ::1                     32     -     -       32     -     -
lo0       - fe80::%lo0/64 fe80::1%lo0              0     -     -        0     -     -
lo0       - 127.0.0.0/8   127.0.0.1         20643188     -     - 20644644     -     -
lagg1  1500 <Link#6>      5c:83:8f:80:6d:36 32030159     0     0 37842963    11     0
lagg1     - 10.180.3.0/24 10.180.3.27       11450030     -     - 11694467     -     -
vlan1  1500 <Link#7>      5c:83:8f:80:6d:36 16238141     0     0 25703826     1     0
vlan1     - 192.168.253.0 192.168.253.27    14213119     -     - 25703786     -     -
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Also, check your temps. These cards run hot, and it may be thermal throttling you're seeing on the NIC controller.
 

kforbus

Cadet
Joined
Jun 17, 2021
Messages
2
Okay, I think I've got it figured out. Turns out the issue isn't on the TrueNAS side at all. I had completely disregarded the fact that I have a 1Gb port on the 10Gb switch going to another management switch. There are also interfaces on TrueNAS and Proxmox connected to this switch, for, you know, management. Normally this wouldn't be too problematic, except it seems sometimes the Proxmox machine is responding to an ARP request with the MAC attached to the management switch, rather than with the MAC that has the 10.0.0.2 address. So then TrueNAS ends up with the wrong MAC in its ARP table and my traffic ultimately ends up traveling over the 1Gb network.

My assumption had been that only the interface containing the correct IP address would respond with its MAC address to an ARP request. After some digging into why Proxmox is replying with the wrong MAC address I learned something new. Apparently Linux will respond from whichever interface receives the ARP request first as long as the address it is looking for exists on the same system. The Proxmox management interface seems to end up being the winner of the race and it says sure 10.0.0.2 is here on this system and then sends its own MAC as the response. So my solution ends up being either port isolation on the Microtik to disallow traffic between the 10Gb ports and its management port, or use arp_filter.

That's my quick, probably terrible, explanation but at least everything is working as expected now. Plus it's Friday so now I can enjoy the weekend without thinking about this, so yay!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
If you are interested in more stuff related to how UNIX networking actually works and how this impacts stuff, there are useful hints at

https://www.truenas.com/community/threads/multiple-network-interfaces-on-a-single-subnet.20204/

Basically, you have the same conceptual error at play that caused me to write that, but you were trying to do something a bit different than the normal reasons people do this, and it bit you in an interesting way. This behaviour is not limited to Linux, but also FreeBSD, Solaris, and pretty much any modern UNIX OS with an abstracted networking stack, which is "almost everything these days."

Plus, you figured it out, which is a nontrivial bit of figuring-it-out for someone who probably isn't a network engineer, so I'm quite happy to make a suggestion here that if you actually have a second switch for management stuff, you might want to see if you can also come up with an additional router port and build a properly routed network so that you have a 1Gbps management network and another network (10Gbps, or mixed 1/10G) for normal traffic. This may not be worth doing, but it is the next logical step down the path of having a more complex network design, and these days, the Ubiquiti and Mikrotik routers are sub-$100.

Either way, good job on figuring it out, it wasn't likely to be easily discovered through back-and-forth on the forums. :smile:
 
Top