TrueNas Core 13 + Ubiquiti 10 gig switch + LACP = hate?

McFuzz89

Dabbler
Joined
Mar 19, 2017
Messages
10
Hi all,

Before I begin - I feel I know the answer to this is likely going to be "ubiquiti fail" - but wanted to bounce my issue past the experts. The players:
  • HPE Microserver Gen 8 running TrueNas with a Mellanox-2 dual SFP+ card, with 2x 10 Gig SFPs.
  • Ubiquiti USW-AGG 8-port 10 gig aggregation switch with the same 2x10 gig SFPs
  • Mac Studio with 10 gig connection, connected to the same switch.
  • Xeon based XCP-NG server with 2x RJ-45 10 gig ports connected to the same switch.
The issue:

Both the XCP-NG and TrueNas boxes are connected to the switch with trunked ports configured for Aggregation (that's what Ubiquiti calls it) which by default enabled LACP (i.e. 802.3ad). As far as link-status goes, both are online with my Mac Studio being able to connect to both boxes, and both boxes can connect to each other.

The problem, however, is that there is severe inconsistence with speed performance with the TrueNas box. When I have both interfaces up and running, my iperf3 results between TrueNas and XCP, as well as TrueNas and Mac Studio, are severely inconsistent - for example:

root@fuzznas:~ # iperf3 -c 10.32.0.10 Connecting to host 10.32.0.10, port 5201 [ 5] local 10.32.0.22 port 18682 connected to 10.32.0.10 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.01 sec 23.1 MBytes 191 Mbits/sec 205 1.41 KBytes [ 5] 1.01-2.00 sec 12.9 MBytes 109 Mbits/sec 128 106 KBytes [ 5] 2.00-3.01 sec 5.91 MBytes 49.2 Mbits/sec 67 4.28 KBytes [ 5] 3.01-4.01 sec 10.0 MBytes 84.4 Mbits/sec 100 11.3 KBytes [ 5] 4.01-5.00 sec 7.14 MBytes 60.1 Mbits/sec 69 7.08 KBytes [ 5] 5.00-6.01 sec 8.69 MBytes 72.3 Mbits/sec 94 7.11 KBytes [ 5] 6.01-7.01 sec 8.67 MBytes 73.1 Mbits/sec 83 4.27 KBytes [ 5] 7.01-8.00 sec 11.1 MBytes 94.1 Mbits/sec 109 14.2 KBytes [ 5] 8.00-9.00 sec 2.51 MBytes 21.0 Mbits/sec 33 25.7 KBytes [ 5] 9.00-10.01 sec 6.94 MBytes 57.6 Mbits/sec 51 9.96 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.01 sec 97.0 MBytes 81.3 Mbits/sec 939 sender [ 5] 0.00-10.01 sec 96.8 MBytes 81.1 Mbits/sec receiver iperf Done. root@fuzznas:~ # iperf3 -c 10.32.0.10 Connecting to host 10.32.0.10, port 5201 [ 5] local 10.32.0.22 port 47229 connected to 10.32.0.10 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 480 MBytes 4.03 Gbits/sec 68 124 KBytes [ 5] 1.00-2.00 sec 335 MBytes 2.81 Gbits/sec 53 242 KBytes [ 5] 2.00-3.00 sec 354 MBytes 2.97 Gbits/sec 54 80.6 KBytes [ 5] 3.00-4.00 sec 354 MBytes 2.96 Gbits/sec 68 71.1 KBytes [ 5] 4.00-5.00 sec 369 MBytes 3.10 Gbits/sec 64 48.3 KBytes [ 5] 5.00-6.00 sec 244 MBytes 2.05 Gbits/sec 59 111 KBytes [ 5] 6.00-7.00 sec 335 MBytes 2.81 Gbits/sec 56 92.6 KBytes [ 5] 7.00-8.00 sec 398 MBytes 3.34 Gbits/sec 69 151 KBytes [ 5] 8.00-9.00 sec 359 MBytes 3.01 Gbits/sec 52 76.5 KBytes [ 5] 9.00-10.00 sec 377 MBytes 3.15 Gbits/sec 80 65.4 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 3.52 GBytes 3.02 Gbits/sec 623 sender [ 5] 0.00-10.01 sec 3.52 GBytes 3.02 Gbits/sec receiver

I can re-run this test many times, either with the Mac Studio or the XCP box and and the results would be different. Mostly, though, they are on the sub gig side, often under 300 mbit.

If I simply disconnect one of the fiber patch cables from the switch and default TrueNas into a single NIC (with aggregation still enabled), I will consistently get 9+ gbit/s. Doesn't matter which patch cable I use or variation of SFP arrangement, the results are the same -- two NIC enabled = fail; one NIC "disconnected" one "connected" = win. The results, btw, are the same whether the TrueNas box acts as a server or client.

Between the Mac Studio and XCP - I consistently get 8.5-9 gbit/s (running iperf3 in a VM so some overhead) - again, regardless of who is server or client.

I confirmed that LACP is not running in strict mode on TrueNas; I've tried enabling it - but then TrueNas will no longer pickup DHCP address and overall exhibits bizarre behavior.

The one extra bit of info to share is that before I got the 10 gig switch and the mellanox card, TrueNas was connected, via LACP and its on-board gig ports, to my Ubiquiti USW-24-Pro without any issues and consistent speed results.

If anyone has any ideas or suggestions, maybe experiencing a similar issue - I am all ears (and eyes haha). Ultimately I don't mind going single NIC; really the only reason I have LACP is because my rack is a pain to work in... but prefer to aggregate when I can :D

Thanks!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I confirmed that LACP is not running in strict mode on TrueNas; I've tried enabling it - but then TrueNas will no longer pickup DHCP address and overall exhibits bizarre behavior.

This suggests that LACP is not negotiated or implemented correctly. This does not need to be an error on your part; it could well be the Ubiquiti. You're also seeing a lot of Retr in there. I would suggest debugging that as a first step. I'm a little suspicious of the Mellanox cards, they have the ability to be the Realtek of the 10G world in some instances.

really the only reason I have LACP is because my rack is a pain to work in...

So if your main concern is redundancy, set this to passive failover mode and you will get both your good speeds and the redundancy benefit of link aggregation.

Bonus material:

 

McFuzz89

Dabbler
Joined
Mar 19, 2017
Messages
10
This suggests that LACP is not negotiated or implemented correctly. This does not need to be an error on your part; it could well be the Ubiquiti. You're also seeing a lot of Retr in there. I would suggest debugging that as a first step. I'm a little suspicious of the Mellanox cards, they have the ability to be the Realtek of the 10G world in some instances.



So if your main concern is redundancy, set this to passive failover mode and you will get both your good speeds and the redundancy benefit of link aggregation.

Bonus material:


Lmao - the realtek of the 10G world...

The problem with setting passive failover is that Ubiquiti does not support it... it's either LACP or bust. Also - pardon my ignorance but what do you mean by Retr?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Lmao - the realtek of the 10G world...

Well, what else would you call it? It's an inexpensive chipset that has some quirks, was sold by OEM's in quantity, had lots of people who complained about performance issues that might not be due to the chipset/design itself, and were early players in the marketplace? Okay Realtek didn't make switches or deal with IB. So some differences too.

The problem with setting passive failover is that Ubiquiti does not support it... it's either LACP or bust.

Please don't make my head explode. It's too early in the morning. UBNT does not need to "support" passive failover. It's passive. You listen on both interfaces for packets. You send packets to your preferred primary interface. You call it a day. Switches really cannot refuse to support it. The main problem is that many people forget to read the manpage and don't set net.link.lagg.failover_rx_all to tell the lagg driver to accept packets on both interfaces. It is primarily useful for failover in the event of a member channel failure on a two link LAGG, which sounds to me like a possible resolution in your situation.

Also - pardon my ignorance but what do you mean by Retr?

Refers to one of the columns in your iperf output. Suggests some modest packet loss.
 

McFuzz89

Dabbler
Joined
Mar 19, 2017
Messages
10
I understand what you're saying about UBNT not needing to support passive, but at the same time their literature appeared to be quite explicit. OF course, I can't seem to find where it was spelled (it was late last night), but I do take whatever they are saying with a bigass grain of salt.

Thanks for the tip regarding failover_rx_all - I have enabled it, but it does not seem to have done much... although iperf3 does appear to have bette results (about 2 gbit/s) consistently so far. However packet loss (retr) is still pretty high - about 900-1000 per test.

I honestly think I did a mistake buying that card... guess I may want to look at other brands :\
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
but at the same time their literature appeared to be quite explicit.

I'm being equally explicit and not to put too fine a point on it, but I've been building networks since the '80's, professionally. Longer than UBNT.

The reason that LACP needs "support" in the switch is that you need the ability to monitor the state of your component links. This is not true for passive failover, however. You're only relying on the existence of ethernet link status to determine usability of a path. This then results in a situation where your ethernet switch(es) might deliver a packet down either path, because in normal usage both links will be up. That's the purpose of the sysctl tweak. The reverse direction, from the host to the ethernet switch, is also not a problem, because the host can simply choose to send the packet down whichever link it would like. The packet will make it to the switching silicon either way.

What you *might* have seen is some sort of statement that the switches do not support passive failover, which seems like the exact thing we're talking about, right? Except it's not. We are discussing host-to-switch passive failover. You could also have switch-to-switch passive failover, which is where you have defined a bundle of links (so as to be compliant with STP) and then just use link status to determine membership eligibility. That's messy and crappy, which is why LACP exists in the first place. I'm fine with them saying they don't support that. Good for them.

I honestly think I did a mistake buying that card... guess I may want to look at other brands :\

Might or might not help. So I guess I have to get a bit deeper here.

The FreeBSD lagg driver is an artificial software defined ethernet interface that is made to look like a "faster" ethernet interface. When packets are handed off to it, it picks a link to shove them down. Due to the design of LACP (see linked article), the path that any given IP flow takes is supposed to be deterministic in order to guarantee in-order delivery of packets. However, when you have, let's say, a single flow, this means that all that flow's traffic is sent out Link#C (on a LACP with links #A-D).

The problem is that at 10Gbps, this can be quite a bit of flow. If the system does not have sufficient buffering to queue up this traffic to go out Link#C, the lower level ethernet driver probably just refuses to do it and reports an error back up the stack to the LAGG driver. I'm *guessing* this is where your Retr is ultimately coming from.

My observation over the years is that the high performance ethernet drivers, such as for Chelsio and Intel, tend to handle certain types of issues better. I can't say for sure that this is the case here, no one's paying me to go try to figure it out, and it isn't a problem I'm running into. But if you want an experienced guess, I'd say that it is an inability of the driver to cope. Probably transmit. Possibly receive. Either way, guessing that since LAGG was developed for 1GbE and ported from NetBSD, this is just an edge case issue where too much is being asked of SOMEthing, whether that's the LAGG or Mellanox driver or whatever.
 

McFuzz89

Dabbler
Joined
Mar 19, 2017
Messages
10
I'm being equally explicit and not to put too fine a point on it, but I've been building networks since the '80's, professionally. Longer than UBNT.

The reason that LACP needs "support" in the switch is that you need the ability to monitor the state of your component links. This is not true for passive failover, however. You're only relying on the existence of ethernet link status to determine usability of a path. This then results in a situation where your ethernet switch(es) might deliver a packet down either path, because in normal usage both links will be up. That's the purpose of the sysctl tweak. The reverse direction, from the host to the ethernet switch, is also not a problem, because the host can simply choose to send the packet down whichever link it would like. The packet will make it to the switching silicon either way.

What you *might* have seen is some sort of statement that the switches do not support passive failover, which seems like the exact thing we're talking about, right? Except it's not. We are discussing host-to-switch passive failover. You could also have switch-to-switch passive failover, which is where you have defined a bundle of links (so as to be compliant with STP) and then just use link status to determine membership eligibility. That's messy and crappy, which is why LACP exists in the first place. I'm fine with them saying they don't support that. Good for them.



Might or might not help. So I guess I have to get a bit deeper here.

The FreeBSD lagg driver is an artificial software defined ethernet interface that is made to look like a "faster" ethernet interface. When packets are handed off to it, it picks a link to shove them down. Due to the design of LACP (see linked article), the path that any given IP flow takes is supposed to be deterministic in order to guarantee in-order delivery of packets. However, when you have, let's say, a single flow, this means that all that flow's traffic is sent out Link#C (on a LACP with links #A-D).

The problem is that at 10Gbps, this can be quite a bit of flow. If the system does not have sufficient buffering to queue up this traffic to go out Link#C, the lower level ethernet driver probably just refuses to do it and reports an error back up the stack to the LAGG driver. I'm *guessing* this is where your Retr is ultimately coming from.

My observation over the years is that the high performance ethernet drivers, such as for Chelsio and Intel, tend to handle certain types of issues better. I can't say for sure that this is the case here, no one's paying me to go try to figure it out, and it isn't a problem I'm running into. But if you want an experienced guess, I'd say that it is an inability of the driver to cope. Probably transmit. Possibly receive. Either way, guessing that since LAGG was developed for 1GbE and ported from NetBSD, this is just an edge case issue where too much is being asked of SOMEthing, whether that's the LAGG or Mellanox driver or whatever.

Thanks for the breakdown of possible failure. I've ended up ordering an HPE branded Intel 520 for $30 as I figured it's cheap enough to experiment with. It also dawned upon me that maybe there are other bottlenecks that may be at play here - certainly not the CPU as even though it's the dinky Celeron in my Gen8 microserver, it's barely going over 10% - but the buffer part may play a role as I've seen elsewhere that with low RAM (I am at 12 GB), you may get some 10 gig bottlenecks.

Nevertheless - can't really say one way or another and I'd rather spend a few bucks on this system to try and get it up to todays snuff before spending a lot more on a fresh system.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
before spending a lot more on a fresh system.

Oh absolutely. And not to put too fine a point on it, a fresh system is unlikely to resolve what may well be a card-level or driver-level issue. I am very curious anytime someone shows up and is willing to spend at least a little effort on parts swap and debugging. We've run through a lot of speculative territory and it may be that this doesn't resolve, but there's significant room for exploration of the territory. I tend not to spend a lot of time tuning for single-stream performance because that's not my use case, but we do have a number of people here who have a lot of time and effort into this, if only you can get their attention. ;-)
 

McFuzz89

Dabbler
Joined
Mar 19, 2017
Messages
10
Oh absolutely. And not to put too fine a point on it, a fresh system is unlikely to resolve what may well be a card-level or driver-level issue. I am very curious anytime someone shows up and is willing to spend at least a little effort on parts swap and debugging. We've run through a lot of speculative territory and it may be that this doesn't resolve, but there's significant room for exploration of the territory. I tend not to spend a lot of time tuning for single-stream performance because that's not my use case, but we do have a number of people here who have a lot of time and effort into this, if only you can get their attention. ;-)

Lol - It's a slipper slope for sure. I've been running truenas on a 8th gen HPE Microserver for about 5 years now? Same hardware, same hard drives - no problems really (I've been using it strictly as a NAS) using LACP on 1 bonded gig interfaces - maxed out the pipeline feeding media into my Xeon based SuperMicro microserver that's about the same age.

Decided - hey why not 10 gig it since the agg server is cheap - and now I see the hole I dug myself into... lmao. I am looking at upgrading the CPU and memory on the HPE to give it a bit more oomph and ability to run services better... and I can already see myself looking into upgrading the drives (4x4TB in RaidZ1) for more storage cuz I am running low... hey i xmas is around the corner, though!
 

McFuzz89

Dabbler
Joined
Mar 19, 2017
Messages
10
Oh absolutely. And not to put too fine a point on it, a fresh system is unlikely to resolve what may well be a card-level or driver-level issue. I am very curious anytime someone shows up and is willing to spend at least a little effort on parts swap and debugging. We've run through a lot of speculative territory and it may be that this doesn't resolve, but there's significant room for exploration of the territory. I tend not to spend a lot of time tuning for single-stream performance because that's not my use case, but we do have a number of people here who have a lot of time and effort into this, if only you can get their attention. ;-)

Welps - wish I had good news but alas, I do not.

Got the HP branded x520 (HP 560SFP+); slapped it in and it identified itself as an Intel 82599 10 gbit interface. From the get-go, when aggregated, performance was abysmal - we're talking 10mbit/s. So I started in-depth troubleshooting - tried different patch cables, tried playing musical chairs with the SFPs - and it all had the same results:

* Crazy packet loss
* When in single interface mode (i.e. not LACP), I was maxing at 1 gig no matter what I did.

Put the Mellanox back in and now it's doing the same exact thing being stuck at 1 gig. Only other difference is that I put in a Xeon 1265L v2 instead of the dual core mention - so this has 4 cores with 8 threads... can't imagine it would degrade the system that much...?
 

McFuzz89

Dabbler
Joined
Mar 19, 2017
Messages
10
Holy moly this was a rollercoaster... long story short, the x520 does not work well in TrueNas using 10Gtek transceivers. I was able to get decent iperf3 speeds, but actual file transfer? Nope - stuck at under 50 kb/s, yes - kb/s!

What's worse is that LACP still does not work but that ended up being due to... :drum roll please: BAD PATCH CABLES! I have 1.5m cables running to my patch panel terminating to a LC-LC keystone and from there, 0.5m patch cables to the switch. The 1.5m cables both have bad connectors or crimps... I noticed the interfaces would not come on unless I tugged on the connector... a lot. Got new cables coming in on Monday.

I did have a 1M cable laying around that I used to test each interface individually (LACP disabled on TrueNas and the switch) - with the 520, the results were actually very poor and I suspect one of the interfaces is flaky. With the Mellanox, I am able to hit 9+ gbit fairly reliably using iperf, but still se a bunch of retries. However the real test was file transfer - over 250 mb/s (2 gbit) with my ol' RAIDZ1 setup which is more or less what I expected. The 520 was, oddly, stuck at 30-50 kb/s no matter what I did.

BTW that part about being stuck at 1 gbit while doing iperf testing - that, I think, is some sort of a weird bug with TrueNas. When I was doing that, I had one of the 1 gbit copper interfaces online and even though I was binding iperf to use the 10 gig interface, it somehow was confused and used the 1 gig interface...?

Final note - running iperf on the 1 gig interfaces in LACP mode has stellar results; no retries and maxed out 1 gig. I am really confused why I get retried with the PCIe cards... especially since I threw in a Xeon 1265L v2 at it (upgrade from a dual core pentium)... so there really shouldn't be any bottlenecks, unless the PCIe bus gets chocked somehow.

Anyway - that's my saga. We'll see how the mellanox behaves with new cables on Monday!
 
Top