Network Goes Down when using LACP (2x 1Gb)

mroptman

Dabbler
Joined
Dec 2, 2019
Messages
23
I cannot figure out why the networking stack becomes unreliable/fails when the Scale host begins taking traffic (SMB/SSH). Whenever the console says the web interface is not available, a restart restores the web console access. My issue is most likely a configuration issue with setting up LACP on Scale or on my other networking hardware. The system is almost stable to begin moving data over from Core.

As a control test, the same hardware booted to Proxmox VE 7.1 does NOT exhibit this issue and LACP works as intended/does not drop off. Able to get two ~935 Mbps transfers in/out on Proxmox and the goal is to achieve this with Scale.

The Scale network "reliably" looses network connectivity about 30 minutes into several different SMB transfers. The console repeats the message "The web interface could not be accessed. Please check network configuration" until the host is restarted. Looking for any pointers or help correct this most likely self-inflected LACP network configuration issue.

Physical Network
  • UniFi 1Gb switch with two adjacent ports set to aggregate mode
  • Two Cat 6 cables from switch to Scale NICs
  • One Cat 5 cable to IPMI NIC (not relevant, but mentioned for awareness)
Scale Network Setup
  • bond0
    • Link Aggregation Protocol = LACP
    • Transmit Hash Policy = LAYER2+3
    • LACPDU Rate = SLOW
    • Link Aggregation Interfaces = eno1,eno2
    • Hardware Offloading enabled
    • MTU = 1500
    • No IP Addresses
  • br0
    • DHCP enabled, static IP reserved on DHCP server for the MAC of this bridge
    • Bridge Members = bond0
    • Hardware Offloading enabled
    • MTU = 1500
    • No IP Addresses
 

Attachments

  • scale-console-config-lacp-bridge.png
    scale-console-config-lacp-bridge.png
    75.5 KB · Views: 281
  • scale-console-web-int-down.png
    scale-console-web-int-down.png
    130.9 KB · Views: 254

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I think you're best to assign the bridge an IP address manually (even if DHCP will work... maybe it won't) and then use the GUI from there.
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829
I have found Link Aggregation (even on a failover basis) to be among the most aggravating and failure-prone aspects of Free- and TrueNAS. I'm not trying to be punny either.

Both from the command line as well as the GUI, the wasted time trying to troubleshoot why the Failover LA failed this time made me finally see the light, turn off failover LACP and move on. I have no idea if it's the TrueNAS or the Switch, and it really doesn't matter. It simply never worked for me reliably.

Similarly, setting the NAS LAN interfaces to bridge mode is another "feature" that is only directly supported by the GUI (with all the attendant pitfalls). Yes, you can do your magic via option 9 in the CLI but the console LAN menu should be able to set bridge mode too. Last time I checked, it still was not enabled to do so, which I find very disappointing.
 

mroptman

Dabbler
Joined
Dec 2, 2019
Messages
23
Thanks for the above replies - I tried further testing and after about ~30 minutes of sustained network transfer, the LACP connection dies out completely.

Removing DHCP to switch to a static address did not help either. I may re-load the hardware with Core to check if LACP is any more (or less) stable.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
LACP is an OS function, so whatever is going wrong is likely inherited behaviour from Debian.

The 30 minute thing rings a bell of some sort in my head but isn't summoning the data associated with it.

Both from the command line as well as the GUI, the wasted time trying to troubleshoot why the Failover LA failed this time made me finally see the light, turn off failover LACP and move on.

Interesting. Rock solid for me. You did remember to set net.link.lagg.failover_rx_all ... right?
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829
Well there may be the root of my issues even though the manual suggests this is not necessary as my home setup is as simple as they come.

By default, received traffic is only accepted when received through the active port. This constraint can be relaxed, which is useful for certain bridged network setups, by creating a tunable with a Variable of net.link.lagg.failover_rx_all, a Value of a non-zero integer, and a Type of Sysctl inSystem ➞ Tunables ➞ Add Tunable.

Thank you for the pointer. Should it be set to 1?
 
Top