Unable to retain stable connections to TrueNAS Scale when using two NICs in two different subnets

JohanKarlsson · Jun 14, 2023

Dear all,

I can’t find a way to maintain a stable communication with TrueNAS Scale (TNS) when two bridge interfaces (br10 and br20) are active at the same time, each in a different subnet (HomeLAN, br10 and ServerLAN, br20). Both subnets are directly connected to the same OPNsense router as you can see in the provided network scheme below.

For this setup, each subnet is a separate VLAN on the same managed switch but I have not defined a trunk port on the switch and router because each VLAN is connected to a separate port on the router (hence I drawn two switches in the provided network scheme). The switch is a layer 2 switch. Currently, I don’t use ipv6. Only ipv4.

Yesterday I did a fresh install of TrueNAS Scale (TNS) 22.02.04 in the hope a config error slipped into my TNS, since I try to solve this issue since TNS 22.02.04. Unfortunately, the issue remains.

What I try to accomplish:

apps and management WebGUI of TrueNAS Scale are only accessible from the ServerLAN subnet via bridge interface br20. In other words: this traffic has to go over the router,
SMB shares are accessible from both ServerLAN and HomeLAN subnets (via bridge interfaces br10 and br20) in order to keep SMB traffic in the local broadcast domain and reduce traffic that has to go over the router.

Everything works just fine when only br20 is active (10.20.20.80/24 ServerLAN). TNS is accessible as expected and connections are stable for both clients connecting from the ServerLAN and the HomeLAN.

When the trouble begins

The trouble begins when both br10 and br20 are made active and get an ipv4 address assigned.

When I – from the HomeLAN (10.10.10.0/24 subnet) -- login to the webGUI (br20 – 10.20.20.80) of TrueNAS Scale by using http://ip_address_TNS_ServerLAN, the connection to the webGUI is unstable, meaning it tries to re-establish the connection showing me a message in the WebGUI:
"Waiting for Active TrueNAS controller to come up…"

I can repeat the issue if I just go to the Dashboard page of TNS. I then see the counters updated for the network interfaces of TNS (and do nothing). After about 40 seconds those counters stall. 30 seconds later I see the message "Waiting for Active TrueNAS controller to come up…" and about 5 seconds later TNS has back a life connection (for about 40 seconds). This cycle repeats over and over again.

Same for apps running on TNS when accessed by a web browser from the HomeLAN, for instance accessing the Ubiquti Unifi controller app suffers from the same re-establishing issue. It does restore the connection after about 30 seconds or so. Ditto for SMB file copy sessions.

When I do the same from the ServerLAN subnet 10.20.20.0/24, the connection to the webGUI and the apps is stable.

It makes no difference if I bind the TNS WebGUI only to the ServerLAN br20 ipv4 address. If I do, The GUI tells me it is bound only to br20, but when I look at the console of TNS it mentions the GUI is still accessible via both the ipv4 addresses of br10 and br20.

Both bridge interfaces don’t use DHCP but have static IP addresses assigned.

Aside from my Cisco switch, all network interface cards – including those in my OPNsense router -- are from Intel and are in full duplex Gbit speed mode.

What I tried so far

First I thought it had something to do with TNS not ready for such a configuration, but now that I upgraded TNS to 22.12.2 the issue seem to remain.
Kubernetes on TNS (TNS WebGUI/Apps/Settings/Advanced Settings/) is using

br20 as Node IP,
uses br20 interface for routing, and
is using the OPNsense router as default gateway for the 10.20.20.0/24 subnet.

For now, TNS webGUI is configured as 0.0.0.0 (TNS WebGUI/System settings/General/GUI/Web Interface IPv4 Address). Even when I restrict the WebGUI to only br20, the issue remains. Also for the apps.

When I check the log on my switch, I don’t see any STP related events. By experiment, I enabled Per VLAN Rapid STP instead of the default Rapid STP on my Cisco switch. The issue remains the same.

Disabling mDNS on the OPNsense router OR the TNS also makes no difference.

My local DNS server runs on my OPNsense router. Telling the local computers – including TNS -- to use another dns (8.8.8.8), did not make any difference (I did reboot TNS, the clients and the router).

Currently, no static routes are defined in TNS. As an experiment, I did create a static route: traffic for 10.10.10.0/24 has to use br10 but that did not help either. I believe it is also not needed since br10 is directly connected to that local broadcast domain.

When I ping the router on the br10 or br20 interface from a client in the HomeLAN, the response time I get back is consistent between 0.9ms and 1.2ms EVEN when I lost again the connection to the TNS WebGUI and see the message in my browser "Waiting for Active TrueNAS controller to come up".

At OSI layer 1, I also changed network cables with new ones (I only use CAT6). CAT 6 UTP should be more than enough for a Gbit a network infrastructure.

Support for jumbo frames (MTU 9000) is activated on the switch and TNS. OPNsense does not support MTU of 9000, but even when I configure network cards to work with MTU default (1500), the reconnect issue remains.

Second thoughts

What puzzles me is why in TNS you can only define one default gateway (TNS WebGUI/Network/Global Configuration/Default Route) instead of on a per network or bridge interface. That way I could assign a default gateway only on the br20 bridge interface and no default gateway on the br10 bridge interface, so br10 would only serve the local broadcast domain it is connected to.

The thought came to my mind that this is a router issue, but I have had a similar dual NIC configuration on a MacOS based system with the same router hardware and configuration and faced no issues at all. The difference between MacOS and TNS – in my case -- is however that in MacOS default gateways are assigned on a per network interface card instead of per server as in TNS. I only gave one NIC in MacOS a default gateway.

So, again, it might be a router issue in regard to asymmetric routing. Thing is: if I tell TNS to only advertise apps and WebGUI on one network interface/bridge, and keep SMB file sharing available in the local broadcast domain only, I eliminate the issue of asymmetric routing. Or, am I thinking wrong here?

I ran out of options and ideas guys. I hope someone can give me the “golden” tip that solves this issue. I might response with a delay due to the time zone difference of 6 to 9 hours between US and Europe. Sorry if variants of this issue already have been asked. After searching for so many hours I have the impression I could not find the solution that could help me out.

Thank you for your time and thoughts.

Johan

Patrick M. Hausen · Jun 14, 2023

You have a textbook example case of asymmetric routing. Even if the packet from a client to TrueNAS goes over the router/firewall, TN will always send the return packet according to its routing table. And if there is a directly connected interface to that client, this is what will be used. This is how it's supposed to work, there is only a single network stack in TN.

So in a way you have a bad network design. You need to change your topology.

Yes, there are architectures where you can have multiple separate network stacks for various services, TN just is not one of them. Services like SMB and management all run on the same stack.

What you can separate from this TN network stack is bridged networking for VMs. In this case the networking is entirely in the OS running inside the VM and TN is only providing a layer 2 bridge or vSwitch if you want to call it that. In this case don't assign an IP address to brNN at all and just connect the VMs.

TN CORE can also connect jails (native FreeBSD containers) in this manner. I think apps in TN SCALE cannot use bridged networking because of the way how container networking in Linux works but I might be mistaken.

HTH,
Patrick

jgreco · Jun 14, 2023

JohanKarlsson said:
What puzzles me is why in TNS you can only define one default gateway (TNS WebGUI/Network/Global Configuration/Default Route) instead of on a per network or bridge interface.

How would you think this would work? If you were on the server and were to type "ping 8.8.8.8", which interface would the packet egress from? There's a reason that you're only supposed to have a single default route.

skittlebrau · Jun 15, 2023

I’ve had the same problem before, except I had a single ethernet connection to TrueNAS SCALE, so I had it configured as a ‘trunk’ port.

Each VLAN attached to the physical interface and an appropriately named bridge (br10, br20, etc).

VLAN 10 - Trusted devices
VLAN 20 - IoT
VLAN 30 - Security cameras
VLAN 50 - ‘DMZ’ (not an actual DMZ in the traditional sense)
VLAN 100 - Management

The web UI was bound to br100.
SMB was available on all bridges.
Kubernetes was bound to br20.

Default gateway was on VLAN 100.

The web UI would timeout every 30 seconds, although Kubernetes apps on br20 and VMs and such were all fine - only the web UI would timeout.

What ended up being the problem was that I had inadvertently created the problem mentioned above by the other posters. I had a firewall rule that allowed just my desktop PC on the VLAN10 network to reach all hosts on VLAN100. I ended up tightening my firewall rules to fix the problem.

JohanKarlsson · Jun 19, 2023

Patrick M. Hausen said:
You have a textbook example case of asymmetric routing. Even if the packet from a client to TrueNAS goes over the router/firewall, TN will always send the return packet according to its routing table. And if there is a directly connected interface to that client, this is what will be used. This is how it's supposed to work, there is only a single network stack in TN.

So in a way you have a bad network design. You need to change your topology.

Yes, there are architectures where you can have multiple separate network stacks for various services, TN just is not one of them. Services like SMB and management all run on the same stack.

What you can separate from this TN network stack is bridged networking for VMs. In this case the networking is entirely in the OS running inside the VM and TN is only providing a layer 2 bridge or vSwitch if you want to call it that. In this case don't assign an IP address to brNN at all and just connect the VMs.

TN CORE can also connect jails (native FreeBSD containers) in this manner. I think apps in TN SCALE cannot use bridged networking because of the way how container networking in Linux works but I might be mistaken.

HTH,
Patrick

Patrick,

First of all, thank you very much for you comment and thank that you are so willing to be helpful on this forum.

I needed some time to process what you wrote. In my Cisco CCNA Routing Protocols and Concepts book, exactly five lines of text is used to discuss asymmetric routing. Not that much.
One is never to old to learn.

Taking into account what you wrote, it seems the best solution is to adjust the network design so the asymmetric situation becomes a symmetric one (read: there is only one path from A to B and vice versa). In particular, I believe the cost of a route is causing trouble here: directly connected networks have a lower cost than routes that must go over a router. A TCP stack always tries the cheapest route first to a given destination. So establishing a session from my HomeLAN to TNS on my ServerLAN works at first when it goes over my router, but since TNS is also directly connected to the HomeLAN, it continues to communicate directly to the HomeLAN during the same session causing the asymmetric routing issue.

I now also understand why ping did work all the time and my http (TCP) session kicked out after 30 seconds. Ping (ICMP) is a stateless protocol while TCP is stateful.

Now that I understand the root cause, I completely agree with you that a symmetric network design is the best way forward. Thing is, when I follow your recommendation, it comes in my case with a speed penalty. I use jumbo frames on the HomeLAN segment to speed up Time Machine backups and SMB traffic. My OPNsense router only supports a MTU of 1500 due to restrictions of the netmap driver it is using.
Next to this: my layer 2 switch is faster than my OPNsense router. I could maybe solve that issue if I buy a layer 3 switch (it was my preferred choice), but unfortunately due to supply chain issues, I needed to wait about a year an a half. Hence I agreed in buying a layer 2 switch last year.

My second best solution

The second best solution I came up with is to adjust my firewall in order to mitigate the asymmetric routing issue you mentioned.

As mentioned before, what I want to accomplish:

All apps and WebGUI on TrueNAS Scale are only accessible via the TNS ServerLAN br20 interface,
SMB traffic on HomeLAN and ServerLAN in regard to TNS stays on the local subnet.

For Apps and TNS WebGUI

For traffic from HomeLAN to TNS ServerLAN running through the firewall, the firewall follows sloppy state tracking. In other words: let the firewall honor the fact that not all traffic passes the firewall within the same TCP session.

I later can change this to:
Traffic to apps and WebGUI on the TNS runs via HA Proxy on OPNsense (later, because making a rewrite rule in HA proxy in regard to Nextcloud gives me a headache).
All traffic from HomeLAN to TNS ServerLAN that does not go through the HA proxy will be blocked by the firewall.

For SMB traffic

I ensure it stays local by blocking all SMB ports to TNS ServerLAN and TNS HomeLAN in the firewall.
I also make sure my DNS gives a local HomeLAN ip address when bonjour queries for an ip address for tns.local in the HomeLAN segment.

By using HA Proxy and making sure SMB traffic stays local in regard to TNS, I basically converted the asymmetric routing issue into a symmetric one. Is my thinking right?

Regards,

Johan

JohanKarlsson · Jun 19, 2023

jgreco said:
How would you think this would work? If you were on the server and were to type "ping 8.8.8.8", which interface would the packet egress from? There's a reason that you're only supposed to have a single default route.

jgreco,

Thank you for your comment. I was more thinking in terms of routing.
When I give no route info or do not assign a default gateway to a network interface, that network interface can only serve the local subnet it is connected to.

So for multiple NIC hosts, its basically an XOR function: you provide a default gateway information to only ONE network/bridge interface. That way, you don't have any egress issues you mentioned since your ping -- for non directly connected networks -- will always go to the single default gateway that is configured.

But, as Patrick mentioned here, TrueNAS Scale is not one of them.

Johan

JohanKarlsson · Jun 19, 2023

skittlebrau said:
I’ve had the same problem before, except I had a single ethernet connection to TrueNAS SCALE, so I had it configured as a ‘trunk’ port.

Each VLAN attached to the physical interface and an appropriately named bridge (br10, br20, etc).

VLAN 10 - Trusted devices
VLAN 20 - IoT
VLAN 30 - Security cameras
VLAN 50 - ‘DMZ’ (not an actual DMZ in the traditional sense)
VLAN 100 - Management

The web UI was bound to br100.
SMB was available on all bridges.
Kubernetes was bound to br20.

Default gateway was on VLAN 100.

The web UI would timeout every 30 seconds, although Kubernetes apps on br20 and VMs and such were all fine - only the web UI would timeout.

What ended up being the problem was that I had inadvertently created the problem mentioned above by the other posters. I had a firewall rule that allowed just my desktop PC on the VLAN10 network to reach all hosts on VLAN100. I ended up tightening my firewall rules to fix the problem.

Skittlebrau,

Thank you for your comment. Your situation is a bit different in that you connected your TNS to a trunk port and used VLANs. To be honest, I prefer VLANs on my network equipment instead of host systems to keep a clear view and to prevent loop holes.

Although your and my situation are different, after a few nights sleep, I indeed think the direction I have to think in is adjusting my firewall and use HA Proxy to access apps on TNS, all with the purpose to mitigate the asymmetric routing issue.

You gave me the idea to look into firewall rules and to add a sloppy rule to my firewall (https://docs.netgate.com/pfsense/en/latest/routing/static.html#asymmetric-routing).
So thank you Skittlebrau.

Johan

jgreco · Jun 19, 2023

JohanKarlsson said:
Thank you for your comment. I was more thinking in terms of routing.
When I give no route info or do not assign a default gateway to a network interface, that network interface can only serve the local subnet it is connected to.

Well, that's not true. Ingress can happen on any interface, a hazard with modern UNIX systems. This may be a nonissue for many people, but I'm typically looking at networks for the vulnerabilities that they generate. If you really need to restrict a network interface, you actually have to place a firewall rule on it to forbid unwanted ingress (and you might as well do unwanted egress too).

JohanKarlsson said:
So for multiple NIC hosts, its basically an XOR function: you provide a default gateway information to only ONE network/bridge interface. That way, you don't have any egress issues you mentioned since your ping -- for non directly connected networks -- will always go to the single default gateway that is configured.

That's correct. The reality is just a little more complicated as you can also have other static routes. But I'm thankful I don't seem to be needing to drag you over the finish line here. We occasionally have people who are Absolutely Very Certain They Know How It Works And Won't Take Kindly To Education. I like the ping example because it is a great way to move people who are thinking about this from the point of view of an externally initiated connection, which is certainly one way you could use to provide a hint about which link to use. TrueNAS doesn't do that either, at least not by default, of course.

JohanKarlsson said:
But, as Patrick mentioned here, TrueNAS Scale is not one of them.

Huh? I don't get what you're saying.

Patrick M. Hausen · Jun 19, 2023

I wrote that there are systems that can provide a separate IP stack for different interfaces or services or similar but that TrueNAS wasn't one of them.

jgreco · Jun 19, 2023

Patrick M. Hausen said:
I wrote that there are systems that can provide a separate IP stack for different interfaces or services or similar but that TrueNAS wasn't one of them.

Oh. Sorry. Still wakin' up. I think it could be totally doable on CORE at least, given both the strong jail+vnet support options (don't forget you don't actually need to chroot a jail) and also multiple routing table support via setfib(1). Just a matter of developer interest, in which I'm sure there is zero. Just imagine the fun it would be trying to support that here on the forums.

Important Announcement for the TrueNAS Community.

Unable to retain stable connections to TrueNAS Scale when using two NICs in two different subnets

JohanKarlsson

Cadet

What I try to accomplish:

When the trouble begins

What I tried so far

Second thoughts

Patrick M. Hausen

Hall of Famer

jgreco

Resident Grinch

skittlebrau

Explorer

JohanKarlsson

Cadet

My second best solution

For Apps and TNS WebGUI

For SMB traffic

JohanKarlsson

Cadet

JohanKarlsson

Cadet

jgreco

Resident Grinch

Patrick M. Hausen

Hall of Famer

jgreco

Resident Grinch

Similar threads

Important Announcement for the TrueNAS Community.

Unable to retain stable connections to TrueNAS Scale when using two NICs in two different subnets

Cadet

What I try to accomplish:​

When the trouble begins​

What I tried so far​

Second thoughts​

Hall of Famer

Resident Grinch

Explorer

Cadet

My second best solution​

For Apps and TNS WebGUI​

For SMB traffic​

Cadet

Cadet

Resident Grinch

Hall of Famer

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Unable to retain stable connections to TrueNAS Scale when using two NICs in two different subnets"

Similar threads

What I try to accomplish:

When the trouble begins

What I tried so far

Second thoughts

My second best solution

For Apps and TNS WebGUI

For SMB traffic