Faulty Kubernetes Routing - passing all traffic to Gateway

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
There have been a couple of threads on this issue - but none of them seem to have a working solution (for me at least)
I change the default gateway of my Scale NAS. (from .254 to .15 if it matters)
None of my containers / charts that use the external world work properly now.
NAS DNS is 192.168.38.10 & 11 (I use AD - so need to use local DNS)
NAS is 192.168.38.32
Gateway is 192.168.38.15

Diagnosis Steps:
Shell into a Heimdall Container (it has ping and nslookup). I have noted where the result differs in a NAS Shell
Ping 1.1.1.1 - works
Ping 192.168.38.15 - works
Ping 192.168.38.10 (or 11, or anything else) - Does Not Work (but does work from the NAS itself). The NAS can ping anywhere, the Container can only ping the gateway (this may well be the issue)
nslookup ibm.com - Does not work (Works from the NAS) - the below is what I get from the container
1661129713468.png

At the very least DNS seems fubarr'd - but there may be more to it.

Further messing around - I added nameserver 8.8.8.8 into the resolv.conf in the container (obviously this will not persist) and theregister.co.uk resolves correctly (albeit not quickly - there is a noticable delay - 4-5 Seconds)

My take on this is that the containers within TN for some reason will only talk to the gateway - and nothing else on the LAN - thus DNS is failing. A further test - I removed 8.8.8.8 and added 192.168.38.15 (my firewall and DNS proxy) and attempted to resolve sex.com (its short and easy to type). This also works, again slowly (4-5 seconds)

BTW - under TrueNAS Global Configuration - Outbound Network I have Allow All selected
Also - the 4-5 seconds delay is only from the container. TrueNAS itself resolves everything very quickly

Anyone got an idea of whats wrong, and what I have messed up and (of course) how to fix it - that doesn't involve rebuilding everything?
 

Attachments

  • 1661130177264.png
    1661130177264.png
    27.3 KB · Views: 82

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I don't think its a truecharts issue. It feels more like a TN issue with Kubernetes. Oh and yes I have been on the discord channel
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I have done a lot more diagnosis on this - and its a whole hell of a lot more complex than I thought. To the point that I am going WTAF is going on.
Simplified Network View:
1661204923173.png

The basic premis is that all traffic from specific hosts is policy based routed at the pfSense firewall. At the moment this is a simple anything from .32 or .36 is routed out of the OpenVPN Gateway whilst stuff from (for example the .10) is routed normally unencrypted onto the internet. I can then get more granular over time, but I need to start somewhere

The pfsense config is really quite simple at the moment.
1 LAN Interface, 1 WAN Interface
1 OpenVPN Tunnel, 1 VPN Interface, 1 VPN Gateway
Rules
MASQ rules on WAN and VPN Interface
PBR rule on LAN (and this may be the immediate issue) pointing certain ip addresses to VPN Gateway. Traffic is tagged
Floating Rule to block tagged traffic from going out of the WAN

With the network in this state containers cannot ping other LAN devices, but can ping the gateway. DNS resolution also does not work as TrueNAS DNS Servers are LAN based. This is suboptimal (as in useless)

If I switch off the VPN Gateway (as in disable it) then containers work properly (but not encrypted obviously)
If I switch the VPN Gateway back on then containers don't work again

Running a traceroute from the Heimdall container (it has nslookup, ping and traceroute) I get
1661205653876.png

Traffic is leaving the container and going straight to the firewall which then routes / redirects the traffic back to the actual destination. This implies that all traffic leaving a container goes to the firewall first and then back into the network. Surely this is wrong on all sorts of levels and not the way that traffic should be routed.

I think what is happening is that the firewall is, through the policy based route, grabbing the traffic and trying to push it out of the VPN Interface - which is entirely the wrong direction - and I think I have confirmed that with some other tests (yes confirmed by a traceroute)

However the bigger question is why is all the local traffic being sent to the firewall rather than being dumped onto the LAN properly - which I feel must be a bug unless someone can correct me
 
Last edited:

Bazoogle

Cadet
Joined
Apr 27, 2023
Messages
9
Were you ever able to make any progress on this? The same issue is still happening nearly 8 months later.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
However the bigger question is why is all the local traffic being sent to the firewall rather than being dumped onto the LAN properly - which I feel must be a bug unless someone can correct me
What is the netmask/prefixlength of your TrueNAS systems?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
/24 - all my LAN is /24. The traffic should NOT be going to the router (when its local) - its being misdirected at the kube-router

IX have acknowledged the issue, declared it as an upstream problem that they cannot / will not fix.

I think they are wrong. I doubt its an upstream issue (although I have no mechanism of proving that) and I actually think they are wrong in not trying to fix this.

For a home user - its probably irrelavent - I am just being fussy
But for an enterprise - "Why is ALL my internal traffic bouncing off my firewall?" would be something of a deal breaker to me causing me to switch off K3S and say "use something else instead". Note that TrueNAS would be fine, but K3S would be out
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
/24 - all my LAN is /24. The traffic should NOT be going to the router (when its local) - its being misdirected at the kube-router
100% agree. Following the SCALE and k3s development all the time I see more reasons to stick with CORE, vnet and jails. It's just so much simpler and more reliable in so many aspects.
 

Bazoogle

Cadet
Joined
Apr 27, 2023
Messages
9
For a home user - its probably irrelavent - I am just being fussy
As a home user, this is causing massive headache. I cannot get this to work. I feel like my knowledge of networking is good enough to know it's screwed up, but not enough to fix anything. You were able to disable VPN gateway to resolve it, however I am having the same issue except I cannot disable anything to fix it. I am able to ping 8.8.8.8 but not google.com. Several apps cannot work because they cannot access the internet. IX needs to fix this. It's enough that I'm considering just using a different NAS service.

Is there a static route I can assign in my router to ensure the traffic goes to the correct location?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
@Bazoogle there are lots of things that could be causing this issue and not just this kube-router "bug/feature"
Scale relies on the router redirecting the traffic back to the LAN - the only reason I spotted this was because I was using PBR on my firewall (to push traffic down the VPN) which was grabbing the traffic before the redirect happened. Everything was working if I disabled PBR. DNS worked, both internally and externally - its just that the routing is sub-optimal.

If you can ping 8.8.8.8 but not resolve then this (the above) is NOT your issue. You have a different problem - to do with DNS resolution.

Load up the Heimdall TrueCharts chart if you can as that one did (when I tested) contain sufficient tools to run tests with
Also check your Kubernetes Settings, is the Cluster DNS IP set to (172.17.0.10).

The way it works is (I think - someone please correct me if I am wrong) - 172.17.0.10 is a K3S service that forwards DNS requests through the kube-router out to the LAN. All traffic from a pod goes through the kube-router. Pods use the K3S Cluster DNS by default for DNS resolution which is then forwarded to the TrueNAS defined DNS.
 

Drocona

Cadet
Joined
Apr 17, 2020
Messages
4
Just here to acknowledge this problem as I've been pulling my hair out the last week trying to figure out what the actual F is going on with my network, only to find it's a problem caused by TrueNAS Scale's implementation of networking in K3s.

To confirm, all traffic coming from K3s is being blasted towards the gateway instead of doing layer 2 communications. This is easy to see when using a strict router/firewall. Traffic coming from K3s that should be L2 traffic is sent to the router/firewall which then blocks the traffic as it is not supposed to be L3 traffic but L2. Result: K3s workloads trying to reach IP's in the same subnet K3s is connected to will FAIL. Accessing other subnets will work fine as that is actually L3 traffic.

To make matters worse, I suspect TrueNAS Scale to also do some seriously bad asymmetric routing, again causing network connectivity issues.
My TrueNAS Scale setup uses VLAN's, it has the following interfaces:
An IP in VLAN10 on 192.168.0.20, this is used for general access and connectivity for shares
An IP in VLAN20 on 172.16.0.20, this is used for K3s workloads

When I try to access a K3s workload from VLAN10 (my PC has IP 192.168.0.100) I seem to be getting the following traffic flow:
192.168.0.100 (PC) to 192.168.0.1/172.31.0.1 (Router/FW)
192.168.0.1/172.16.0.1 (Router/FW) to 172.31.0.54 (TrueNAS K3s Workload LoadBalancer IP)
172.31.0.54 (K3s LB) to 172.16.0.225 (K3s POD IP)
So far so good, but then the reply:
172.16.0.225 (K3s POD IP) to 172.16.0.1 (K3s Kube Router)
172.16.0.1 (K3s Kube Router) to 192.168.0.100 (PC)

It seems that since TrueNAS has an IP in the 192.168.0.x range, it decides packets to 192.168.0.100 should go out on that interface instead of staying on the correct VLAN for the K3s cluster even though K3s is configured to be VLAN20 only.
This throws an asymmetrical routing problem on the router/firewall and kills the traffic flow which causes extremely intermittent connectivity.
I've implemented a workaround in my network to accept this behavior on the firewall but obviously, this is not how it's supposed to be working!
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Yay - someone else agrees with me. K3S routing is broken, badly.
I did log this as a fault, had to explain the issue to IX, who (and this is the individual on the ticket, rather than IX as a whole) clearly did not understand how routing works. IX have declined to fix, saying working as expected. [Which it clearly isn't]

I cannot confirm the VLAN issues - I haven't tried.

From my PoV - it makes K3S borderline unuseable. BTW - if you have a "host networking" option in the pod, tick that, it seems to solve some of the issues.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
*sigh*
 
Top