SOLVED Kubernetes DNS resolution broken

Xenthalon · Dec 26, 2023

After running fine for the past few months, as of this morning some of my containers threw errors that they couldn't resolve external addresses. I tried to fix the issue for the past few hours, but I'm kinda stuck.

From the TrueNAS SCALE shell nslookup is fine:

Code:

root@freenas[~]# nslookup google.com
Server:         1.1.1.1
Address:        1.1.1.1#53

Non-authoritative answer:
Name:   google.com
Address: 142.250.186.78
Name:   google.com
Address: 2a00:1450:4001:827::200e

From inside a container things don't look too good though:

Code:

root@freenas[~]# k3s kubectl exec -i -t gitea-76db8c7bd6-h5jz8 --namespace ix-gitea -- nslookup google.com
Defaulted container "gitea" out of: gitea, gitea-init-postgres-wait (init)
Server:         172.17.0.10
Address:        172.17.0.10:53

;; connection timed out; no servers could be reached

command terminated with exit code 1

So I dug around, and the culprit seems to be within Kubernetes CoreDNS:

Code:

root@freenas[~]# k3s kubectl logs --namespace=kube-system -l k8s-app=kube-dns                             
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 172.16.1.116:48981->1.1.1.1:53: i/o timeout
[ERROR] plugin/errors: 2 google.com. A: read udp 172.16.1.116:45498->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 google.com. A: read udp 172.16.1.116:32862->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 172.16.1.116:46529->1.1.1.1:53: i/o timeout
[ERROR] plugin/errors: 2 google.com. A: read udp 172.16.1.116:48613->1.1.1.1:53: i/o timeout
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 172.16.1.116:37692->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 172.16.1.116:50928->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 google.com. A: read udp 172.16.1.116:50217->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 172.16.1.116:43422->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 google.com. A: read udp 172.16.1.116:48322->8.8.8.8:53: i/o timeout

So this is good and bad, my DNS entries (1.1.1.1, 8.8.8.8) from /etc/resolv.conf are forwarded correctly into CoreDNS, the requests from the gitea container make it to CoreDNS, it seems however as if the DNS requests don't make it out from here. And I'm kinda stuck what to do next. I restarted my server several times, hoping it would resolve itself, and googling brought me to lots of issues, from firewalls, to incorrect iptables entries to straight up Kubernetes reinstalls fixing the issue.

But I don't think there is a firewall in my Scale install, and I really don't wanna reinstall. Weirdest thing is I didn't change anything, all I did was run a couple of Truecharts container updates yesterday evening, but how can those ruin my entire k3s setup?

Everything is default, running latest TrueNAS-SCALE-23.10.1.

Network config:

Kubernetes Settings:

morganL · Dec 26, 2023

So did the problem occur after the TrueCharts updates??
You might want to detail which trueCharts and also check with them if there is any change they have made.

It would also be useful to know if the problem persists through a reboot.

Xenthalon · Dec 26, 2023

It persists through several reboots.

TrueCharts made a major version bump of all their charts, for me these were: bazarr, plex, prowlarr, radarr, sonarr and sabnzbd. Didn't see any bug reports in their Discord or their issue tracker though. And rolling back the containers to previous versions didn't change anything either.

I bet it's a bad iptables entry, but I just don't know anything about that. And if I would change my Kubernetes Cluster DNS IP, that would completely reset my Kubernetes and delete all containers, right? That'll be my last resort then.

morganL · Dec 27, 2023

Xenthalon said:
It persists through several reboots.

TrueCharts made a major version bump of all their charts, for me these were: bazarr, plex, prowlarr, radarr, sonarr and sabnzbd. Didn't see any bug reports in their Discord or their issue tracker though. And rolling back the containers to previous versions didn't change anything either.

I bet it's a bad iptables entry, but I just don't know anything about that. And if I would change my Kubernetes Cluster DNS IP, that would completely reset my Kubernetes and delete all containers, right? That'll be my last resort then.

I don't know if anyone else has reported the issue...
perhaps change the thread title to include the trueCharts update.

There are some diagnostic suggestions in this thresd:

Kubernetes and internal DNS.

I am having issues resolving DNS names inside Kubernetes. From pod to pod. The pod is set to use 'ClusterFirst' but I cannot resolve any other pod. root@plex-meta-manager-ix-chart-66cbb7b6fb-nzwsk:/# nslookup plex.ix-plex Server: 172.17.0.10 Address: 172.17.0.10:53 ** server...

www.truenas.com

Do you want to report a bug from the UI?
Get the NAS ticket number and post here.

BlackHunter · Dec 28, 2023

Bumping this thread, the same started happening to me. I've been running fine for the past 2 months and yesterday it stopped working. I also dug around and found that CoreDNS pod is failing to resolve external DNS.

The pod itself seems to be fine, I've come across many past issues and none of the fixes worked. All of them were either bad /etc/resolv.conf files where you had to override them, but the file seems to be fine. Also, restarting the pod didn't work either.

I'm not familiar with kubernetes, but I came to the same conclusion; that the issue seems to reside in the communication between the CoreDNS pod and the underlying OS.

I have tried:
- Debugging the CoreDNS container with ephemeral and it seemed the /etc/resolv.conf is fine and using the host DNS servers.
- Hard coding the DNS servers into the CoreDNS server, didn't work either.
- Debugging internal DNS queries (in between pods), they work fine.
- Shutting down the CoreDNS pod cuts the internal DNS resolutions, too.
- Resetting (by either deleting, scaling down, or rollout restart) the pod didn't yield different results.
- nslookup within the CoreDNS container didn't resolve external DNS queries.

What puzzles me is that I don't remember going through any major TrueNAS Scale version. Rolling back the problematic pods didn't work either. Current train of TrueNAS is 23.10.0.1. Currently I am updating to TrueNAS is 23.10.1 to see if it fixes anything, will post again if it changees something...

BlackHunter · Dec 28, 2023

Config:

/etc/resolv.conf in coreDNS pod:

Code:

root@truenas[/home/admin]# k3s kubectl debug -n kube-system -it $(./dns-pod.sh) --target=coredns --image=busybox
root@coredns-b85f967f9-b7g2m:/# cat /etc/resolv.conf 
nameserver 100.90.1.1
nameserver 8.8.8.8

Don't know if it matters or if it works that way, but also tried a nslookup inside it:

Code:

/ # nslookup google.com
;; connection timed out; no servers could be reached

Also tried running a dnsutils pod from this guide:

Code:

root@truenas[/home/admin]# kubectl run dnsutils --image=k8s.gcr.io/e2e-test-images/jessie-dnsutils:1.3 --command -- sleep 3600
root@truenas[/home/admin]# k3s kubectl run dnsutils --image=k8s.gcr.io/e2e-test-images/jessie-dnsutils:1.3 --command -- sleep 3600
pod/dnsutils created
root@truenas[/home/admin]# k3s kubectl exec -i -t dnsutils -- nslookup kubernetes.default
Server:         172.17.0.10
Address:        172.17.0.10#53


Name:   kubernetes.default.svc.cluster.local
Address: 172.17.0.1


root@truenas[/home/admin]# k3s kubectl exec -i -t dnsutils -- nslookup google.com       
Server:         172.17.0.10
Address:        172.17.0.10#53


** server can't find google.com: SERVFAIL


command terminated with exit code 1
root@truenas[/home/admin]#

BlackHunter · Dec 28, 2023

Follow up

I checked the discord and someone figured out a quick fix.
Apparently adding a masquerade to all 172.16.0.0/16 seems to fix the DNS issues, so it suggests the DNS traffic might have been blocked by the iptables.

The command that I used was:

Code:

iptables -t nat -A POSTROUTING -o enp9s0 -s 172.16.0.0/16 -j MASQUERADE

And my iptables postrouting table went from:

Code:

root@truenas[/home/admin]# iptables -t nat -L -v
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 MASQUERADE  all  --  any    enp9s0  10.8.0.0/24          anywhere           
    0     0 SNAT       all  --  any    any    !172.16.0.0/16       !172.16.0.0/16        vdir ORIGINAL vmethod MASQ /*  */ to:192.168.1.200 random-fully

Chain KUBE-KUBELET-CANARY (0 references)
 pkts bytes target     prot opt in     out     source               destination         

Chain KUBE-MARK-DROP (0 references)
 pkts bytes target     prot opt in     out     source               destination         

Chain KUBE-MARK-MASQ (0 references)
 pkts bytes target     prot opt in     out     source               destination         

Chain KUBE-POSTROUTING (0 references)
 pkts bytes target     prot opt in     out     source               destination

to

Code:

root@truenas[/home/admin]# iptables -t nat -A POSTROUTING -o enp9s0 -s 172.16.0.0/16 -j MASQUERADE
root@truenas[/home/admin]# iptables -t nat -L -v                                                 
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 MASQUERADE  all  --  any    enp9s0  10.8.0.0/24          anywhere           
    0     0 SNAT       all  --  any    any    !172.16.0.0/16       !172.16.0.0/16        vdir ORIGINAL vmethod MASQ /*  */ to:192.168.1.200 random-fully
   40  2326 MASQUERADE  all  --  any    enp9s0  172.16.0.0/16        anywhere           

Chain KUBE-KUBELET-CANARY (0 references)
 pkts bytes target     prot opt in     out     source               destination         

Chain KUBE-MARK-DROP (0 references)
 pkts bytes target     prot opt in     out     source               destination         

Chain KUBE-MARK-MASQ (0 references)
 pkts bytes target     prot opt in     out     source               destination         

Chain KUBE-POSTROUTING (0 references)
 pkts bytes target     prot opt in     out     source               destination

morganL · Dec 30, 2023

BlackHunter said:
Follow up

I checked the discord and someone figured out a quick fix.
Apparently adding a masquerade to all 172.16.0.0/16 seems to fix the DNS issues, so it suggests the DNS traffic might have been blocked by the iptables.

The command that I used was:

Code:
iptables -t nat -A POSTROUTING -o enp9s0 -s 172.16.0.0/16 -j MASQUERADE

Did anyone work out why the change was needed...... was it a SCALE issue or the TrueCharts apps?

If it is a SCALE issue, we'd prefer a ticket is filed (try out the new UI) and then report the NAS ticket number here.

BlackHunter · Jan 8, 2024

morganL said:
Did anyone work out why the change was needed...... was it a SCALE issue or the TrueCharts apps?

If it is a SCALE issue, we'd prefer a ticket is filed (try out the new UI) and then report the NAS ticket number here.

I haven't looked it up since. Haven't updated SCALE in a good while. I did update scale apps though.

The apps I had when this happened were:
- bazarr (truecharts stable)
- cloudflareddns (truecharts stable)
- couldnative-pg (truecharts stable)
- crafty (truecharts stable)
- dashy (truecharts stable)
- firefly-iii (trueNAS community)
- jellyfin (truecharts stable)
- jellyseerr (truecharts stable)
- pgadmin (truecharts stable)
- YouTransfer (truecharts stable)
- prometheus-operator (truecharts stable)
- prowlarr (truecharts stable)
- nginx custom app (docker container)
- qbittorrent (truecharts stable)
- radarr (truecharts stable)
- rustpad (truecharts stable)
- scrutiny (truecharts stable)
- sonarr (trueNAS community)
- wg-easy (trueNAS charts)

I usually update apps as soon as I see an update available, so you might start by checking, from these apps, which ones had a meaninful update on December 27th and 28th.

Hope it helps.

Important Announcement for the TrueNAS Community.

SOLVED Kubernetes DNS resolution broken

Xenthalon

Cadet

morganL

Captain Morgan

Xenthalon

Cadet

morganL

Captain Morgan

Kubernetes and internal DNS.

BlackHunter

Cadet

BlackHunter

Cadet

BlackHunter

Cadet

morganL

Captain Morgan

BlackHunter

Cadet

Similar threads