Containers all seem to be offline, need help troubleshooting

dealy663

Dabbler
Joined
Dec 4, 2021
Messages
32
Hi

I've been running Scale for about a year now and yesterday my system became relatively unusable. I am able to connect via the console, but it isn't really clear to me what has gone wrong. The file systems look good, zpool reports that all of my disk pool are functioning properly. Also basic networking seems to be ok, however if I connect to trueness via SSH the connection works for only a couple of minutes. Samba is down and all of my shares are inaccessible. The TrueNAS GUI is not working either. This makes it more difficult for me to troubleshoot. PiHole is also running in a container which is inaccessible so my DNS services are down. One final oddity is that one of my VMs is still up and functioning properly. I think it was the only one on by default. But I can SSH into that VM normally, and the web service that is running on that VM seems to be fine.

I couldn't find anything helpful in syslog or error log. However in the k3s log I do see some suspicious messages. The initial status of k3s looks ok, it starts
  1. kube-apiserver
  2. kube-scheduler
  3. kube-controller-manager
  4. ... and several other normal sounding controllers
The first indication of a problem when it tries to start PiHole: "PLEG is not healthy, failed to read pod IP... Then more stuff aboutCNI fails to retire network namespace, container not found in pod, more plugin/docker errors related to networkPlugin for CNI. Then a slew of messages bout CNI config uninitialized for several containers:
  • ix-pihole
  • kube-system/openebs-zfs-controller
  • ix-pihole/svclb-pihole-dns-qnbg7
Followed by lots of other fails related mount device failed, failed to remove container...

Any ideas, of what has gone wrong? How to troubleshoot or better fix this? Or am I even barking up the wrong tree?

Thanks, Derek
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
You have multiple issues... usually there is a common cause.

My general recommendation is to focus on the simplest problem.... resolve that, then go on to the next.

The simplest seems to be ssh... looks like a networking issue.

Which version of SCALE? When did you update?
 

dealy663

Dabbler
Joined
Dec 4, 2021
Messages
32
I'm on TrueNAS Scale 22.02.04, I last updated several months ago.

It doesn't really seem like a network problem. I can ping trueNAS and the ping times just continue at about the same pace. While when I ssh over I can login and eventually the ssh session dies, ping is fine. Also I can ssh to a VM hosted on trueNAS without issue, so not sure how there could be a network problem that is borking ssh and Kubernetes. I've been running iPerf now and it is also holding solid at ~9Gb/sec.

Accessing the networking settings via the text UI on the console, everything looks just fine like I last configured it. I have a bridge defined and am only using the 10GbE NIC. I could enable the onboard 1GbE Nic, but I'm doubtful that will make much a difference. Also I'm just not that familiar where all the network controls are on TrueNAS when going through the cmd line. iPerf3 has been running for 5 min or so with no drops.

Man this is a bummer.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
How are you setting up ssh security...
Any certificates involved?
 

dealy663

Dabbler
Joined
Dec 4, 2021
Messages
32
an rsa public key was uploaded to my trueNAS logon .ssh folder, nothing very elaborate for a home lab setup
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
an rsa public key was uploaded to my trueNAS logon .ssh folder, nothing very elaborate for a home lab setup
And the webUI is not working... these are basic services.

I'd suggest you document the network and bridge/app settings well. Perhaps someone will see a mistake. Otherwise, I'd suggest tearing apps and bridges down and making sure the fundamentals work.

Is there any config change recently before the issues started (eg new app).
 

dealy663

Dabbler
Joined
Dec 4, 2021
Messages
32
No recent changes, last app installed was maybe an update to Plex around aug-sep. I might have installed PiHole around that time too I guess.

No other config changes for maybe 6-9 months.

network interface> query
name
nametype state.aliasaliasesipv4_dhcpdescmtubridge memberstp
enp4s0physicaltrue1Gbe (disabled)1500
enp11s0physicalfe80::8261...false10Gbe
br1bridge192.168.86.11/24
fe80::342f...
192.168.86.11/24falsebridged if9000enp11s0true
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
And you are testing ping and iperf from the same mahine you are trying to ssh and webui?
What address is that machine?
How are dns and Ip default gateway set?
Do you have apps set to same IP address or their own IP address?
 

dealy663

Dabbler
Joined
Dec 4, 2021
Messages
32
Ping and iPerf client are running from my MacBook: 192.168.86.136
Dns on MacBook points to piHole which is running in a container on TrueNas box
All default gateways point to: 192.168.86.1
DNS on TrueNAS points to 192.168.86.11 (theTrueNas ip addr)
All app containers have their default IP address, which I assume are the same as TrueNAS 192.168.86.11
 

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
And the webUI is not working
What are the links displayed into console? I ask this because I run the Ui on a custom port, while Traefik is using the usual web ports.

console-scale.png
 

dealy663

Dabbler
Joined
Dec 4, 2021
Messages
32
192.168.86.11:80 & 443 (normal default)

For just about 3 min a while ago I was able to get to PiHole and also the main login page of the UI. After logging in the UI never finished loading the page and then PiHole was no longer responsive. This is so so weird.
 

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
@dealy663 I'm on a Mac also and I only use Safari as browser. Sometimes when I have issues with websites, I open a private window to make sure is not something weird stored into cache. I'm pretty sure is not related to your issue but worth trying. I use at home Unifi Pro infrastructure, so I never experienced any network related issues. From what you mentioned earlier, this is not a network related issue also.

Do you have any DNS related settings to Scale into Pi-Hole, like I do? I run redundant Pi-Holes on 2 Raspberry Pi, set as DNS servers into my UDM SE. I was wondering if is possible to stop Pi-Hole app and test the access to your Scale UI, without any Pi-Hole interaction into network and from a Safari private window.

Also, what happens when you run these commands? Try that first please.
Code:
# systemctl status nginx
# systemctl restart middlewared
# iptables -L INPUT -n --line-numbers

If Nginx is not running or something changed into your iptables, that would explain your issues. I know that iptables rules are changed dynamically, while Kubernetes cluster is starting at boot time.

Testing my UI from Mac terminal:
Code:
$ curl -I http://uranus.lan:8080/ui/
HTTP/1.1 200 OK
Server: nginx
Date: Tue, 20 Dec 2022 02:25:17 GMT
Content-Type: text/html
Content-Length: 7852
Last-Modified: Tue, 13 Dec 2022 12:49:41 GMT
Connection: keep-alive
Etag: TrueNAS-SCALE-22.12.0
Cache-Control: must-revalidate
Strict-Transport-Security: max-age=0; includeSubDomains; preload
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Permissions-Policy: geolocation=(),midi=(),sync-xhr=(),microphone=(),camera=(),magnetometer=(),gyroscope=(),fullscreen=(self),payment=()
Referrer-Policy: strict-origin
X-Frame-Options: SAMEORIGIN
Accept-Ranges: bytes

If my browser would not display the UI, at least I know there is a problem with the browser, since the Nginx service is functional.
 
Last edited:

dealy663

Dabbler
Joined
Dec 4, 2021
Messages
32
yeah, with pihole being down, I've just been using ip address for all of my testing while troubleshooting. I have a secondary dns on an old sinology NAS, which I'm now kind of depending on while TrueNAS is being so weird. I've noticed that the UI seems to work better in chrome than Safari. But I know that I have bigger problems than browser oddities.
 

qmcb23YR

Dabbler
Joined
Mar 30, 2020
Messages
12
what version are you running? I had the same issue (I believe that was on TrueNAS-SCALE-22.02.2?) due to a memory leak caused by the web ui.

I fixed it by logging in via SSH and restarting the middleware service. I then set up a weekly cronjob to do that for me, and that solved it until the bug was fixed in a later version.
 
Top