k3s not starting after upgrade

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
If you need any help with our Tools, please be sure to reach out to our support staff directly on Discord :)
I'd love to, if I could figure out how the hell Discord is supposed to work. I appear to already be logged in--I see my username and status at the lower-left corner of the window--but I "must claim and verify my account before I can send messages in the TrueCharts server." So I click the button and enter the email and password that are already registered, whereupon Discord tells me what I already know--that that email is already registered. No idea what I'm supposed to be doing here. Edit: some progress; apparently Discord had created two accounts for me, one attached to my email address and one apparently not attached to anything but a browser cookie. Deleted the latter, logged in with the former, and it looks like I'm past that hurdle.
 
Last edited:

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
there very few changes that are required to be done, post upgrade.
Redoing users (and permissions, because those will necessarily change when the users change) is a pretty significant change. And, of course, the fact that our existing apps appear to have disappeared into never-never land, and the currently-leading suggestion to resolve this issue is to destroy the ix-applications dataset and start over, doesn't really count as "very few changes"--and while this obviously isn't the case for everyone, it's equally obvious that I'm not the only one affected.

In terms of status, "set applications pool back to the original and wait several hours" results in k3s not running. So does "unset pool, reboot, set pool back to original and wait several hours."

So yeah, I'm frustrated. I'm frustrated with iX, who have proven time and again that they're utterly incapable of shipping a major release without (what should have been) show-stopping bugs (I'm also frustrated with them, though far less so, for pulling complete 180s on longstanding, and long-defended, design decisions without the decency of even a mention in the release notes). And I'm frustrated with myself, as I should have known better than to trust a .0 release from them, of all outfits. And I know that's taking itself out on you more than it should.
 

truecharts

Guru
Joined
Aug 19, 2021
Messages
788
Redoing users (and permissions, because those will necessarily change when the users change) is a pretty significant change. And, of course, the fact that our existing apps appear to have disappeared into never-never land, and the currently-leading suggestion to resolve this issue is to destroy the ix-applications dataset and start over, doesn't really count as "very few changes"--and while this obviously isn't the case for everyone, it's equally obvious that I'm not the only one affected.

In terms of status, "set applications pool back to the original and wait several hours" results in k3s not running. So does "unset pool, reboot, set pool back to original and wait several hours."

So yeah, I'm frustrated. I'm frustrated with iX, who have proven time and again that they're utterly incapable of shipping a major release without (what should have been) show-stopping bugs (I'm also frustrated with them, though far less so, for pulling complete 180s on longstanding, and long-defended, design decisions without the decency of even a mention in the release notes). And I'm frustrated with myself, as I should have known better than to trust a .0 release from them, of all outfits. And I know that's taking itself out on you more than it should.

We want to highlight that the bugs you're describing are incredibly niche.
We estimate about maybe 1% or less of users reporting such issues.

Though we agree there have been more bugs than needed and some features where rushed to meet a release date. They definately should've sticked with industry-standard Alpha, Beta, RC systems (aka: Beta == feature complete, RC == should be mostly bug free)

Deleting the ix-applications dataset is not needed when using TrueTool backups, as restore includes complete reinit of the k3s subsystems.
 

Ryan Haver

Dabbler
Joined
Jul 6, 2013
Messages
23
Currently dealing with this now. I've never had apps installed on this TrueNAS Scale server and ran into the issues outlined in this thread when attempting to deploy my first app today. I resolved some of it by settings the v4 gateway and interface in the advanced settings for Applications, but I am now seeing a loop of the failure in the screenshot below.

Screenshot 2022-12-21 164448.png


Eventually, I end up getting the error others are seeing:
Failed to configure PV/PVCs support: Cannot connect to host 127.0.0.1:6443 ssl:default [Connect call failed ('127.0.0.1', 6443)]

Update:
After multiple reboots and a few rounds of removing the dataset used by Kubernetes...things are just working now...
 
Last edited:

zyrex

Dabbler
Joined
Nov 20, 2019
Messages
17
OP here;
After just not messing with the apps for a while, since there's a bug report now, my k3s service has decided to just work..
I can now see my apps, but they are stuck at deploying since yesterday.

At least some progress?
 

João Fernandes

Dabbler
Joined
Dec 20, 2022
Messages
14
OP here;
After just not messing with the apps for a while, since there's a bug report now, my k3s service has decided to just work..
I can now see my apps, but they are stuck at deploying since yesterday.

At least some progress?
Are you able to connect with LENS to the cluster? I've been progressively identified the multiple issues when deploying new apps and lens make it really easy to fast detect those. Yes I could achieve the same through command line but I'm lazy.
 

zyrex

Dabbler
Joined
Nov 20, 2019
Messages
17
I haven't heard of Lens, but I've never used anything kubernetes besides these apps in Scale.

Been 24 hours now, and my apps are still deploying.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Well, ./truetool.sh --restore has been running for the last 24 hours or so. No errors reported, and k3s is actually running, but when the tool says "this will take a LONG time," apparently it isn't kidding.
 

João Fernandes

Dabbler
Joined
Dec 20, 2022
Messages
14
I thought so. Link it next time for others. :smile:
sure, https://www.truenas.com/community/threads/connecting-to-kubernetes-cluster-with-lens.105060/ it's great when there is insights from the community on how to overcome some limitations.
Regarding the K3S cluster issues, I finally found out my issue which was related to server time in BIOS. Once that fixed, everything else started working.

I'm well aware that scale is not yet prime time and I'll have to dig a bit further at this stage in order to overcome some limitations and LENS is a great tool to bring visibility of possible deployment issues. Thank you once again for the original thread.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
OK, I'm back up and running, with a good deal of assistance through a ticket on the TrueCharts Discord. One important thing I'd overlooked in @truecharts' earlier post is that the backups are helpful once you revert--you can't restore an Angelfish backup on Bluefin. But what seems to have done it is to revert to Angelfish (including setting that as the default boot environment), unset the pool, reboot, then set the pool again.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Well, let's see. I'd reverted to Angelfin, gotten my apps back, taken a fresh backup, and re-ran the upgrade to Bluefin, just on the chance that the problem was caused by my messing with things before the (apparently lengthy, silent, and fragile) migration process finished. After two hours of uptime where I didn't touch anything other than to start SSH (which for some reason didn't start on boot despite being configured to do so), k3s wasn't running, and the system was showing the same "Applications are not running" screen I posted an image of up-thread.

I shouldn't have any apps that would be affected by the host path validation issue, but I've gone ahead and unchecked that box, despite the scary warning telling me it will eat my data. Ten minutes later, k3s appears to be in a restart loop:
Code:
root@freenas2[~]# systemctl status k3s
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/lib/systemd/system/k3s.service; disabled; vendor preset: disabled)
     Active: activating (start) since Fri 2022-12-23 15:32:13 EST; 6s ago
       Docs: https://k3s.io
    Process: 1701296 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 1701298 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 1701303 (k3s-server)
      Tasks: 130
     Memory: 541.8M
        CPU: 12.206s
     CGroup: /system.slice/k3s.service
             └─1701303 /usr/local/bin/k3s server

Dec 23 15:32:19 freenas2 k3s[1701303]: I1223 15:32:19.196432 1701303 trace.go:205] Trace[86147235]: "List(recursive=true) etcd3" au>
Dec 23 15:32:19 freenas2 k3s[1701303]: Trace[86147235]: [840.301406ms] [840.301406ms] END
Dec 23 15:32:19 freenas2 k3s[1701303]: I1223 15:32:19.197611 1701303 trace.go:205] Trace[1989960934]: "List(recursive=true) etcd3" >
Dec 23 15:32:19 freenas2 k3s[1701303]: Trace[1989960934]: [855.897411ms] [855.897411ms] END
Dec 23 15:32:19 freenas2 k3s[1701303]: I1223 15:32:19.200446 1701303 trace.go:205] Trace[1282533019]: "List(recursive=true) etcd3" >
Dec 23 15:32:19 freenas2 k3s[1701303]: Trace[1282533019]: [833.652061ms] [833.652061ms] END
Dec 23 15:32:19 freenas2 k3s[1701303]: I1223 15:32:19.208926 1701303 trace.go:205] Trace[248270492]: "List(recursive=true) etcd3" a>
Dec 23 15:32:19 freenas2 k3s[1701303]: Trace[248270492]: [861.43907ms] [861.43907ms] END
Dec 23 15:32:19 freenas2 k3s[1701303]: I1223 15:32:19.372255 1701303 trace.go:205] Trace[1608905525]: "List(recursive=true) etcd3" >
Dec 23 15:32:19 freenas2 k3s[1701303]: Trace[1608905525]: [1.167433119s] [1.167433119s] END

...and a moment later,
Code:
root@freenas2[~]# systemctl status k3s
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/lib/systemd/system/k3s.service; disabled; vendor preset: disabled)
     Active: activating (auto-restart) (Result: exit-code) since Fri 2022-12-23 15:33:51 EST; 4s ago
       Docs: https://k3s.io
    Process: 1722552 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 1722555 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
    Process: 1722567 ExecStart=/usr/local/bin/k3s server --flannel-backend=none --disable=traefik,metrics-server,local-storage --di>
   Main PID: 1722567 (code=exited, status=1/FAILURE)
        CPU: 27.457s

...and the error appears to be this, from k3s_daemon.log: Dec 23 15:36:39 freenas2 k3s[1761888]: E1223 15:36:39.918072 1761888 kubelet.go:1397] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I assume you are running on Angelfish again
No, I'm back on Bluefin, though will probably permanently chuck 22.12.0 in the bin if we can't come up with some explanation for what's going on in the next day or so. The suggestion had been made that I just didn't give whatever migration Bluefin does to the apps enough time to complete before I started changing things (in order to fix the apparent problem, because nothing in the UI actually tells the user that there's a migration going on and not to interrupt it), but if it isn't complete after two hours I don't think it's going to complete. I disabled host path validation a little over a half-hour ago, and k3s is still on the same restart loop I posted 20 minutes ago.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
No, I'm back on Bluefin, though will probably permanently chuck 22.12.0 in the bin if we can't come up with some explanation for what's going on in the next day or so. The suggestion had been made that I just didn't give whatever migration Bluefin does to the apps enough time to complete before I started changing things (in order to fix the apparent problem, because nothing in the UI actually tells the user that there's a migration going on and not to interrupt it), but if it isn't complete after two hours I don't think it's going to complete. I disabled host path validation a little over a half-hour ago, and k3s is still on the same restart loop I posted 20 minutes ago.

As indicated in the other thread... it seems like a config issue or a timing issue that exposes some flaw. Any help you or anyone else on the thread that can recreate the issue would be appreciated.

App migration probably includes App updates.... so it will be dependent on the number and size of the Apps and the network speed. Each App could take tens of minutes.

I would agree that we need some UI element to provide status. Would you like to make the suggestion. In the meantime, perhaps checking with network stats would work.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I should have stopped when I had the apps back up and running under Angelfish...

So my apps have been down since I tried to upgrade to Bluefin on Monday, and the apps are the only reason I migrated to SCALE in the first place (aside from my continuing and growing suspicion, which iX continues to deny, that they'll be dropping CORE entirely before too long). Bluefin may be the greatest thing ever, or it may be a giant steaming pile--but whichever it is (or wherever in between it is, which I suppose is more likely), it clearly wasn't written to cope with my apparently-very-demanding (really?) use case. So, since this worked yesterday morning, I figured I'd revert back to Angelfish, unset pool/reboot/choose pool, and the apps would be back.

But no, of course it isn't that simple. Unset pool and reboot worked just fine. After the system comes back up, go to Apps, and it immediately prompts me to choose a pool. I do, a progress bar runs for about 8 seconds, and then I get Error: [EFAULT] Docker service is not running:
1671885472951.png

This is a new error to me--I haven't seen it previously while working on this issue.

When I close that, go to Settings -> Choose Pool, and select the same pool again, it returns "Success" within about one second:
1671885542361.png

But apps are not running:
1671885745904.png

k3s is dead:
Code:
root@freenas2[~]# systemctl status k3s
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/lib/systemd/system/k3s.service; disabled; vendor preset: disabled)
     Active: inactive (dead)
       Docs: https://k3s.io

...and running systemctl restart k3s first returns failure, and then it seems to put k3s into the same restart loop we've seen before:
Code:
root@freenas2[~]# systemctl restart k3s
Job for k3s.service failed because the control process exited with error code.
See "systemctl status k3s.service" and "journalctl -xe" for details.
root@freenas2[~]# systemctl status k3s
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/lib/systemd/system/k3s.service; disabled; vendor preset: disabled)
     Active: activating (start) since Sat 2022-12-24 07:43:40 EST; 5s ago
       Docs: https://k3s.io
    Process: 231874 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 231875 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 231876 (k3s-server)
      Tasks: 130
     Memory: 800.5M
     CGroup: /system.slice/k3s.service
             └─231876 /usr/local/bin/k3s server

Dec 24 07:43:43 freenas2 k3s[231876]: I1224 07:43:43.340630  231876 shared_informer.go:247] Caches are synced for crd-autoregist>
Dec 24 07:43:43 freenas2 k3s[231876]: I1224 07:43:43.341401  231876 shared_informer.go:247] Caches are synced for cluster_authen>
Dec 24 07:43:43 freenas2 k3s[231876]: W1224 07:43:43.748066  231876 lease.go:233] Resetting endpoints for master service "kubern>
Dec 24 07:43:43 freenas2 k3s[231876]: I1224 07:43:43.751233  231876 controller.go:611] quota admission added evaluator for: endp>
Dec 24 07:43:43 freenas2 k3s[231876]: I1224 07:43:43.927488  231876 controller.go:611] quota admission added evaluator for: endp>
Dec 24 07:43:44 freenas2 k3s[231876]: I1224 07:43:44.238952  231876 controller.go:132] OpenAPI AggregationController: action for>
Dec 24 07:43:44 freenas2 k3s[231876]: I1224 07:43:44.238982  231876 controller.go:132] OpenAPI AggregationController: action for>
Dec 24 07:43:44 freenas2 k3s[231876]: W1224 07:43:44.368221  231876 lease.go:233] Resetting endpoints for master service "kubern>
Dec 24 07:43:44 freenas2 k3s[231876]: I1224 07:43:44.368963  231876 storage_scheduling.go:109] all system priority classes are c>
Dec 24 07:43:44 freenas2 k3s[231876]: E1224 07:43:44.811559  231876 controller.go:159] Found stale data, removed previous endpoi>
root@freenas2[~]# systemctl status k3s
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/lib/systemd/system/k3s.service; disabled; vendor preset: disabled)
     Active: activating (start) since Sat 2022-12-24 07:43:56 EST; 9s ago
       Docs: https://k3s.io
    Process: 235461 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 235462 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 235463 (k3s-server)
      Tasks: 137
     Memory: 844.7M
     CGroup: /system.slice/k3s.service
             └─235463 /usr/local/bin/k3s server

Dec 24 07:44:01 freenas2 k3s[235463]: I1224 07:44:01.308678  235463 storage_scheduling.go:109] all system priority classes are c>
Dec 24 07:44:01 freenas2 k3s[235463]: W1224 07:44:01.308806  235463 lease.go:233] Resetting endpoints for master service "kubern>
Dec 24 07:44:01 freenas2 k3s[235463]: E1224 07:44:01.754728  235463 controller.go:159] Found stale data, removed previous endpoi>
Dec 24 07:44:03 freenas2 k3s[235463]: time="2022-12-24T07:44:03-05:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1>
Dec 24 07:44:03 freenas2 k3s[235463]: time="2022-12-24T07:44:03-05:00" level=info msg="Handling backend connection request [ix-t>
Dec 24 07:44:03 freenas2 k3s[235463]: time="2022-12-24T07:44:03-05:00" level=info msg="Running kubelet --address=0.0.0.0 --anony>
Dec 24 07:44:03 freenas2 k3s[235463]: Flag --network-plugin has been deprecated, will be removed along with dockershim.
Dec 24 07:44:03 freenas2 k3s[235463]: time="2022-12-24T07:44:03-05:00" level=info msg="Waiting to retrieve kube-proxy configurat>
Dec 24 07:44:03 freenas2 k3s[235463]: I1224 07:44:03.088788  235463 server.go:442] "Kubelet version" kubeletVersion="v1.23.5+k3s>
Dec 24 07:44:03 freenas2 k3s[235463]: I1224 07:44:03.091153  235463 dynamic_cafile_content.go:156] "Starting controller" name="c>


Multiple unset pool/reboot/choose pool cycles have given the same results each time.
 

truecharts

Guru
Joined
Aug 19, 2021
Messages
788
OK, I'm back up and running, with a good deal of assistance through a ticket on the TrueCharts Discord. One important thing I'd overlooked in @truecharts' earlier post is that the backups are helpful once you revert--you can't restore an Angelfish backup on Bluefin. But what seems to have done it is to revert to Angelfish (including setting that as the default boot environment), unset the pool, reboot, then set the pool again.
Warning in the script is incoming within a month or so...
 
Top