k3s not starting after upgrade

danb35 · Dec 21, 2022

truecharts said:
If you need any help with our Tools, please be sure to reach out to our support staff directly on Discord :)

I'd love to, if I could figure out how the hell Discord is supposed to work. I appear to already be logged in--I see my username and status at the lower-left corner of the window--but I "must claim and verify my account before I can send messages in the TrueCharts server." So I click the button and enter the email and password that are already registered, whereupon Discord tells me what I already know--that that email is already registered. ~~No idea what I'm supposed to be doing here.~~ Edit: some progress; apparently Discord had created two accounts for me, one attached to my email address and one apparently not attached to anything but a browser cookie. Deleted the latter, logged in with the former, and it looks like I'm past that hurdle.

danb35 · Dec 21, 2022

Daisuke said:
there very few changes that are required to be done, post upgrade.

Redoing users (and permissions, because those will necessarily change when the users change) is a pretty significant change. And, of course, the fact that our existing apps appear to have disappeared into never-never land, and the currently-leading suggestion to resolve this issue is to destroy the ix-applications dataset and start over, doesn't really count as "very few changes"--and while this obviously isn't the case for everyone, it's equally obvious that I'm not the only one affected.

In terms of status, "set applications pool back to the original and wait several hours" results in k3s not running. So does "unset pool, reboot, set pool back to original and wait several hours."

So yeah, I'm frustrated. I'm frustrated with iX, who have proven time and again that they're utterly incapable of shipping a major release without (what should have been) show-stopping bugs (I'm also frustrated with them, though far less so, for pulling complete 180s on longstanding, and long-defended, design decisions without the decency of even a mention in the release notes). And I'm frustrated with myself, as I should have known better than to trust a .0 release from them, of all outfits. And I know that's taking itself out on you more than it should.

truecharts · Dec 21, 2022

danb35 said:
Redoing users (and permissions, because those will necessarily change when the users change) is a pretty significant change. And, of course, the fact that our existing apps appear to have disappeared into never-never land, and the currently-leading suggestion to resolve this issue is to destroy the ix-applications dataset and start over, doesn't really count as "very few changes"--and while this obviously isn't the case for everyone, it's equally obvious that I'm not the only one affected.

In terms of status, "set applications pool back to the original and wait several hours" results in k3s not running. So does "unset pool, reboot, set pool back to original and wait several hours."

So yeah, I'm frustrated. I'm frustrated with iX, who have proven time and again that they're utterly incapable of shipping a major release without (what should have been) show-stopping bugs (I'm also frustrated with them, though far less so, for pulling complete 180s on longstanding, and long-defended, design decisions without the decency of even a mention in the release notes). And I'm frustrated with myself, as I should have known better than to trust a .0 release from them, of all outfits. And I know that's taking itself out on you more than it should.

We want to highlight that the bugs you're describing are incredibly niche.
We estimate about maybe 1% or less of users reporting such issues.

Though we agree there have been more bugs than needed and some features where rushed to meet a release date. They definately should've sticked with industry-standard Alpha, Beta, RC systems (aka: Beta == feature complete, RC == should be mostly bug free)

Deleting the ix-applications dataset is not needed when using TrueTool backups, as restore includes complete reinit of the k3s subsystems.

Ryan Haver · Dec 21, 2022

Currently dealing with this now. I've never had apps installed on this TrueNAS Scale server and ran into the issues outlined in this thread when attempting to deploy my first app today. I resolved some of it by settings the v4 gateway and interface in the advanced settings for Applications, but I am now seeing a loop of the failure in the screenshot below.

Eventually, I end up getting the error others are seeing:
Failed to configure PV/PVCs support: Cannot connect to host 127.0.0.1:6443 ssl:default [Connect call failed ('127.0.0.1', 6443)]

Update: After multiple reboots and a few rounds of removing the dataset used by Kubernetes...things are just working now...

zyrex · Dec 21, 2022

OP here;
After just not messing with the apps for a while, since there's a bug report now, my k3s service has decided to just work..
I can now see my apps, but they are stuck at deploying since yesterday.

At least some progress?

João Fernandes · Dec 22, 2022

zyrex said:
OP here;
After just not messing with the apps for a while, since there's a bug report now, my k3s service has decided to just work..
I can now see my apps, but they are stuck at deploying since yesterday.

At least some progress?

Are you able to connect with LENS to the cluster? I've been progressively identified the multiple issues when deploying new apps and lens make it really easy to fast detect those. Yes I could achieve the same through command line but I'm lazy.

zyrex · Dec 22, 2022

I haven't heard of Lens, but I've never used anything kubernetes besides these apps in Scale.

Been 24 hours now, and my apps are still deploying.

Daisuke · Dec 23, 2022

João Fernandes said:
Are you able to connect with LENS to the cluster?

You modify iptables, right?

João Fernandes · Dec 23, 2022

Daisuke said:
You modify iptables, right?

I had to, otherwise I wouldn't be able to connect to the cluster remotely. Used your post as a reference on how to achieve it :)

Daisuke · Dec 23, 2022

João Fernandes said:
I had to, otherwise I wouldn't be able to connect to the cluster remotely. Used your post as a reference on how to achieve it :)

I thought so. Link it next time for others.

danb35 · Dec 23, 2022

Well, ./truetool.sh --restore has been running for the last 24 hours or so. No errors reported, and k3s is actually running, but when the tool says "this will take a LONG time," apparently it isn't kidding.

João Fernandes · Dec 23, 2022

Daisuke said:
I thought so. Link it next time for others.

sure, https://www.truenas.com/community/threads/connecting-to-kubernetes-cluster-with-lens.105060/ it's great when there is insights from the community on how to overcome some limitations.
Regarding the K3S cluster issues, I finally found out my issue which was related to server time in BIOS. Once that fixed, everything else started working.

I'm well aware that scale is not yet prime time and I'll have to dig a bit further at this stage in order to overcome some limitations and LENS is a great tool to bring visibility of possible deployment issues. Thank you once again for the original thread.

danb35 · Dec 23, 2022

OK, I'm back up and running, with a good deal of assistance through a ticket on the TrueCharts Discord. One important thing I'd overlooked in @truecharts' earlier post is that the backups are helpful once you revert--you can't restore an Angelfish backup on Bluefin. But what seems to have done it is to revert to Angelfish (including setting that as the default boot environment), unset the pool, reboot, then set the pool again.

morganL · Dec 23, 2022

zyrex said:
View attachment 61210

I have tried changing to my local ip for Node IP, however that didn't change much in terms of working :)

Are the host path safety checks causing the issue?
Please disable them and see if it resolves the issue.

danb35 · Dec 23, 2022

Well, let's see. I'd reverted to Angelfin, gotten my apps back, taken a fresh backup, and re-ran the upgrade to Bluefin, just on the chance that the problem was caused by my messing with things before the (apparently lengthy, silent, and fragile) migration process finished. After two hours of uptime where I didn't touch anything other than to start SSH (which for some reason didn't start on boot despite being configured to do so), k3s wasn't running, and the system was showing the same "Applications are not running" screen I posted an image of up-thread.

I shouldn't have any apps that would be affected by the host path validation issue, but I've gone ahead and unchecked that box, despite the scary warning telling me it will eat my data. Ten minutes later, k3s appears to be in a restart loop:

Code:

root@freenas2[~]# systemctl status k3s
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/lib/systemd/system/k3s.service; disabled; vendor preset: disabled)
     Active: activating (start) since Fri 2022-12-23 15:32:13 EST; 6s ago
       Docs: https://k3s.io
    Process: 1701296 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 1701298 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 1701303 (k3s-server)
      Tasks: 130
     Memory: 541.8M
        CPU: 12.206s
     CGroup: /system.slice/k3s.service
             └─1701303 /usr/local/bin/k3s server

Dec 23 15:32:19 freenas2 k3s[1701303]: I1223 15:32:19.196432 1701303 trace.go:205] Trace[86147235]: "List(recursive=true) etcd3" au>
Dec 23 15:32:19 freenas2 k3s[1701303]: Trace[86147235]: [840.301406ms] [840.301406ms] END
Dec 23 15:32:19 freenas2 k3s[1701303]: I1223 15:32:19.197611 1701303 trace.go:205] Trace[1989960934]: "List(recursive=true) etcd3" >
Dec 23 15:32:19 freenas2 k3s[1701303]: Trace[1989960934]: [855.897411ms] [855.897411ms] END
Dec 23 15:32:19 freenas2 k3s[1701303]: I1223 15:32:19.200446 1701303 trace.go:205] Trace[1282533019]: "List(recursive=true) etcd3" >
Dec 23 15:32:19 freenas2 k3s[1701303]: Trace[1282533019]: [833.652061ms] [833.652061ms] END
Dec 23 15:32:19 freenas2 k3s[1701303]: I1223 15:32:19.208926 1701303 trace.go:205] Trace[248270492]: "List(recursive=true) etcd3" a>
Dec 23 15:32:19 freenas2 k3s[1701303]: Trace[248270492]: [861.43907ms] [861.43907ms] END
Dec 23 15:32:19 freenas2 k3s[1701303]: I1223 15:32:19.372255 1701303 trace.go:205] Trace[1608905525]: "List(recursive=true) etcd3" >
Dec 23 15:32:19 freenas2 k3s[1701303]: Trace[1608905525]: [1.167433119s] [1.167433119s] END

...and a moment later,

Code:

root@freenas2[~]# systemctl status k3s
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/lib/systemd/system/k3s.service; disabled; vendor preset: disabled)
     Active: activating (auto-restart) (Result: exit-code) since Fri 2022-12-23 15:33:51 EST; 4s ago
       Docs: https://k3s.io
    Process: 1722552 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 1722555 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
    Process: 1722567 ExecStart=/usr/local/bin/k3s server --flannel-backend=none --disable=traefik,metrics-server,local-storage --di>
   Main PID: 1722567 (code=exited, status=1/FAILURE)
        CPU: 27.457s

...and the error appears to be this, from k3s_daemon.log:

Dec 23 15:36:39 freenas2 k3s[1761888]: E1223 15:36:39.918072 1761888 kubelet.go:1397] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"

danb35 · Dec 23, 2022

morganL said:
I assume you are running on Angelfish again

No, I'm back on Bluefin, though will probably permanently chuck 22.12.0 in the bin if we can't come up with some explanation for what's going on in the next day or so. The suggestion had been made that I just didn't give whatever migration Bluefin does to the apps enough time to complete before I started changing things (in order to fix the apparent problem, because nothing in the UI actually tells the user that there's a migration going on and not to interrupt it), but if it isn't complete after two hours I don't think it's going to complete. I disabled host path validation a little over a half-hour ago, and k3s is still on the same restart loop I posted 20 minutes ago.

morganL · Dec 23, 2022

danb35 said:
No, I'm back on Bluefin, though will probably permanently chuck 22.12.0 in the bin if we can't come up with some explanation for what's going on in the next day or so. The suggestion had been made that I just didn't give whatever migration Bluefin does to the apps enough time to complete before I started changing things (in order to fix the apparent problem, because nothing in the UI actually tells the user that there's a migration going on and not to interrupt it), but if it isn't complete after two hours I don't think it's going to complete. I disabled host path validation a little over a half-hour ago, and k3s is still on the same restart loop I posted 20 minutes ago.

As indicated in the other thread... it seems like a config issue or a timing issue that exposes some flaw. Any help you or anyone else on the thread that can recreate the issue would be appreciated.

App migration probably includes App updates.... so it will be dependent on the number and size of the Apps and the network speed. Each App could take tens of minutes.

I would agree that we need some UI element to provide status. Would you like to make the suggestion. In the meantime, perhaps checking with network stats would work.

danb35 · Dec 24, 2022

morganL said:
I would agree that we need some UI element to provide status. Would you like to make the suggestion.

[NAS-119563] - iXsystems TrueNAS Jira

ixsystems.atlassian.net

danb35 · Dec 24, 2022

I should have stopped when I had the apps back up and running under Angelfish...

So my apps have been down since I tried to upgrade to Bluefin on Monday, and the apps are the only reason I migrated to SCALE in the first place (aside from my continuing and growing suspicion, which iX continues to deny, that they'll be dropping CORE entirely before too long). Bluefin may be the greatest thing ever, or it may be a giant steaming pile--but whichever it is (or wherever in between it is, which I suppose is more likely), it clearly wasn't written to cope with my apparently-very-demanding (really?) use case. So, since this worked yesterday morning, I figured I'd revert back to Angelfish, unset pool/reboot/choose pool, and the apps would be back.

But no, of course it isn't that simple. Unset pool and reboot worked just fine. After the system comes back up, go to Apps, and it immediately prompts me to choose a pool. I do, a progress bar runs for about 8 seconds, and then I get Error: [EFAULT] Docker service is not running:

This is a new error to me--I haven't seen it previously while working on this issue.

When I close that, go to Settings -> Choose Pool, and select the same pool again, it returns "Success" within about one second:

But apps are not running:

k3s is dead:

Code:

root@freenas2[~]# systemctl status k3s
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/lib/systemd/system/k3s.service; disabled; vendor preset: disabled)
     Active: inactive (dead)
       Docs: https://k3s.io

...and running systemctl restart k3s first returns failure, and then it seems to put k3s into the same restart loop we've seen before:

Code:

root@freenas2[~]# systemctl restart k3s
Job for k3s.service failed because the control process exited with error code.
See "systemctl status k3s.service" and "journalctl -xe" for details.
root@freenas2[~]# systemctl status k3s
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/lib/systemd/system/k3s.service; disabled; vendor preset: disabled)
     Active: activating (start) since Sat 2022-12-24 07:43:40 EST; 5s ago
       Docs: https://k3s.io
    Process: 231874 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 231875 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 231876 (k3s-server)
      Tasks: 130
     Memory: 800.5M
     CGroup: /system.slice/k3s.service
             └─231876 /usr/local/bin/k3s server

Dec 24 07:43:43 freenas2 k3s[231876]: I1224 07:43:43.340630  231876 shared_informer.go:247] Caches are synced for crd-autoregist>
Dec 24 07:43:43 freenas2 k3s[231876]: I1224 07:43:43.341401  231876 shared_informer.go:247] Caches are synced for cluster_authen>
Dec 24 07:43:43 freenas2 k3s[231876]: W1224 07:43:43.748066  231876 lease.go:233] Resetting endpoints for master service "kubern>
Dec 24 07:43:43 freenas2 k3s[231876]: I1224 07:43:43.751233  231876 controller.go:611] quota admission added evaluator for: endp>
Dec 24 07:43:43 freenas2 k3s[231876]: I1224 07:43:43.927488  231876 controller.go:611] quota admission added evaluator for: endp>
Dec 24 07:43:44 freenas2 k3s[231876]: I1224 07:43:44.238952  231876 controller.go:132] OpenAPI AggregationController: action for>
Dec 24 07:43:44 freenas2 k3s[231876]: I1224 07:43:44.238982  231876 controller.go:132] OpenAPI AggregationController: action for>
Dec 24 07:43:44 freenas2 k3s[231876]: W1224 07:43:44.368221  231876 lease.go:233] Resetting endpoints for master service "kubern>
Dec 24 07:43:44 freenas2 k3s[231876]: I1224 07:43:44.368963  231876 storage_scheduling.go:109] all system priority classes are c>
Dec 24 07:43:44 freenas2 k3s[231876]: E1224 07:43:44.811559  231876 controller.go:159] Found stale data, removed previous endpoi>
root@freenas2[~]# systemctl status k3s
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/lib/systemd/system/k3s.service; disabled; vendor preset: disabled)
     Active: activating (start) since Sat 2022-12-24 07:43:56 EST; 9s ago
       Docs: https://k3s.io
    Process: 235461 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 235462 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 235463 (k3s-server)
      Tasks: 137
     Memory: 844.7M
     CGroup: /system.slice/k3s.service
             └─235463 /usr/local/bin/k3s server

Dec 24 07:44:01 freenas2 k3s[235463]: I1224 07:44:01.308678  235463 storage_scheduling.go:109] all system priority classes are c>
Dec 24 07:44:01 freenas2 k3s[235463]: W1224 07:44:01.308806  235463 lease.go:233] Resetting endpoints for master service "kubern>
Dec 24 07:44:01 freenas2 k3s[235463]: E1224 07:44:01.754728  235463 controller.go:159] Found stale data, removed previous endpoi>
Dec 24 07:44:03 freenas2 k3s[235463]: time="2022-12-24T07:44:03-05:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1>
Dec 24 07:44:03 freenas2 k3s[235463]: time="2022-12-24T07:44:03-05:00" level=info msg="Handling backend connection request [ix-t>
Dec 24 07:44:03 freenas2 k3s[235463]: time="2022-12-24T07:44:03-05:00" level=info msg="Running kubelet --address=0.0.0.0 --anony>
Dec 24 07:44:03 freenas2 k3s[235463]: Flag --network-plugin has been deprecated, will be removed along with dockershim.
Dec 24 07:44:03 freenas2 k3s[235463]: time="2022-12-24T07:44:03-05:00" level=info msg="Waiting to retrieve kube-proxy configurat>
Dec 24 07:44:03 freenas2 k3s[235463]: I1224 07:44:03.088788  235463 server.go:442] "Kubelet version" kubeletVersion="v1.23.5+k3s>
Dec 24 07:44:03 freenas2 k3s[235463]: I1224 07:44:03.091153  235463 dynamic_cafile_content.go:156] "Starting controller" name="c>

Multiple unset pool/reboot/choose pool cycles have given the same results each time.

truecharts · Dec 24, 2022

danb35 said:
OK, I'm back up and running, with a good deal of assistance through a ticket on the TrueCharts Discord. One important thing I'd overlooked in @truecharts' earlier post is that the backups are helpful once you revert--you can't restore an Angelfish backup on Bluefin. But what seems to have done it is to revert to Angelfish (including setting that as the default boot environment), unset the pool, reboot, then set the pool again.

Warning in the script is incoming within a month or so...

Important Announcement for the TrueNAS Community.

k3s not starting after upgrade

Hall of Famer

Hall of Famer

Guru

Dabbler

Dabbler

Dabbler

Dabbler

Contributor

Dabbler

Contributor

Hall of Famer

Dabbler

Hall of Famer

Captain Morgan

Hall of Famer

Hall of Famer

Captain Morgan

Hall of Famer

Hall of Famer

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "k3s not starting after upgrade"

Similar threads