k3s not starting after upgrade

morganL · Jan 2, 2023

danb35 said:
It's hardly a tangent when you say:

and:

and:

(to quote only from your last post to me) ...all of which indicate that "users" (clearly including myself) just ought to know that there's a massive change in good operating practices with SCALE--which would have to mean that it's documented somewhere we should have known to look, because after all the only intuitive user interface is the nipple. It also ignores that the only actionable item you mention is to check the release notes before you upgrade, which is certainly good practice, but you know as well as I do that there's nothing in there about "upgrading will destroy your applications to the point that you need to do a clean install and rebuild from scratch." So, however rare my particular issue may be, there's nothing in the release notes that addresses it. If you want to toss out those sweeping assertions without being challenged on them or defending them, I guess that's your right, but it isn't really useful to any sort of conversation.

Oh, and contrary to:

iX say that it's ready for "general use", which they define as "Field tested software with mature features. Few issues are expected."

Software Status - TrueNAS Roadmap - Open Source NAS Software

Get up-to-date insight on the TrueNAS roadmap. Explore the TrueNAS project timeline and stay informed with the latest updates.

www.truenas.com

I'd equate that to "production", at least through the SMB level (though the word "production" doesn't appear anywhere on that page).

I realize I'm pushing back pretty aggressively. That's because what I'm hearing from you is "you should have known better," without your having the common courtesy to specify what I should have known better (other that what I've already admitted: that iX have a long and ignoble history of shipping releases with show-stopping bugs, and therefore not to trust a .0 release) or on what basis; with the further implication that whatever trouble I'm having is my own fault.

On the software status page... we don't use the word "production" because it has too broad a definition. Production at home is different from production in a bank.

The definitions we chose were:

Tester
Early adopter
General
Conservative
Mission Critical.

Its very clear Bluefin is ready for Tester and Early Adopter.

"General" was a toss-up between Angelfish 22.02.4 and Bluefin 22.12.0 - we chose Bluefin because it had many bug fixes and overlayfs support compared with angelfish. No major issues were reported in 1st 10 days. We preferred new users to start there since that is where any bug fixes will go. However we expect a few issues. This App upgrade issue has surprised us and has still not been identified (nor do we know how many systems out of 10,000 have been impacted)

Explicitly for SCALE.. we have not declared it ready for Conservative or Mission Critical.. we expect to get there with 22.12.2. Our experience has been that we need tens of thousands of users for months to get to that quality level. TrueNAS CORE 13.0-U3.1 has reached that.

jsherm101 · Jan 3, 2023

morganL said:
On the software status page... we don't use the word "production" because it has too broad a definition. Production at home is different from production in a bank.

The definitions we chose were:

Tester
Early adopter
General
Conservative
Mission Critical.

Its very clear Bluefin is ready for Tester and Early Adopter.

"General" was a toss-up between Angelfish 22.02.4 and Bluefin 22.12.0 - we chose Bluefin because it had many bug fixes and overlayfs support compared with angelfish. No major issues were reported in 1st 10 days. We preferred new users to start there since that is where any bug fixes will go. However we expect a few issues. This App upgrade issue has surprised us and has still not been identified (nor do we know how many systems out of 10,000 have been impacted)

Thanks for the perspective -- FWIW from my product development perspective the tossup ignores the upgrade path audience but also different user types (i.e. new users with complex chart configs) . There are a lot more changes to Bluefin vs. Angelfish than what I expected for k3s. My upgrade broke my apps and I called it a total loss and started over with Bluefin. And yet I too am experiencing a random k3s error and it won't start with no real way to debug the cause, so seems to be something at the IX level causing it (along with UPS service issues).

In the future I'd argue even if SCALE Is so much more complex, the number of changes to the environment seem to imply an "upgrade" user may be safer with Angelfish until the Feb. release, but I'm also curious why someone starting fresh like me is also hitting k3s issues with complex chart setups -- it seems to imply that even "general" "power users" for many scale charts might be better off with angelfish too or at least better viz on how to track the bugs (i.e. do I sit and wait on 22.12.0 or wait for 23.2.1?) The most frustrating bit isn't just being blindsided by the system level issues, but also the difficulty in knowing the best path forward to resolve it. There are even issues like viewing/accessing shells with admin users that I think "general" audience users would definitely get confused by if they didn't read the fine print to switch to a root user.

Another issues is that it's hard to get insight into how meta-tickets (i.e. bundles of tickets with similar issues) are being triaged and managed. Understandably individual tickets contain lots of private logs/data but then when I'm trying to find a solution based on a given error it's very hard to find the best solution to try between discord, forums, Jira, etc. because of this. This is something IX may want to consider given how many layers there are and while there are many people willing to help, we can greatly benefit from the value of something like a knowledgebase when dealing with big bugs and common errors

morganL · Jan 3, 2023

jsherm101 said:
Thanks for the perspective -- FWIW from my product development perspective the tossup ignores the upgrade path audience but also different user types (i.e. new users with complex chart configs) . There are a lot more changes to Bluefin vs. Angelfish than what I expected for k3s. My upgrade broke my apps and I called it a total loss and started over with Bluefin. And yet I too am experiencing a random k3s error and it won't start with no real way to debug the cause, so seems to be something at the IX level causing it (along with UPS service issues).

In the future I'd argue even if SCALE Is so much more complex, the number of changes to the environment seem to imply an "upgrade" user may be safer with Angelfish until the Feb. release, but I'm also curious why someone starting fresh like me is also hitting k3s issues with complex chart setups -- it seems to imply that even "general" "power users" for many scale charts might be better off with angelfish too or at least better viz on how to track the bugs (i.e. do I sit and wait on 22.12.0 or wait for 23.2.1?) The most frustrating bit isn't just being blindsided by the system level issues, but also the difficulty in knowing the best path forward to resolve it. There are even issues like viewing/accessing shells with admin users that I think "general" audience users would definitely get confused by if they didn't read the fine print to switch to a root user.

Another issues is that it's hard to get insight into how meta-tickets (i.e. bundles of tickets with similar issues) are being triaged and managed. Understandably individual tickets contain lots of private logs/data but then when I'm trying to find a solution based on a given error it's very hard to find the best solution to try between discord, forums, Jira, etc. because of this. This is something IX may want to consider given how many layers there are and while there are many people willing to help, we can greatly benefit from the value of something like a knowledgebase when dealing with big bugs and common errors

Thanks for the note and suggestions.

In addition of the Software status page, there is a general software lifecycle page where we try to highlight the maturation process that each release goes through. This captures some of the sentiment. We recommend at release that "less complex deployments".

Software Development Life Cycle

Notice about the typical development timeframe and end of life expectations for TrueNAS major versions.

www.truenas.com

I'm not sure whether Angelfish would have been better in your case. We may have resolved some bugs but added some more. Swings and roundabouts.

I agree with your assessment of the upgrade issues.... these are the hardest to test, given diversity or hardware and configuration. Its the user testing that uncovers issues. We didn't see the issues in BETA and RC1... but perhaps too few upgrades vs new systems.

We don't have anything between user tickets and these community threads. However, we are making tickets less private. Ideally, these community posts act like a freeform knowledge base. We also want suggestions added to the documentation pages.

Apologize for the wait, but the Eng team is working hard to get the 22.12.1 out.. due in early Feb. Probably 160+ bug fixes expected.

dtype · Feb 25, 2023

I'm running into the same thing. TrueNAS Scale 22.12.1.

No applications installed. No additional repos configured. I unset the applications pool, then deleted the ix-applications dataset, and then re-selected the pool. k3s never gets off the ground, with the same looped errors others have noted.

Starts with:

Code:

Feb 25 18:34:25 s2 k3s[1138022]: time="2023-02-25T18:34:25-08:00" level=info msg="Starting k3s v1.25.3+k3s-9afcd6b9-dirty (9afcd6b9)"

Ends with:

Code:

Feb 25 18:34:20 s2 k3s[1135307]: time="2023-02-25T18:34:20-08:00" level=info msg="Starting helm.cattle.io/v1, Kind=HelmChart controller"
Feb 25 18:34:20 s2 k3s[1135307]: time="2023-02-25T18:34:20-08:00" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller"
Feb 25 18:34:20 s2 k3s[1135307]: I0225 18:34:20.103254 1135307 event.go:294] "Event occurred" object="kube-system/kuberouter" fieldPath="" kind="Addon" apiVersion="k3s.cattle.io/v1" type="Normal" reason="AppliedManifest" message="Applied manifest at \"/mnt/neo/ix-applications/k3s/server/manifests/kuberouter.yaml\""
Feb 25 18:34:20 s2 k3s[1135307]: time="2023-02-25T18:34:20-08:00" level=info msg="Starting batch/v1, Kind=Job controller"
Feb 25 18:34:20 s2 k3s[1135307]: time="2023-02-25T18:34:20-08:00" level=info msg="Starting /v1, Kind=Node controller"
Feb 25 18:34:20 s2 k3s[1135307]: time="2023-02-25T18:34:20-08:00" level=info msg="Starting /v1, Kind=ConfigMap controller"
Feb 25 18:34:20 s2 k3s[1135307]: time="2023-02-25T18:34:20-08:00" level=info msg="Starting /v1, Kind=ServiceAccount controller"
Feb 25 18:34:20 s2 k3s[1135307]: I0225 18:34:20.113459 1135307 event.go:294] "Event occurred" object="kube-system/multus-daemonset" fieldPath="" kind="Addon" apiVersion="k3s.cattle.io/v1" type="Normal" reason="ApplyingManifest" message="Applying manifest at \"/mnt/neo/ix-applications/k3s/server/manifests/multus-daemonset.yaml\""
Feb 25 18:34:20 s2 k3s[1135307]: E0225 18:34:20.125267 1135307 kubelet.go:2034] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
Feb 25 18:34:20 s2 k3s[1135307]: E0225 18:34:20.132422 1135307 kubelet.go:1397] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"

I'm moderately k8s and k3s aware, so happy to help troubleshoot here. I'm looking into this a bit more this weekend.

dtype · Feb 25, 2023

Just for completeness, this is one complete cycle of the k3s failure to launch:

gist:616e09507ab7ee9072cedddfebbe4585

GitHub Gist: instantly share code, notes, and snippets.

gist.github.com

Again:
* Unset applications pool
* systemctl stop k3s
* manually shot container/docker processes to free up file handles so I could delete dataset
* Removed ix-applications dataset
* Set application pool (/mnt/neo)
* Showing /var/log/k3s_daemon.log here

This has persisted reboots and existed for me on 12.22.0 and now 12.22.1.

Motherboard is https://www.supermicro.com/en/products/motherboard/a2sdi-12c-hln4f (not that dissimilar to what IX ships in the Mini boxes). This "neo" pool is a pair of mirrored SSDs. I do have a couple of VMs running with fully subscribed non-pinned cpus (in case some kind of need to pin cpus is a suspect). 128GB of RAM and lots available. Fully willing to wipe any k3s/docker state (I've tried to do this).

edmondtan3 · Jan 31, 2024

Hi all,

Anybody knows how the "unset pool" and "set pool" in script?

After reboot, I need to unset pool and then set pool again in application to fix the issues.

Thanks

Ed

edmondtan3 · Feb 7, 2024

Hi All,

Server Spec:
OS Version:TrueNAS-SCALE-23.10.1.3
Product:Super Server
Model:Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz
Memory:64 GiB
Supermicro x11ssl-f

I have fixed my issues, going to share it here hope this help.

Issues:
My application cannot start after my TNS rebooted. I usually have to unset the pool and set the pool again to fix the issue. (very annoying)
Something I received error message after reboot, the k3s cannot detect my network interface br1 or SSL issues.

What I use TNS for:
I need my TrueNAS Scale to do the following:
- samba
- qbittorent
- plex
- nextcloud
- VM x 3 (window server, Ubuntu with OpenVPN, Ubuntu test server)

I need my VMs able to talk to Samba or TrueNas itself, I need my TNS server after reboot everthing just coming back online automatically. Which I don't have to do unset and set pool etc.

To fix the issues:
I have 2 nic on my motherboard and both connected to a unmanaged switch.
eno1 - used for TrueNas web ui (Assign static IP address 192.168.0.18/24)
eno2 - used for VM but I need to created br1 for the VM to use it. (untick the DHCP and no static IP assign)

On the Virtual Machine nic you need to select br1(eno2) and login to your VM and set a static IP address to your server. My VM will be using the same IP range as the TNS Web UI (VM IP 192.168.0.19/24 and gateway set to .1)

On the Application setting I have the following.

I also have schuduled reboot every Wed and Sat. Now everything just running as expected, VM can talk to TNS host, Samba, Plex and Qbittorrent. After reboot VM and application started without issues.

Hope this help!

Important Announcement for the TrueNAS Community.

k3s not starting after upgrade

morganL

Captain Morgan

Software Status - TrueNAS Roadmap - Open Source NAS Software

jsherm101

Dabbler

morganL

Captain Morgan

Software Development Life Cycle

dtype

Cadet

dtype

Cadet

gist:616e09507ab7ee9072cedddfebbe4585

edmondtan3

Cadet

edmondtan3

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

k3s not starting after upgrade

Captain Morgan

Dabbler

Captain Morgan

Cadet

Cadet

Cadet

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "k3s not starting after upgrade"

Similar threads