k3s not starting after upgrade

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
It's hardly a tangent when you say:

and:

and:

(to quote only from your last post to me) ...all of which indicate that "users" (clearly including myself) just ought to know that there's a massive change in good operating practices with SCALE--which would have to mean that it's documented somewhere we should have known to look, because after all the only intuitive user interface is the nipple. It also ignores that the only actionable item you mention is to check the release notes before you upgrade, which is certainly good practice, but you know as well as I do that there's nothing in there about "upgrading will destroy your applications to the point that you need to do a clean install and rebuild from scratch." So, however rare my particular issue may be, there's nothing in the release notes that addresses it. If you want to toss out those sweeping assertions without being challenged on them or defending them, I guess that's your right, but it isn't really useful to any sort of conversation.

Oh, and contrary to:

iX say that it's ready for "general use", which they define as "Field tested software with mature features. Few issues are expected."
I'd equate that to "production", at least through the SMB level (though the word "production" doesn't appear anywhere on that page).

I realize I'm pushing back pretty aggressively. That's because what I'm hearing from you is "you should have known better," without your having the common courtesy to specify what I should have known better (other that what I've already admitted: that iX have a long and ignoble history of shipping releases with show-stopping bugs, and therefore not to trust a .0 release) or on what basis; with the further implication that whatever trouble I'm having is my own fault.

On the software status page... we don't use the word "production" because it has too broad a definition. Production at home is different from production in a bank.

The definitions we chose were:

Tester
Early adopter
General
Conservative
Mission Critical.

Its very clear Bluefin is ready for Tester and Early Adopter.

"General" was a toss-up between Angelfish 22.02.4 and Bluefin 22.12.0 - we chose Bluefin because it had many bug fixes and overlayfs support compared with angelfish. No major issues were reported in 1st 10 days. We preferred new users to start there since that is where any bug fixes will go. However we expect a few issues. This App upgrade issue has surprised us and has still not been identified (nor do we know how many systems out of 10,000 have been impacted)

Explicitly for SCALE.. we have not declared it ready for Conservative or Mission Critical.. we expect to get there with 22.12.2. Our experience has been that we need tens of thousands of users for months to get to that quality level. TrueNAS CORE 13.0-U3.1 has reached that.
 

jsherm101

Dabbler
Joined
Nov 25, 2016
Messages
20
On the software status page... we don't use the word "production" because it has too broad a definition. Production at home is different from production in a bank.

The definitions we chose were:

Tester
Early adopter
General
Conservative
Mission Critical.

Its very clear Bluefin is ready for Tester and Early Adopter.

"General" was a toss-up between Angelfish 22.02.4 and Bluefin 22.12.0 - we chose Bluefin because it had many bug fixes and overlayfs support compared with angelfish. No major issues were reported in 1st 10 days. We preferred new users to start there since that is where any bug fixes will go. However we expect a few issues. This App upgrade issue has surprised us and has still not been identified (nor do we know how many systems out of 10,000 have been impacted)

Thanks for the perspective -- FWIW from my product development perspective the tossup ignores the upgrade path audience but also different user types (i.e. new users with complex chart configs) . There are a lot more changes to Bluefin vs. Angelfish than what I expected for k3s. My upgrade broke my apps and I called it a total loss and started over with Bluefin. And yet I too am experiencing a random k3s error and it won't start with no real way to debug the cause, so seems to be something at the IX level causing it (along with UPS service issues).

In the future I'd argue even if SCALE Is so much more complex, the number of changes to the environment seem to imply an "upgrade" user may be safer with Angelfish until the Feb. release, but I'm also curious why someone starting fresh like me is also hitting k3s issues with complex chart setups -- it seems to imply that even "general" "power users" for many scale charts might be better off with angelfish too or at least better viz on how to track the bugs (i.e. do I sit and wait on 22.12.0 or wait for 23.2.1?) The most frustrating bit isn't just being blindsided by the system level issues, but also the difficulty in knowing the best path forward to resolve it. There are even issues like viewing/accessing shells with admin users that I think "general" audience users would definitely get confused by if they didn't read the fine print to switch to a root user.

Another issues is that it's hard to get insight into how meta-tickets (i.e. bundles of tickets with similar issues) are being triaged and managed. Understandably individual tickets contain lots of private logs/data but then when I'm trying to find a solution based on a given error it's very hard to find the best solution to try between discord, forums, Jira, etc. because of this. This is something IX may want to consider given how many layers there are and while there are many people willing to help, we can greatly benefit from the value of something like a knowledgebase when dealing with big bugs and common errors
 
Last edited:

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Thanks for the perspective -- FWIW from my product development perspective the tossup ignores the upgrade path audience but also different user types (i.e. new users with complex chart configs) . There are a lot more changes to Bluefin vs. Angelfish than what I expected for k3s. My upgrade broke my apps and I called it a total loss and started over with Bluefin. And yet I too am experiencing a random k3s error and it won't start with no real way to debug the cause, so seems to be something at the IX level causing it (along with UPS service issues).

In the future I'd argue even if SCALE Is so much more complex, the number of changes to the environment seem to imply an "upgrade" user may be safer with Angelfish until the Feb. release, but I'm also curious why someone starting fresh like me is also hitting k3s issues with complex chart setups -- it seems to imply that even "general" "power users" for many scale charts might be better off with angelfish too or at least better viz on how to track the bugs (i.e. do I sit and wait on 22.12.0 or wait for 23.2.1?) The most frustrating bit isn't just being blindsided by the system level issues, but also the difficulty in knowing the best path forward to resolve it. There are even issues like viewing/accessing shells with admin users that I think "general" audience users would definitely get confused by if they didn't read the fine print to switch to a root user.

Another issues is that it's hard to get insight into how meta-tickets (i.e. bundles of tickets with similar issues) are being triaged and managed. Understandably individual tickets contain lots of private logs/data but then when I'm trying to find a solution based on a given error it's very hard to find the best solution to try between discord, forums, Jira, etc. because of this. This is something IX may want to consider given how many layers there are and while there are many people willing to help, we can greatly benefit from the value of something like a knowledgebase when dealing with big bugs and common errors

Thanks for the note and suggestions.

In addition of the Software status page, there is a general software lifecycle page where we try to highlight the maturation process that each release goes through. This captures some of the sentiment. We recommend at release that "less complex deployments".


I'm not sure whether Angelfish would have been better in your case. We may have resolved some bugs but added some more. Swings and roundabouts.

I agree with your assessment of the upgrade issues.... these are the hardest to test, given diversity or hardware and configuration. Its the user testing that uncovers issues. We didn't see the issues in BETA and RC1... but perhaps too few upgrades vs new systems.

We don't have anything between user tickets and these community threads. However, we are making tickets less private. Ideally, these community posts act like a freeform knowledge base. We also want suggestions added to the documentation pages.

Apologize for the wait, but the Eng team is working hard to get the 22.12.1 out.. due in early Feb. Probably 160+ bug fixes expected.
 

dtype

Cadet
Joined
Feb 25, 2023
Messages
2
I'm running into the same thing. TrueNAS Scale 22.12.1.

No applications installed. No additional repos configured. I unset the applications pool, then deleted the ix-applications dataset, and then re-selected the pool. k3s never gets off the ground, with the same looped errors others have noted.

Starts with:
Code:
Feb 25 18:34:25 s2 k3s[1138022]: time="2023-02-25T18:34:25-08:00" level=info msg="Starting k3s v1.25.3+k3s-9afcd6b9-dirty (9afcd6b9)"


Ends with:
Code:
Feb 25 18:34:20 s2 k3s[1135307]: time="2023-02-25T18:34:20-08:00" level=info msg="Starting helm.cattle.io/v1, Kind=HelmChart controller"
Feb 25 18:34:20 s2 k3s[1135307]: time="2023-02-25T18:34:20-08:00" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller"
Feb 25 18:34:20 s2 k3s[1135307]: I0225 18:34:20.103254 1135307 event.go:294] "Event occurred" object="kube-system/kuberouter" fieldPath="" kind="Addon" apiVersion="k3s.cattle.io/v1" type="Normal" reason="AppliedManifest" message="Applied manifest at \"/mnt/neo/ix-applications/k3s/server/manifests/kuberouter.yaml\""
Feb 25 18:34:20 s2 k3s[1135307]: time="2023-02-25T18:34:20-08:00" level=info msg="Starting batch/v1, Kind=Job controller"
Feb 25 18:34:20 s2 k3s[1135307]: time="2023-02-25T18:34:20-08:00" level=info msg="Starting /v1, Kind=Node controller"
Feb 25 18:34:20 s2 k3s[1135307]: time="2023-02-25T18:34:20-08:00" level=info msg="Starting /v1, Kind=ConfigMap controller"
Feb 25 18:34:20 s2 k3s[1135307]: time="2023-02-25T18:34:20-08:00" level=info msg="Starting /v1, Kind=ServiceAccount controller"
Feb 25 18:34:20 s2 k3s[1135307]: I0225 18:34:20.113459 1135307 event.go:294] "Event occurred" object="kube-system/multus-daemonset" fieldPath="" kind="Addon" apiVersion="k3s.cattle.io/v1" type="Normal" reason="ApplyingManifest" message="Applying manifest at \"/mnt/neo/ix-applications/k3s/server/manifests/multus-daemonset.yaml\""
Feb 25 18:34:20 s2 k3s[1135307]: E0225 18:34:20.125267 1135307 kubelet.go:2034] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
Feb 25 18:34:20 s2 k3s[1135307]: E0225 18:34:20.132422 1135307 kubelet.go:1397] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"


I'm moderately k8s and k3s aware, so happy to help troubleshoot here. I'm looking into this a bit more this weekend.
 

dtype

Cadet
Joined
Feb 25, 2023
Messages
2
Just for completeness, this is one complete cycle of the k3s failure to launch:


Again:
* Unset applications pool
* systemctl stop k3s
* manually shot container/docker processes to free up file handles so I could delete dataset
* Removed ix-applications dataset
* Set application pool (/mnt/neo)
* Showing /var/log/k3s_daemon.log here

This has persisted reboots and existed for me on 12.22.0 and now 12.22.1.

Motherboard is https://www.supermicro.com/en/products/motherboard/a2sdi-12c-hln4f (not that dissimilar to what IX ships in the Mini boxes). This "neo" pool is a pair of mirrored SSDs. I do have a couple of VMs running with fully subscribed non-pinned cpus (in case some kind of need to pin cpus is a suspect). 128GB of RAM and lots available. Fully willing to wipe any k3s/docker state (I've tried to do this).
 

edmondtan3

Cadet
Joined
Oct 30, 2020
Messages
4
Hi all,

Anybody knows how the "unset pool" and "set pool" in script?

After reboot, I need to unset pool and then set pool again in application to fix the issues.

Thanks

Ed
 

edmondtan3

Cadet
Joined
Oct 30, 2020
Messages
4
Hi All,

Server Spec:
OS Version:TrueNAS-SCALE-23.10.1.3
Product:Super Server
Model:Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz
Memory:64 GiB
Supermicro x11ssl-f


I have fixed my issues, going to share it here hope this help.

Issues:
My application cannot start after my TNS rebooted. I usually have to unset the pool and set the pool again to fix the issue. (very annoying)
Something I received error message after reboot, the k3s cannot detect my network interface br1 or SSL issues.

What I use TNS for:
I need my TrueNAS Scale to do the following:
- samba
- qbittorent
- plex
- nextcloud
- VM x 3 (window server, Ubuntu with OpenVPN, Ubuntu test server)

I need my VMs able to talk to Samba or TrueNas itself, I need my TNS server after reboot everthing just coming back online automatically. Which I don't have to do unset and set pool etc.

To fix the issues:
I have 2 nic on my motherboard and both connected to a unmanaged switch.
eno1 - used for TrueNas web ui (Assign static IP address 192.168.0.18/24)
eno2 - used for VM but I need to created br1 for the VM to use it. (untick the DHCP and no static IP assign)

1707351869252.png


On the Virtual Machine nic you need to select br1(eno2) and login to your VM and set a static IP address to your server. My VM will be using the same IP range as the TNS Web UI (VM IP 192.168.0.19/24 and gateway set to .1)
1707352062743.png


On the Application setting I have the following.

1707352209150.png


I also have schuduled reboot every Wed and Sat. Now everything just running as expected, VM can talk to TNS host, Samba, Plex and Qbittorrent. After reboot VM and application started without issues.

Hope this help!
 
Top