k3s not starting after upgrade

danb35 · Dec 29, 2022

Whatever Bluefin broke, it really broke. Maybe I'll remember this in the future: never, ever, EVER trust a .0 release from iXSystems.

As recommended in @Daisuke's other thread, got a pair of SSDs the other day for apps and such, and installed them this afternoon. Created a pool, choose pool, wait, wait, wait for k3s to finally start, but it eventually did. Then try to install something--Traefik seemed like a good place to start. And of course it fails to install:

[EFAULT] Failed to install chart release: Error: INSTALLATION FAILED: failed pre-install: timed out waiting for the condition

. The traceback isn't very helpful, at least for me:

Code:

Error: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 426, in run
    await self.future
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 461, in __run_body
    rv = await self.method(*([self] + args))
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1152, in nf
    res = await f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1284, in nf
    return await func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/chart_releases_linux/chart_release.py", line 486, in do_create
    await self.middleware.call(
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1306, in call
    return await self._call(
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1266, in _call
    return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args)
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1169, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
  File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/chart_releases_linux/helm.py", line 44, in helm_action
    raise CallError(f'Failed to {tn_action} chart release: {stderr.decode()}')
middlewared.service_exception.CallError: [EFAULT] Failed to install chart release: Error: INSTALLATION FAILED: failed pre-install: timed out waiting for the condition

Wondering if it might be unique to Traefik, I tried Heimdall--same thing. On the off chance that it might be something unique to TrueCharts, I tried installing the "official" photoprism--same result there as well.

So: Upgrading to Bluefin destroys my apps--they don't run on Bluefin, but Bluefin has broken them in such a way that they won't run if I revert to Angelfish either. And not only that, even choosing a fresh pool for the apps won't let me start over and install them from scratch. And this is a release iX recommends everyone upgrade to.

truecharts · Dec 29, 2022

danb35 said:
Wondering if it might be unique to Traefik, I tried Heimdall--same thing. On the off chance that it might be something unique to TrueCharts, I tried installing the "official" photoprism--same result there as well.

Unlikely, because that official app doesn't have any "pre-install" at all ;-)

danb35 said:
So: Upgrading to Bluefin destroys my apps--they don't run on Bluefin, but Bluefin has broken them in such a way that they won't run if I revert to Angelfish either. And not only that, even choosing a fresh pool for the apps won't let me start over and install them from scratch. And this is a release iX recommends everyone upgrade to.

Passive agressive statements based on n=1, is not really helpfull.
We've not seen the thing you've been seeing from our users until this day.

Based on what you describe, your system never has worked well.
But without A LOT more info and a jira ticket, no one is going to be able to give any worth while response to this.

danb35 · Dec 29, 2022

truecharts said:
Unlikely, because that official app doesn't have any "pre-install" at all ;-)

Here's the error message when I try to install the official photoprism:

[EFAULT] Failed to install chart release: Error: INSTALLATION FAILED: failed pre-install: timed out waiting for the condition

. And the traceback:

Code:

Error: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 426, in run
    await self.future
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 461, in __run_body
    rv = await self.method(*([self] + args))
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1152, in nf
    res = await f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1284, in nf
    return await func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/chart_releases_linux/chart_release.py", line 486, in do_create
    await self.middleware.call(
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1306, in call
    return await self._call(
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1266, in _call
    return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args)
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1169, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
  File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/chart_releases_linux/helm.py", line 44, in helm_action
    raise CallError(f'Failed to {tn_action} chart release: {stderr.decode()}')
middlewared.service_exception.CallError: [EFAULT] Failed to install chart release: Error: INSTALLATION FAILED: failed pre-install: timed out waiting for the condition

I haven't parsed through every line, but that looks identical to me.

truecharts said:
Passive agressive statements based on n=1, is not really helpfull.

Fair enough, but I don't think my frustration here is unreasonable. 20+ apps were running smoothly under Angelfish. I upgrade to Bluefin and they seem to be very badly broken, and nobody seems to know why.

truecharts said:
Also your conclusion: "Upgrading to Bluefin destroys my apps", has no correlation with your experience above it "new installs wont work".

I'm not sure how to better describe it--my apps didn't migrate to Bluefin, I can't get them to run on Angelfish, and I can't even (back under Bluefin) set up a new pool for apps and start over there. I think that sounds like "destroys my apps." Sure, they're (presumably) still on disk--there's a tank/ix-applications dataset with a little over 100 GB of data present--but I can't do anything with them. As a recap:

Under Angelfish, 20+ apps are running smoothly
Upgrade to Bluefin, zero apps are running; the UI gives the red stop sign and says "applications are not running."
Tried following suggestions from a couple of other threads on the same error message: set the node IP, set the interface and gateway, to no effect.
Figured "no problem, that's why TrueNAS has boot environments," and reverted to Angelfish. The apps still aren't running. Now I'm getting concerned.
After opening a ticket on your Discord and some back and forth (and a great deal of timely help there), I'm able to get my existing apps running again under Angelfish. And here's where I should have stopped, but didn't. Because one of those helping me in that ticket suggested I do the Bluefin upgrade again, wait, and give plenty of time to allow the apparently-problematic migration process to complete.
So, ran the upgrade again, booted Bluefin, and waited for a number of hours. No dice, the apps still don't come up. Morgan suggests up-thread that it might have something to do with host path validation, so I disable that and wait some more. Still nothing.
Since I think I know how to get them back under Angelfish after my last go-round, I decide to give up on Bluefin and go back to Angelfish. Revert to that BE, unset pool, reboot, choose pool--but unlike last time, the apps don't come up. At all. It's stuck on "applications are not running" in the GUI, and k3s is in a restart loop. I can't restore the TrueTool backup, because TrueTool doesn't see any of the backups.
At this point I pretty much give up on getting my existing apps back and decide to follow @Daisuke's recommendation to set up a separate pool on SSD for applications, and just start over with them. And that goes as I describe in this post and my last--apps don't install. Last response on my Discord ticket is "Yeah something is off with your SCALE box/setup," which sounds to me like "not our problem"--for which I don't blame you (plural), as I don't think it is.

truecharts · Dec 29, 2022

There are some options to try:
- Making a TrueTool backup on the borked Bluefin, then restoring that one
- Restoring a TrueTool Anglefish (pre-upgrade) backup on bluefin.
- It's correct that Anglefish does not allow backup restore on a non-functional cluster, Bluefin does, however.

That assumes you followed our backup guide though.

danb35 · Dec 29, 2022

truecharts said:
That assumes you followed our backup guide though.

That's this guide, right?

Backup, Migrations and Restore | TrueCharts

This guide has been written with the best efforts of the staff and tested as best possible. We are not responsible if it doesn't work for every scenario or user situation.

truecharts.org

If I'm understanding it correctly (since I am also moving--or wanting to move--the apps from a pool of spinners ("tank") to a SSD pool ("software")), and I'm running under Bluefin, I should be able to:

Backup the NAS config file
Reset the NAS to defaults (as the guide emphasizes to "Ensure a Clean system")
replicate tank/ix-applications to software/ix-applications (with the parameters/exclusions noted there)
Create the various datasets (software/ix-applications/docker, software/ix-applications/k3s, etc.)
Restore the backed-up config file to the NAS
After reboot, Apps -> Choose Pool -> software
Restore using TrueTool

There's been a good bit of messing around over the past couple of weeks, but I believe all the backups on tank would be of apps under Angelfish.

browntiger · Dec 30, 2022

Personally I never thought that a location of a pool will make a difference...

Can I make a suggestion:
- Stop all your apps (or use set REPLICAS=0, in k3s )
- Start them one by one, letting them complete the install and become available.
- It could be that two apps charts are not coded well and require access to some slow external API. Or timing settings are totally off.

You are claiming that this is related to BlueFin, but it could be an issue with TrueCharts chart (a chart may need delay/probe).

@truecharts are not doing you any favors, telling you some different random facts...

Facts:
- Kubernetes deadlocks are not all that uncommon.
- I am very grateful for TrueCharts efforts and maintenance of their ginormous catalog, but those charts are NOT coded by developers with expertise in a particular container (containers).
- You seems to have an issue with "pre-install" deadlock. (may be) I would try a container by container and let them all one by one complete install actions.
- It could be a side-effect of OverlayFS introduced in this release. You need to let a chart completely download and start (esp if it is relies on some external APIs)...

- Could be an issue with pulling 20 docker images ALL at once and running into some delay.

As suggested above I would start an image and wait until it fully started before going to the next.

Kubernetes could still fail.
- Kubernetes charts use "probes" to figure out if container properly started or failed. Kubernetes will try to restart a "failed" container few times before giving up..
- Typically a chart designer will configure: initialDelaySeconds, failureThreshold, periodSeconds (hopefully)
- The default periodSeconds is ONLY 10s (may be very BAD for your server)
- Next you may have the probes: "startup probe", "readiness probe" and "liveness probe". (those probes may or may not be configured correctly)
- Those probes may or may not be specified. Failure to include a "startup probe" could cause severe issues with a chart.

You can google up more details on this subject, but I would not necessarily blame TrueNas team for a failure of Kubernetes to start. At least a lot more details will be needed to figure out your issue, other than this brief log.

truecharts · Dec 30, 2022

browntiger said:
Facts:
- Kubernetes deadlocks are not all that uncommon.
- I am very grateful for TrueCharts efforts and maintenance of their ginormous catalog, but those charts are NOT coded by developers with expertise in a particular container (containers).
- You seems to have an issue with "pre-install" deadlock. (may be) I would try a container by container and let them all one by one complete install actions.
- It could be a side-effect of OverlayFS introduced in this release. You need to let a chart completely download and start (esp if it is relies on some external APIs)...

- Could be an issue with pulling 20 docker images ALL at once and running into some delay.

As suggested above I would start an image and wait until it fully started before going to the next.

Kubernetes could still fail.
- Kubernetes charts use "probes" to figure out if container properly started or failed. Kubernetes will try to restart a "failed" container few times before giving up..
- Typically a chart designer will configure: initialDelaySeconds, failureThreshold, periodSeconds (hopefully)
- The default periodSeconds is ONLY 10s (may be very BAD for your server)
- Next you may have the probes: "startup probe", "readiness probe" and "liveness probe". (those probes may or may not be configured correctly)
- Those probes may or may not be specified. Failure to include a "startup probe" could cause severe issues with a chart.

You can google up more details on this subject, but I would not necessarily blame TrueNas team for a failure of Kubernetes to start. At least a lot more details will be needed to figure out your issue, other than this brief log.

We're going to briefly respond to this, as we view this as needless suggestive content:

- We have a lot of experience in the thing we do. There is no evidence that a lack of quality or experience has anything to do with these issues.
- @danb35 already reported it was not isolated to TrueCharts
- While some of your comments are definately true "in general", they do not really make sense on the issue or are needlessly suggesting that they might, somehow, be broken on all our Apps.

This issue has nothing to do with TrueCharts or iX Apps quality and should be reported to iX as an issue with their Apps platform instead.
We stand with that advice.

---

On the topic of having "NOT coded by developers with expertise in a particular container".

In our experience about 80% of people making the containers for the software we use, often the first-party developers, are completely useless when it comes to building containers. Making many huge mistakes and delivering CVE-ridden piece of garbage containers.

It's actually people like the folks over at LSIO or Bitnami, who still suffer from some general design mistakes in our opinion, that often understand their many containers a LOT better than the original creator.

Why? Because they are specialised in containers and their deployment, just like we are.

What this comes down to, is that we, often, know better to how containers should be deployed on kubernetes than their original creator.
That does not mean that Applications always corporates with being deployed, but bugs caused by that do not lead to the above.

danb35 · Dec 30, 2022

browntiger said:
Can I make a suggestion:
- Stop all your apps (or use set REPLICAS=0, in k3s )
- Start them one by one, letting them complete the install and become available.

I appreciate the suggestion, but it doesn't seem to understand the critical issue: I have no apps. None. When the Apps system is set to use the pool on which the existing apps had been (and I believe still are) stored, and were happily running under Angelfish, k3s doesn't start at all, I get the error message posted up-thread that "Applications are not running" with the big red stop sign, and nothing I've tried has changed that.

When I set the apps system to use a different pool (not migrating anything, just using a different pool), k3s does (eventually, sometimes after several hours) start, but of course there are no apps installed. And when I try to install one, whether from TrueCharts or from iX, it fails with the error noted up-thread, which appears to be identical whether it's an iX app or a TrueCharts app. There shouldn't be any sort of deadlock going on, unless there's a great deal of activity behind the scenes that the UI's choosing to hide--and if it's activity of the sort that means "hey dummy, leave the apps system alone for a while while it does stuff," the UI definitely shouldn't be hiding it.

truecharts said:
This issue has nothing to do with TrueCharts or iX Apps quality and should be reported to iX as an issue with their Apps platform instead.

...and it has been, nearly two weeks ago now. Crickets other than "nothing useful in the debug, can you send a different log?" I'm sure the holidays aren't helping, and I recognize I'm not a paying customer, but it'd sure be nice to get to the bottom of this.

I'm trying the restore option you mentioned now. It's been stopped at [25%] Initializing new kubernetes cluster... for the last 12 hours. I don't see that it hurts to wait, but that seems like it might be a bit excessive, particularly when the new pool is SSD rather than spinners. Maybe I'll hop on the Discord in the morning if it hasn't progressed by then.

Daisuke · Dec 30, 2022

@danb35 the time you invested in this issue could’ve been better used to simply do a clean Bluefin install and start with all your apps fresh. Be aware that even in Angelfish things changed, related to app settings. Example, simple volumes were gone in Angelfish.

If you decide to do this, make sure the SSD pool is destroyed prior, and created with the new clean install. Basically, you should not keep any traces of the ix-applications dataset in any pool, prior clean install. If you want to have a reminder what settings each application had, run:

Code:

# midclt call chart.release.query '[["id", "=", "myappname"]]' | jq -M '.[].config' > /mnt/pool/dataset/myappname.json

Even if my upgrade was without issues, I still decided to perform a clean install and add all my apps, fresh. I explained in this post what changed. I recommend you not to restore any form of backups and install your apps one by one, from scratch.

Bluefin introduces a new version of k3s and Kubernetes, which moved away from dockershim as main container backend and implements the overlayfs driver for Docker, as well many improvements that might affect old applications or even the basic operations, quick example asyncio is gone. We can argue for hours that an upgrade should work, or stop wasting time and prepare for future releases by doing a clean install.

danb35 said:
Maybe I'll remember this in the future: never, ever, EVER trust a .0 release from iXSystems

This has nothing to do with the first release, you will experience the same issues with future releases, when you upgrade from Angelfish. I expect similar issues to make surface with any major Scale release, mainly because Kubernetes fast changes or major component changes. Example, in future Scale Capio major release (note the alphabetical release names), Docker will be replaced with containerd. I rather have my OS operating perfectly, so I'm very open to clean installs and I'm happy that that my pools will never be affected.

browntiger · Dec 30, 2022

danb35 said:
I appreciate the suggestion, but it doesn't seem to understand the critical issue: I have no apps. None. When the Apps system is set to use the pool on which the existing apps had been (and I believe still are) stored, and were happily running under Angelfish, k3s doesn't start at all, I get the error message posted up-thread that "Applications are not running" with the big red stop sign, and nothing I've tried has changed that.

When I set the apps system to use a different pool (not migrating anything, just using a different pool), k3s does (eventually, sometimes after several hours) start, but of course there are no apps installed. And when I try to install one, whether from TrueCharts or from iX, it fails with the error noted up-thread, which appears to be identical whether it's an iX app or a TrueCharts app. There shouldn't be any sort of deadlock going on, unless there's a great deal of activity behind the scenes that the UI's choosing to hide--and if it's activity of the sort that means "hey dummy, leave the apps system alone for a while while it does stuff," the UI definitely shouldn't be hiding it.

...and it has been, nearly two weeks ago now. Crickets other than "nothing useful in the debug, can you send a different log?" I'm sure the holidays aren't helping, and I recognize I'm not a paying customer, but it'd sure be nice to get to the bottom of this.

I'm trying the restore option you mentioned now. It's been stopped at [25%] Initializing new kubernetes cluster... for the last 12 hours. I don't see that it hurts to wait, but that seems like it might be a bit excessive, particularly when the new pool is SSD rather than spinners. Maybe I'll hop on the Discord in the morning if it hasn't progressed by then.

Sorry insanely long thread to read...

>err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"

have you tried disabling QOS? Do you really need it... See if you can add to k3s.service...

--cgroups-per-qos=false --enforce-node-allocatable=""

truecharts · Dec 31, 2022

browntiger said:
Sorry insanely long thread to read...

>err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"

have you tried disabling QOS? Do you really need it... See if you can add to k3s.service...

--cgroups-per-qos=false --enforce-node-allocatable=""

NEVER modify the k3s service.

You seem to be going at this as a kubernetes deployment. However it's not:
The Apps system is a managed system by iX-Systems TrueNAS. Which is *build* on kubernetes.

Even if this "solves" it, you've a timebomb in your hands.

browntiger · Dec 31, 2022

What? Modifying the kubernetes startup in the systemd is "seem to be going at this as a kubernetes deployment. However it's not:"

Sorry Lol. It is definitely k8s. It is pretty common solution to a failed QOS reservation. Google it up. This op may have an issue with Mobo/CPU it is something worth trying. To undo place a # lol.

I have an older server at work running Debian that has this fix. Makes little difference. It is not something that going to stick anyway.

>Even if this "solves" it, you've a timebomb in your hands.
Based on reported error, op has an issue with kubernetes service start up, specifically CPU reservation. Google says those Xeons e5-2670s are 2012s. All i am suggesting is bypass qos to see if k3s will start.

truecharts · Dec 31, 2022

browntiger said:
What? Modifying the kubernetes startup in the systemd is "seem to be going at this as a kubernetes deployment. However it's not:"

Sorry Lol. It is definitely k8s. It is pretty common solution to a failed QOS reservation. Google it up. This op may have an issue with Mobo/CPU it is something worth trying. To undo place a # lol.

I have an older server at work running Debian that has this fix. Makes little difference. It is not something that going to stick anyway.

>Even if this "solves" it, you've a timebomb in your hands.
Based on reported error, op has an issue with kubernetes service start up, specifically CPU reservation. Google says those Xeons e5-2670s are 2012s. All i am suggesting is bypass qos to see if k3s will start.

TrueNAS is supposed to be an appliance OS, it's not supposed to be altered in these ways and users are not supposed to alter background services managed by the OS themselves.

You also don't have to, try to, educate us on something that is literally our existence. We're just trying to explain to you, that TrueNAS SCALE is not "just an OS running kubernetes", SCALE Apps are based on it but the whole system is not designed to be "user serviceable".

Making user-modifications to it, basically means you're not running TrueNAS anymore and can have all sorts of weird consequences with updates and trigger bugs with other services.

All services within TrueNAS are supposed to only be access using what's called "the middleware".
Any deviation from that by the user, carries the risk to completely brick the system in the future. Because the middleware is not expecting these modification.

So "I have an older server at work running Debian", is the wrong way to look at TrueNAS SCALE. It's better to look at it like it's router firmware or some other form of appliance OS/firmware.

danb35 · Jan 1, 2023

Daisuke said:
I rather have my OS operating perfectly, so I'm very open to clean installs and I'm happy that that my pools will never be affected.

"You should expect to need to do a clean install and reconfigure from scratch with every major release", as you're suggesting, is simply unacceptable. If iX can't do better than that, they might as well say so, so that I can find something else. That isn't the way they've worked for 10+ years, and I can't imagine it's how they intend to start working. Hell, they even support upgrades across two completely different products (CORE and SCALE--and it's becoming increasingly apparent that they're different, and rapidly diverging, products).

"Blow it away and start over from scratch because of some bizarre bug that borked the migration" is borderline acceptable, once--though the fact that whatever it did made it impossible to revert means it's hanging onto the edge of "acceptable" by a very fine thread. But what you're proposing is so far over the edge that the edge can't even be seen any more.

Daisuke · Jan 1, 2023

danb35 said:
it's becoming increasingly apparent that they're different

There is a key factor in all this, Kubernetes. Stop using apps and you will never have any issues upgrading, you will use Debian basically, which is a battle tested OS. With the rapid changes in Kubernetes, there is a big chance things will break easy. In every Kubernetes release, there are API changes and feature removals.

From my perspective, is not iXsystems entire fault the upgrades do not work perfectly for every single soul, users need to change their mentality if they want to use Kubernetes and start paying attention to iXsystems Release Notes, amongst other things. Most users don't do any upgrades for several versions and don't even bother to see if their apps change in design structure (hint at simple PVC, which nobody bothered to look at, until they got upgrade errors). If they want to use Scale, they need to start developing these habits. At work, we go to expensive tests prior every minor Kubernetes version upgrade, to make sure things don't break. Because they often break, by Kubernetes design.

danb35 said:
is simply unacceptable

You're going to tell me, "iXsystems should do all these extensive tests." They do and they fix things, properly. But like the entire planet, bugs exists, because we are humans. You also forget that this is free Open-source software, if you want support and things to work perfect, purchase a license and you will be told not to use Scale yet, because is not considered production ready. If you are not comfortable with that, you can always go back to Core and don't face these issues, running your VMs and jails in a "stone age, sort of saying" way.

The part I don't understand is why do you spend all this energy complaining about iXsystems did not do this and that for 10 years, instead of taking action and aligning with the Scale reality of doing things.

morganL · Jan 1, 2023

Daisuke said:
There is a key factor in all this, Kubernetes. Stop using apps and you will never have any issues upgrading, you will use Debian basically, which is a battle tested OS. With the rapid changes in Kubernetes, there is a big chance things will break easy. In every Kubernetes release, there are API changes and feature removals.

From my perspective, is not iXsystems entire fault the upgrades do not work perfectly for every single soul, users need to change their mentality if they want to use Kubernetes and start paying attention to iXsystems Release Notes, amongst other things. Most users don't do any upgrades for several versions and don't even bother to see if their apps change in design structure (hint at simple PVC, which nobody bothered to look at, until they got upgrade errors). If they want to use Scale, they need to start developing these habits. At work, we go to expensive tests prior every minor Kubernetes version upgrade, to make sure things don't break. Because they often break, by Kubernetes design.

You're going to tell me, "iXsystems should do all these extensive tests." They do and they fix things, properly. But like the entire planet, bugs exists, because we are humans. You also forget that this is free Open-source software, if you want support and things to work perfect, purchase a license and you will be told not to use Scale yet, because is not considered production ready. If you are not comfortable with that, you can always go back to Core and don't face these issues, running your VMs and jails in a "stone age, sort of saying" way.

The part I don't understand is why do you spend all this energy complaining about iXsystems did not do this and that for 10 years, instead of taking action and aligning with the Scale reality of doing things.

Clearly there are some current issues with updates that are not as straightforward as expected. We're still chasing down whether these are bugs that broadly apply or specific issues for some systems.

However, our goal is to get quality to level where updates don't require sophisticated expertise to diagnose and resolve. Unfortunately, we don't get there with .0 releases after multi-thousand system beta (& RC) testing. Probably there are too few are production systems updating and stressing all corners of the update process. The update process is much harder to test in the the lab that the standard operating modes and features.

We try to map out the general progression with this lifecycle description: https://www.truenas.com/docs/scale/gettingstarted/useragreements/softwaredevelopmentlifecyclescale/

There are now 10,000 systems using the SCALE 22.12.0 release.... so there are bound to be some issues uncovered.. especially given the hardware and application/use-case diversity that exists. We appreciate the bug reports and will be working through them asap for the .1 and .2 releases.

Daisuke · Jan 2, 2023

morganL said:
The update process is much harder to test in the lab that the standard operating modes and features.

I was expecting this, we face the same issues at work. People don’t realize the amount of work involved, yet is very easy for them to complain.

danb35 · Jan 2, 2023

Daisuke said:
Stop using apps and you will never have any issues upgrading,

...and then I might as well have stayed on CORE. The Apps ecosystem is the only reason for me to be on SCALE; I have a Proxmox cluster for virtualization, and last I heard, CORE was more performant for the core file-sharing functions anyway. So (if I believe iX that CORE is still a going concern) if I didn't want to use the apps, there was no significant reason to move to SCALE in the first place.

Daisuke said:
users need to change their mentality if they want to use Kubernetes and start paying attention to iXsystems Release Notes, amongst other things.

Show me where the release notes, or anything else from iX, say that "there is a big chance things will break easily." Show me where they say, "you shouldn't expect upgrades between major versions to work." Because that's the position you appear to be taking and defending.

Daisuke said:
The part I don't understand is why do you spend all this energy complaining about iXsystems did not do this and that for 10 years, instead of taking action and aligning with the Scale reality of doing things.

What is this "Scale reality of doing things"? Where is it documented? Or if it isn't (and, well, it isn't, at least from iX), how did you conclude that it was the new reality? But as to why I didn't just give up as you recommended a while back, a few reasons:

I'm kind of stubborn (my wife would likely say very stubborn), and don't like giving up. I think this is a generally positive trait, though it can certainly be overdone.
There's a bug there, obscure as it may be, that's somehow triggered by the combination of this system configuration and this hardware. My hope was that iX would be able to track it down and fix it--obscure as it may be, it's unlikely I'm the only person who was or will be affected by it. And if I don't have the skill to track it down myself, at least keeping the system in a condition where the bug is manifesting itself, in order that someone else can find it, is a contribution to the project.
"Blow it away and start over, by hand, from scratch" is a nontrivial amount of work, even for a home deployment--it occupied the better part of yesterday (aggravated by general flakiness in the UI that wasn't there previously), and it isn't done yet. SCALE has made major changes to the GUI, largely for the good, but that means many settings are tucked away in very different places than they had been previously. Odds are it will never be exactly as it was, though likely with inconsequential differences.

But with all of that said, two weeks of the apps not working was too much.

Daisuke · Jan 2, 2023

danb35 said:
Show me where the release notes, or anything else from iX

I have nothing to prove to you. You go into same tangent over and over again. Good luck, I’m done with this.

danb35 · Jan 2, 2023

Daisuke said:
You go into same tangent over and over again.

It's hardly a tangent when you say:

Daisuke said:
users need to change their mentality if they want to use Kubernetes

and:

Daisuke said:
If they want to use Scale, they need to start developing these habits.

and:

Daisuke said:
aligning with the Scale reality of doing things.

(to quote only from your last post to me) ...all of which indicate that "users" (clearly including myself) just ought to know that there's a massive change in good operating practices with SCALE--which would have to mean that it's documented somewhere we should have known to look, because after all the only intuitive user interface is the nipple. It also ignores that the only actionable item you mention is to check the release notes before you upgrade, which is certainly good practice, but you know as well as I do that there's nothing in there about "upgrading will destroy your applications to the point that you need to do a clean install and rebuild from scratch." So, however rare my particular issue may be, there's nothing in the release notes that addresses it. If you want to toss out those sweeping assertions without being challenged on them or defending them, I guess that's your right, but it isn't really useful to any sort of conversation.

Oh, and contrary to:

Daisuke said:
you will be told not to use Scale yet, because is not considered production ready.

iX say that it's ready for "general use", which they define as "Field tested software with mature features. Few issues are expected."

Software Status - TrueNAS Roadmap - Open Source NAS Software

Get up-to-date insight on the TrueNAS roadmap. Explore the TrueNAS project timeline and stay informed with the latest updates.

www.truenas.com

I'd equate that to "production", at least through the SMB level (though the word "production" doesn't appear anywhere on that page).

I realize I'm pushing back pretty aggressively. That's because what I'm hearing from you is "you should have known better," without your having the common courtesy to specify what I should have known better (other that what I've already admitted: that iX have a long and ignoble history of shipping releases with show-stopping bugs, and therefore not to trust a .0 release) or on what basis; with the further implication that whatever trouble I'm having is my own fault.

Important Announcement for the TrueNAS Community.

k3s not starting after upgrade

Hall of Famer

Guru

Hall of Famer

Guru

Hall of Famer

Explorer

Guru

Hall of Famer

Contributor

Explorer

Guru

Explorer

Guru

Hall of Famer

Contributor

Captain Morgan

Contributor

Hall of Famer

Contributor

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "k3s not starting after upgrade"

Similar threads