Looking for Bluefin systems where Apps don't start

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
(repost from another discussion)

Adding a status note on TrueNAS SCALE Bluefin.... There are now about 15,000 Bluefin systems up and active. Most of iX TrueNAS test systems have been running Bluefin very reliably.

Apart from the bugs identified in the release notes, the Apps have had the most reports of problems. So far 8 bug reports. We'd guess 10s of systems are having problems.... but frankly we don't have a great way of collecting this data, apart from these forums.

Unfortunately, we have not seen a pattern in those existing reports nor have we been able to reproduce a case. We are not sure if there are one or more causes of the issues... and whether there are specific processors/hardware that cause issues. It may or may not be related to having multiple pools.

So, we have three requests:

1) Please report any case where Apps are failing to run. Make a comment here. Start a new thread with hardware details, we'll see if there's any obvious problems and then ask for a bug report. We can also see more clearly how many systems have an issue. Finding a simple system with an issue would be useful.

2) If anyone can reproduce the issue on a SCALE VM, that would be very useful. We might be able to reproduce and isolate any software causes faster.

3) If anyone has time and spare hardware available, try building an angelfish system, installing Apps and then updating to Bluefin. Please report success/failure and the hardware spec. 1st person with a failure should start a separate thread.

Obviously, we'd like to chase down any issues like this ASAP. Thanks in advance for any help.
 

nadinio

Dabbler
Joined
Jul 9, 2017
Messages
22
Hey moragnL, I've been unable to get apps started after a bluefin upgrade even after attempting several fixes suggested by the community. I've made a thread here with hardware specs.
 

Lipsum Ipsum

Dabbler
Joined
Aug 31, 2022
Messages
22
It may not be quite what you're looking for, but I had an issue just getting Apps section working, let alone any particular app within it.

First, I'm a veteran of IT, but relatively new to the TrueNAS ecosystem. I'm still fumbling my way around with some things. I've been following this thread as a sort of guide with installing and configuring BlueFin prior to a production build. The first time I encountered the bug, I had my storage pool setup as described in the initial post of that thread in the "Pools and Datasets" section, except I omitted the uranus dataset.

I've been able to recreate what causes the issue with no additional steps other than to cause the error. It happens both with my bare metal hardware and a minimal Hyper-V VM. I don't believe the exact config details are necessary to duplicate, but if you really need them, I can provide details.

Steps to reproduce
1. Fresh install of bluefin using the TrueNAS-SCALE-22.12.0.iso. Accept defaults just to get the initial system up.
2. Create initial pool with one or more drives however you'd like. Name doesn't seem to matter.
3. Add a dataset. Again, name doesn't seem to matter.
4. Add a sub-dataset to the one created in step 3. Call it "ix-applications".
5. Go to Apps.
6. System prompts "Choose a pool for Apps". Select the pool you created in step 2.
7. An error is shown:
Error: [EINVAL] kubernetes_update.force: Apps have been partially initialized on 'default' pool but it is missing 'default/ix-applications/docker, default/ix-applications/k3s, default/ix-applications/k3s/kubelet, default/ix-applications/releases, default/ix-applications/default_volumes, default/ix-applications/catalogs' datasets. Specify force to override this and let system re-initialize applications.

I found the error message already reported in Jira along with it's code changes in the repo from a few days ago. I don't know what the normal timeline is for bug fixes to become officially released, but I'm impatient and didn't want to wait to have it fixed. I couldn't imagine this error was normal, expected, or hadn't been encountered with a clean install unless I was doing something wrong.

So....yeah, it was me. I shouldn't have been manually creating the ix-application dataset in my step 4. I was the "dumb" user doing something I shouldn't do. However, as a software developer myself, I know that a decent number of bugs that are encountered exactly by regular users doing "dumb" things not expected.

From the Jira bug, it looks like the option to force a (re)install is being addressed. However perhaps the system shouldn't let me manually create a dataset with that name to begin with. Protect us dumb users from our own ignorance so to speak, showing a validation error that a "reserved name" or similar instead. Maybe giving me the option to force it if I really, really want to use it. Or when I selected the pool to use for Apps, tell me that the dataset/directory already exists but is empty and I should remove it first.

Anyways, I figure people wiser can me can figure out what the best course of action is, if any. I consider my instance of this issue fixed. Maybe it can help someone else out like me who didn't know better.

If you need anything further, let me know.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
It may not be quite what you're looking for, but I had an issue just getting Apps section working, let alone any particular app within it.

First, I'm a veteran of IT, but relatively new to the TrueNAS ecosystem. I'm still fumbling my way around with some things. I've been following this thread as a sort of guide with installing and configuring BlueFin prior to a production build. The first time I encountered the bug, I had my storage pool setup as described in the initial post of that thread in the "Pools and Datasets" section, except I omitted the uranus dataset.

I've been able to recreate what causes the issue with no additional steps other than to cause the error. It happens both with my bare metal hardware and a minimal Hyper-V VM. I don't believe the exact config details are necessary to duplicate, but if you really need them, I can provide details.

Steps to reproduce
1. Fresh install of bluefin using the TrueNAS-SCALE-22.12.0.iso. Accept defaults just to get the initial system up.
2. Create initial pool with one or more drives however you'd like. Name doesn't seem to matter.
3. Add a dataset. Again, name doesn't seem to matter.
4. Add a sub-dataset to the one created in step 3. Call it "ix-applications".
5. Go to Apps.
6. System prompts "Choose a pool for Apps". Select the pool you created in step 2.
7. An error is shown:


I found the error message already reported in Jira along with it's code changes in the repo from a few days ago. I don't know what the normal timeline is for bug fixes to become officially released, but I'm impatient and didn't want to wait to have it fixed. I couldn't imagine this error was normal, expected, or hadn't been encountered with a clean install unless I was doing something wrong.

So....yeah, it was me. I shouldn't have been manually creating the ix-application dataset in my step 4. I was the "dumb" user doing something I shouldn't do. However, as a software developer myself, I know that a decent number of bugs that are encountered exactly by regular users doing "dumb" things not expected.

From the Jira bug, it looks like the option to force a (re)install is being addressed. However perhaps the system shouldn't let me manually create a dataset with that name to begin with. Protect us dumb users from our own ignorance so to speak, showing a validation error that a "reserved name" or similar instead. Maybe giving me the option to force it if I really, really want to use it. Or when I selected the pool to use for Apps, tell me that the dataset/directory already exists but is empty and I should remove it first.

Anyways, I figure people wiser can me can figure out what the best course of action is, if any. I consider my instance of this issue fixed. Maybe it can help someone else out like me who didn't know better.

If you need anything further, let me know.

Thanks.

It's an awkward problem. There is little/no restriction on what datasets are called... because the systems has no prior knowledge what the dataset will be used for. It would be a valid name of an SMB dataset.

We can only really improve error messages....or improve documentation to avoid this error.

The bugID mentioned is probably a partial fix.
 

Unburned3156

Cadet
Joined
Jan 21, 2023
Messages
7
Hey morganL, I just found this thread. I have logged Jira job NAS-119965 for this. Logs are attached to that.

I've logged thread https://www.truenas.com/community/t...ers-kubernetes-service-is-not-running.107164/

Reddit post: https://www.reddit.com/r/truenas/comments/10i5lmi/issues_installing_apps_and_creating_containers/

My system could be described as simple I guess. it is a home NAS, mainly for stuffing around / learning on.
I don't mind assisting further to help resolve this as long as I can keep my data through the process (if it is lost it's not the end of the world, I have a copy of the majority until I get this sorted).
I do not have any spare hardware though unfortunately.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
For running Apps, we generally recommend more than 8GB of RAM

Not sure if this is causing an issue, but we don't test with only 8GB. Can you upgrade?

 

Unburned3156

Cadet
Joined
Jan 21, 2023
Messages
7
For running Apps, we generally recommend more than 8GB of RAM

Not sure if this is causing an issue, but we don't test with only 8GB. Can you upgrade?

It definately could be, I've planned to upgrade it anyway. I will try putting in 32gb of memory from my main PC and see if that makes a difference. I'll report back.
 

Unburned3156

Cadet
Joined
Jan 21, 2023
Messages
7
It definately could be, I've planned to upgrade it anyway. I will try putting in 32gb of memory from my main PC and see if that makes a difference. I'll report back.
OK so I installed 8GB more RAM, but I still had the same message. I have added some notes to my Jira job https://ixsystems.atlassian.net/browse/NAS-119965 as it seemed after restarting the k3s service I get further. Will continue to update.
 

KrisH

Cadet
Joined
Jan 27, 2023
Messages
2
Hi,

Wanted to chime in, I have quite a lot of k8s and k3s experience, truenas scale from the initial release of 22.02 .

With the upgrade, I wanted to set up things myself, as truecharts wasn't always the greatest experience.

I installed metallb using helm. It works, I can create services that have an ip that's usable. The services show up, so traefik, also installed using helm with a service of type loadbalancer also gets an external ip (10.0.1.5)
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 172.17.0.1 <none> 443/TCP 20h
kube-system kube-dns ClusterIP 172.17.0.10 <none> 53/UDP,53/TCP,9153/TCP 20h
metallb-system metallb-webhook-service ClusterIP 172.17.196.240 <none> 443/TCP 19h
traefik traefik LoadBalancer 172.17.209.7 10.0.1.5 80:41068/TCP,443:14972/TCP
After a reboot only the ClusterIP services "survive":
kubectl get svc -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 172.17.0.1 <none> 443/TCP 20h
kube-system kube-dns ClusterIP 172.17.0.10 <none> 53/UDP,53/TCP,9153/TCP 20h
metallb-system metallb-webhook-service ClusterIP 172.17.196.240 <none> 443/TCP 20h

My next step would be to see if truecharts would make a difference here, but I don't get why I should.

Seems pretty undocumented and unexpected behaviour.
 

browntiger

Explorer
Joined
Oct 18, 2022
Messages
58
Hi,

Wanted to chime in, I have quite a lot of k8s and k3s experience, truenas scale from the initial release of 22.02 .

With the upgrade, I wanted to set up things myself, as truecharts wasn't always the greatest experience.

I installed metallb using helm. It works, I can create services that have an ip that's usable. The services show up, so traefik, also installed using helm with a service of type loadbalancer also gets an external ip (10.0.1.5)

After a reboot only the ClusterIP services "survive":


My next step would be to see if truecharts would make a difference here, but I don't get why I should.

Seems pretty undocumented and unexpected behaviour.
Just like we do everyplace else pull k3s_deamon and kube_router logs. Post the real errors. Without it we will be guessing...

>I installed metallb using helm.
No I get it, I come with a similar background (not just Scale). But I think this may be touch crazy.

TrueCharts did insane amount of work with countless apps. While I can easily can pull my own on baremetal, would I want to?

If you are not crazy with their setup of a particular app (Traefik for me), fork / hack / pull your own.
Missing an app, or need it with your own features or upgrafe to "Enterprise Edition" , takes only seconds to modify truecharts.
And they stuff if fairly flexible for 99% cases.

But if you do like bare metal k8s, go ahead and give us the logs.
 

Trexx

Dabbler
Joined
Apr 18, 2021
Messages
29
@morganL I haven’t submitted logs yet, but I think some of the problems may stem from people who have chosen to follow the recommendation to setup the system using the “admin” account vs. the standard ”root”. Today for example I was having a problem with a truecharts application and trying to get to shell resulted in the following error:

‘Unable to read /etc/rancher/k3s/k3s.yaml —permission denied’.

This also occurred when trying the same things on the IX-system Plex package.

This seems to be a relatively common issue I am seeing in several areas. The recommendation for the change from root to admin has been made, but the platform hasn’t been fully retro-fitted in all areas to make that truly usable.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
@morganL I haven’t submitted logs yet, but I think some of the problems may stem from people who have chosen to follow the recommendation to setup the system using the “admin” account vs. the standard ”root”. Today for example I was having a problem with a truecharts application and trying to get to shell resulted in the following error:

‘Unable to read /etc/rancher/k3s/k3s.yaml —permission denied’.

This also occurred when trying the same things on the IX-system Plex package.

This seems to be a relatively common issue I am seeing in several areas. The recommendation for the change from root to admin has been made, but the platform hasn’t been fully retro-fitted in all areas to make that truly usable.

There is a bug report NAS-119374 on that issue... not sure if it causes other symptoms.

 

KrisH

Cadet
Joined
Jan 27, 2023
Messages
2
The code that randomly deletes services and other stuff has already been removed, and will hopefully make it in the next .1 https://github.com/truenas/middleware/commit/9efe2a74d5cec9539f3c434a564e8f594c304da8

Then there are helm reinstalls each time to work around this issue for truecharts and built-in applications.

Que the theme song https://www.youtube.com/watch?v=UKTNWI0eYJ4

It's often easier to stick to the vanilla charts/etc and the default k8s behaviour/expectations of consumers than to understand the black magic that truecharts/middlewared adds.
 

truecharts

Guru
Joined
Aug 19, 2021
Messages
788
The code that randomly deletes services and other stuff has already been removed, and will hopefully make it in the next .1 https://github.com/truenas/middleware/commit/9efe2a74d5cec9539f3c434a564e8f594c304da8

Then there are helm reinstalls each time to work around this issue for truecharts and built-in applications.

There are not services being "randomly" deleted from TrueCharts Apps.
The linked PR has no impact on anything related to TrueCharts Apps. (one way or the other)

There is simply no reinstall required, nor is anything being "reinstalled".


Que the theme song https://www.youtube.com/watch?v=UKTNWI0eYJ4

It's often easier to stick to the vanilla charts/etc and the default k8s behaviour/expectations of consumers than to understand the black magic that truecharts/middlewared adds.

Our Apps are 99% normal helm charts, that are even tested in "normal" helm clusters.

However, it is very well possible that more "random" deletes will happen in the future as well. Because.. well... direct Helm access is simply not an advertised feature of TrueNAS SCALE ;-)
 

3titter

Cadet
Joined
Feb 12, 2023
Messages
1
I did a clean slate install of bluefin and have been re-installing the apps i had on the angelfish system previously. all work except for jellyfin, which gets hung up at 'deploying'. when i check the events it seems its failing to pull the docker image from truecharts. this has not been a problem for the other truecharts apps i've installed and nothing i try to change seems to work. also, the install no longer shows my iGPU on my amd CPU for use in HWA in jellyfin
 

truecharts

Guru
Joined
Aug 19, 2021
Messages
788
I did a clean slate install of bluefin and have been re-installing the apps i had on the angelfish system previously. all work except for jellyfin, which gets hung up at 'deploying'. when i check the events it seems its failing to pull the docker image from truecharts. this has not been a problem for the other truecharts apps i've installed and nothing i try to change seems to work. also, the install no longer shows my iGPU on my amd CPU for use in HWA in jellyfin

We always suggest you file a support ticket with our support staff on discord if you need any help with our Apps. :)
 

TinyWorkshop

Dabbler
Joined
Jul 14, 2022
Messages
40
I was having some serious issue regarding apps not able to write on host and also not starting at all.

In my case the problem was setting up the root user using the gui option during installation, I think has created a root user without all the permissions, messed up also the dataset created with that installation.

reinstalled with the default root option (the not recommended one :tongue: ) now works perfectly
 

efalsken

Cadet
Joined
Feb 17, 2023
Messages
8
I've got one. I had bluefin running just fine for months. It was on a bonded ethernet connection. Moved the server to a new location and I deleted the bond. (same primary IP) and now none of the apps start and k3s is failing. Combinations of rebooting and unset-reset pool did not help.
 

patrick339

Cadet
Joined
Feb 19, 2023
Messages
2
I have encounter these problems at different server, and i reinstall servel times, every times is new installation ,new pool(no import) and new dataset(ix-applications).

The process is here. Install Apps(include Truecharts App), Disable Host Path Safety Checks, Add SMB Path and ACL to share apps mount data, then change system datasets to boot-pool.
Reboot and problem come again, here is the alarts:
Failed to start kubernetes cluster for Applications: [EFAULT] Failed to configure PV/PVCs support: Cannot connect to host 127.0.0.1:6443 ssl:default [Connect call failed ('127.0.0.1', 6443)]
* Glusterd work directory dataset is not mounted.
 
Top