Missing Pool and Datasets after SCALE 22.02.0 > SCALE 22.02.0.1

ian351c · Mar 24, 2022

Hello all,

I've been running TrueNAS in various incarnations for over 10 years, and I have finally run into a strange data loss issue that I could really use some help with. Today, I upgraded from SCALE 22.02.0 to SCALE 22.02.0.1. After the reboot, my NAS took over 15 minutes to come back up. It sat at "A start job is running for Import ZFS pools" for the full 15m8s. After this I was able to log into the system to find that my largest pool (imaginatively named "tank") is missing. In the GUI, I can see all the disks for the pool just fine, but the pool itself has just a single line "OFFLINE" in the GUI. I am able to import the pool from the command line ("zfs import tank") which works fine, and zpool status shows no errors. The GUI then shows the pool, however, most of my datasets are missing. I have one dataset out of . According to the GUI, the data itself is (probably) still there since the pool is still at 79% full. Also, according to the GUI, all my snapshots are still there, though I cannot access /mnt/tank/.zfs from the cli.

Update: after rebooting back to 22.02.0 and praying...
So the boot process still takes the full 15m8s. However, the tank pool does mount and it appears all my data is present. I have completed an overnight scrub of the pool successfully with no errors. All of my installed apps are gone (I suspect this is a result of the long boot time rather than the apps actually being deleted, since they live on a different pool).

tl;dr or What I need help with right now:
- How do I find out why the NAS is still taking 15m8s to boot?
- How do I restart the Apps environment once the NAS is up (if necessary)?

Thanks!

sretalla · Mar 24, 2022

Maybe start by looking at dmesg or at cat /var/log/messages (or use whatever viewing method you want for that file).

ian351c · Mar 24, 2022

Thanks @sretalla. Since this is happening with whatever middlewared is doing during boot, I don't think /var/log/messages or dmesg would help much. Although if you have any hints on what to look for, I'm all ears.

On the flip side, you inspired me to look around a bit and in /var/log/middlewared.log I am seeing this error:

Code:

middlewared.service_exception.CallError: [ENOENT] Pool 12222627561934414124 not found.

This appears about 15 minutes after the previous entry in the log and at pretty close to the same time the system becomes accessible after a reboot. None of my pools have that guid, so it appears that TrueNAS is trying to mount a pool that doesn't exist.
Hopefully there's a way to tell TrueNAS not to mount this pool at boot...

EDIT:
I take that back. That GUID is for one of my pools! However, once the timeout has occurred and I am able to log into the system, that pool is mounted just fine. So, I'm not sure why my NAS is taking so long to boot, telling me it can't mount a pool that is there for 15 minutes, and then finishes booting with the pool mounted just fine.

mkarwin · Mar 25, 2022

ian351c said:
tl;dr or What I need help with right now:
- How do I find out why the NAS is still taking 15m8s to boot?

It's a wild guess, but given my experience/issues with NAS booting up, it might be the fact that there is a lot of snapshots on the pools imported during the boot-up process. In my case it's a multihour boot up procedure, but my system has over 45k snaps on ix-applications datasets automatically created by TrueNAS/Kubernetes/Docker. Earlier, when there was like 50-200 snaps, the reboot was brisk, now it's multihour/multiday process (while technically the CLI outputted to the screen shows the 7-option menu earlier, within 2 hours after reboot, the start-on-boot services and pods/apps are not available/running for "a few more" hours). If you also start to suffer from this corelation of large number of snaps and really slow boot up, please consider speaking up in snapshots issue thread and/or housekeeping needed thread and, especially, adding your vote for JIRA ticket/suggestion.

ian351c · Mar 25, 2022

@mkarwin That appears to be my case as well. At least similar. I had recently set up a couple replication tasks to back up my permanent storage (the dataset where I keep all my bind mounts for app storage) and the ix-applications dataset. It was at about 6500 snapshots when I updated to SCALE 22.02.0.1. However, just deleting the snapshots wasn't enough. I had to delete the datasets (not the originals, but the backups). Now my NAS boots in a reasonable amount of time and all the pools are mounted. The issue I have now is that Apps > Installed Applications in the GUI gives me "Applications are not running". And I have a "pool.import_on_boot" task running from three reboots ago that I can't seem to get rid of. It shows up in the Task Manager, but doesn't show up in the job list from "midclt call core.get_jobs".

Anyone know how to tell SCALE to start/restart it's Apps infrastructure? I'd rather not just start k3s and docker from shell as that might confuse the GUI even further.

EDIT: Also, I have been following the snapshots saga. I just didn't think it affected me until just now. I have voted for and subscribed to the Jira ticket.

mkarwin · Mar 27, 2022

@ian351c I also had this issue with "Applications are not running", and ended up with some cli-fu provided in the TrueCharts discord. Otherwise, I had to reboot a few times or leave the "import_on_boot" job going for said hours before things auto-fixed or clicked into place "fast" enough to show up/report in WebUI all by themselves. Basically that's also what the teams even suggested - just leave it running and wait for results... But that means in my case the full boot-up to a somewhat usable state takes hours, and that's without any filesystem/dataset/snapshot manual cli-fu being employed that could actually hurt the machine. The destroy action on the dataset automatically removes snaps associated, but I believe that Kubernetes/docker than attempts to use existing/saved snaps as base for deployments, so these are presumably also checked for presence upon start, and then the process might be hanging till timeout happens.
I am not sure that the snapshots issue is the one impacting your case predominantly, but I bet it has something to do with your problems one way or another...

mkarwin · Mar 28, 2022

@ian351c
FYI, while it's not an official solution, given how some SCALE users eg. suggested going for

Code:

docker image prune --all --force --filter "until=24h"

once in a while to "manually" clean the system of any non-needed images, and some reported some success in improving the situation just from running

Code:

docker container prune --force --filter "until=24h"

once in a while (though if you want to combine both you'd need to first prune containers and then images, and then run

Code:

docker volume prune --force

cause volumes do not know the until filter), I've went through with docker specific but more combined/thorough solution...
I've used

Code:

docker system prune --all --force --volumes

(unfortunately if I want to clean volumes within this command I cannot use until filter as the same as for volume domain specific applies here as well) in a daily cron job (though I think I'll move it to something like a bi-weekly or monthly job after some testing period) - sure it's a workaround, it's rough around the edges, it's not allowing a combined volumes and date filters, and most importantly it somewhat "breaks" the application's instant rollbacking capability (seems like TrueNAS connects "previous entries" for the app with their on-filesystem/on-dataset sets of snapshots/clones that are removed upon docker prune within the "previous pod versions), but I can kind of live with that as I'm testing app (re)deployments usually between 16:00 and 24:00 and have the daily prune run set for 03:00, so there's a lot of time to test and then rollback if needed before the prune runs.
Of course, one could also combine a full stack of docker * prune for each subdomain - container, image, volume, network, cache - chained together and use the respective best options/switches to clean safer and better and granularlier, but I went with a single basic command instead. Whether you like the single though limited command or a combo of domain specifics is up to you of course.
In effect:

all the stopped/finished containers upon run have been removed/deleted and now the app restarts for the remaining active containers/pods have greatly improved in snappiness (back to how it was) - overall reported container count dropped from 2k+ to nearly 100...
thousands upon thousands of snaps, clones and filesystems/volumes have been removed along with the containers - i'm from 46.6k down to 0.6k of snaps, and in storage terms that's nearly 100GB freed...
hundreds of hanging/older versions of images have been deleted - i'm down from way over 2k to less than 20 now...
network routing to cluster and through, to and from respective containers has also improved...
docker build caches have dramatically reduced the storage aspect - down from over 3GB used to less than 1GB...
my CPU now reports under 10% on *idle* (consider idle as no active compute from running containers) with temps then quickly dropping at idle times to 40C - previously even on *idle* CPU in my server was hovering around 70% usage with temps at similar number though in degs C...
my reboot times are now like 15-20 mins total (as in from reboot commencing to all services and all apps up and running) instead of hours ealier... furthermore, i don't have to run through a few reboots to get apps in a working state, they are no longer reporting as missing or something upon first reboot ;)

Overall, I think I'm going to become a bit happier user with this primitive workaround until a proper smarter approach is offered in TrueNAS Scale (I'm thinking like an option to mark a container/pod as test or dev or prod one to eg. keep the containers and snapshots for debug analysis or have them pruned upon container failure or after a few hours/daily for already tested and presumably working PROD ones). But since docker support's approach is basically this is by design behaviour to enable docker volumes/layers analysis on failing/broken containers (which I honestly did and still do use a lot upon building/testing my own docker images with my own docker repository) and any maintenance is to be organised elsewhere, and TrueNAS team's approach currently remains at this is docker dogmatic issue (they are right about that, it's how docker devs thought out docker, more in line with ad hoc start/stop quickly apps than those running for longer periods of time) that it's not properly cleaning after itself in the long run, I think this will do for now until better solution is devised/offered in some future version of TrueNAS (as in periodic housekeeping of any trash docker/kubernetes leftovers).
I've also replaced all the aliases for

Code:

docker run

with

Code:

docker run --rm

for any stable docker image/container for my users (to reduce stopped/finished but hanging in docker lists containers counts and reduce chances of my users generating noise/trash from impromptu/ad hoc container runs), and left the regular not clean after self on fail docker run command for the small subset of build/test deployments for debug purposes.
Hopefully my set of workarounds will help others.
Please bear in mind that this workaround clears anything docker created for non-currently running containers, so if you have some stopped app that you start only now and then you need to have it running when the prune command analyses the containers, otherwise the container/image/snaps/volumes/networks created by docker will get purged. I currently have only 2 docker images and corresponding pods that I build from scratch/source in my CI/CD pipeline for a one time single somewhat short action docker apps that are stopped most of the time, other apps are constantly running/in use, so this solution works for me. But your mileage may vary...

ian351c · Mar 28, 2022

Thanks for the detailed reply @mkarwin! I saw the same update go into the Jira ticket for the many (many) snapshots issue. ;-) I tried out the commands (you might want to edit "images" to "image" in the first one for others that find this thread later) and while I didn't get any space back, I did delete a bunch of volumes and containers (but not images). ~~I haven't tried rebooting yet, but I'll cross that bridge when I come to it.~~

EDIT:
So I tried rebooting... I am observing the same behavior I had before trying your commands. So I am using a magic incantation that I put together that solves my particular issue with the GUI telling me that Apps aren't running:

Code:

zfs set mountpoint=legacy apps/ix-applications/k3s/kubelet
mount -t zfs apps/ix-applications/k3s/kubelet /var/lib/kubelet
zfs set mountpoint=legacy $(zfs list -o name | grep ix-applications | grep pvc)
midclt call service.stop kubernetes
midclt call service.start kubernetes

I put this together due to the fact that I was (and still am) seeing this error in /var/log/middlewared.log on boot:

Code:

<Task finished name='Task-3692' coro=<Middleware.call() done, defined at /usr/lib/python3/dist-packages/middlewared/main.py:1312> exception=CalledProcessError(1, ('mount', '-t', 'zfs', 'apps/ix-applications/k3s/kubelet', '/var/lib/kubelet'), b'', b"filesystem 'apps/ix-applications/k3s/kubelet' cannot be mounted using 'mount'.\nUse 'zfs set mountpoint=legacy' or 'zfs mount apps/ix-applications/k3s/kubelet'.\nSee zfs(8) for more information.\n")>

I have my own ticket going for my issue (which appears different, but prolly related to the too many snapshots issue): NAS-115457.

mkarwin · Mar 28, 2022

ian351c said:
Thanks for the detailed reply @mkarwin! I saw the same update go into the Jira ticket for the many (many) snapshots issue. ;-) I tried out the commands (you might want to edit "images" to "image" in the first one for others that find this thread later) and while I didn't get any space back, I did delete a bunch of volumes and containers (but not images). ~~I haven't tried rebooting yet, but I'll cross that bridge when I come to it.~~

(...)

I have my own ticket going for my issue (which appears different, but prolly related to the too many snapshots issue): NAS-115457.

Thanks for pointing that out. Yeah, I've fixed this small code error. Fortunately, the system prune is correctly spelled ;)
Just FYI, this is not a complete solution, the result is still about 500 snaps remaining in my case, some of which are the the ones from the Data protection feature - so the automated daily snaps that are automatically cleared. There are still droves of snaps taken seemingly upon reboot/kubernetes restart that are non-removable due to snapshot has dependent clones returned message on attempting of their zfs destroy runs by hand (these are on <pool>/ix-applications/<docker>/<dockerimagelayerfilesystem>), and multiples are even removable (these are on <pool>/ix-applications/release/<appname>/<numberedversion> subdataset), both of which are snaps taken by docker/kubernetes for apps in the environment. The latter of which with their contents refering to some specific versions of the app deployments (historically) in use on the machine/server (I've had more than 50 release snaps for some of the apps deployed) - these are not in use per se, but are "recipes/snaps" for specific application deployment rollbacks as used in Installed Applications -> App -> Rollback functionality, if I recall/understand correctly. I'm not sure which docker/kubernetes mechanism manages this, but over time even only a handful of running apps will grow the number of these 2 types of snaps to over a hundred or easily over 500. Sure it's still a reduction thanks to the daily docker prune runs compared to multiples of thousands before, but these are hundreds of snaps not taken by user managed/deployed automatas but instead these are snaps taken for the Kubernetes/docker environment by those tools.
Perhaps in future TrueNAS Scale dev team will offer some intelligent cleaning utility to take care of these in a smart manner automatically or give an option to trim creation of these in app deployment forms. Sure, they're not taking a lot space on the storage per se thanks to the snapshotting nature, but these are still hundreds of snapshots that should be better managed by the system.
Great that you've created your JIRA ticket, it should help with narrowing and fixing any issues we have by letting the devs know of these, allowing them to manage their time and resources, and ultimately helping devs prepare the product for commercial/enterprise release.

emanuelx · Apr 26, 2022

I have exactly the same issue, the middleware is taking 15mins to run, I'm not sure but my nas e stuck for a lot of time in the error message

error: serial port -com0 isn't found
error: terminal -serial' isn't found.
error: terminal -serial' isn't found.

https://imgur.com/ausaJfn

Important Announcement for the TrueNAS Community.

Missing Pool and Datasets after SCALE 22.02.0 > SCALE 22.02.0.1

ian351c

Patron

sretalla

Powered by Neutrality

ian351c

Patron

mkarwin

Dabbler

ian351c

Patron

mkarwin

Dabbler

mkarwin

Dabbler

ian351c

Patron

mkarwin

Dabbler

emanuelx

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Missing Pool and Datasets after SCALE 22.02.0 > SCALE 22.02.0.1

Patron

Powered by Neutrality

Patron

Dabbler

Patron

Dabbler

Dabbler

Patron

Dabbler

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Missing Pool and Datasets after SCALE 22.02.0 > SCALE 22.02.0.1"

Similar threads