SOLVED [SCALE] Mass cleanup/removal of ix-applications snapshots solution required

mkarwin

Dabbler
Joined
Jun 10, 2021
Messages
40
As I've explained in another post/thread here, there is an issue with snapshots being created for ix-applications dataset and its child datasets. I mean in under a year with moderate to light use, my TrueNAS Scale server has over 47 thousand snapshots created. Yeah, you read that right - it's 47502 snapshots, out of which the boot pool snapshots taken on upgrade/update account for 5 entries and there's roughly 48 snapshots correctly managed through the Data Protection functionality that I've set up. The rest, so roughly 47450, are the ones created automatically by TrueNAS/TrueCharts/docker/kubernetes. Due to their presence it seems that all my storage related tasks crawl to a halt, take ages or simply timeout.
How do I get rid of these, and then get rid of these on a recurrent basis/fix this?
As it is, I cannot delete them by hand by clicking them under snapshots and choosing delete - even if go through the manual process and choose 100 items per page, it's still 475 such pages to click through. It's not tedious, it's abhorent ;) Let alone, nigh on impossible due to multiples of errors/warnings such as:
Code:
Cannot destroy <pool>/ix-applications/docker/<sha256_like_string>@<some_numbered_snapshot>: snapshot has dependent clones

upon attempting to manually remove those.
So please provide a better, possibly automated, solution to deal with those in a proper way. One that would become part of general solution in this platform. This is not something that might have impacted TrueNAS Core users cause it seems this is related to how TrueNAS uses docker/kubernetes for its applications and the impact on the pool containing the applications.
 

Attachments

  • zfs-snapshots.txt
    7.1 MB · Views: 284

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
I do agree with you about needing some better way of managing this from the GUI. However, there is a way to delete the snapshots automatically from the command line.

This answer on ServerFault has a pretty good suggestions. I'd strongly recommend doing a backup before going crazy with commands, and properly testing it using the "echo" example at the end of the post:

 

mkarwin

Dabbler
Joined
Jun 10, 2021
Messages
40
Yeah, I saw it. In fact, I've had a similar piped one-liner (taken from my good old days of Oracle Solaris trainings :wink:) almost run to manually destroy those snaps, but then I thought what if I brick the thing or otherwise compromise middleware. Plus it's a mostly one-time solution that might not be suited to how Scale has been organised to work. There's also that ZFS general script I've mentioned in the another thread offered in that ServerFault answers, though I got there through another thread on the TrueNAS Core forum thread from way back for the manual/periodic snapshots not being removed despite limiting to 14D set in the UI issue. Still, that script only lists about 25k of 45k+ snaps upon scan for older than 2W, despite my manual scans showing a number that's closer to 40k. Sure, that 20k remaining after such a script destroys those snaps is a lot less. But still, this is not something that has been tested with TrueNAS Scale specifically and how it works under the hood. It hasn't been vetted by TrueNAS Scale team, nor presumably verified by Scale users/testers. As it stands, I cannot afford at this moment, due to the upgrade last weekend taking so long, to risk bricking whole server upon running some compute project now. Perhaps next weekend... once some TrueNAS Scale devs/testers/users with available "test" machines vet one of the solutions offered in this outside/3rd-party source?
Heck, maybe if enough people vote for my JIRA ticket/suggestion we'll get an official solution soon?
 

Ixian

Patron
Joined
May 11, 2015
Messages
218
I've been concerned about this for some time given the long-standing issues with Docker's ZFS storage driver (short version: This has been a known problem since 2019 for any Docker volume service using the ZFS storage driver from Docker and the Docker team's response amounts to "some other service should clean up the snapshots, that isn't a Docker problem).

It's a very real problem that is going to start hitting a lot more users now that SCALE is GA.

Please register and vote for the issue @mkarwin reported here: https://jira.ixsystems.com/browse/NAS-115386

That is the only real way this will get the attention it needs.
 

truecharts

Guru
Joined
Aug 19, 2021
Messages
788
Docker Prune should solve most of these issues.
 

Ixian

Patron
Joined
May 11, 2015
Messages
218
Docker Prune should solve most of these issues.

No, it doesn't. If you read up on the problem with the Docker ZFS driver you'll understand why.

Docker specifically - and apparently, intentionally - doesn't clean up the snapshots, even though it creates them. This is different behavior from their other storage drivers like Overlay2.

It will prune the volumes and containers themselves, which actually makes the problem worse.
 

truecharts

Guru
Joined
Aug 19, 2021
Messages
788
No, it doesn't. If you read up on the problem with the Docker ZFS driver you'll understand why.

Docker specifically - and apparently, intentionally - doesn't clean up the snapshots, even though it creates them. This is different behavior from their other storage drivers like Overlay2.

It will prune the volumes and containers themselves, which actually makes the problem worse.

Well, yes and no.
It's not that simple.

The snapshots are mostly created by 3 things:
- Manual (A big no-no on ix-applications)
- SCALE Apps Backups (either through truetool and/or automatic during SCALE upgrades)
- ZFS docker driver when there are, somewhat simplified, relationships between layers.

Docker-Prune does cleanup "orphan" datasets completely, but if there are relationships it will not remove them.
However, when you do clean them it also takes all snapshots with it as well.

We've had multiple users test this behavior and many reported hunders, thousands or even ten times that much, of snapshots being nuked from the system with docker-prune. So we are quite confident that it works out pretty well.

Is it a 100% solution? No.
But it solves the problem enough to prevent users from experiencing problematic issues caused by snapshot count. However, it's not long-term solution and not enterprise-stable.

Also, on another note:
It's pretty rude to tell people who actively worked on this problem for a few many-many hours (including discussions with iX staff and working on fixes for related issues) to go "read up" on something, without at least trying to assess what they know.

For example, we worked on solving issues where the backup solution for SCALE Apps increased this problematic behavior by about 2-3 times if used regularly and how that related to the prune command.
 

Ixian

Patron
Joined
May 11, 2015
Messages
218
Don't get bent out of shape over it, sometimes a cigar is just a cigar i.e. I didn't know you already looked in to it so suggested you should, no rudeness implied. Your original one-sentence answer was pretty vague so I made a simple judgement call. Your reply is a much better explanation.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Upvoted! Why are there so many snapshots in the first place?
 

truecharts

Guru
Joined
Aug 19, 2021
Messages
788
Upvoted! Why are there so many snapshots in the first place?
@Ixian pretty much explained that it had to do with the docker ZFS driver and we added to that that there are a few more complications.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
So in short: It is primarily a Docker issue.
Which means that iXsystems is unlikely to spend much time fixing a "low" priority JIRA ticket.
And according to @Ixian the Docker team is of the opinion that this is a ZFS issue and that someone else must clean up their mess.

Which leaves as options:
  • Live with it.
  • Hope that SCALE will eventually move to another container engine than Docker.
  • Drop SCALE and run Docker on a non-ZFS platform (Ubuntu Server, Alpine Linux, etc.).
 

HarryMuscle

Contributor
Joined
Nov 15, 2021
Messages
161
Does anyone have information on what event triggers Docker to create a snapshot? Is it for example starting or stopping a container?

Thanks,
Harry
 

mkarwin

Dabbler
Joined
Jun 10, 2021
Messages
40
Does anyone have information on what event triggers Docker to create a snapshot? Is it for example starting or stopping a container?

Thanks,
Harry
From what I've heard from docker training at work, each docker image layer triggers creating snapshots and from some of these a clone to apply next layer upon each docker image build upon docker start. So basically, a start triggers a mountain of snpashots on heavier/multi-layered docker images. Furthermore, upon stop another set of snapshots might be taken for secuity reasons. Plus upon pod and of course container update - as then all the current existing filesystems, including those snapshots that are cloned into filesystems to be RW, are again snapshotted for security (as in rollback) reason before snapshotted as well, before the new version builds its own image layer by layer using ZFS storage driver. And then, whenever container needs to write data to any file existing in any layer below the rewritable uppermost one a snapshot again is taken before the blockwrite is commited, so there's a possibility that if your app updates/creates some files existing in the layered structure not within exact "mounted external" (from the container PoV) storage, a snapshot might also be triggered. At least that's more or less what's also been pointed out at docker wiki. All in all, there are multiple occassions the snaps/clones are triggered, but docker team puts the maintenance of said snaps onto sys-admins/platform-admins (it's pointed as:
Code:
The btrfs and zfs storage drivers allow for advanced options, such as creating “snapshots”, but require more maintenance and setup. Each of these relies on the backing filesystem being configured correctly.
). So perhaps we only need some ZFS side restrictions applied or a recurring housekeeping job upon app (re)deployments or removals etc.
 
Last edited:

mkarwin

Dabbler
Joined
Jun 10, 2021
Messages
40
Which leaves as options:
  • Live with it.
  • Hope that SCALE will eventually move to another container engine than Docker.
  • Drop SCALE and run Docker on a non-ZFS platform (Ubuntu Server, Alpine Linux, etc.).
  1. This option for obvious reasons is hard to recommend, given how throughout a life-time the performance will tank due to those snapshots and layers. I guess the only way to live with it is to on occassions remove the pool from Kubernetes/Apps storage underpinning selector, destroy the ix-applications dataset on said pool, thus in turn triggering all the ZFS snapshots created on this dataset to also be removed. And then to add the same storage pool back again, thus the ix-applications dataset gets created anew, and redeploy all the apps used before anew...
  2. This option would entail dropping the usage of more efficient ZFS storage driver in favor of more general overlay2 (as described in docker storage drivers comparison) if we are to stay with Docker engine. ZFS driver is more efficient - both in tems of space usage and performance, the switch to overlay2 would mean we'd just be trading the ZFS snapshots for OVERLAYFS layer mounts, and since we're already running ZFS-based filesystems we might as well use it, although as mentioned above, docker devs put the maintenace, so housekeeping, and configuration firmly in sys-admin/platform-admin hands. So I guess there might be some extra work needed to eg. clean-up upon (re)deployments and on recurring schedule - same as eg. some docker guys at rancher suggest using snapper for when BTRFS is used as the underlying storage driver - otherwise the snapshots will quickly run out of hands. Furthermore, none of the current popular container engines offers automatic clean-ups/housekeeping of ZFS snapshots, most of them support snapshot creation and the logic behind them is similar to what docker does with ZFS driver currently - so they're good at creating snapshots as snaps and clones, but they're not cleaning afterwards. So yeah, maybe some would create less or could be configured to limit these (eg. like overlayfs has a limit of 128 lower layers + 1 work RW top layer per mount, and layers get rebuilt upon deployments so they're taking slightly longer and are less snappy than the ZFS underpinnings, though they do should not litter as much), on the other hand, maybe we could also set a limit such as that on the ZFS filesystem in use currently to have ZFS automatically squash snaps (though i don't think ZFS has a squashing multiple snaps into one functionality) or limit new snaps creation requiring/enforcing middleware to check first, destroy any unneeded snaps, housekeep those present to make room for new ones requested. Either way, regardless of the engine or storage driver, the snapshots management needs to be implemented on the middleware side.
  3. In case you run it on a different platform, remember that the default storage driver overlay2 trades snaps for layers, though these are simply recreated anew on "deployment", so you lack the quick rollback feature more advanced storage drivers support. If you instead opt for BTRFS (or XFS) so more advanced filesystem storage driver, you're also going to use snapshots anyway and the situation would become an issue sooner or later regardless of the OS...
Basically, we could drop ZFS storage driver and switch back to the default overlay2 thus losing easy rollbacking/security features, being slightly less performant and efficient when done correctly, and having a limit of 128+1 layers per mount (so per respective app filesystem) at most. On the other hand, we would not be then impacted by this snapshotting issue. But this needs updating in the sys config, kubernetes config and in middleware to handle the expected behaviour for the higher complexity security aspects. It would of course also mean any existing app deployments would have to be recreated from scratch. and this needs to be added/tested on the TrueNAS side...
On the other hand, I guess it might be even better or easier if eg. app (re)deployment (so start/stop) had a clean-up procedures that destroys the datasets for the app (thus triggering ZFS related snapshot removal, removing any no-longer needed container images etc. housekeeping actions) before recreating them anew using same/new image version. Of course to have the persistence kept, all the PVCs would have to be kept eg. outside of regular docker managed subdatasets/filesystems and actually be kept as normal PVs with eg hostpath volumes. At least that's my understanding...
 

Ixian

Patron
Joined
May 11, 2015
Messages
218
There's an older (1y+) thread over on the Moby Git that covers the problem in more detail: https://github.com/moby/moby/issues/41055

No real solution (other than "create a zvol, format it with EXT4, and put the docker system directory there so it can use Overlay2....which really isn't a solution) but some may be interested in the details.

Last time I tried it (RC2?) SCALE did allow formatting a zvol from the CLI (the packages for mkfs.ext4 weren't in earlier betas) and I gave it a whirl but outside of VMs SCALE really isn't set up to support filesystems other than ZFS and I ran in to numerous issues. It also requires a bit of a workaround with pre-init scripts to mount the volume on boot since you can't/shouldn't directly edit fstab with SCALE.

It would be a lot better if SCALE just ran what essentially amounts to garbage-collection and cleans up the snapshots the docker storage driver creates.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
So in short: It is primarily a Docker issue.
Which means that iXsystems is unlikely to spend much time fixing a "low" priority JIRA ticket.
And according to @Ixian the Docker team is of the opinion that this is a ZFS issue and that someone else must clean up their mess.

Which leaves as options:
  • Live with it.
  • Hope that SCALE will eventually move to another container engine than Docker.
  • Drop SCALE and run Docker on a non-ZFS platform (Ubuntu Server, Alpine Linux, etc.).

Just so everyone is aware, iX does take these types of issues seriously.. even if in the short term we don't have a fix for it in Angelfish. There will be some reluctance in making a major change to the released software train. The issue started getting attention mid last year. Thanks to help from TrueCharts, we did reduce the number of snapshots to help make Angelfish more usable.

As discussed, there are other container engines, there's also a method of using overlayfs on zfs. For Bluefin, we expect to make one of these work.

Our preference is to use well engineered and tested open source software components, but if necessary we can develop a solution as an additional "management feature" (e.g @Ixian garbage collection). However, it's better to minimize garbage in the 1st place.
 

mkarwin

Dabbler
Joined
Jun 10, 2021
Messages
40
While it's not an official solution, given how some SCALE users eg. suggested going for
Code:
docker image prune --all --force --filter "until=24h"
once in a while to "manually" clean the system of any non-needed images, and some reported some success in improving the situation just from running
Code:
docker container prune --force --filter "until=24h"
once in a while (though if you want to combine both you'd need to first prune containers and then images, and then run
Code:
docker volume prune --force
cause volumes do not know the until filter), I've went through with docker specific commands...
I've used
Code:
docker system prune --all --force --volumes
(unfortunately if I want to clean volumes within this command I cannot use until filter) in a daily cron job (though I think I'll move it to something like a bi-weekly or monthly job) - sure it's a workaround, it's rough around the edges, it's not allowing a combined volumes and date filters, and most importantly it somewhat "breaks" the application's instant rollbacking capability (seems like TrueNAS connects "previous entries" for the app with their on filesystem sets of snapshots/clones that are removed upon docker prune), but I can kind of live with that as I'm testing app (re)deployments usually between 16:00 and 24:00 and have the daily prune run set for 03:00.
Of course, one could also combine a full stack of docker * prune for each subdomain - container, image, volume, network, cache - chained together and use the respective best options/switches to clean safer and better and granularlier, but I went with a single basic command instead. Whether you like the single though limited command or a combo of domain specifics is up to you of course.
In effect:
  • all the stopped/finished containers upon run have been removed/deleted and now the app restarts for the remaining active containers/pods have greatly improved in snappiness (back to how it was) - overall reported container count dropped from 2k+ to nearly 100...
  • thousands upon thousands of snaps, clones and filesystems/volumes have been removed along with the containers - i'm from 46.6k down to 0.6k of snaps, and in storage terms that's nearly 100GB freed...
  • hundreds of hanging/older versions of images have been deleted - i'm down from way over 2k to less than 20 now...
  • network routing to cluster and through, to and from respective containers has also improved...
  • docker build caches have dramatically reduced the storage aspect - down from over 3GB used to less than 1GB...
  • my CPU now reports under 10% on *idle* (consider idle as no active compute from running containers) with temps then quickly dropping at idle times to 40C - previously even on *idle* CPU in my server was hovering around 70% usage with temps at similar number though in degs C...
Overall, I think I'm going to become a bit happier user with this primitive workaround until a proper smarter approach is offered in TrueNAS Scale (I'm thinking like an option to mark a container/pod as test or dev or prod one to eg. keep the containers and snapshots for debug analysis or have them pruned upon container failure or after a few hours/daily for already tested and presumably working PROD ones). But since docker support's approach is basically this is by design behaviour to enable docker volumes/layers analysis on failing/broken containers (which I honestly did and still do use a lot upon building/testing my own docker images with my own docker repository) and any maintenance is to be organised elsewhere, and TrueNAS team's approach currently remains at this is docker dogmatic issue (they are right about that, it's how docker devs thought out docker, more in line with ad hoc start/stop quickly apps than those running for longer periods of time) that it's not properly cleaning after itself in the long run, I think this will do for now until better solution is devised/offered in some future version of TrueNAS (as in periodic housekeeping of any trash docker/kubernetes leftovers).
I've also replaced all the aliases for
Code:
docker run
with
Code:
docker run --rm
for any stable docker image/container for my users (to reduce stopped/finished but hanging in docker lists containers counts and reduce chances of my users generating noise/trash from impromptu/ad hoc container runs), and left the regular not clean after self on fail docker run command for the small subset of build/test deployments for debug purposes.
Hopefully my set of workarounds will help others.
Please bear in mind that this workaround clears anything docker created for non-currently running containers, so if you have some stopped app that you start only now and then you need to have it running when the prune command analyses the containers, otherwise the container/image/snaps/volumes/networks created by docker will get purged. I currently have only 2 docker images and corresponding pods that I build from scratch/source in my CI/CD pipeline for a one time single somewhat short action docker apps that are stopped most of the time, other apps are constantly running/in use, so this solution works for me. But your mileage may vary...


Just FYI, this is not a complete solution, the result is still about 500 snaps remaining, some of which are the the ones from the Data protection feature - so the automated daily snaps that are automatically cleared correctly.
There are still droves of snaps taken seemingly upon reboot/kubernetes restart that are non-removable due to snapshot has dependent clones returned message on attempting of their zfs destroy runs by hand (these are on <pool>/ix-applications/<docker>/<dockerimagelayerfilesystem>), and multiples are even removable (these are on <pool>/ix-applications/release/<appname>/<numberedversion> subdataset), both of which are snaps taken by docker/kubernetes for apps in the environment. The latter of which with their contents refering to some specific versions of the app deployments (historically) in use on the machine/server (I've had more than 50 release snaps for some of the apps deployed) - these are not in use per se, but are "recipes/snaps" for specific application deployment rollbacks as used in Installed Applications -> App -> Rollback functionality, if I recall/understand correctly. I'm not sure which docker/kubernetes mechanism manages this, but over time even only a handful of running apps will grow the number of these 2 types of snaps to over a hundred or easily over 500. Sure it's still a reduction thanks to the daily docker prune runs compared to multiples of thousands before, but these are hundreds of snaps not taken by user managed/deployed automatas, nor are they related to/pruned by aforementioned commands for cron jobs but instead these are snaps taken for the Kubernetes/docker environment by those application tools.
Perhaps in future TrueNAS Scale dev team will offer some intelligent cleaning utility to take care of these in a smart manner automatically or give an option to trim creation of these in app deployment forms. Sure, they're not taking a lot space on the storage per se thanks to the snapshotting nature, but these are still hundreds of snapshots that should be better managed by the system.
 
Last edited:

mkarwin

Dabbler
Joined
Jun 10, 2021
Messages
40
Just so everyone is aware, iX does take these types of issues seriously.. even if in the short term we don't have a fix for it in Angelfish. There will be some reluctance in making a major change to the released software train. The issue started getting attention mid last year. Thanks to help from TrueCharts, we did reduce the number of snapshots to help make Angelfish more usable.

As discussed, there are other container engines, there's also a method of using overlayfs on zfs. For Bluefin, we expect to make one of these work.

Our preference is to use well engineered and tested open source software components, but if necessary we can develop a solution as an additional "management feature" (e.g @Ixian garbage collection). However, it's better to minimize garbage in the 1st place.
Yeah, and we understand that. Going for overlay2 as the storage solution would just trade ZFS snaps for less efficient layers in the overlay2 driver stack. Basically some of the trash/noise would still be there, only placed/encapsulated in different tech. Overlay drivers could help a bit, but ultimately they might still end up proliferating the layers for different mounts that could still be left afterwards for Docker debugging/analysis/checks. Sure in some cases this might help in denoising the WebUI by eg. moving this to a different subpage/pane. On the other hand, I was wondering if it were possible to also have eg. multiple tags or labels applied, eg. to signal/mark more clearly the origin/pupose in both computer and human readable manner, such that these and these snaps are a product of system/OS running, these and these are docker created for specific app/namespace, these and these are kubernetes managed for specific app/namespace, these and these are user defined automatas etc. and then add extra field/selector to the table to filter/select only specific subgroups, as in user defined, OS/middlewared required, kubernetes for this and this app, docker for this and this app...
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
I've used
Code:
docker system prune --all --force --volumes
(unfortunately if I want to clean volumes within this command I cannot use until filter) in a daily cron job (though I think I'll move it to something like a bi-weekly or monthly job) - sure it's a workaround, it's rough around the edges, it's not allowing a combined volumes and date filters, and most importantly it somewhat "breaks" the application's instant rollbacking capability (seems like TrueNAS connects "previous entries" for the app with their on filesystem sets of snapshots/clones that are removed upon docker prune), but I can kind of live with that as I'm testing app (re)deployments usually between 16:00 and 24:00 and have the daily prune run set for 03:00.
Adopted! Thanks for the code.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Yeah, and we understand that. Going for overlay2 as the storage solution would just trade ZFS snaps for less efficient layers in the overlay2 driver stack. Basically some of the trash/noise would still be there, only placed/encapsulated in different tech. Overlay drivers could help a bit, but ultimately they might still end up proliferating the layers for different mounts that could still be left afterwards for Docker debugging/analysis/checks. Sure in some cases this might help in denoising the WebUI by eg. moving this to a different subpage/pane. On the other hand, I was wondering if it were possible to also have eg. multiple tags or labels applied, eg. to signal/mark more clearly the origin/pupose in both computer and human readable manner, such that these and these snaps are a product of system/OS running, these and these are docker created for specific app/namespace, these and these are kubernetes managed for specific app/namespace, these and these are user defined automatas etc. and then add extra field/selector to the table to filter/select only specific subgroups, as in user defined, OS/middlewared required, kubernetes for this and this app, docker for this and this app...
I don't disagree.... but we have resource and time contraints. Off-the-shelf solutions are preferred.. either existing code or developed and tested contributions. Its a good discussion topic for any developers on discord channel.
 
Top