How are Periodic Snapshots marked for deletion?

Joined
Oct 22, 2019
Messages
3,641
I could never find a definitive answer to this particular question. What exactly runs behind the scenes to mark and delete old snapshots?

Is there a script that searches for old snapshots based on their name? Or is it based on their creation timestamp?

Is it more sophisticated where it stores a table of snapshots and their expiration date, regardless of the snapshot's name and regardless of the Periodic Snapshot Task name? (In other words, renaming a snapshot will not save it from certain doom once its expiration arrives, since it is "tagged" regardless of its name?)

What happens if you change the "Naming Schema" for a Periodic Snapshot Task that already exists? Will expired snapshots eventually "slip through the cracks" and live for an eternity? :wink:

periodic-snapshot-deletion-method.png



Attached is a screenshot to illustrate my confusion.
 
Last edited:

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
In my experience, yes, they will live on forever. I think the retention data is in the task, not the snapshots themselves. Not ideal, and I don't think it worked that way in 11.2 and prior. I hope I'm wrong though and have just been missing something.
 
Joined
Oct 22, 2019
Messages
3,641
You mean if you rename a snapshot, the task that is supposed to remove expired snapshots will skip it (since it looks for a matching name)?
 
Last edited:

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
You mean if you rename a snapshot, the task that is supposed to remove expired snapshots will skip it (since it looks for a matching name)?
Yes, I believe so. I think there are some things you can change on the task where it will continue to remove expired snapshots, and some things that break it. I'm not sure exactly what though. I just know that I've had orphaned snapshots hang around after modifying a task.
 

indy

Patron
Joined
Dec 28, 2013
Messages
287
I did exactly that (changed the naming scheme) and FreeNas stopped deleting out-of-date snapshots with the old name.
 
Joined
Oct 22, 2019
Messages
3,641
I just know that I've had orphaned snapshots hang around after modifying a task.
I did exactly that (changed the naming scheme) and FreeNas stopped deleting out-of-date snapshots with the old name.

I wonder if this is by design, or just a "handy accident" that works to protect snapshots?

Makes me curious, then, if there is a risk that a similarly-named snapshot task might inadvertently have its snapshots deleted, even though they have a longer expiration date? After all, the default naming schema is the same for newly created snapshot tasks.
 

rudds

Dabbler
Joined
Apr 17, 2018
Messages
34
Chiming in here to say the logic for this stuff is confusing, or buggy, or both. I'm having a similar problem, in that my pool snapshots are set to a 3-day retention in this "periodic snapshot task" like this:

Screenshot 2021-03-21 093453.png


Those snapshots are synced daily to an external pool overnight with a "replication task", and I set a 6-week retention on the replicated snapshots:

Screenshot 2021-03-21 093531.png


However I just noticed this morning that the snapshots on my external pool are also getting purged after three days, which was a bit of a rude surprise. (Thankfully I haven't needed to retrieve anything from older backups recently.)

I had this setup working fine under 11.3, so I'm not sure at what point the logic changed or a bug was introduced, but something does seem wrong here.
 
Joined
Oct 22, 2019
Messages
3,641
I had this setup working fine under 11.3, so I'm not sure at what point the logic changed or a bug was introduced, but something does seem wrong here.
The mystery continues. Do these replicated snapshots (on the external pool) happen to have "-3d" appended to their names? Maybe it's the "-3d" which signals their destruction, even though you set an override lifetime of 6 weeks?


I'm still trying to figure this one out, and I'm not savvy enough to comb through source code to answer my own question in this matter. I don't even know if this is even a "bug" or if we're using snapshots incorrectly, or maybe we are missing a key detail not covered in the documentation?

I guess it's safe to say there's nothing in the snapshot's metadata itself that signals to the system "it's safe to delete me now!"

So that leaves an under-the-hood task (maybe run on a schedule, like a built-in cron job that cannot be reviewed or modified). Maybe it does in fact look at the name of the snapshot (or creation date?) and compare it to the appended portion, such as "-6m" or "-3d". This is noted above when someone renames a Snapshop Task.

Maybe it has nothing to do with any part of the snapshot's name, and there's a separate database stored somewhere that links snapshots to their expiration dates, and the cron job will destroy any existing snapshots that are in this "list"? Perhaps renaming a snapshot inadvertantly "saves it from certain destruction" because the task/database/list is referencing snapshot names, rather than some sort of unique index ID?

At this point I'm just shooting in the wind. It's not something I can feasibly test out to know for certain. Unless someone else has any theories that can answer this mystery?


Once you create a Snapshot Task, you sort of just "set it and forget it". It's not really clear what takes place behind-the-scenes when it comes to retention and expiration, as seen above with @rudds issue of overriding the lifetime to 6 weeks, yet the snapshots are destroyed after 3 days.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I'm not 100% sure which method TrueNAS is employing, but I have seen reference to autosnap before (https://github.com/ansemjo/autosnap), so maybe there's some code based on that involved..

If that is the case, names are most certainly important in designating the retention period. (although there's also potentially a limit in terms of number of snapshots which might also be defined as a property of the dataset (snapshot_limit, related to snapshot_count when checking, I guess... I notice that the original script uses some named properties that I don't see on my replicated datasets, so there's clearly some variation)

You could probably consider a complementary script that uses the creation time of the snapshot to hold and un-hold the snapshots on the backup system depending on when you would want them to be allowed to be removed or not.

I also note that the Advanced Replication creation settings allow for the retention of the target to be set differently than the source, so I guess that's something to look at and confirm if it works or not (then raise a bug report if not).
 
Last edited:

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
I don't have an explanation of the fundamentals but probably can add to the confusion :wink:

i have a snapshot task for my VMs that does hourly snapshots with a retention period of two weeks:
Bildschirmfoto 2021-03-22 um 13.44.19.png

The resulting snapshots look like this:
Code:
root@freenas[~]# zfs list -t snap -r ssd/vms/windows-pmh-disk0
NAME                                              USED  AVAIL     REFER  MOUNTPOINT
ssd/vms/windows-pmh-disk0@auto-2021-03-08_13-00  9.07M      -     30.4G  -
ssd/vms/windows-pmh-disk0@auto-2021-03-08_14-00  6.01M      -     30.4G  -
ssd/vms/windows-pmh-disk0@auto-2021-03-08_15-00  6.66M      -     30.4G  -
[...]
ssd/vms/windows-pmh-disk0@auto-2021-03-22_11-00  15.6M      -     30.8G  -
ssd/vms/windows-pmh-disk0@auto-2021-03-22_12-00  9.57M      -     30.8G  -
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  9.81M      -     30.8G  -


Then I have a replication task that replicates one snapshot per day and keeps those for 4 weeks on another system:
Bildschirmfoto 2021-03-22 um 13.46.22.png

The snapshots on the target system look like this:
Code:
root@freenas2[~]# zfs list -t snap -r fusion/backup/vms/windows-pmh-disk0
NAME                                                        USED  AVAIL     REFER  MOUNTPOINT
fusion/backup/vms/windows-pmh-disk0@auto-2021-02-23_00-00   644M      -     44.1G  -
fusion/backup/vms/windows-pmh-disk0@auto-2021-02-24_00-00   549M      -     44.1G  -
fusion/backup/vms/windows-pmh-disk0@auto-2021-02-25_00-00   572M      -     44.1G  -
[...]
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-20_00-00   572M      -     44.6G  -
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-21_00-00   515M      -     44.6G  -
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00     0B      -     44.6G  -


So - don't ask me why exactly but at least for me the system is working perfectly as I intend it to do. Perhaps my config screenshots help coming to a conclusion.

Kind regards,
Patrick
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Source system:
Code:
root@freenas[~]# zfs get all ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00
NAME                                             PROPERTY                VALUE                   SOURCE
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  type                    snapshot                -
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  creation                Mon Mar 22 13:00 2021   -
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  used                    9.85M                   -
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  referenced              30.8G                   -
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  compressratio           1.00x                   -
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  volsize                 80G                     local
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  createtxg               10889070                -
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  guid                    3500501945768056423     -
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  primarycache            all                     default
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  secondarycache          all                     default
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  defer_destroy           off                     -
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  userrefs                0                       -
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  objsetid                86133                   -
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  mlslabel                none                    default
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  refcompressratio        1.00x                   -
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  written                 19.5M                   -
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  clones                                          -
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  logicalreferenced       30.6G                   -
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  context                 none                    default
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  fscontext               none                    default
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  defcontext              none                    default
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  rootcontext             none                    default
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  encryption              off                     default
ssd/vms/windows-pmh-disk0@auto-2021-03-22_13-00  org.freebsd.ioc:active  yes                     inherited from ssd


Destination system:
Code:
root@freenas2[~]# zfs get all fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00
NAME                                                       PROPERTY               VALUE                  SOURCE
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  type                   snapshot               -
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  creation               Mon Mar 22  0:00 2021  -
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  used                   0B                     -
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  referenced             44.6G                  -
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  compressratio          1.00x                  -
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  volsize                80G                    local
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  createtxg              1179316                -
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  guid                   4868441139690642052    -
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  primarycache           all                    default
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  secondarycache         all                    default
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  defer_destroy          off                    -
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  userrefs               0                      -
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  objsetid               53833                  -
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  mlslabel               none                   default
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  refcompressratio       1.00x                  -
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  written                994M                   -
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  clones                                        -
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  logicalreferenced      30.6G                  -
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  context                none                   default
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  fscontext              none                   default
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  defcontext             none                   default
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  rootcontext            none                   default
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  encryption             off                    default
fusion/backup/vms/windows-pmh-disk0@auto-2021-03-22_00-00  org.truenas:managedby  217.29.46.82           inherited from fusion/backu
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
So that's a clear "NO" to any information about snapshot retention being stored in the snapshot on either end.

I note from the API reference that there's nothing other than the creation and manipulation of replication jobs available, so no option to run cleanup separately.

It seems to me that the replication task itself is doing the cleanup based on the rules defined in the task itself and identifies jobs to work on based on creation time of the snapshots and the naming standard defined in the task properties in addition to the settings of the task controlling how to treat both sides.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
As far as I can deduct from some open issues in iXsystems' JIRA, they use zettarepl.

To quote:
zettarepl is a cross-platform ZFS replication solution. It provides:
  • Snapshot-based PUSH and PULL replication over SSH or high-speed unencrypted connection
  • Extensible snapshot creation and replication schedule, replication of manually created snapshots
  • Consistent recursive snapshots with possibility to exclude certain datasets
  • All modern ZFS features support including resumable replication
  • Flexible snapshot retention on both local and remote sides
  • Comprehensive logging that helps you to understand what is going on and why
  • Configuration via simple and clear YAML file
  • Full integration with FreeNAS, the World’s #1 data storage solution
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
As far as I can deduct from some open issues in iXsystems' JIRA, they use zettarepl.
I agree, it seems identical to the options available in the GUI.

And indeed it has a built-in set of routines for managing the retention/cleaning of snapshots on both sides as part of the replication system.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
And indeed it has a built-in set of routines for managing the retention/cleaning of snapshots on both sides as part of the replication system.
Which is the only way that makes sense, don't you think? Picture a separate purging task based on the snapshot creation time and some retention configuration. When the replication stops for some reason and that goes undetected, the task will happily expire old snapshots until all of them are gone or the problem is noticed and fixed.

Only delete old data when the new data has been committed is a sane conservative approach, IMHO.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Which is the only way that makes sense, don't you think? Picture a separate purging task based on the snapshot creation time and some retention configuration. When the replication stops for some reason and that goes undetected, the task will happily expire old snapshots until all of them are gone or the problem is noticed and fixed.

Only delete old data when the new data has been committed is a sane conservative approach, IMHO.
I totally agree, but there is this:
  • hold_pending_snapshots will prevent source snapshots from being deleted by retention of replication fails for some reason
Or maybe more correct to say, "there is also this"
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
That's the other side, yes. I was primarily thinking of some purging task on the destination.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I was primarily thinking of some purging task on the destination.
Which would be stopped if the snapshots still pending are marked as held by the replication task.

Maybe there's some combination of logic with all of that which would be helpful, but there are so many moving parts.

I see you've confirmed that it behaves as expected if set correctly according to your wishes, so maybe we just leave it at that.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Which would be stopped if the snapshots still pending are marked as held by the replication task.
The point is: there are no separate tasks. The purging is done once the replication completed. Which is good.
 
Top