ZFS Rollup - A script for pruning snapshots, similar to Apple's TimeMachine

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
This seems to work just fine - thanks!

I now need to figure out how to use this in a way that does not affect my ability to recover from replication issues. The problem: If there are replication problems, there need to be a common (between primary and secondary server) snapshot to be able to (manually) recover. If I run rollup on both servers, the snapshots that are kept may not be the same one. So for now, I'll stick to running it only on the backup server.
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
Yeah, this is getting out of my league as I'm unfortunately not using replication. The only solution that seems satisfactory would be having the rollup script be in charge of both sender and receiver from one or the other location. IE. it would shell into the other box to list and delete snapshots. That seems doable, but I don't know that I should take it on until I have a replication environment with which to test. Donations accepted ;-)
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
I promise to share any money I make on this with you :smile:

All of this did make me review the clearempty.py script. It now keeps the latest and the latest NEW, which results in two daily snapshots being copied to my backup server. I tried to figure out why we keep the latest NEW, but I couldn't reconstruct that from memory and this topic. I am considering removing that check. The script would still keep the latest snapshot, it just doesn't seem to matter whether it's NEW or not. Any comments? If not, I can send you a patch (which just removes a couple of lines).
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
I can actually probably fix that the same way I changed the rollup script. Unless that still leads to two being saved.

Unfortunately I don't have a replication setup to test with.

Can you post the output of the command above? Is latest new and latest a different snapshot?
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
On my system, the latest NEW is always the latest snapshot. Although there may be no latest NEW.
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
I'm confused about the issue you're referring to. At first you mentioned that clear empty saves latest and latest-new, but then also say these are the same snapshot. Is there an issue in too much being saved? Or just redundant code?
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
Sorry for not making myself clear. The script will first find the latest snapshot, and exclude that from the candidate list. Then it will do the same for the latest NEW. But because the latest is already removed from the candidate list, it will select the next latest NEW from the candidate list. As a result, two snapshots are kept, while only one would suffice, because the latest is also the latest NEW.
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
I recently updated my FreeNAS system and I now get many errors from the cleanempty.py script:

cannot destroy snapshot <snapshotname>: dataset is busy

I make snapshots every 15 minutes, so this list gets very long. I haven't looked at the system's internals for some time, but my first guess is that this is related to the various 'zfs hold' calls I see in the console window. This hold/release thing is probably implemented relatively recently, and the cleanempty.py script hasn't been updated accordingly.

Is this a known problem? Is it safe to just add a zfs release command before the dataset gets destroyed in the cleanempty.py script? Or is this hold/release thing used for other internal housekeeping by FreeNAS?
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
This may just be a problem because I was using an out-of-date autorepl.py script. So if the hold/release thing should not cause any problems with the cleanempty script, please ignore my message.
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
I just took a look at the scripts and they don't specifically handle snapshots that have been held. I can look in to handling that.

For my own case, I haven't seen any issues with errors related to snapshots on hold, but I don't think I have any on hold either.

Thanks for the feedback.
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
I'm still getting these errors whenever the script runs.

My guess is that the snapshot script puts the snapshots on hold if there is a replication task. After replication, the hold-status is cleared. I run snapshots during the day, and replicate them in the evening. What I observe is that in the morning, there are only a few lines of errors due to snapshots being on hold. This builds up during the day.

It's not difficult to clear the hold-status in the cleanup-script. The problem is that I don't know if that introduces any new problems. I cannot find any explanation of the logic of the snapshot and replication scripts, or why this hold-logic was introduced.
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
Aren't they in sync? Oh shoot. I must have forgotten to push to GitHub. I'll update that tonight.

I also need to get the tmsnap utility in there. I'll try for that tonight as well.
 

leonroy

Explorer
Joined
Jun 15, 2012
Messages
77
Aren't they in sync? Oh shoot. I must have forgotten to push to GitHub. I'll update that tonight.

I also need to get the tmsnap utility in there. I'll try for that tonight as well.

Heh, maybe delete the src in one of them and just stick a URL in the README.md to the other ;)

Thanks btw, they worked great and helped me cleanup a slow box which was stuffed with 10k+ of snapshots.
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
Glad it helped.

I usually keep both repos in sync and I overall like having both up as some people prefer one host or the other.
 

dwoodard3950

Dabbler
Joined
Dec 16, 2012
Messages
18
I've struggled with this process of creating snapshot tasks on a given dataset with different intervals and differing lifetimes. I've created 3 tasks for a given dataset, with intervals of 15min, hourly, and daily. The problem is that the lifetime of the shorter intervals ends up being deleted locally and then causing a failure for remote replication. Has anyone sorted this out? Any suggestions?
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
Are you using this script? Or just the FreeNAS snapshot and replication settings?
 

dwoodard3950

Dabbler
Joined
Dec 16, 2012
Messages
18
Are you using this script? Or just the FreeNAS snapshot and replication settings?
I've used another script for backing up to called 'zfs-backup.sh' which looks for user properties to identify datasets to replicate. It runs with a cron job and looks for sanpshots on the destination and source to ensure consistency.
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
OK, I know others have run in to similar issues at various times, but I think the FreeNAS replication setup was supposed to handle those issues now. My only suggestion would be that your script needs more intelligence when determining what can be deleted.

I personally don't have any experience with replication and my only feedback has been that the rollup script I developed doesn't seem to be causing any problems.

Do you have a link for the source of this zfs-backup script? Or post the contents somewhere?
 

dwoodard3950

Dabbler
Joined
Dec 16, 2012
Messages
18
OK, I know others have run in to similar issues at various times, but I think the FreeNAS replication setup was supposed to handle those issues now. My only suggestion would be that your script needs more intelligence when determining what can be deleted.

I personally don't have any experience with replication and my only feedback has been that the rollup script I developed doesn't seem to be causing any problems.

Do you have a link for the source of this zfs-backup script? Or post the contents somewhere?
https://github.com/adaugherity/zfs-backup/blob/master/zfs-backup.sh

As for the built in replication; I use that as well to another server on the intranet. I use the zfs-backup.sh script to execute an off-site backup since it requires a little more manipulation and I don't want all datasets pushed to off-site backup.
 
Top