How to leverage periodic snapshots for large tar backups?

T_PT

Dabbler
Joined
Mar 20, 2023
Messages
20
Hi,

I'm not finding an example of this question using the search function, so apologies if I'm missing an existing post/discussion that answers this.

We have some backup jobs that output very large tar files, there aren't significant changes in the underlying data from week to week and some of them remain unchanged. They are currently backed up on to NAS devices, but I'm expermenting with possibilities for using TrueNAS and wondering how I can leverage the advantages of ZFS over a standard NAS.

I was thinking that snapshots could be a way to allow us to store more versions of these large tar files by storing the difference between versions - is this possible? Am I misunderstanding what can be achieved in this case?

I've set up a periodic snapshot on a dataset and I've use that dataset as a target for the backup tar files, but each snapshot is the entire size of the backup file.

I've set up a periodic snapshot task to run every minute (just for testing) and disabled "Allow Taking Empty Snapshots". I'm able to run the script to back up to tar and write successfully over NFS to the datastore and subsequent runs of that backup job replace the existing tars.

I'm guessing this is because the entire tar file is written, replacing the previous version, rather than this being an edit of an existing file.

Have I made an error in my config or am I trying to do something that isn't possible?

Thanks very much for reading this far :)
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I was thinking that snapshots could be a way to allow us to store more versions of these large tar files by storing the difference between versions - is this possible? Am I misunderstanding what can be achieved in this case?
I think maybe you're confusing snapshots with deduplication... completely different things (although can be used together).

I've set up a periodic snapshot on a dataset and I've use that dataset as a target for the backup tar files, but each snapshot is the entire size of the backup file.

I've set up a periodic snapshot task to run every minute (just for testing) and disabled "Allow Taking Empty Snapshots". I'm able to run the script to back up to tar and write successfully over NFS to the datastore and subsequent runs of that backup job replace the existing tars.

I'm guessing this is because the entire tar file is written, replacing the previous version, rather than this being an edit of an existing file.
Right, the tar file being written in its entirety forces the "changed" contents of the disk to be included in the snapshot... the entire tar file.

Have I made an error in my config or am I trying to do something that isn't possible?
You seem to be trying to get deduplication, so not so much impossible as a different thing than you're trying... and a resource hungry one that will not work well if you don't set it up right.

You might want to get very familiar with the concepts and requirements before trying that with any real data, but here's a pointer to something that may help:
 

T_PT

Dabbler
Joined
Mar 20, 2023
Messages
20
I think maybe you're confusing snapshots with deduplication... completely different things (although can be used together).


Right, the tar file being written in its entirety forces the "changed" contents of the disk to be included in the snapshot... the entire tar file.


You seem to be trying to get deduplication, so not so much impossible as a different thing than you're trying... and a resource hungry one that will not work well if you don't set it up right.

You might want to get very familiar with the concepts and requirements before trying that with any real data, but here's a pointer to something that may help:
Thanks very much for taking the time to respond.

I was deliberately avoiding dedupe initially as I'd read a few things that seemed to suggest it wasn't worth the effort for most use-cases; although now you've mentioned it I can see how this in this particular use-case that block level deduplication could be exactly the answer I need (assuming I stick to outputting a large tar).

I'm just pushing additional jobs out to a couple of TrueNAS test boxes on commodity hardware at the minute before we commit to a hardware spec and funding, so while it's real data it's not our actual backup, it's a learning exercise, testing what is achievable and a chance for me to make mistakes before migrating real systems!

I'll read more on dedupe and test that out initially although I may end up approaching this a different way, I need to test whether it's even worth creating the tar.

Thanks again for your assistance, I appreciate it.
 
Joined
Oct 22, 2019
Messages
3,641
I'm guessing this is because the entire tar file is written, replacing the previous version, rather than this being an edit of an existing file.
Such tar files would have to be backed up "in-place".

This is possible with a tool like rsync, which has an option to write the changes to the existing file "in-place". (It will inspect every 128 KiB with a checksum to know whether or not that segment needs to be modified, and it will append to the file as needed if the size is larger.)

This can take longer than simply sending the whole file from scratch, but it leverages block-based CoW filesystems, such as ZFS and Btrfs; and yields much more efficient snapshots.

Otherwise, as you noted, the snapshots will consume the space of the entire file with each iteration, even if it appears as the "same file" from your perspective.


I was thinking that snapshots could be a way to allow us to store more versions of these large tar files by storing the difference between versions - is this possible? Am I misunderstanding what can be achieved in this case?
Possible with a tool like rsync which supports "in-place" writes.
 

T_PT

Dabbler
Joined
Mar 20, 2023
Messages
20
Such tar files would have to be backed up "in-place".

This is possible with a tool like rsync, which has an option to write the changes to the existing file "in-place". (It will inspect every 128 KiB with a checksum to know whether or not that segment needs to be modified, and it will append to the file as needed if the size is larger.)

This can take longer than simply sending the whole file from scratch, but it leverages block-based CoW filesystems, such as ZFS and Btrfs; and yields much more efficient snapshots.

Otherwise, as you noted, the snapshots will consume the space of the entire file with each iteration, even if it appears as the "same file" from your perspective.
Interesting - that sounds like it would work well.

The performance impact isn't as important as the disk space efficiency, so that might be something to test before exploring the deduplication option.

Thanks very much for your suggestion.
 
Joined
Oct 22, 2019
Messages
3,641
Make sure to explicitly invoke the --inplace and --no-whole-file flags in the rsync command.
 
Joined
Oct 22, 2019
Messages
3,641
Shoot, I just realized something.

The way you described the backups, it sounds like there might be two problems:
  • You're already using a specific software that generates its own tar files
  • The tar filenames are probably different each time
 

T_PT

Dabbler
Joined
Mar 20, 2023
Messages
20
Make sure to explicitly invoke the --inplace and --no-whole-file flags in the rsync command.
Thanks very much for this - I'll give it a test and see how it works.

I've not worked out how to quote different replies yet, but regarding your other message;
The tar filename will be the same in my testing, currently the script has got a date stamp being appended to the filename but I've removed that in my test script as I knew I'd need the same file name - thanks for thinking of this though :)
 

T_PT

Dabbler
Joined
Mar 20, 2023
Messages
20
It's been a little while for testing but I wanted to update this thread.
@sretalla - thanks very much for the suggestion for deduplication. I think it can make a lot of sense for the use-case I was testing for backups, however we're going to have other very mixed workloads and I think that for the time being I'll avoid the added complexity/implications of deduplication

@winnielinnie - the suggestion about using rsync has worked a charm. I really appreciate being pointed in the right direction.

There is however a consequence, there is an extra step in the process which slows things down quite a lot.

Ultimately, the trade-off is extra time and work in creating and rsyncing the backup; but the return on that extra investment is the ability to access snapshots of those backups (potentially a lot of them) with minimal extra disk space being used.
This remains un-tested for now, but I imagine that if I back up that datastore to a cloud provider, we'll see advantages with only the snapshot differences being sent out too, rather than whole files that are in some cases triple digit GBs for the time being!
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
I think with tar files and rsync, one you have 1 bytes difference then rest of the file will be different, so in-place wont get you much then, but depends on how the tar files are build, maybe if anything changes then there are lots of changes anyway. I also use rsync via rsnapshot as another arrow in my backup quiver.

There are other backup systems that are more esoteric and IIRC handle partially changed files better. Both borg and restic come to mind, there are probably more, and depends on your level of comfort with that software too. It has been a while since I looked at them, and zfs snapshots are working for me.
 

T_PT

Dabbler
Joined
Mar 20, 2023
Messages
20
Thanks for the reply and suggestions samarium, I'll take a look at borg and/or restic at some point.

What does your pipeline look like?

Are you taking a backup and pushing that on to zfs with snapshots enabled? Is your output one large file? And are you successfully copying just the differences?
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
I am using 2 backup methods, with some overlap.

I am using rsnapshot, which basically using rsync and references a previous version in the backup area and either hard links to the existing version or copies the new version. Kind of like a poor man's zfs snapshot, but it can work with hosts that don't have zfs, doesn't have to be TN either.

I am also using zfs snapshots, and replicating the snapshots to a zfs backup server.

So in neither case is my output a a large file, that would actually make my life more difficult. Not sure why you are doing it that way, but I don't know your environment either.

In the rsync case, it is just the changed files, in a normal hierarchy with a classifier like host/{daily,weekly,monthly}/ prefixing the paths.

Neither rsync or rsnapshot is new, I've been using them for many years, but then I've been using zfs for many years too. TN has rsync, don't know about rsnapshot, but TN does have a webui for setting up rsync replication IIRC.

In the zfs case I'm just pushing the hourly/daily/monthly snapshots to the backup server and it is just the changed blocks, but reconsituted into a normal zfs filesystem heirarchy. The snapshots need to be managed so they don't grow without bound, but that is just scripting, and TN zettarepl does a reasonable job mostly. Sounds like you need to read up on zfs and snapshots too, maybe make some snapshots and experiment with zfs send/recv to get a feel for how it works.
 
Last edited:

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
@T_PT , if you copy the individual files instead of the tar ball (ideally combined with rsync to identify and skip identical files), the snapshots should only contain the differences, thus consuming relatively little space.
 

T_PT

Dabbler
Joined
Mar 20, 2023
Messages
20
@T_PT , if you copy the individual files instead of the tar ball (ideally combined with rsync to identify and skip identical files), the snapshots should only contain the differences, thus consuming relatively little space.
There are many ways to skin this cat!
I'm starting from the point of trying not to mess with the existing process too much, but that might also be worth investigating. TrueNAS is opening up a lot of possibilities to tweak things for efficiencies in all kinds of areas :)
 
Top