How to leverage periodic snapshots for large tar backups?

T_PT · Jul 5, 2023

Hi,

I'm not finding an example of this question using the search function, so apologies if I'm missing an existing post/discussion that answers this.

We have some backup jobs that output very large tar files, there aren't significant changes in the underlying data from week to week and some of them remain unchanged. They are currently backed up on to NAS devices, but I'm expermenting with possibilities for using TrueNAS and wondering how I can leverage the advantages of ZFS over a standard NAS.

I was thinking that snapshots could be a way to allow us to store more versions of these large tar files by storing the difference between versions - is this possible? Am I misunderstanding what can be achieved in this case?

I've set up a periodic snapshot on a dataset and I've use that dataset as a target for the backup tar files, but each snapshot is the entire size of the backup file.

I've set up a periodic snapshot task to run every minute (just for testing) and disabled "Allow Taking Empty Snapshots". I'm able to run the script to back up to tar and write successfully over NFS to the datastore and subsequent runs of that backup job replace the existing tars.

I'm guessing this is because the entire tar file is written, replacing the previous version, rather than this being an edit of an existing file.

Have I made an error in my config or am I trying to do something that isn't possible?

Thanks very much for reading this far :)

sretalla · Jul 5, 2023

T_PT said:
I was thinking that snapshots could be a way to allow us to store more versions of these large tar files by storing the difference between versions - is this possible? Am I misunderstanding what can be achieved in this case?

I think maybe you're confusing snapshots with deduplication... completely different things (although can be used together).

T_PT said:
I've set up a periodic snapshot on a dataset and I've use that dataset as a target for the backup tar files, but each snapshot is the entire size of the backup file.

I've set up a periodic snapshot task to run every minute (just for testing) and disabled "Allow Taking Empty Snapshots". I'm able to run the script to back up to tar and write successfully over NFS to the datastore and subsequent runs of that backup job replace the existing tars.

I'm guessing this is because the entire tar file is written, replacing the previous version, rather than this being an edit of an existing file.

Right, the tar file being written in its entirety forces the "changed" contents of the disk to be included in the snapshot... the entire tar file.

T_PT said:
Have I made an error in my config or am I trying to do something that isn't possible?

You seem to be trying to get deduplication, so not so much impossible as a different thing than you're trying... and a resource hungry one that will not work well if you don't set it up right.

You might want to get very familiar with the concepts and requirements before trying that with any real data, but here's a pointer to something that may help:

My experiments in building a home server capable of handling fast + consistent deduplication

Stilez submitted a new resource: My experiments in building a home server capable of handling fast + consistent deduplication AIM: To help people looking at deduplication on TrueNAS 12+, what I've found on the way making it work on mine. On sustained mixed loads, such as 50GB+ file copies...

www.truenas.com

T_PT · Jul 5, 2023

sretalla said:
I think maybe you're confusing snapshots with deduplication... completely different things (although can be used together).

Right, the tar file being written in its entirety forces the "changed" contents of the disk to be included in the snapshot... the entire tar file.

You seem to be trying to get deduplication, so not so much impossible as a different thing than you're trying... and a resource hungry one that will not work well if you don't set it up right.

You might want to get very familiar with the concepts and requirements before trying that with any real data, but here's a pointer to something that may help:

My experiments in building a home server capable of handling fast + consistent deduplication

Stilez submitted a new resource: My experiments in building a home server capable of handling fast + consistent deduplication AIM: To help people looking at deduplication on TrueNAS 12+, what I've found on the way making it work on mine. On sustained mixed loads, such as 50GB+ file copies...

www.truenas.com

Thanks very much for taking the time to respond.

I was deliberately avoiding dedupe initially as I'd read a few things that seemed to suggest it wasn't worth the effort for most use-cases; although now you've mentioned it I can see how this in this particular use-case that block level deduplication could be exactly the answer I need (assuming I stick to outputting a large tar).

I'm just pushing additional jobs out to a couple of TrueNAS test boxes on commodity hardware at the minute before we commit to a hardware spec and funding, so while it's real data it's not our actual backup, it's a learning exercise, testing what is achievable and a chance for me to make mistakes before migrating real systems!

I'll read more on dedupe and test that out initially although I may end up approaching this a different way, I need to test whether it's even worth creating the tar.

Thanks again for your assistance, I appreciate it.

winnielinnie · Jul 5, 2023

T_PT said:
I'm guessing this is because the entire tar file is written, replacing the previous version, rather than this being an edit of an existing file.

Such tar files would have to be backed up "in-place".

This is possible with a tool like rsync, which has an option to write the changes to the existing file "in-place". (It will inspect every 128 KiB with a checksum to know whether or not that segment needs to be modified, and it will append to the file as needed if the size is larger.)

This can take longer than simply sending the whole file from scratch, but it leverages block-based CoW filesystems, such as ZFS and Btrfs; and yields much more efficient snapshots.

Otherwise, as you noted, the snapshots will consume the space of the entire file with each iteration, even if it appears as the "same file" from your perspective.

T_PT said:
I was thinking that snapshots could be a way to allow us to store more versions of these large tar files by storing the difference between versions - is this possible? Am I misunderstanding what can be achieved in this case?

Possible with a tool like rsync which supports "in-place" writes.

T_PT · Jul 5, 2023

winnielinnie said:
Such tar files would have to be backed up "in-place".

This is possible with a tool like rsync, which has an option to write the changes to the existing file "in-place". (It will inspect every 128 KiB with a checksum to know whether or not that segment needs to be modified, and it will append to the file as needed if the size is larger.)

This can take longer than simply sending the whole file from scratch, but it leverages block-based CoW filesystems, such as ZFS and Btrfs; and yields much more efficient snapshots.

Otherwise, as you noted, the snapshots will consume the space of the entire file with each iteration, even if it appears as the "same file" from your perspective.

Interesting - that sounds like it would work well.

The performance impact isn't as important as the disk space efficiency, so that might be something to test before exploring the deduplication option.

Thanks very much for your suggestion.

winnielinnie · Jul 5, 2023

Make sure to explicitly invoke the --inplace and --no-whole-file flags in the rsync command.

winnielinnie · Jul 5, 2023

Shoot, I just realized something.

The way you described the backups, it sounds like there might be two problems:

You're already using a specific software that generates its own tar files
The tar filenames are probably different each time

T_PT · Jul 5, 2023

winnielinnie said:
Make sure to explicitly invoke the --inplace and --no-whole-file flags in the rsync command.

Thanks very much for this - I'll give it a test and see how it works.

I've not worked out how to quote different replies yet, but regarding your other message;
The tar filename will be the same in my testing, currently the script has got a date stamp being appended to the filename but I've removed that in my test script as I knew I'd need the same file name - thanks for thinking of this though :)

T_PT · Jul 14, 2023

It's been a little while for testing but I wanted to update this thread.
@sretalla - thanks very much for the suggestion for deduplication. I think it can make a lot of sense for the use-case I was testing for backups, however we're going to have other very mixed workloads and I think that for the time being I'll avoid the added complexity/implications of deduplication

@winnielinnie - the suggestion about using rsync has worked a charm. I really appreciate being pointed in the right direction.

There is however a consequence, there is an extra step in the process which slows things down quite a lot.

Ultimately, the trade-off is extra time and work in creating and rsyncing the backup; but the return on that extra investment is the ability to access snapshots of those backups (potentially a lot of them) with minimal extra disk space being used.
This remains un-tested for now, but I imagine that if I back up that datastore to a cloud provider, we'll see advantages with only the snapshot differences being sent out too, rather than whole files that are in some cases triple digit GBs for the time being!

samarium · Jul 14, 2023

I think with tar files and rsync, one you have 1 bytes difference then rest of the file will be different, so in-place wont get you much then, but depends on how the tar files are build, maybe if anything changes then there are lots of changes anyway. I also use rsync via rsnapshot as another arrow in my backup quiver.

There are other backup systems that are more esoteric and IIRC handle partially changed files better. Both borg and restic come to mind, there are probably more, and depends on your level of comfort with that software too. It has been a while since I looked at them, and zfs snapshots are working for me.

T_PT · Jul 14, 2023

Thanks for the reply and suggestions samarium, I'll take a look at borg and/or restic at some point.

What does your pipeline look like?

Are you taking a backup and pushing that on to zfs with snapshots enabled? Is your output one large file? And are you successfully copying just the differences?

samarium · Jul 14, 2023

I am using 2 backup methods, with some overlap.

I am using rsnapshot, which basically using rsync and references a previous version in the backup area and either hard links to the existing version or copies the new version. Kind of like a poor man's zfs snapshot, but it can work with hosts that don't have zfs, doesn't have to be TN either.

I am also using zfs snapshots, and replicating the snapshots to a zfs backup server.

So in neither case is my output a a large file, that would actually make my life more difficult. Not sure why you are doing it that way, but I don't know your environment either.

In the rsync case, it is just the changed files, in a normal hierarchy with a classifier like host/{daily,weekly,monthly}/ prefixing the paths.

Neither rsync or rsnapshot is new, I've been using them for many years, but then I've been using zfs for many years too. TN has rsync, don't know about rsnapshot, but TN does have a webui for setting up rsync replication IIRC.

In the zfs case I'm just pushing the hourly/daily/monthly snapshots to the backup server and it is just the changed blocks, but reconsituted into a normal zfs filesystem heirarchy. The snapshots need to be managed so they don't grow without bound, but that is just scripting, and TN zettarepl does a reasonable job mostly. Sounds like you need to read up on zfs and snapshots too, maybe make some snapshots and experiment with zfs send/recv to get a feel for how it works.

ChrisRJ · Jul 14, 2023

@T_PT , if you copy the individual files instead of the tar ball (ideally combined with rsync to identify and skip identical files), the snapshots should only contain the differences, thus consuming relatively little space.

T_PT · Jul 18, 2023

ChrisRJ said:
@T_PT , if you copy the individual files instead of the tar ball (ideally combined with rsync to identify and skip identical files), the snapshots should only contain the differences, thus consuming relatively little space.

There are many ways to skin this cat!
I'm starting from the point of trying not to mess with the existing process too much, but that might also be worth investigating. TrueNAS is opening up a lot of possibilities to tweak things for efficiencies in all kinds of areas :)

Important Announcement for the TrueNAS Community.

How to leverage periodic snapshots for large tar backups?

T_PT

Dabbler

sretalla

Powered by Neutrality

My experiments in building a home server capable of handling fast + consistent deduplication

T_PT

Dabbler

My experiments in building a home server capable of handling fast + consistent deduplication

winnielinnie

MVP

T_PT

Dabbler

winnielinnie

MVP

winnielinnie

MVP

T_PT

Dabbler

T_PT

Dabbler

samarium

Contributor

T_PT

Dabbler

samarium

Contributor

ChrisRJ

Wizard

T_PT

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

How to leverage periodic snapshots for large tar backups?

Dabbler

Powered by Neutrality

Dabbler

MVP

Dabbler

MVP

MVP

Dabbler

Dabbler

Contributor

Dabbler

Contributor

Wizard

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "How to leverage periodic snapshots for large tar backups?"

Similar threads