Adding / removing files from an existing .zip archive, how does ZFS handle it?

winnielinnie · Dec 23, 2020

Since ZFS is copy-on-write (COW), and snapshots are done at the block level (unlike other alternatives that rely on the file level), what is "supposed" to happen when you update an existing .zip archive by adding more files to it?

Assume the following:

The .zip archive is using a default compression level (it's not simply "stored" like a .tar archive)
The .zip archive already contains many files
The .zip archive is decently large, say over 1GB
You're not using deduplication

When you modify this compressed archive by removing files (somewhere in the middle) and/or adding new files to it, does ZFS make a copy of the entire 1GB .zip archive each time? Or only the sections that it senses are being modified?

From what I understand, it's the latter (only the modified parts), yet doesn't that depend on the software being used, such as the third-party compression program?

What if the program doesn't "play nice" with ZFS and creates a temporary copy of the archive as it's being modified, then deletes the original archive, then finally renames the temporary version back to the original file name? Wouldn't that look like a "new" file to ZFS (with new blocks), and thus taking a snapshot would take up a lot of extra space (nearly 1GB) even though only a portion of the archive was truly modified?

sretalla · Dec 24, 2020

winnielinnie said:
When you modify this compressed archive by removing files (somewhere in the middle) and/or adding new files to it, does ZFS make a copy of the entire 1GB .zip archive each time? Or only the sections that it senses are being modified?

From what I understand, it's the latter (only the modified parts), yet doesn't that depend on the software being used, such as the third-party compression program?

What if the program doesn't "play nice" with ZFS and creates a temporary copy of the archive as it's being modified, then deletes the original archive, then finally renames the temporary version back to the original file name? Wouldn't that look like a "new" file to ZFS (with new blocks), and thus taking a snapshot would take up a lot of extra space (nearly 1GB) even though only a portion of the archive was truly modified?

I agree with everything you have said.

If the zip software handles things at block level, only modified blocks will be worked with and hence written (into blocks not previously used).

If the zip software makes a new file and then moves that "copy" into place over the top of the original, the original and the copy will both be in separate blocks, so space consumption is a consideration.

HoneyBadger · Dec 24, 2020

winnielinnie said:
When you modify this compressed archive by removing files (somewhere in the middle) and/or adding new files to it, does ZFS make a copy of the entire 1GB .zip archive each time? Or only the sections that it senses are being modified?

From what I understand, it's the latter (only the modified parts), yet doesn't that depend on the software being used, such as the third-party compression program?

Correct. ZFS doesn't do anything with the file, it can only do things to the records the file is composed of.

But every archival/zip program I've seen does its own equivalent of "copy on write" at the application level, creating a new file called SomeGibberish.tmp, and then when the operation successfully completes, deleting the old file and renaming the .tmp - which does as you've outlined here for ZFS with snapshots in place.

Amusingly enough, you'd be in better shape from a ZFS space utilization/snapshot perspective if the compression software literally wrote thick zeroes over the portion of the .zip that had the deleted file. ZFS would catch those zeroes and compress them into nothing (at record-level granularity, mind you) and only have those new empty records in your latest snapshot.

Unfortunately it's a consequence of copy-on-write on top of copy-on-write when snapshots come into play - not really an easy way to avoid this.

winnielinnie · Dec 24, 2020

sretalla said:
If the zip software makes a new file and then moves that "copy" into place over the top of the original, the original and the copy will both be in separate blocks, so space consumption is a consideration.

That's what I was afraid of, which ties into @HoneyBadger's point:

HoneyBadger said:
But every archival/zip program I've seen does its own equivalent of "copy on write" at the application level, creating a new file called SomeGibberish.tmp, and then when the operation successfully completes, deleting the old file and renaming the .tmp - which does as you've outlined here for ZFS with snapshots in place.

Again, this was my fear, as it appears to be the de facto method for archiving/compressing tools. Likely because such programs were designed before ZFS became widely adopted? It's unfortunate, since it decreases performance and efficiency on a file system which is design for performance and efficiency. (And to be clear, I am referring to modifying a .zip file over SMB, in-place. I'm not referring to downloading the file, modifying it, and then re-sending it via SMB.)

Amusingly enough, you'd be in better shape from a ZFS space utilization/snapshot perspective if the compression software literally wrote thick zeroes over the portion of the .zip that had the deleted file. ZFS would catch those zeroes and compress them into nothing (at record-level granularity, mind you) and only have those new empty records in your latest snapshot.

So a natural consequence of this is that by taking snapshots over time, older snapshots would consume more space as they still have pointers to the "original" .zip archive records (yet in reality, you only modified a small portion of it, but the archival program is creating new .tmp files every time there is an operation done to the original .zip.)

Unfortunately it's a consequence of copy-on-write on top of copy-on-write when snapshots come into play - not really an easy way to avoid this.

Then wouldn't making a dedicated dataset with high-compression (e.g, zstd) essentially achieve the same amount of saved space, but with the added benefit that you retain ZFS's features for managing snapshots and records? Consider these two scenarios:

Scenario A, numerous files and folders, each "archive" as single .zip archive in a standard dataset

You have large folders with numerous files within, some amounting to large sizes overall
You compress each folder into its own large, highly compressed .zip archive
You store these .zip archive files into a dataset named archives
This dataset uses the default compression option (lz4)
Occasionally, you might need to remove, add, or modify a file within a .zip archive

Scenario B, numerous files and folders, each "archive" getting its own folder in a highly compressed dataset

You have large folders with numerous files within, some amounting to large sizes overall
You treat each "archive" as its own folder, with many files and subfodlers within
You store these folders into a dataset named archives
This dataset uses high compression (zstd)
Occasionally, you might need to remove, add, or modify a file with a folder

How accurate is it to say that Scenario A has the "advantage" of far fewer individual files (since each .zip is a single file, and the hundreds or thousands of files within each .zip archive are not part of the dataset's file system table)?

If that's not a real benefit, then Scenario B technically has only benefits and no disadvantages to Scenario A?

However, there might still be a problem when rsync'ing something such as Thunderbird email folders. Each folder (Inbox, Sent, Trash) is its own single file, and this file can be modified and compacted (such as when you delete emails, send emails, receive new emails). Doesn't the same issue arise (of inefficient snapshots) if you regularly rsync your Thunderbird email folders, and make regular snapshots of this dataset?

Would not the same unavoidable problem arise when doing daily rsync's to the NAS, and you happen to have very large compressed .zip files that are modified somewhere in the middle?

EDIT: Or is this not the case with rsync if using its --inplace option? (As opposed to the default, which as I understand creates .tmp files when files have changed on the source.)

HoneyBadger · Dec 24, 2020

winnielinnie said:
Again, this was my fear, as it appears to be the de facto method for archiving/compressing tools. Likely because such programs were designed before ZFS became widely adopted? It's unfortunate, since it decreases performance and efficiency on a file system which is design for performance and efficiency. (And to be clear, I am referring to modifying a .zip file over SMB, in-place. I'm not referring to downloading the file, modifying it, and then re-sending it via SMB.)

Realistically they're designed this way because most filesystems aren't copy-on-write, and implementing it at the application layer is the filesystem-agnostic way to guarantee you don't accidentally clobber their existing file by having the archive program crash out.

winnielinnie said:
So a natural consequence of this is that by taking snapshots over time, older snapshots would consume more space as they still have pointers to the "original" .zip archive records (yet in reality, you only modified a small portion of it, but the archival program is creating new .tmp files every time there is an operation done to the original .zip.)

Bingo. If you update that 1GB zip file between each snapshot, it will most likely have the old revision in there every time, rather than only just the few records that changed.

winnielinnie said:
Then wouldn't making a dedicated dataset with high-compression (e.g, zstd) essentially achieve the same amount of saved space, but with the added benefit that you retain ZFS's features for managing snapshots and records? Consider these two scenarios:

Scenario A, numerous files and folders, each "archive" as single .zip archive in a standard dataset

You have large folders with numerous files within, some amounting to large sizes overall

You compress each folder into its own large, highly compressed .zip archive

You store these .zip archive files into a dataset named archives

This dataset uses the default compression option (lz4)

Occasionally, you might need to remove, add, or modify a file within a .zip archive

Scenario B, numerous files and folders, each "archive" getting its own folder in a highly compressed dataset

You have large folders with numerous files within, some amounting to large sizes overall

You treat each "archive" as its own folder, with many files and subfodlers within

You store these folders into a dataset named archives

This dataset uses high compression (zstd)

Occasionally, you might need to remove, add, or modify a file with a folder

How accurate is it to say that Scenario A has the "advantage" of far fewer individual files (since each .zip is a single file, and the hundreds or thousands of files within each .zip archive are not part of the dataset's file system table)?

If that's not a real benefit, then Scenario B technically has only benefits and no disadvantages to Scenario A?

Scenario B will save more physical space on the ZFS pool, as it won't have the snapshot bloat that Scenario A does. Logical used space may be larger but ultimately you can just allow compression to do its thing. The volume of small files may be a concern - single larger archives/.tars often transfer more efficiently than hordes of small files - but that's a tradeoff that you have to weigh the pros and cons of.

winnielinnie said:
However, there might still be a problem when rsync'ing something such as Thunderbird email folders. Each folder (Inbox, Sent, Trash) is its own single file, and this file can be modified and compacted (such as when you delete emails, send emails, receive new emails). Doesn't the same issue arise (of inefficient snapshots) if you regularly rsync your Thunderbird email folders, and make regular snapshots of this dataset?

Would not the same unavoidable problem arise when doing daily rsync's to the NAS, and you happen to have very large compressed .zip files that are modified somewhere in the middle?

EDIT: Or is this not the case with rsync if using its --inplace option? (As opposed to the default, which as I understand creates .tmp files when files have changed on the source.)

You're continuing to hit the nail on the head here, both with the identified issue (a small change in a large file, if rewritten by the application, makes ZFS treat it as a whole new one) and the solution (overwrite in place wherever possible, causing fewer new records and therefore less churn in the snapshots)

rsync --inplace will get around this, but notably if you are targeting a local filesystem you need to explicitly add --no-whole-file as well. If you're rsync'ing to an NFS or SMB mount point it should imply it; but there's no harm in specifying as well. Bear in mind that an interrupted network or rsync job using --inplace means corruption at the target end though since your data was overwritten in-place and will therefore be inconsistent.

Edit: Although if it's an encrypted archive, something as mundane as changing a few bits may result in much larger changes to the file.

winnielinnie · Dec 24, 2020

Thank you so much for your follow-up, @HoneyBadger!

This helps clear up some things and now I have to sit back and really re-think my strategies in using my NAS server (with ZFS). See my "wish" at the end of this post.

HoneyBadger said:
You're continuing to hit the nail on the head here, both with the identified issue (a small change in a large file, if rewritten by the application, makes ZFS treat it as a whole new one)

rsync --inplace will get around this, but notably if you are targeting a local filesystem you need to explicitly add --no-whole-file as well. If you're rsync'ing to an NFS or SMB mount point it should imply it; but there's no harm in specifying as well. Bear in mind that an interrupted network or rsync job using --inplace means corruption at the target end though since your data was overwritten in-place and will therefore be inconsistent.

I just confirmed this with some tests, and it appears this is unfortunately the reality: there is a trade-off between "space-savings" and "data integrity" when transferring files from a non-ZFS source to a ZFS destination. I'm mentally and emotionally split on this issue.

Here is what my test unearthed...

My Thunderbird inbox is a single file named "Inbox", around 100MB in size.

As it exists on my dataset, it sits there as a 100MB file. I take a snapshop of this dataset and name it @test-snap

Immediately looking at the snapshot @test-snap, it claims as "Used" 0MB (basically nothing.)

If I delete random emails somewhere in the middle of my Thunderbird inbox, and then rsync to the server with the --inplace option, the "Used" size of the snapshot @test-snap only grows by 1MB or less with each successful rsync. Barely anything.

Yet, if I delete a random email from my inbox and then rsync to the server without the --inplace option? After a successful rsync, I check the "Used" size of @test-snap, and it is now consuming just over 100MB! (Simply because I did not use --inplace option this time.)

Bear in mind that an interrupted network or rsync job using --inplace means corruption at the target end though since your data was overwritten in-place and will therefore be inconsistent.

Wouldn't a subsequent rsync pass detect this (if there has been an update to the source file), and "attempt" to update it in-place, but then simply overwrite the entire file since it cannot do an in-place update? I'm not sure how gracefully rsync deals with interrupted transfers / resumes when using the --inplace option. I wouldn't mind the extra cost of transferring a large file all over again for those occasions when a previous transfer was interrupted. (If rsync is unable to overwrite such a corrupted file, then I need to figure out an alternative, or bite the bullet and deal with the extra wasted space by not using --inplace for my rsync transfers.) <--- Trying to research this on the web brings up even more confusion from others, some posts are 6 years old.

So here's my Christmas wish this year and every year: All operating systems would serve everyone better if they all supported ZFS natively (and used it as a default file system). It would make tools like rsync and Windows Backup redundant (and pointless).

Imagine a world where backing up to an external USB was a matter of an incremental send between two snapshots? Imagine if backing up to a NAS server was simply a matter of an incremental send between two snapshots?

For those less savvy, it wouldn't be a stretch for software developers to create pretty GUIs labeled as "backup software", which in reality are using ZFS snapshots and incremental sends under-the-hood. Heck, even compression and archive software would not need to implement their own "file-based copy-on-write" mechanisms.

A pure, only-ZFS world...

As it stands today, it seems like ZFS is relegated to specialty systems. It's not as if I leverage it on my laptop or desktop computer: it's only ever used by my NAS server. In order to backup, or send and receive, to and from the NAS box, we're essentially using middlemen: SMB, NFS, rsync, etc. How I would love to set up and send scheduled incremental ZFS snapshots from my Windows PC to my NAS box.

(I'm no lawyer, but is it more of a legal reason than a technical one why this isn't the case and likely never will be?)

winnielinnie · Dec 30, 2020

HoneyBadger said:
Bear in mind that an interrupted network or rsync job using --inplace means corruption at the target end though since your data was overwritten in-place and will therefore be inconsistent.

winnielinnie said:
Wouldn't a subsequent rsync pass detect this (if there has been an update to the source file), and "attempt" to update it in-place, but then simply overwrite the entire file since it cannot do an in-place update? I'm not sure how gracefully rsync deals with interrupted transfers / resumes when using the --inplace option. I wouldn't mind the extra cost of transferring a large file all over again for those occasions when a previous transfer was interrupted.

UPDATE
I want to drop in with an update in regards to the previous points, of which I quoted above.

I bring good news! Hopefully others might find relief in what my tests unearthed.

Long story short: I tried as hard as possible to corrupt a large .zip file (1GB). Not only did I interrupt the rsync process, I even removed the ethernet cable during some transfers. I did all kinds of modifications to the .zip (removing files from the beginning, middle, and end; adding new files to it, etc) I even ran interrupted rsync transfers between modifications to the large .zip file. Basically, I tried to simulate the most unrealistic and worst-case scenarios.

As you can see below, I used the --inplace and --no-whole-file options (even though --no-whole-file is implied between remote targets). While I used SSH, I'm sure the same test would be successful with Rsync Daemon Mode on the TrueNAS server:

Code:

rsync -v -a -H -h --progress --inplace --no-whole-file /home/user/testdir/ user@ipaddr:/mnt/mainpool/playground/testdir/

No matter what I tried, if I ever allowed rsync to eventually complete uninterrupted, the resulting 1GB .zip file on the server matches with the same SHA1 checksum as on my PC.

It didn't matter how many times I interrupted a previous transfer or how I modified the .zip file between attempted transfers. Once rsync had a chance to finish uninterrupted, the resulting file on the server was an exact copy (sha1 confirmed) as on my PC. I believe this happens because by default rsync checks the file's size and modification time. If either do not match with the file on the source, rsync will transfer it over. (After all, an interrupted transfer may result with a "corrupted" file on the destination, but this corrupted file is going to have a different size than the source file, and thus the next time rsync runs, it's going to attempt to transfer that file again.) Regardless of how much of the file has changed, or whether using --inplace or not, --partial or not, --whole-file or not, the end result is essentially the same: an exact replica of the file from the source. How efficient or how long this process takes is not as important to me as the integrity of the resulting file after a successful rsync transfer, which according to my tests seems to work reliably!

I even redid my test comparing the differences in snapshot sizes between using the --inplace option vs without --inplace.

Without --inplace, snapshots are significantly larger! Even a minor addition to the .zip file makes a previous snapshot jump in size from almost nothing to 1GB! However, when using the --inplace option, a minor addition to the .zip file only makes the previous snapshot grow in size by mere kilobytes (since I only added a small file into the .zip archive.)

CONCLUSION
Using rsync's --inplace option appears to be the preferable method of rsync'ing to a TrueNAS server
(...or any ZFS / copy-on-write file system)

Important Announcement for the TrueNAS Community.

Adding / removing files from an existing .zip archive, how does ZFS handle it?

winnielinnie

MVP

sretalla

Powered by Neutrality

HoneyBadger

actually does care

winnielinnie

MVP

HoneyBadger

actually does care

winnielinnie

MVP

winnielinnie

MVP

Similar threads

Important Announcement for the TrueNAS Community.

Adding / removing files from an existing .zip archive, how does ZFS handle it?

winnielinnie

MVP

sretalla

Powered by Neutrality

HoneyBadger

actually does care

winnielinnie

MVP

HoneyBadger

actually does care

winnielinnie

MVP

winnielinnie

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Adding / removing files from an existing .zip archive, how does ZFS handle it?"

Similar threads