Archive or deduplication + integrated compression

K4M1L · Jul 31, 2022

I am using truenas as backup server for some websites. And recently question came to mind. What is more space efficient? Compressing website (wordpress) to the zip archive or copying it as files (thousands of little text files + even more images) and run deduplication and compression on Truenas. Bare in mind there is hundreds of websites some using same plugins and core. They are not stored under same user and they are in separated datasets under parent dataset. Currently I am storing zip files but I am wondering If there is not better way.
Idk if it's neccessary to disclose the hardware but it runs dual socket 8 cores + 48GB ram per core.

winnielinnie · Jul 31, 2022

Using inline compression (LZ4, ZSTD) is the most "streamline" way to do it, and you'll always have access to the individual files without the need to decompress a giant archive file. The larger the recordsize policy on the dataset (i.e, 1 MiB records), the more efficient inline compression can work.

However, if this is the occasional backup/archive, then containing the thousands of files into single archive files means less filesystem overhead.

Either way, deduplication might not be worth it and introduce more issues down the line. It also requires more RAM for the deup tables.

For write-once, read-many (or rarely) "archives", using a high-compression ZSTD level + 1 MiB recordsize is probably the best combination.

K4M1L · Jul 31, 2022

It's daily or weekly backup and file size is in kb at most (pictures are bigger). I have set 1MiB record size already. It's almost never read. So inline compression is more space efficient then normal zip? Or is it only convenience?

morganL · Jul 31, 2022

K4M1L said:
It's daily or weekly backup and file size is in kb at most (pictures are bigger). I have set 1MiB record size already. It's almost never read. So inline compression is more space efficient then normal zip? Or is it only convenience?

I'd suggest you do the calculations based on storage capacity needed.
If you do daily backups, do you have a solution to avoid storing the same file every day?
(for example rsync with daily snapshots is space and bandwidth efficient)
If not deduplication may be worthwhile.

winnielinnie · Jul 31, 2022

How did you quote their post about the 1 MiB record size? I never read anything about what the record size was set to...

morganL · Jul 31, 2022

winnielinnie said:
How did you quote their post about the 1 MiB record size? I never read anything about what the record size was set to...

See the post above mine.

The 1MB record size is probably too large to be maximally efficient for storage space.

K4M1L · Aug 1, 2022

morganL said:
I'd suggest you do the calculations based on storage capacity needed.
If you do daily backups, do you have a solution to avoid storing the same file every day?
(for example rsync with daily snapshots is space and bandwidth efficient)
If not deduplication may be worthwhile.

That's the issue. There is no feasible way for me to not store same file every day. Bandwith is not a problem. Size is. Wi dedup do the trick? I mean in principle it sounds perfect, dedup and inline compression. But it might not be the same if it's block level dedup. Are there any tools to calculate it or should I just try?

K4M1L · Aug 1, 2022

IDK how to edit post, sorry. I should've mentioned, only limiting factor hardware-wise is drives capacity and IOps of spinning rust. CPUs are powerfull enough and I have some memory to add if necessary.

winnielinnie · Aug 1, 2022

morganL said:
The 1MB record size is probably too large to be maximally efficient for storage space.

Not to veer too off topic, but if no modifications are being made to any file ("write once to archive and/or maybe read later"), then 1 MiB recordsize policy won't have a negative impact on storage space efficiency. In fact, it can further help if there are compressible files.

Assuming ashift=12 (4 KiB blocks), a 4 KiB HTML file written to a dataset with a recordsize policy of 1 MiB, will only consume 4 KiB of space.

Assuming ashift=12 (4 KiB blocks), a 128 KiB HTML file written to a dataset with a recordsize policy of 1 MiB, will only consume 128 KiB of space.

Assuming ashift=12 (4 KiB blocks), a 512 KiB HTML file written to a dataset with a recordsize policy of 1 MiB, will only consume 512 KiB of space.

Assuming ashift=12 (4 KiB blocks), a 1 MiB HTML file written to a dataset with a recordsize policy of 1 MiB, will consume 1 MiB of space.

However, the above assumes no inline compression.

With the last example, the HTML file might only consume 768 KiB of space because the inline compression was able to squeeze the 256 blocks down to 192 blocks.

4 KiB x 256 = 1 MiB (uncompressed)

4 KiB x 192 = 768 KiB (compressed)

So depending how the archives are being saved/written (assuming few to no modifications), the larger recordsize policy will yield better read performance (if needed later) and allow for more efficient inline compression.

Deduplication can be an extra layer on top of this.

morganL · Aug 1, 2022

K4M1L said:
That's the issue. There is no feasible way for me to not store same file every day. Bandwith is not a problem. Size is. Wi dedup do the trick? I mean in principle it sounds perfect, dedup and inline compression. But it might not be the same if it's block level dedup. Are there any tools to calculate it or should I just try?

Rsync with snapshots on the storage target is a more efficient solution for this problem.
Dedup will work, but will require RAM and perhaps a special SSD VDEV.

Perhaps down to a 128K block size... unless all the files are well over 4MB.

winnielinnie · Aug 1, 2022

morganL said:
See the post above mine.

Now I see it!

Previously it was empty, and clicking on your "quote" took me to a page that read something like "The requested post could not be found."

winnielinnie · Aug 1, 2022

K4M1L said:
Are there any tools to calculate it or should I just try?

Do you have a spare pool to test this on? Or even perhaps create a "dummy" dataset for the sake of testing? (A "dummy" dataset such that you can safely destroy it once you're done playing around with it.)

I've never tried enabling and later disabling Deduplication, but I believe it can safely be disabled (assuming you're not using a special dedup vDev), or you can simply destroy the "dummy" dataset, which likewise will destroy the deduped records. (Again, no personal experience on this. Don't take my word for it.)

morganL said:
Rsync with snapshots on the storage target is a more efficient solution for this problem.

To extend on @morganL's suggestion, using rsync with the --inplace and --no-whole-file parameters will reduce the undesired effects of CoW if files are modified between runs. (This is because by default, rsync uses its own built-in "CoW" mechanism.) And then you'd leverage ZFS snapshots to serve as the replacement for discrete .zip archive files.

So instead of the discrete files:

archive_2022-07-15-00-00.zip
archive_2022-08-01-00-00.zip
archive_2022-08-15-00-00.zip

You will instead have the snapshots:

@archive_2022-07-15-00-00
@archive_2022-08-01-00-00
@archive_2022-08-15-00-00

Essentially, any identical records across the snapshots will not take up extra space, which makes Deduplication a moot point and not really worth it for this use-case. And yet, you can treat/access the individual snapshots as if they are distinct read-only .zip archive files of what the website was in that moment in time.

It would be up to you to coordinate/automate (or do it manually) taking snapshots after a successful rsync run.

This also assumes each website archive gets its own dataset, only for itself.

EDIT: About the last point of "This also assumes each website archive gets its own dataset, only for itself." I re-read your initial post and you mention hundreds of websites. So separate datasets per website doesn't sound feasible anymore. Your snapshots are likely to be "staggered" in relation to the rsync runs, unless you're able to coordinate the rsync runs to complete within a reasonable timeframe, in which you can "line up" snapshots to reflect successfully completed backups for all websites each time.

And if you do decide to use rsync, you might want to read through this thread, since rsync is very metadata heavy, and you will likely benefit from much speedier rsync runs by increasing how much of the ARC will be reserved for metadata. The "solution" and further discussion (for those who might more greatly benefit from higher values than the one I'm using) begins at post #9 and beyond.

ZFS "ARC" doesn't seem that smart...

Without yet resorting to adding an L2ARC, is there a "tuneable" that I can test which instructs the ARC to prioritize metadata? I upgraded from 16GB to 32GB ECC RAM. Yet there is zero change in this behavior. This keeps happening: I run regular rsync tasks from a few local clients, which is...

www.truenas.com

K4M1L · Aug 3, 2022

So I've tried this settings. On each dataset is identical website. Not very good result

winnielinnie · Aug 3, 2022

So the dataset "test zip" I take it to be a single compressed .zip file of the entire website?

While "test dedup" is the same website, with everything saved using inline ZSTD-19 compression?

All else being equal, the larger size may in fact be due to many small files adding extra overhead with metadata.

However, each subsequent snapshot will take up only minimal extra space if you use this method:

winnielinnie said:
To extend on @morganL's suggestion, using rsync with the --inplace and --no-whole-file parameters will reduce the undesired effects of CoW if files are modified between runs. (This is because by default, rsync uses its own built-in "CoW" mechanism.) And then you'd leverage ZFS snapshots to serve as the replacement for discrete .zip archive files.

So instead of the discrete files:

archive_2022-07-15-00-00.zip

archive_2022-08-01-00-00.zip

archive_2022-08-15-00-00.zip

You will instead have the snapshots:

@archive_2022-07-15-00-00

@archive_2022-08-01-00-00

@archive_2022-08-15-00-00

Essentially, any identical records across the snapshots will not take up extra space, which makes Deduplication a moot point and not really worth it for this use-case. And yet, you can treat/access the individual snapshots as if they are distinct read-only .zip archive files of what the website was in that moment in time.

It would be up to you to coordinate/automate (or do it manually) taking snapshots after a successful rsync run.

Whereas, using distinct .zip archives for each addtional backup will be much less efficient.

The only pragmatic problem I see is that this might become cumbersome if you try "one dataset for each website". It's still doable, and you can also have them all in a single dataset, but with staggered snapshots.

Deduplication isn't exclusive, one way or the other. But if you go the route of "snapshots instead of distinct .zip archives", you'll essentially be saving a lot of space going forwards, regardless.

Important Announcement for the TrueNAS Community.

Archive or deduplication + integrated compression

K4M1L

Cadet

winnielinnie

MVP

K4M1L

Cadet

morganL

Captain Morgan

winnielinnie

MVP

morganL

Captain Morgan

K4M1L

Cadet

K4M1L

Cadet

winnielinnie

MVP

morganL

Captain Morgan

winnielinnie

MVP

winnielinnie

MVP

ZFS "ARC" doesn't seem that smart...

K4M1L

Cadet

winnielinnie

MVP

Similar threads

Important Announcement for the TrueNAS Community.

Archive or deduplication + integrated compression

Cadet

MVP

Cadet

Captain Morgan

MVP

Captain Morgan

Cadet

Cadet

MVP

Captain Morgan

MVP

MVP

Cadet

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Archive or deduplication + integrated compression"

Similar threads