Archive or deduplication + integrated compression

K4M1L

Cadet
Joined
Jul 31, 2022
Messages
5
I am using truenas as backup server for some websites. And recently question came to mind. What is more space efficient? Compressing website (wordpress) to the zip archive or copying it as files (thousands of little text files + even more images) and run deduplication and compression on Truenas. Bare in mind there is hundreds of websites some using same plugins and core. They are not stored under same user and they are in separated datasets under parent dataset. Currently I am storing zip files but I am wondering If there is not better way.
Idk if it's neccessary to disclose the hardware but it runs dual socket 8 cores + 48GB ram per core.
 
Joined
Oct 22, 2019
Messages
3,641
Using inline compression (LZ4, ZSTD) is the most "streamline" way to do it, and you'll always have access to the individual files without the need to decompress a giant archive file. The larger the recordsize policy on the dataset (i.e, 1 MiB records), the more efficient inline compression can work.

However, if this is the occasional backup/archive, then containing the thousands of files into single archive files means less filesystem overhead.

Either way, deduplication might not be worth it and introduce more issues down the line. It also requires more RAM for the deup tables.

For write-once, read-many (or rarely) "archives", using a high-compression ZSTD level + 1 MiB recordsize is probably the best combination.
 
Last edited:

K4M1L

Cadet
Joined
Jul 31, 2022
Messages
5
It's daily or weekly backup and file size is in kb at most (pictures are bigger). I have set 1MiB record size already. It's almost never read. So inline compression is more space efficient then normal zip? Or is it only convenience?
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
It's daily or weekly backup and file size is in kb at most (pictures are bigger). I have set 1MiB record size already. It's almost never read. So inline compression is more space efficient then normal zip? Or is it only convenience?
I'd suggest you do the calculations based on storage capacity needed.
If you do daily backups, do you have a solution to avoid storing the same file every day?
(for example rsync with daily snapshots is space and bandwidth efficient)
If not deduplication may be worthwhile.
 
Joined
Oct 22, 2019
Messages
3,641
How did you quote their post about the 1 MiB record size? I never read anything about what the record size was set to...
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
How did you quote their post about the 1 MiB record size? I never read anything about what the record size was set to...
See the post above mine.

The 1MB record size is probably too large to be maximally efficient for storage space.
 

K4M1L

Cadet
Joined
Jul 31, 2022
Messages
5
I'd suggest you do the calculations based on storage capacity needed.
If you do daily backups, do you have a solution to avoid storing the same file every day?
(for example rsync with daily snapshots is space and bandwidth efficient)
If not deduplication may be worthwhile.
That's the issue. There is no feasible way for me to not store same file every day. Bandwith is not a problem. Size is. Wi dedup do the trick? I mean in principle it sounds perfect, dedup and inline compression. But it might not be the same if it's block level dedup. Are there any tools to calculate it or should I just try?
 

K4M1L

Cadet
Joined
Jul 31, 2022
Messages
5
IDK how to edit post, sorry. I should've mentioned, only limiting factor hardware-wise is drives capacity and IOps of spinning rust. CPUs are powerfull enough and I have some memory to add if necessary.
 
Joined
Oct 22, 2019
Messages
3,641
The 1MB record size is probably too large to be maximally efficient for storage space.
Not to veer too off topic, but if no modifications are being made to any file ("write once to archive and/or maybe read later"), then 1 MiB recordsize policy won't have a negative impact on storage space efficiency. In fact, it can further help if there are compressible files.

Assuming ashift=12 (4 KiB blocks), a 4 KiB HTML file written to a dataset with a recordsize policy of 1 MiB, will only consume 4 KiB of space.

Assuming ashift=12 (4 KiB blocks), a 128 KiB HTML file written to a dataset with a recordsize policy of 1 MiB, will only consume 128 KiB of space.

Assuming ashift=12 (4 KiB blocks), a 512 KiB HTML file written to a dataset with a recordsize policy of 1 MiB, will only consume 512 KiB of space.

Assuming ashift=12 (4 KiB blocks), a 1 MiB HTML file written to a dataset with a recordsize policy of 1 MiB, will consume 1 MiB of space.

However, the above assumes no inline compression.

With the last example, the HTML file might only consume 768 KiB of space because the inline compression was able to squeeze the 256 blocks down to 192 blocks.

4 KiB x 256 = 1 MiB (uncompressed)

4 KiB x 192 = 768 KiB (compressed)

So depending how the archives are being saved/written (assuming few to no modifications), the larger recordsize policy will yield better read performance (if needed later) and allow for more efficient inline compression.

Deduplication can be an extra layer on top of this.
 
Last edited:

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
That's the issue. There is no feasible way for me to not store same file every day. Bandwith is not a problem. Size is. Wi dedup do the trick? I mean in principle it sounds perfect, dedup and inline compression. But it might not be the same if it's block level dedup. Are there any tools to calculate it or should I just try?

Rsync with snapshots on the storage target is a more efficient solution for this problem.
Dedup will work, but will require RAM and perhaps a special SSD VDEV.

Perhaps down to a 128K block size... unless all the files are well over 4MB.
 
Joined
Oct 22, 2019
Messages
3,641
See the post above mine.
Now I see it!

Previously it was empty, and clicking on your "quote" took me to a page that read something like "The requested post could not be found."
 
Joined
Oct 22, 2019
Messages
3,641
Are there any tools to calculate it or should I just try?
Do you have a spare pool to test this on? Or even perhaps create a "dummy" dataset for the sake of testing? (A "dummy" dataset such that you can safely destroy it once you're done playing around with it.)

I've never tried enabling and later disabling Deduplication, but I believe it can safely be disabled (assuming you're not using a special dedup vDev), or you can simply destroy the "dummy" dataset, which likewise will destroy the deduped records. (Again, no personal experience on this. Don't take my word for it.)



Rsync with snapshots on the storage target is a more efficient solution for this problem.
To extend on @morganL's suggestion, using rsync with the --inplace and --no-whole-file parameters will reduce the undesired effects of CoW if files are modified between runs. (This is because by default, rsync uses its own built-in "CoW" mechanism.) And then you'd leverage ZFS snapshots to serve as the replacement for discrete .zip archive files.

So instead of the discrete files:
  • archive_2022-07-15-00-00.zip
  • archive_2022-08-01-00-00.zip
  • archive_2022-08-15-00-00.zip

You will instead have the snapshots:
  • @archive_2022-07-15-00-00
  • @archive_2022-08-01-00-00
  • @archive_2022-08-15-00-00

Essentially, any identical records across the snapshots will not take up extra space, which makes Deduplication a moot point and not really worth it for this use-case. And yet, you can treat/access the individual snapshots as if they are distinct read-only .zip archive files of what the website was in that moment in time. :cool:

It would be up to you to coordinate/automate (or do it manually) taking snapshots after a successful rsync run.

This also assumes each website archive gets its own dataset, only for itself.

EDIT: About the last point of "This also assumes each website archive gets its own dataset, only for itself." I re-read your initial post and you mention hundreds of websites. So separate datasets per website doesn't sound feasible anymore. Your snapshots are likely to be "staggered" in relation to the rsync runs, unless you're able to coordinate the rsync runs to complete within a reasonable timeframe, in which you can "line up" snapshots to reflect successfully completed backups for all websites each time.

And if you do decide to use rsync, you might want to read through this thread, since rsync is very metadata heavy, and you will likely benefit from much speedier rsync runs by increasing how much of the ARC will be reserved for metadata. The "solution" and further discussion (for those who might more greatly benefit from higher values than the one I'm using) begins at post #9 and beyond.

 
Last edited:

K4M1L

Cadet
Joined
Jul 31, 2022
Messages
5
1659466942554.png

So I've tried this settings. On each dataset is identical website. Not very good result
 
Joined
Oct 22, 2019
Messages
3,641
So the dataset "test zip" I take it to be a single compressed .zip file of the entire website?

While "test dedup" is the same website, with everything saved using inline ZSTD-19 compression?

All else being equal, the larger size may in fact be due to many small files adding extra overhead with metadata.

However, each subsequent snapshot will take up only minimal extra space if you use this method:
To extend on @morganL's suggestion, using rsync with the --inplace and --no-whole-file parameters will reduce the undesired effects of CoW if files are modified between runs. (This is because by default, rsync uses its own built-in "CoW" mechanism.) And then you'd leverage ZFS snapshots to serve as the replacement for discrete .zip archive files.

So instead of the discrete files:
  • archive_2022-07-15-00-00.zip
  • archive_2022-08-01-00-00.zip
  • archive_2022-08-15-00-00.zip

You will instead have the snapshots:
  • @archive_2022-07-15-00-00
  • @archive_2022-08-01-00-00
  • @archive_2022-08-15-00-00

Essentially, any identical records across the snapshots will not take up extra space, which makes Deduplication a moot point and not really worth it for this use-case. And yet, you can treat/access the individual snapshots as if they are distinct read-only .zip archive files of what the website was in that moment in time. :cool:

It would be up to you to coordinate/automate (or do it manually) taking snapshots after a successful rsync run.


Whereas, using distinct .zip archives for each addtional backup will be much less efficient.


The only pragmatic problem I see is that this might become cumbersome if you try "one dataset for each website". It's still doable, and you can also have them all in a single dataset, but with staggered snapshots.

Deduplication isn't exclusive, one way or the other. But if you go the route of "snapshots instead of distinct .zip archives", you'll essentially be saving a lot of space going forwards, regardless.
 
Top