Are there any tools to calculate it or should I just try?
Do you have a spare pool to test this on? Or even perhaps create a "dummy" dataset for the sake of testing? (A "dummy" dataset such that you can safely destroy it once you're done playing around with it.)
I've never tried enabling and later disabling Deduplication, but I believe it can safely be disabled (assuming you're not using a special dedup vDev), or you can simply destroy the "dummy" dataset, which likewise will destroy the deduped records. (Again, no personal experience on this. Don't take my word for it.)
Rsync with snapshots on the storage target is a more efficient solution for this problem.
To extend on
@morganL's suggestion, using rsync with the
--inplace and
--no-whole-file parameters will reduce the undesired effects of CoW if files are modified between runs. (This is because
by default, rsync uses its own built-in "CoW" mechanism.) And then you'd leverage ZFS snapshots to serve as the
replacement for discrete .zip archive files.
So instead of the discrete
files:
- archive_2022-07-15-00-00.zip
- archive_2022-08-01-00-00.zip
- archive_2022-08-15-00-00.zip
You will instead have the
snapshots:
- @archive_2022-07-15-00-00
- @archive_2022-08-01-00-00
- @archive_2022-08-15-00-00
Essentially, any identical records across the snapshots will not take up extra space, which makes Deduplication a moot point and not really worth it for this use-case. And yet, you can treat/access the individual snapshots
as if they are distinct read-only .zip archive files of what the website was in that moment in time.
It would be up to you to coordinate/automate (or do it manually) taking snapshots after a successful rsync run.
This also assumes each website archive gets its
own dataset, only for itself.
EDIT: About the last point of "This also assumes each website archive gets its
own dataset, only for itself." I re-read your initial post and you mention
hundreds of websites. So separate datasets per website doesn't sound feasible anymore. Your snapshots are likely to be "staggered" in relation to the rsync runs, unless you're able to coordinate the rsync runs to complete within a reasonable timeframe, in which you can "line up" snapshots to reflect successfully completed backups for all websites each time.
And if you do decide to use rsync, you might want to read through
this thread, since rsync is very metadata heavy, and you will likely benefit from
much speedier rsync runs by increasing how much of the ARC will be reserved for metadata. The "solution" and further discussion (for those who might more greatly benefit from
higher values than the one I'm using) begins at post #9 and beyond.
Without yet resorting to adding an L2ARC, is there a "tuneable" that I can test which instructs the ARC to prioritize metadata? I upgraded from 16GB to 32GB ECC RAM. Yet there is zero change in this behavior. This keeps happening: I run regular rsync tasks from a few local clients, which is...
www.truenas.com