de-dup alternatives

Status
Not open for further replies.

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
tl-dnr:
De-dup really is a specialized use case. If you think you need it, you don't. You will know when you need it and can actually implement it wisely.


There have been a few posts about using de-dup. So, let's first get some things out of the way;
  • De-dup is a memory hog, as the de-dup table(s) must reside in memory. The more data de-dupped, the more memory needed.
  • De-dup can be a CPU hog for writing, (it needs to scan the de-dup table for matches).
  • De-dup in general wants better checksum algorythms, (which tend to be slower), to prevent hash collisions.
  • De-dup can't be retroactively enabled, (and have all the data magically de-dupped). It has to be written to a de-dupped enabled dataset. Plus, ALL it's support data, (the other data you want to de-dup against), has to be in the same dataset.
  • De-dup can't be retroactively disabled. All dataset(s) that use de-dup would have to be copied to dataset(s) without de-dup. Then the source dataset(s) destroyed. Until then, the memory impact still exists.
  • De-dup is checksum dependant. Changing the checksum algorythm on an active de-dup dataset prevents de-dup from de-dupping new data against old data.
So, now that you understand that de-dup is not some magic fixer for reducing data storage, you can simulate PART of the results of de-dup in some ways using snapshots. Here is how I do it for my client backups;
  1. Create a backup dataset
  2. Create a client dataset inside the backup dataset
  3. Use any file by file full backup tool to initially populate the client dataset
  4. Snapshot the client dataset, (I use the date, as in @20181002, for the name)
  5. All future backups use Rsync to only copy files that have changed, snapshot again after
Thus, I have perhaps 20 or 30 full backups that only take the space of the changes between each one. If I start running out of space, I can delete the oldest snapshots until I have enough space. Rsync can be slower, but the benefits on space are worth it to me.

NOTE: This does not do real de-dup. But it does save space on things that don't change.

Edit: Put the take-away at the top. Re-worded a bit for proper english syntax.
Edit: Per suggestion, added line item about not being able to disable de-dup easily.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Be Ye Fairly Warned: Badger nitpicks ahead.

De-dup is a memory hog, as the de-dup table(s) must reside in memory. The more data de-dupped, the more memory needed.
Technically, the DDT can reside on fast L2ARC devices; this will impact performance, because even the fastest NVMe 3D XPoint DIMM Hyper Type-R storage will still be impacted by the need to negotiate either an nvme or block driver to query the portion(s) of the table that spill over. Updates to the DDT still have to be committed back to the pool vdevs though. (Layer 2 nitpick: until we get meta/DDT-only vdevs in OpenZFS. ZoL is experimenting with them in the latest RC, Oracle ZFS already has them.)

De-dup in general wants better checksum algorythms to prevent collisions, (which tend to be slower).
SHA512, Skein, and the ZoL-only Edon-R actually perform faster than SHA256 on modern hardware. fletcher4 shouldn't be used with dedup since there's the (small but possible) chance of hash collision. (And if your workload is bottlenecking on CSUM, maybe upgrade from that Pentium 4 you're using?)

Plus, ALL [dedup] support data, (the other data you want to de-dup against), has to be in the same dataset.
Dedup is tunable per dataset/zvol, but the DDT hashing is pool-wide, unless something's changed. Going to see if I can test this right now.

None of this detracts from the overall message of this thread, which is "You probably shouldn't use ZFS deduplication, manage your data instead for better results."
 
Joined
Dec 29, 2014
Messages
1,135
None of this detracts from the overall message of this thread, which is "You probably shouldn't use ZFS deduplication, manage your data instead for better results."

I forget who it was, but someone in the forums mentioned when that they build a new pool with max compression enabled. After the initial data transfer is complete, they set the compression back to the normal level. That way static/archival data has max compression (which is probably as good or better than de-dup), but new writes don't take a huge CPU hit.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I forget who it was, but someone in the forums mentioned when that they build a new pool with max compression enabled. After the initial data transfer is complete, they set the compression back to the normal level. That way static/archival data has max compression (which is probably as good or better than de-dup), but new writes don't take a huge CPU hit.

I believe that's @MatthewSteinhoff - bulk-loading the "archival" tier of data with gzip9, and then switching back to lz4?
 
Joined
Feb 2, 2016
Messages
574
De-dup can't be retroactively enabled,

I think the more important note is that de-dupe can't be retroactively disabled. I'd be willing to put up with horrible performance on an initial data load or migration if it cleaned up all my duplicate files. But it's my understanding that de-dupe can't be disabled once the data is loaded and will always be sucking down huge hunks of RAM.

I believe that's @MatthewSteinhoff - bulk-loading the "archival" tier of data with gzip9, and then switching back to lz4?

That's right. Load data with gzip9, get all the good compresses, change to lz4 for better live performance.

If de-dupe worked the way compression worked, I'd use de-dupe for loading, too.

Cheers,
Matt
 

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626
de-dupe can't be retroactively disabled.
... and when one day the DDT tables don't fit in RAM anymore one cannot actually mount the pool anymore IIRC (unless puts more RAM)

Sent from my mobile phone
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
But it's my understanding that de-dupe can't be disabled once the data is loaded and will always be sucking down huge hunks of RAM.

Correct. Only way to get rid of it is to migrate the data to another dataset with dedup off and destroy the old one.
 

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Per comments above from @MatthewSteinhoff & @HoneyBadger, I editted the original post adding a disabling line item...

Gee, with a bit more editting, and removing my alternative to de-dup, we could make a de-dup sticky. That way we can point people to it, when they think they know better.
 

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626
and removing my alternative
I wouldn't remove. I used to stick to the idea of de-duplication until I learned nice ways to avoid it. But maybe it's only me :)

a disabling line item..
Will you emphasize that a day comes when DDTs stop fitting in RAM and performance drops rapidly (or even: dies) ?

Sent from my mobile phone
 
Last edited:

garm

Wizard
Joined
Aug 19, 2017
Messages
1,556
Another point that could be covered by this thread is snapshot cloning. A lot of ill adviced dedup setups (where the table will grow indefinitely) can be solved by cloning.
 
Last edited:

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626
Some other ways:
  1. Rsync + hardlinks incremental backups (so not only ZFS) - easy to be found by the means of STFW
  2. Any client side incremental backups (so not only *nix)
  3. Any client side deduplication
Sent from my mobile phone
 
Status
Not open for further replies.
Top