de-dup alternatives

Arwen · Oct 2, 2018

tl-dnr:
De-dup really is a specialized use case. If you think you need it, you don't. You will know when you need it and can actually implement it wisely.

There have been a few posts about using de-dup. So, let's first get some things out of the way;

De-dup is a memory hog, as the de-dup table(s) must reside in memory. The more data de-dupped, the more memory needed.
De-dup can be a CPU hog for writing, (it needs to scan the de-dup table for matches).
De-dup in general wants better checksum algorythms, (which tend to be slower), to prevent hash collisions.
De-dup can't be retroactively enabled, (and have all the data magically de-dupped). It has to be written to a de-dupped enabled dataset. Plus, ALL it's support data, (the other data you want to de-dup against), has to be in the same dataset.
De-dup can't be retroactively disabled. All dataset(s) that use de-dup would have to be copied to dataset(s) without de-dup. Then the source dataset(s) destroyed. Until then, the memory impact still exists.
De-dup is checksum dependant. Changing the checksum algorythm on an active de-dup dataset prevents de-dup from de-dupping new data against old data.

So, now that you understand that de-dup is not some magic fixer for reducing data storage, you can simulate PART of the results of de-dup in some ways using snapshots. Here is how I do it for my client backups;

Create a backup dataset
Create a client dataset inside the backup dataset
Use any file by file full backup tool to initially populate the client dataset
Snapshot the client dataset, (I use the date, as in @20181002, for the name)
All future backups use Rsync to only copy files that have changed, snapshot again after

Thus, I have perhaps 20 or 30 full backups that only take the space of the changes between each one. If I start running out of space, I can delete the oldest snapshots until I have enough space. Rsync can be slower, but the benefits on space are worth it to me.

NOTE: This does not do real de-dup. But it does save space on things that don't change.

Edit: Put the take-away at the top. Re-worded a bit for proper english syntax.
Edit: Per suggestion, added line item about not being able to disable de-dup easily.

HoneyBadger · Oct 2, 2018

Be Ye Fairly Warned: Badger nitpicks ahead.

De-dup is a memory hog, as the de-dup table(s) must reside in memory. The more data de-dupped, the more memory needed.
Technically, the DDT can reside on fast L2ARC devices; this will impact performance, because even the fastest NVMe 3D XPoint DIMM Hyper Type-R storage will still be impacted by the need to negotiate either an nvme or block driver to query the portion(s) of the table that spill over. Updates to the DDT still have to be committed back to the pool vdevs though. (Layer 2 nitpick: until we get meta/DDT-only vdevs in OpenZFS. ZoL is experimenting with them in the latest RC, Oracle ZFS already has them.)

De-dup in general wants better checksum algorythms to prevent collisions, (which tend to be slower).
SHA512, Skein, and the ZoL-only Edon-R actually perform faster than SHA256 on modern hardware. fletcher4 shouldn't be used with dedup since there's the (small but possible) chance of hash collision. (And if your workload is bottlenecking on CSUM, maybe upgrade from that Pentium 4 you're using?)

Plus, ALL [dedup] support data, (the other data you want to de-dup against), has to be in the same dataset.
Dedup is tunable per dataset/zvol, but the DDT hashing is pool-wide, unless something's changed. Going to see if I can test this right now.

None of this detracts from the overall message of this thread, which is "You probably shouldn't use ZFS deduplication, manage your data instead for better results."

Elliot Dierksen · Oct 2, 2018

HoneyBadger said:
None of this detracts from the overall message of this thread, which is "You probably shouldn't use ZFS deduplication, manage your data instead for better results."

I forget who it was, but someone in the forums mentioned when that they build a new pool with max compression enabled. After the initial data transfer is complete, they set the compression back to the normal level. That way static/archival data has max compression (which is probably as good or better than de-dup), but new writes don't take a huge CPU hit.

HoneyBadger · Oct 2, 2018

Elliot Dierksen said:
I forget who it was, but someone in the forums mentioned when that they build a new pool with max compression enabled. After the initial data transfer is complete, they set the compression back to the normal level. That way static/archival data has max compression (which is probably as good or better than de-dup), but new writes don't take a huge CPU hit.

I believe that's @MatthewSteinhoff - bulk-loading the "archival" tier of data with gzip9, and then switching back to lz4?

MatthewSteinhoff · Oct 3, 2018

Arwen said:
De-dup can't be retroactively enabled,

I think the more important note is that de-dupe can't be retroactively disabled. I'd be willing to put up with horrible performance on an initial data load or migration if it cleaned up all my duplicate files. But it's my understanding that de-dupe can't be disabled once the data is loaded and will always be sucking down huge hunks of RAM.

HoneyBadger said:
I believe that's @MatthewSteinhoff - bulk-loading the "archival" tier of data with gzip9, and then switching back to lz4?

That's right. Load data with gzip9, get all the good compresses, change to lz4 for better live performance.

If de-dupe worked the way compression worked, I'd use de-dupe for loading, too.

Cheers,
Matt

pro lamer · Oct 3, 2018

MatthewSteinhoff said:
de-dupe can't be retroactively disabled.

... and when one day the DDT tables don't fit in RAM anymore one cannot actually mount the pool anymore IIRC (unless puts more RAM)

Sent from my mobile phone

HoneyBadger · Oct 3, 2018

MatthewSteinhoff said:
But it's my understanding that de-dupe can't be disabled once the data is loaded and will always be sucking down huge hunks of RAM.

Correct. Only way to get rid of it is to migrate the data to another dataset with dedup off and destroy the old one.

pro lamer · Oct 3, 2018

pro lamer said:
unless puts more RAM

Or borrows a server with more RAM, move their disks there and do this: ;)

HoneyBadger said:
Only way to get rid of it is to migrate the data to another dataset with dedup off

Sent from my mobile phone

Arwen · Oct 3, 2018

Per comments above from @MatthewSteinhoff & @HoneyBadger, I editted the original post adding a disabling line item...

Gee, with a bit more editting, and removing my alternative to de-dup, we could make a de-dup sticky. That way we can point people to it, when they think they know better.

pro lamer · Oct 3, 2018

Arwen said:
and removing my alternative

I wouldn't remove. I used to stick to the idea of de-duplication until I learned nice ways to avoid it. But maybe it's only me :)

Arwen said:
a disabling line item..

Will you emphasize that a day comes when DDTs stop fitting in RAM and performance drops rapidly (or even: dies) ?

Sent from my mobile phone

garm · Oct 3, 2018

Another point that could be covered by this thread is snapshot cloning. A lot of ill adviced dedup setups (where the table will grow indefinitely) can be solved by cloning.

pro lamer · Oct 3, 2018

Some other ways:

Rsync + hardlinks incremental backups (so not only ZFS) - easy to be found by the means of STFW
Any client side incremental backups (so not only *nix)
Any client side deduplication

Sent from my mobile phone

Important Announcement for the TrueNAS Community.

de-dup alternatives

Arwen

MVP

HoneyBadger

actually does care

Elliot Dierksen

Guru

HoneyBadger

actually does care

MatthewSteinhoff

Guru

pro lamer

Guru

HoneyBadger

actually does care

pro lamer

Guru

Arwen

MVP

pro lamer

Guru

garm

Wizard

pro lamer

Guru

Similar threads

Important Announcement for the TrueNAS Community.

de-dup alternatives

MVP

actually does care

Guru

actually does care

Guru

Guru

actually does care

Guru

MVP

Guru

Wizard

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "de-dup alternatives"

Similar threads