ZFS de-Duplication - Or why you shouldn't use de-dup

Arwen · Jan 12, 2023

Arwen submitted a new resource:

ZFS de-Duplication - Or why you shouldn't use de-dup - ZFS de-duplicate essentials

The TrueNAS forums occasionally have people who come across ZFS de-duplicate, and want to investigate its use. Or think it is a good idea, and want to implement it.

Here are some suggested configuration details:

Understand that you need CPU power to compare ZFS blocks, for all writes in a de-dup dataset or zVol. This also means writes are delayed until the de-dup process compare is complete. So, faster CPU cores can work better with de-dup, than more cores.

In most cases...

My experiments in building a home server capable of handling fast + consistent deduplication

AIM: To help people looking at deduplication on TrueNAS 12+, what I've found on the way making it work on mine. On sustained mixed loads, such as 50GB+ file copies and multiple transfers, using TrueNAS 12 with a deduped pool and default config...

www.truenas.com

Arwen · Jan 12, 2023

Forum moderators, can we get the discussion moved out of;
Forums -> TrueNAS -> FreeNAS (Legacy Software Releases) -> FreeNAS Help & support -> General Questions and Help
This is really not a "FreeNAS (Legacy Software Releases)" resource.

Arwen · Jan 12, 2023

HoneyBadger said:
The "1.25GB for 1TB" ratio often vastly underrepresents the requirements of deduplication.

More often, we see the "5GB per 1TB" suggestion and that notably also applies to a dataset with an average record size of 64KB. If deduplication is applied to an iSCSI ZVOL (with a default volblocksize of 16K) this will result in the potential for 4x the memory usage, or "20GB per 1TB"

I would suggest leveraging and/or including a point to the resource from @Stilez on their experimentations with, and hardware requirements in order to get performant results (it required a dedup vdev using Optane devices) as a reference for deduplication requiring significant hardware.

My experiments in building a home server capable of handling fast + consistent deduplication

AIM: To help people looking at deduplication on TrueNAS 12+, what I've found on the way making it work on mine. On sustained mixed loads, such as 50GB+ file copies and multiple transfers, using TrueNAS 12 with a deduped pool and default config...

www.truenas.com

I've added both points. Copied your memory suggestion and put in a suggested reading section.

Keep the suggestions coming. We want something straight forward to point new users that want to use De-Dup to read, so at least they are informed.

HoneyBadger · Jan 12, 2023

Arwen said:
Forum moderators, can we get the discussion moved out of;
Forums -> TrueNAS -> FreeNAS (Legacy Software Releases) -> FreeNAS Help & support -> General Questions and Help
This is really not a "FreeNAS (Legacy Software Releases)" resource.

Moved to Operation and Performance which seems like the most accurate place for the discussion thread.

Arwen · Jan 12, 2023

HoneyBadger said:
Moved to Operation and Performance which seems like the most accurate place for the discussion thread.

Thank you.

We should probably move some more of the Resource discussion threads out of the legacy FreeNAS sub-forums, as the implication is that the resource would then only apply to legacy FreeNAS

Okeur75 · Jan 13, 2023

Thank you for the ressource.
I would add, based on my personal experience, that if you really want to test de-dedup, you should do it on a dedicated pool (even if it can be enabled on a particular dataset).
From what I experienced, an issue on a deduplication-enabled dataset can prevent you from mounting the whole pool (see my thread here).

Also, since dedup relies even more on memory than ZFS, maybe emphasize the need for ECC RAM (already strongly recommended for ZFS, but even more if you use dedup).

And for "spelling update" you missed a word here : when you a program like RSync

Arwen · Jan 13, 2023

Okeur75 said:
Thank you for the ressource.
I would add, based on my personal experience, that if you really want to test de-dedup, you should do it on a dedicated pool (even if it can be enabled on a particular dataset).
From what I experienced, an issue on a deduplication-enabled dataset can prevent you from mounting the whole pool (see my thread here).

Also, since dedup relies even more on memory than ZFS, maybe emphasize the need for ECC RAM (already strongly recommended for ZFS, but even more if you use dedup).

And for "spelling update" you missed a word here : when you a program like RSync

Thank you.

Added this section:

In some cases, it is better to use a dedicated pool for de-dup, than to share the pool with both de-dupped datasets and ones without. Thus, if a pool problems arise, it will not affect all your data.

I've added a suggestion for the ECC RAM:

Because de-dup is memory intensive, some people suggest that ECC memory is more important for this use.

I've fixed the wording for:

when you update with a program like RSync

Arwen · Jan 13, 2023

Something occurred to me.

Is the ZFS de-dup checksum table sorted?
If not, would sorting the de-dup checksum table improve performance?

I mean, imagine that you have 10, 256 byte checksums you have to compare against your new one. If they are not in sorted order, you have to check every single one against your new one. BUT, if they are in sorted order like this:

New checksum - 5678...

Checksum table:
1234...
2345...
3456...
4567...
6789...
7890...
8901...
9012..
9999...

As you can see, you only have to compare the first 64 bit word of the first 5 checksums before you find that your new one is unique. The same would apply to some degree if their was a match, because you avoid all the unnecessary compares.

Plus, some of sort routines would apply. Like starting the compare in the middle. Then checking the middle entry of the upper half or lower half, and continuing until either match or unique.

Now obviously the de-dup table would need to be a sorted linked list, so that you can add a new entry anywhere in the middle, not just at the ends.

So, did my half-awake brain figure out how ZFS de-dup works?
Or did I come up with a huge optimization?

I searched the web, but could not find an answer, (for my half-awake brain...).

AlexGG · Jan 13, 2023

Arwen said:
Is the ZFS de-dup checksum table sorted?

It is an AVL tree, so yes.

jgreco · Jun 22, 2023

It might be worthwhile to go into a little discussion about file-level deduplication strategies a little bit more.

File-level deduplication: This is cheap and super-easy, without any significant requirements. I like Phil Karn's dupmerge (look for "KA9Q dupmerge" to find the source code).

The code removes duplicate files and replaces copies other than the first with a hardlink back to the first. However, it isn't automatic, you have to run something like a script to cause it to do the deduplication. It also relies on you not changing the contents of the files, so it is mostly useful for archival file access.

HoneyBadger · Jun 22, 2023

Moved this thread again as it seems like it once more found its way into a read-only forum area.

Arwen · Jun 22, 2023

jgreco said:
It might be worthwhile to go into a little discussion about file-level deduplication strategies a little bit more.

File-level deduplication: This is cheap and super-easy, without any significant requirements. I like Phil Karn's dupmerge (look for "KA9Q dupmerge" to find the source code).

The code removes duplicate files and replaces copies other than the first with a hardlink back to the first. However, it isn't automatic, you have to run something like a script to cause it to do the deduplication. It also relies on you not changing the contents of the files, so it is mostly useful for archival file access.

Done. Made a separate section, and included RSync which can do something similar.

jgreco · Jun 22, 2023

Arwen said:
Done. Made a separate section, and included RSync which can do something similar.

Excellent. You rock. You do a very nice job of writing on your resources, I'm just a little jealous.

Arwen · Jun 22, 2023

jgreco said:
Excellent. You rock. You do a very nice job of writing on your resources, I'm just a little jealous.

I am a professional writer after all, even have several published works!

Important Announcement for the TrueNAS Community.

ZFS de-Duplication - Or why you shouldn't use de-dup

Arwen

MVP

HoneyBadger

actually does care

My experiments in building a home server capable of handling fast + consistent deduplication

Arwen

MVP

Arwen

MVP

My experiments in building a home server capable of handling fast + consistent deduplication

HoneyBadger

actually does care

Arwen

MVP

Okeur75

Dabbler

Arwen

MVP

Arwen

MVP

AlexGG

Contributor

jgreco

Resident Grinch

HoneyBadger

actually does care

Arwen

MVP

jgreco

Resident Grinch

Arwen

MVP

Similar threads