Resource - ZFS de-Duplication - Or why you shouldn't use de-dup

The TrueNAS forums occasionally have people who come across ZFS de-duplicate, and want to investigate its use. Or think it is a good idea, and want to implement it.

Here are some suggested configuration details:

Understand that you need CPU power to compare ZFS block checksums, for all writes in a de-dup dataset or zVol. This also means writes are delayed until the de-dup process compare is complete. So, faster CPU cores can work better with de-dup, than more cores.
In most cases, better checksum algorithms, like sha256 or sha512, work better at avoiding checksum conflicts, (broken de-dup blocks). But, those better algorithms take more CPU time to compute, (and verify on reads).
De-duplication is enabled on datasets, zVols, or entire pools. Choose wisely as some data you may want to store might not be de-duplicatable, thus waste system CPU time trying to verify that.
The working ZFS de-dup table is normally kept in memory. Their are options to extend it into different types of vDevs, like special vDevs or Cache / L2ARC vDevs. But they are not magic speed up solutions.
And from the ZFS manual page zfsconcepts;

It is generally recommended that you have at least 1.25 GiB of RAM per 1 TiB of storage when you enable deduplication. Calculating the exact requirement depends heavily on the type of data stored in the pool.

Click to expand...
More often, we see the "5GB of RAM per 1TB" suggestion and that notably also applies to a dataset with an average record size of 64KB. If de-duplication is applied to an iSCSI ZVOL (with a default volblocksize of 16K) this will result in the potential for 4x the memory usage, or "20GB of RAM per 1TB"
Because de-dup is memory intensive, some people suggest that ECC memory is more important for this use.
In some cases, it is better to use a dedicated pool for de-dup, than to share the pool with both de-dupped datasets and ones without. Thus, if a pool problem arises, it will not affect all your data.

Gotchas:

Their have been cases where a TrueNAS server using ZFS de-dup feature was rebooted, and then can't import the pool with the de-dupped dataset(s). This is due to lack of memory for the de-dup table. The only fix, is to add more memory, either physically. Or to free up memory from other processes.
As more and more data is added, the server can get slower. At some point, it may be too slow for practical usage. Thus, require a hardware upgrade with both faster CPU and more memory.
The ZFS manual page for zfsprops actually warns against using de-dup:

Unless necessary, deduplication should not be enabled on a system. See the Deduplication section of zfsconcepts(7).

Click to expand...

Alternative:

Use snapshots and plain file backups. Thus, when you update with a program like RSync, it will not copy over pre-existing files that are exactly the same. This is a poor man's de-duplication but requires much less in resources.
Use a dedicated TrueNAS server for your de-duplicated datasets. This can then be designed with ZFS de-duplication in mind. And not have to worry about other uses being slow or impacted because of the de-duplication writes.

File level de-duplication:

There are programs that will do file level de-duplication on any file system, ZFS not required. File level de-duplication is cheap and super-easy, without any significant requirement. None of these are automatic and require user to either initiate the process. Or to have it croned.

RSync:
RSync can make backups in different directories, while referencing the source. See the manual page for RSync if interested.

dupmerge:
Phil Karn's dupmerge is another example, (look for "KA9Q dupmerge" to find the source code).

The code removes duplicate files and replaces copies other than the first with a hardlink back to the first. It also relies on you not changing the contents of the files, so it is mostly useful for archival file access.

My experiments in building a home server capable of handling fast + consistent deduplication

AIM: To help people looking at deduplication on TrueNAS 12+, what I've found on the way making it work on mine. On sustained mixed loads, such as 50GB+ file copies and multiple transfers, using TrueNAS 12 with a deduped pool and default config...

www.truenas.com

tl;dnr
To sum it up, the average TrueNAS user should avoid using ZFS' de-duplication feature. Many home & small business users that don't have iXsystems support contracts may not have designed their hardware and hardware upgrade cycle with ZFS de-duplication in mind. Thus, it can and will likely bite them, potentially very hard.

Note the wording is designed to suggest you don't use ZFS de-duplication. However, you can do what you want and or need. This Resource was just to assist with some of the how and why things work, so you would not be as surprised.

Do you having any:

Technical corrections?
Improvements?
Syntactical / spelling updates?

Reactions: mistermanko, oblivioncth, UserFriendlyyy and 4 others

Important Announcement for the TrueNAS Community.

Resource ZFS de-Duplication - Or why you shouldn't use de-dup

My experiments in building a home server capable of handling fast + consistent deduplication

More resources from Arwen

Share this resource