Deduplication performance impact low load

EnKo · Jan 9, 2022

In meantime it is clear deduplication is a volatile feature. Question is, how is the impact of deduplication on a system on low load?

In fact, I have one write user and maybe up to 10 read users at the same time on a dataset. The system is not sufficient for the amount of stored data, but the tasks are not user-critical (streaming). Do you have references about the user experience?

Moreover how is the impact on user-critical tasks related to non-deduplication datasets?

pschatz100 · Jan 9, 2022

Did you search for comments about deduplication? For one thing, it requires massive amounts of system memory. For another, it will take a huge toll on write performance.

As for your question, it is not possible to comment without a description of your system per the Forum Rules posted at the top of each page. However, if your system is close to the minimum specs for TrueNAS, then it will not be sufficient for running de-dup tasks.

EnKo · Jan 11, 2022

I intended this question more general and not specific to my case, since I expect a more subjective answer on your experiences. In my current configuration the system (small enterprise grade) works literally without any load. But considering recommendations of 3 to 5 GB of RAM per TB of deduplicated data and regarding a continuously growing set of data there will be a point where the RAM is too small. So I am awaiting a situation where the system is required to work on swap, which will be the bottle neck, since the CPU will be sufficient.

In my specific case I am using a 4 core 3.50 GHz CPU, 64 GB of RAM and HDD (no SSD). I expect the amount of deduplicated data (original size) between 10 and 20 TB within the next years. As long as the writing rate do not drops to less than 25 % of current rate and other datasets are not too much effected, I can live with it.

jgreco · Jan 11, 2022

EnKo said:
awaiting a situation where the system is required to work on swap,

That is not what happens with ARC contents. They are simply evicted, and this then requires the re-fetching of the DDT data. When this happens at scale, it is like hitting a brick wall. You are suddenly thrashing through ARC metadata for every block. It is not fun, it is super-ugly, you will want to commit seppuku.

EnKo said:
The system is not sufficient for the amount of stored data, but the tasks are not user-critical (streaming).

Streaming ... what, exactly? Only certain types of data are amenable to significant benefits from deduplication. Storing VM images or uncompressed backups is an example. If you are streaming video and you are hoping that

/mnt/yourpool/Video/Incoming/somemovie.mp4
/mnt/yourpool/Video/Movies/somemovie.mp4

happen to deduplicate because they're the same file contents, well, yes, that should deduplicate, but the better solution is to remove one of them, or use hardlinks, so that the filenames point to the same file data. Hardlink-based "deduplication" is virtually free in the UNIX environment, you just have to analyze your filesystem, which burns up some I/O, but the clever implementations work on file size and only then check file contents, hardlinking them if they are the same. Us old-timers like Phil Karn's dupmerge tool, but several other more recent options exist. You can search for "dupmerge" on the forum to find some other threads where people chime in with alternative tools. This is not quite as awesome as dedup, but it comes without the terrible ARC requirements.

EnKo said:
other datasets are not too much effected, I can live with it.

The I/O load is placed on the pool as a whole, so, other datasets can definitely be affected.

EnKo · Jan 15, 2022

Thank you very much, this answers my question perfectly. Since a have the problem of similar, but not identical files, I am not sure to benefit much from file analysis tools. But I was looking for this already and I will have a look on the tools you mentioned.

jgreco · Jan 15, 2022

dedup would only be able to deduplicate where blocks of data are exactly identical.

Important Announcement for the TrueNAS Community.

Deduplication performance impact low load

EnKo

Dabbler

pschatz100

Guru

EnKo

Dabbler

jgreco

Resident Grinch

EnKo

Dabbler

jgreco

Resident Grinch

Similar threads

Important Announcement for the TrueNAS Community.

Deduplication performance impact low load

EnKo

Dabbler

pschatz100

Guru

EnKo

Dabbler

jgreco

Resident Grinch

EnKo

Dabbler

jgreco

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Deduplication performance impact low load"

Similar threads