Deduplication questions

pedz · Mar 11, 2022

I have the Mini X+ with 32G of memory. The user interface for deduplication really seems to discourage using it because it is “memory intensive” so I chose not to.

I am currently going through a car load (literally) of old disk drives sending them to the NAS to store. I am sure there is a ton of duplicate files. What I’ve done in the past is wrote a program to find duplicate files. One feature that it had is I could easily find the full path to all of the duplicates so if I found a file that I no longer wanted, I could delete it and delete all of the duplicates and thus freeing up the space.

The first question is, is 32G enough memory to safely run the deduplication process? The NAS currently has 40TB of raidz2 storage but it has no VMs or other load.

The other question is, does the built in deduplication system offer a way to tell the user where the duplicates are so I can delete those too? I get the idea that it finds duplicate blocks, not duplicate files but I’m only guessing about that.

The last question is if I build my own system to find duplicates: this is a BSD system. I’ve not looked around but I assume I can find packages to add for a simple database, etc. I guess if nothing else, I can build a VM and have it be the place that runs the app I create. It is relatively simple (but slow) to find duplicate files.

c77dk · Mar 11, 2022

My feeling is "don't" - but try to read this post. If you enable with the hardware you have, the chance of getting burnt badly on performance is quite big.

pedz · Mar 11, 2022

Off topic -- I just got a "Keeps coming back" award. Giggle... sorta like a bad oder?!?!?

HoneyBadger · Mar 11, 2022

pedz said:
The first question is, is 32G enough memory to safely run the deduplication process?

No. Furthermore, deduplication runs inline and will impact newly written data only - it won't do anything to sniff out duplicate files in your existing data. It's also very memory and I/O intensive as shown in the in the resource from @Stilez linked by @c77dk above.

pedz said:
The other question is, does the built in deduplication system offer a way to tell the user where the duplicates are so I can delete those too? I get the idea that it finds duplicate blocks, not duplicate files but I’m only guessing about that.

Nope. You're accurate about it only working at the "block level" (at the record level, if we want to be exact here) so your final question leads you to the solution:

pedz said:
The last question is if I build my own system to find duplicates: this is a BSD system. I’ve not looked around but I assume I can find packages to add for a simple database, etc. I guess if nothing else, I can build a VM and have it be the place that runs the app I create. It is relatively simple (but slow) to find duplicate files.

TrueNAS is a BSD system (or Linux, if you happened to be running SCALE) but should be thought of as more of an appliance - build whatever duplicate-hunting solution you want to use in a jail or VM and run it there to avoid causing lasting impact on the FreeNAS/TrueNAS OS itself. If you have a sufficiently fast network pipe, running it on a separate machine entirely will keep it from doing too much "writing back" as well when it builds the index.

I'd suggest something as simple as a system that will compare based on extension and file size first, possibly with a secondary match of "last modified time" before going onto anything more computationally intense like a pure hashing match. From your array's perspective it would look like someone doing a recursive ls -al which would be a long-running job but not particularly heavy. Dump the results into a DB, and have it look for cases where extensions and file sizes match. See what you capture with a first pass.

Important Announcement for the TrueNAS Community.

Deduplication questions

pedz

Dabbler

c77dk

Patron

pedz

Dabbler

HoneyBadger

actually does care

Similar threads

Important Announcement for the TrueNAS Community.

Deduplication questions

pedz

Dabbler

c77dk

Patron

pedz

Dabbler

HoneyBadger

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Deduplication questions"

Similar threads