Deduplication questions

pedz

Dabbler
Joined
Jan 29, 2022
Messages
35
I have the Mini X+ with 32G of memory. The user interface for deduplication really seems to discourage using it because it is “memory intensive” so I chose not to.

I am currently going through a car load (literally) of old disk drives sending them to the NAS to store. I am sure there is a ton of duplicate files. What I’ve done in the past is wrote a program to find duplicate files. One feature that it had is I could easily find the full path to all of the duplicates so if I found a file that I no longer wanted, I could delete it and delete all of the duplicates and thus freeing up the space.

The first question is, is 32G enough memory to safely run the deduplication process? The NAS currently has 40TB of raidz2 storage but it has no VMs or other load.

The other question is, does the built in deduplication system offer a way to tell the user where the duplicates are so I can delete those too? I get the idea that it finds duplicate blocks, not duplicate files but I’m only guessing about that.

The last question is if I build my own system to find duplicates: this is a BSD system. I’ve not looked around but I assume I can find packages to add for a simple database, etc. I guess if nothing else, I can build a VM and have it be the place that runs the app I create. It is relatively simple (but slow) to find duplicate files.
 

c77dk

Patron
Joined
Nov 27, 2019
Messages
468
My feeling is "don't" - but try to read this post. If you enable with the hardware you have, the chance of getting burnt badly on performance is quite big.
 

pedz

Dabbler
Joined
Jan 29, 2022
Messages
35
Off topic -- I just got a "Keeps coming back" award. Giggle... sorta like a bad oder?!?!? :eek:
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
The first question is, is 32G enough memory to safely run the deduplication process?
No. Furthermore, deduplication runs inline and will impact newly written data only - it won't do anything to sniff out duplicate files in your existing data. It's also very memory and I/O intensive as shown in the in the resource from @Stilez linked by @c77dk above.

The other question is, does the built in deduplication system offer a way to tell the user where the duplicates are so I can delete those too? I get the idea that it finds duplicate blocks, not duplicate files but I’m only guessing about that.
Nope. You're accurate about it only working at the "block level" (at the record level, if we want to be exact here) so your final question leads you to the solution:

The last question is if I build my own system to find duplicates: this is a BSD system. I’ve not looked around but I assume I can find packages to add for a simple database, etc. I guess if nothing else, I can build a VM and have it be the place that runs the app I create. It is relatively simple (but slow) to find duplicate files.
TrueNAS is a BSD system (or Linux, if you happened to be running SCALE) but should be thought of as more of an appliance - build whatever duplicate-hunting solution you want to use in a jail or VM and run it there to avoid causing lasting impact on the FreeNAS/TrueNAS OS itself. If you have a sufficiently fast network pipe, running it on a separate machine entirely will keep it from doing too much "writing back" as well when it builds the index.

I'd suggest something as simple as a system that will compare based on extension and file size first, possibly with a secondary match of "last modified time" before going onto anything more computationally intense like a pure hashing match. From your array's perspective it would look like someone doing a recursive ls -al which would be a long-running job but not particularly heavy. Dump the results into a DB, and have it look for cases where extensions and file sizes match. See what you capture with a first pass.
 
Top