Sizing a dedicated dedup vdev?

ajgnet · Jun 17, 2020

I have ~60TB of photos & videos (with about 25% duplicates) stored in different folders from different users. Our machine has 256 GB RAM. I was curious about the option in TrueNAS 12 to specify a dedicated dedup vdev, as I'm thinking this could free up RAM.

Could I theoretically take an extra SSD and allocate this as a dedicated dedup vdev? Would I need redundancy for this, like a mirror?

Heracles · Jun 17, 2020

Hey @ajgnet,

ajgnet said:
as I'm thinking this could free up RAM.

Really not! Dedup is a huge hog and will consume a lot of RAM.

Best thing would be to remove your duplicates yourself. You can do this using some hash function to identify and remove duplicates. You can either delete them or replace them with links.

Another option is to put your duplicates in a dataset and then to clone that dataset.

There are many ways to reduce the size created by duplicated data, but FreeNAS's zfs level dedup is almost never a good one.

Maybe @HoneyBadger have more input for you... He wrote a lot about dedup in general.

ajgnet · Jun 17, 2020

Heracles said:
Really not! Dedup is a huge hog and will consume a lot of RAM.

Does a dedup vdev not move the deduplication tables from RAM to a vdev?

Heracles · Jun 17, 2020

ajgnet said:
Does a dedup vdev not move the deduplication tables from RAM to a vdev?

Nope.

vDev are building blocks used to create a pool. The pool itself is then managed by ZFS.

Dedup is done at zfs level, so not vDev level.

HoneyBadger · Jun 17, 2020

Short answer: Don't do it.

ajgnet said:
Does a dedup vdev not move the deduplication tables from RAM to a vdev?

No, it moves them from the general pool devices to a vdev. RAM is volatile and anything updated there needs to be committed to disk, whether it's general capacity vdevs or dedicated DDT vdevs.

The DDT gets cached into RAM for performance reasons, which is why running out of RAM and having your DDT reads hitting spinning disks is so punishing. It's not so much "saves RAM" but rather "makes the consequences of running out less catastrophic." Having to do a bunch of random reads against an SSD that isn't doing anything other than handling DDT lookups/updates is far better than having to interleave those random reads with your regular data I/O against the capacity vdevs; but both are still much, much slower than going through RAM.

Regarding redundancy, "special" vdevs of any type in OpenZFS are considered root vdevs of the pool - losing them results in either inaccessible data (for things like a small_blocks vdev) or a total pool failure (metadata/ddt) so you should be looking at two-way mirrors at a minimum; three-way mirror wouldn't be unreasonable.

For your setup, at an average indexing cost of 5G/1T (assuming 64K records) your 60T of data is going to be looking for 300G of RAM just to hold DDTs. (That's not accounting for RAM needed for other metadata or actual ARC for MRU/MFU.) Even if you devoted 3/4 of your RAM (192G) to hold DDTs, you'd still have 108G of potential lookups that would hit the SSD rather than RAM. If you're unlucky and your recordsize is smaller, you could use significantly more RAM - halving your average recordsize doubles your RAM footprint.

Could it be tolerable if you use an SSD as a dedup vdev? Possibly. But you're going to lose a lot of performance for only 25% space savings, not just from giving up 192G or so of your potential ARC to do housekeeping, but needing to scan through that size of table to check for duplicate hits when you flush transaction groups to disk. Worst case scenario is when writing unique data (which is 75% of your data) - you check the whole table, get no results, and then you have to write the new data to the pool, the metadata for the new data, and a new record to the DDT (so you can dedup against it in the future)

You can run a simulation on the pool (warning - this will generate a decent chunk of I/O) by using the following command:

zdb -U /data/zfs/zpool.cache -S YourPoolNameHere

You'll get a simulated deduplication run and a summary line at the bottom showing the estimated savings.

@Heracles has it correct in that the best way is to identify the duplicates "above" ZFS at the file level. Hash the files or use some other means to identify and tag the duplicates.

With multiple users, some duplication is inevitable but I'm curious as to what got you to this point of having ~15T of your 60T of data be fully redundant - if there's a "shared photo/video repository" you could set each user up with read-only access to it and encourage them to simply open the files from there when they need to review them. If it's a read/write collaborative effort, then leverage r/w access to shared folders based on department or role; encourage them to only use their home folder/mapped drive for truly "100% personal" work effort.

ajgnet · Jun 17, 2020

Thanks so much for your clear answer. I learned a lot.

HoneyBadger · Jun 17, 2020

ajgnet said:
Thanks so much for your clear answer. I learned a lot.

Happy to share the knowledge. The switch to TrueNAS and OpenZFS is going to buck a lot of the old "rules of thumb" for sure, so I'm hoping to be able to build some resources for this in advance of its release.

ajgnet · Jun 18, 2020

I see in the man page for zpool "...by default this includes all metadata, the indirect blocks of user data, and any deduplication tables. The class can also be provisioned to accept small file blocks."

Does this mean that if I add a "special" vdev to my pool, are DDT tables stored there by default? Or would I need an additional dedicated dedup vdev?

Heracles · Jun 18, 2020

ajgnet said:
Or would I need an additional dedicated dedup vdev?

I think that the point is more that you should not do dedup at all...

ajgnet · Jun 18, 2020

Agree, just trying to learn how it works

HoneyBadger · Jun 18, 2020

ajgnet said:
Does this mean that if I add a "special" vdev to my pool, are DDT tables stored there by default? Or would I need an additional dedicated dedup vdev?

Correct; if you add a separate vdev of type "dedup" they will go there. Overflow of a special vdev causes it to go back to the main pool; not sure on how it would work if you added a generic "special" first, turned on dedup, and then added "dedup" - but I assume it wouldn't migrate the DDTs and you'd end up with your DDT spanning both the "special" and "dedup" vdevs until all records on the former were updated.

I'm a big fan of special for meta, but it doesn't mean dedup is viable yet. ;)

Ericloewe · Jun 19, 2020

Expect zero migration of existing, written (meta)data, as that would require BPR. Same goes for the overflow scenario, even if you increase the size of available metadata storage, new blocks will probably use it, but the ones that overflowed will stay overflowed.

TrumanHW · Aug 13, 2021

HoneyBadger said:
Short answer: Don't do it.

No, it moves them from the general pool devices to a vdev. RAM is volatile and anything updated there needs to be committed to disk, whether it's general capacity vdevs or dedicated DDT vdevs.

The DDT gets cached into RAM for performance reasons, which is why running out of RAM and having your DDT reads hitting spinning disks is so punishing. It's not so much "saves RAM" but rather "makes the consequences of running out less catastrophic." Having to do a bunch of random reads against an SSD that isn't doing anything other than handling DDT lookups/updates is far better than having to interleave those random reads with your regular data I/O against the capacity vdevs; but both are still much, much slower than going through RAM.

Regarding redundancy, "special" vdevs of any type in OpenZFS are considered root vdevs of the pool - losing them results in either inaccessible data (for things like a small_blocks vdev) or a total pool failure (metadata/ddt) so you should be looking at two-way mirrors at a minimum; three-way mirror wouldn't be unreasonable.

For your setup, at an average indexing cost of 5G/1T (assuming 64K records) your 60T of data is going to be looking for 300G of RAM just to hold DDTs. (That's not accounting for RAM needed for other metadata or actual ARC for MRU/MFU.) Even if you devoted 3/4 of your RAM (192G) to hold DDTs, you'd still have 108G of potential lookups that would hit the SSD rather than RAM. If you're unlucky and your recordsize is smaller, you could use significantly more RAM - halving your average recordsize doubles your RAM footprint.

Could it be tolerable if you use an SSD as a dedup vdev? Possibly. But you're going to lose a lot of performance for only 25% space savings, not just from giving up 192G or so of your potential ARC to do housekeeping, but needing to scan through that size of table to check for duplicate hits when you flush transaction groups to disk. Worst case scenario is when writing unique data (which is 75% of your data) - you check the whole table, get no results, and then you have to write the new data to the pool, the metadata for the new data, and a new record to the DDT (so you can dedup against it in the future)

GREAT advice!!

So, after begging a friend said he was going to help me setup a FreeNAS build -- he stopped helping (after I'd ordered equipment, etc) ... because for the 1st time in perhaps his entire life he managed to get a modicum of vagina. As a consequence, I screwed up in EXACTLY the way you're discussing above (willy-nilly using dedupe bc FN 10 I think ... had no warnings about the performance and there were so many damned config choices that looking up each one would've taken an eternity.

Anyway, now I have probably 30GB of my data deduped, and even disabling dedupe (as you know) doesn't stop the pain.

Thing is -- my ARC (RAM) is 48GB ... and it wasn't even using half (according to the overview -- but maybe that's lying or misleading, etc) ...

I'd be ELATED to throw an array of NVMe SSDs and probably have 128GB (if not more I can put in the computer) -- at LEAST to get my goddamned data out of freaking JAIL! We're talking 1MB/s to Read any of the data that was written while DeDupe was enabled -- if not in the kb !!!!

I also have a piece of SHIT CPU in this machine -- bc uh ... you know ... QNAPs use hamsters for their compute power!!! How was I supposed to know!?

I have a POS CPU! in the 4x Dell T320
E5-2403v2 which is a 1.8GHz 4c (NO HYPERTHREADING! NO TURBO!) lol

I'm GOING to replace them ... unless it's not even going to help at all.

Was going to get the E5-2450v2 or E5-2470v2 ...

E5-2450v2 2.5GHZ (3.3 GHz) 8c (16 threads)
(which cost all of about $55)

... or, if cores would be better than an extra 100MHz (and worth paying 2x for them)

E5-2450v2 2.4GHZ (3.2 GHz) 10c (20 threads)

I also have a Radian RMS-200/8G

and still know that both of those things will do VERY LITTLE if ANYTHING to mitigate my issue ...

I'm hoping I can find an optane which my DDT (dedupe table) ...which I spell out not for you but others...
Might be able to swing a pair of 905P
... which I could use as a pair for special vDevs and dedicate to 4k files after my dedup project...?

Important Announcement for the TrueNAS Community.

Sizing a dedicated dedup vdev?

ajgnet

Explorer

Heracles

Wizard

ajgnet

Explorer

Heracles

Wizard

HoneyBadger

actually does care

ajgnet

Explorer

HoneyBadger

actually does care

ajgnet

Explorer

Heracles

Wizard

ajgnet

Explorer

HoneyBadger

actually does care

Ericloewe

Server Wrangler

TrumanHW

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

Sizing a dedicated dedup vdev?

Explorer

Wizard

Explorer

Wizard

actually does care

Explorer

actually does care

Explorer

Wizard

Explorer

actually does care

Server Wrangler

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Sizing a dedicated dedup vdev?"

Similar threads