Deduplication with big record size

MatMK · Aug 2, 2023

Hello. I am not sure if this is the right place for the thread, but couldn't find better category.

I am a beginner with both TrueNAS and ZFS. As probably a lot of people before me, I am toying with the idea of de-duplication. And as a lot of people before me, I am getting recommended from everywhere to stay away from it.

However, I fail to understand one thing. When learning about the topic, I stumbled upon an interesting reddit comment, where the person argues that de-duplication does not require crazy hardware, if configured correctly. Specifically, that the default record size should be increased. From my (limited) knowledge, it seems like the logic is correct (my explanation later). If true, I keep wondering why this isn't more often recommended on this forum.

I tried to explain this to myself as follows: (Please take this with a grain of salt - I am not presenting this as a correct information and I fully expect being corrected if I am wrong):

The dedup table takes 320B for every block it has to housekeep. For a block size of 128 KiB, that results in 2.5 GiB per TiB of data.

The problem is, that would work if all blocks were exactly 128 KiB (and each would be unique).

What causes it to increase:
If we theoretically had sizes of all the files divisible by the block size, there wouldn't be any increase. But in the real world, that is hardly the case. Because ZFS sets only the maximum record size and leaves it variable, there will be a lot of smaller records, which will also have to be managed with the same fixed overhead as whole full records. This increases the dedup table size, by a percentage that hugely depends on how big files you keep in your system. For a lot of small files, there will be a lot of small records, and the table size will rise quickly. For big files, it will rise only marginally (just for the record that holds the last bytes that did not fit).

What causes it to decrease:
The table has to store a new entry only for each unique record. If there is a lot of duplicates (as should be on a dedup-enabled system), the same records will be found as duplicates of already existing ones, therefore not saved in the table, and the table size will not increase. Let's say our data is on the system exactly twice. This would reduce the RAM requirement by half.

The "rule of thumb(s)" found on the internet recommend from 1 - 5 GiB of RAM per TiB of data. This would align with my understanding of the topic that I mentioned above. 1 GiB would be for use cases where the dedup ratio is 2.5x, whereas 5 GiB tries to be on the safe side by taking small files into account.

This all I would understand.

But now let's say that we would increase the block size from 128 KiB to 1 MiB.
This has both downsides and upsides.

Downsides:
The de-duplication does not work as efficiently as it could in terms of saved space. There will be less duplicates found, because even for a change of one bit, the whole record has to be stored separately. Let's say I would have two identical giant text files the size of 1 MiB. Because they are duplicate, they take only 1 MiB of storage. Now I change one character in one of the files. If the block size would be 128 KiB, only 1/8 of the file would need to be separated, making the total storage used 1.1 MiB. But with the record size of 1 MiB, the whole record would have to be copied over, and no space would be saved. The files are now taking 2 MiB.

But...

Upsides:
The dedup table takes a fixed amount of 320B for every block, no matter if the block is 128 KiB or 1 MiB. This means, that for perfectly filled blocks, the dedup table overhead decreases 8 times as well! Let's say I have 4 MiB file. If the record size is 128 KiB, the table contains 32 entries just for this file. With the record size of 1 MiB however, it only has to store 4.
When recalculating the formula, this would mean that the RAM size required is only 320 MiB per TiB! This no longer sounds that unachievable.
Both the small files problem and the reduction by duplicates still apply here of course. The small files are especially problematic for my calculation here. Given that the record size is so big, there will be a lot more of them by definition and could increase our table size significantly. Still, it should be a lot less than when used the default record size, that is if your data is not filled with only small files.
Another upside of this is 8x less hashes for the CPU. The chunks the CPU would have to hash are bigger mind you, but there is not the overhead of initializing each record and especially writing it out.

My question is:
Is there some hidden problem that I didn't account for in my calculations? Why is nobody talking about this? Sure, you have less saved space. But if your duplicate data are mostly large, would there be a problem? RAM requirements would be a lot more feasible this way, and this was always the main/only argument against it.

I get that for some use cases, the duplicate records that need to be accounted for are only a small chunks of otherwise different files. There, increasing the record size would make the dedup practically useless as it would be unable to find the similarities and would not save any space.
Also, not every dataset is fit for such big record size, as I understand. Database files or similar that need to be frequrently rewritten in small parts would suffer in performance, as the storage always has to read/write the whole 1 MiB.

But for a use case where our duplicates have large chunks of data in common or are even the same exact files (let's say, assets for a program, manual copy-backups of files, ISOs or same video files and photos), the increase in block size would not lower de-duplication efficiency, and would in fact increase the performance because of all the resources saved. CPU wouldn't have to work that hard, not so many accesses to the table and hashing. Not mentioning of course the massive RAM usage decrease, which would in turn increase performance as it could be used for ZFS.

I think that for home users for example, the second use case is lot more fitting, and could be done with reasonably equipped systems (which can't be said for the default record size, as everyone rightly suggests).

Thank you for reading this and even more so if you decide to explain or discuss it with me below.

Arwen · Aug 3, 2023

Good review of your options. Can't really make suggestions.

I did write up something that can help new users with understanding ZFS de-dup;

Resource - ZFS de-Duplication - Or why you shouldn't use de-dup

The TrueNAS forums occasionally have people who come across ZFS de-duplicate, and want to investigate its use. Or think it is a good idea, and want to implement it. Here are some suggested configuration details: Understand that you need CPU...

www.truenas.com

It is meant to discourage casual users of TrueNAS from using ZFS de-dup. But, in your case it might have some useful information. If you find something to be added, removed or updated, simply post to the Resources' discussion thread, not here.

sretalla · Aug 3, 2023

Just keep in mind that although you may be able to skimp on hardware and still do dedup, the penalty for underestimating the requirement even a little, is an almost completely unusable system (which requires hardware upgrades just to continue working).

Worst case would be some kind of data/pool loss as a result (ZFS can get unpredictable when starved of resources), so be prepared with appropriate backups if you will play with it and care about your data.

Ericloewe · Aug 3, 2023

MatMK said:
However, I fail to understand one thing. When learning about the topic, I stumbled upon an interesting reddit comment, where the person argues that de-duplication does not require crazy hardware, if configured correctly. Specifically, that the default record size should be increased. From my (limited) knowledge, it seems like the logic is correct (my explanation later). If true, I keep wondering why this isn't more often recommended on this forum.

It's a "Technically yes, but no" sort of situation.

Yes, if you're exclusively storing large blocks and you're using recordsize=1M, you do literally reduce the size of the DDT by 8 times. That's not a super realistic scenario and if it were realistic, it would still be better to handle dedup at the application layer (i.e. run a dedup tool to find duplicates and do something about it).
People typically want dedup thinking of VMs, where n VMs end up sharing a substantial part of their disk images - but that usage is as far from large blocks as you can get and 16K might be a reasonable recordsize, at which point you're looking at an increase by 8 times of the size of the DDT, relative to 128K.

MatMK said:
My question is:
Is there some hidden problem that I didn't account for in my calculations? Why is nobody talking about this? Sure, you have less saved space. But if your duplicate data are mostly large, would there be a problem? RAM requirements would be a lot more feasible this way, and this was always the main/only argument against it.

Dedup is also computationally slow, beyond any considerations of large tables. You need cryptographic hashes for all blocks, which is inherently slower than the default fletcher checksum. On top of that, the code itself is not super optimal and the abstractions are inherently slowish. It's workable for lower-performance and large block sizes, but then you're back at "why put up with it if you can just run a tool to replace duplicates with symlinks?" with no real upsides besides it being automatic and transparent if done by ZFS.

MatMK said:
But for a use case where our duplicates have large chunks of data in common or are even the same exact files (let's say, assets for a program, manual copy-backups of files, ISOs or same video files and photos), the increase in block size would not lower de-duplication efficiency, and would in fact increase the performance because of all the resources saved. CPU wouldn't have to work that hard, not so many accesses to the table and hashing. Not mentioning of course the massive RAM usage decrease, which would in turn increase performance as it could be used for ZFS.

Conceptually yes, but the devil is in the details. If you're doing it file-by-file, you're suddenly dealing with mountains of small files, probably. If you're tarballing things, you need to ensure that they're the same on all backups. Backed up a home directory into a multi-GB .tar.gz and then the user edited a byte? Congratulations, the whole archive is now probably different and doesn't dedup.

MatMK · Aug 3, 2023

Ericloewe said:
Yes, if you're exclusively storing large blocks and you're using recordsize=1M, you do literally reduce the size of the DDT by 8 times. That's not a super realistic scenario

I know, that's what I mentioned too. You can never get even the large blocks to fit perfectly, so reducing by 8 times is out of the question. But the reduction of, let's say, 5 times? That could be realistic, and still a great improvement in terms of memory usage that makes de-duplication feasible on many more computers.

Ericloewe said:
if it were realistic, it would still be better to handle dedup at the application layer (i.e. run a dedup tool to find duplicates and do something about it).

Ericloewe said:
It's workable for lower-performance and large block sizes, but then you're back at "why put up with it if you can just run a tool to replace duplicates with symlinks?" with no real upsides besides it being automatic and transparent if done by ZFS.

But the upsides of it being transparent are huge. As far as I know, there is no real alternative on the application layer. Sure, you can replace files with symlinks, but then you edit one file and the rest follows (not useful for backups, terrible if you forget about it). What would really be useful would be a tool that makes use of the CoW mechanism of the filesystem, and merges the duplicates only internally, transparently. Something like the functionality of snapshots, but done on existing files. It could even work offline, not on the fly. If someone knows about such a thing, please let me know, would love to use it.

Ericloewe said:
Dedup is also computationally slow, beyond any considerations of large tables.

Would it be a problem on modern hardware? If speed of let's say 1 or 2.5 Gbit would be enough, couldn't basically any modern CPU handle it?

Ericloewe said:
On top of that, the code itself is not super optimal and the abstractions are inherently slowish.

If so, couldn't it be improved in future releases? I doubt that it is by design.

Ericloewe said:
Backed up a home directory into a multi-GB .tar.gz and then the user edited a byte? Congratulations, the whole archive is now probably different and doesn't dedup.

You are right of course, this would work only on identical files. Already compressed (or encrypted) files would not count as duplicates, if changed only partially. But that is the inherent problem with de-duplication, not tied to a big record size. Worst case? The file would be saved separately, as if you didn't enable de-duplication at all.

I am only saying, if the cost of it being enabled would be low enough that you couldn't tell a difference (like with the bigger record size), there is no real downside of using it. It's like the situation with ZFS compression, that is almost always recommended to be used, as far as I know. You have a file that is uncompressable? In worst case, it just doesn't save any space, as if it would be disabled.

Etorix · Aug 3, 2023

MatMK said:
I know, that's what I mentioned too. You can never get even the large blocks to fit perfectly, so reducing by 8 times is out of the question. But the reduction of, let's say, 5 times? That could be realistic, and still a great improvement in terms of memory usage that makes de-duplication feasible on many more computers.

For your calculations, you really should assume "5 GB RAM per deduped TB", not "1-5 GB"—the lower end here is either unrealistic or already assumes the largest possible record size.
So, in the best case, you may end up with 1 GB/TB with 1 M records (and lower dedup efficiency…). This comes on top of the recommended minimum of 16 GB RAM to run TrueNAS (with 1 GbE networking…). Current HDDs are well over 10 TB, the sweet spot for TB/$ should be around 16-18 TB HDDs. So even the smallest 2-way mirror (not really safe or recommended at these sizes) would push minimal RAM to the 32 GB mark (if the calculations were right, and I certainly do not assume so!). Any raidz2 pushes the minimum RAM to at least 64 GB. "Feasible on many more computers", really?

The numbers just do not work the way you'd want.

MatMK said:
If so, couldn't it be improved in future releases? I doubt that it is by design.

It could probably be improved, but dedup being of little practical use, optimising the dedup code has to be low priority. And code optimisation would NOT come with a decrease in memory requirements: These ARE by design.

Ericloewe · Aug 3, 2023

MatMK said:
Would it be a problem on modern hardware? If speed of let's say 1 or 2.5 Gbit would be enough, couldn't basically any modern CPU handle it?

"It depends", it's more of a hit to IOPS.

MatMK said:
If so, couldn't it be improved in future releases? I doubt that it is by design.

Sure, and there are proposals. But it's a bunch of work and nobody has really said "we're pushing this forward".

MatMK said:
Something like the functionality of snapshots, but done on existing files. It could even work offline, not on the fly. If someone knows about such a thing, please let me know, would love to use it.

Block cloning allows for such a thing, but makes no attempt to handle the management/discovery/deduplication part of the problem - it's just the required infrastructure.

MatMK said:
I am only saying, if the cost of it being enabled would be low enough that you couldn't tell a difference (like with the bigger record size), there is no real downside of using it. It's like the situation with ZFS compression, that is almost always recommended to be used, as far as I know. You have a file that is uncompressable? In worst case, it just doesn't save any space, as if it would be disabled.

The performance impact is very different, though. LZ4 compression in particular is effectively free, if you're waiting at all for disk I/O.

Heracles · Aug 3, 2023

The only purpose of dedup is to save space.

If you consider the costs and risks associated with dedup, you will quickly see that it is way higher than the cost of storage space itself. If you need more space, do not try to get it by doing dedup. Get actual drives as many and as big drives as you need. That is million time better.

--With more drives, you can do more vDevs and with more vDevs, you have more speed.
--Without dedup, all your RAM is used for caching actual data, so again more speed.
--Without the crypto checksum, you have more speed once again.
--Whatever space dedup would have been able to save because of duplicate information, compression would have save a high proportion of it, transparently, at a much lower cost and much lower risk.

The costs and risks of dedup vs its benefits make it useless for 99.99% of people. Again, its only goal is to save you space. Why would you pay that much in RAM, CPU and risks when the price tag for drives is that low compare to these ?

winnielinnie · Aug 3, 2023

Since the technical side has been assessed and explained thoroughly by the above posts, let me give you something to consider from a non-technical perspective. The “human” perspective, if you will:

Can you find examples of home users of ZFS that use dedup who are benefitting from it without complications or regrets? If so, what do they share in common?

MatMK · Aug 3, 2023

Etorix said:
So, in the best case, you may end up with 1 GB/TB with 1 M records (and lower dedup efficiency…). This comes on top of the recommended minimum of 16 GB RAM to run TrueNAS (with 1 GbE networking…). Current HDDs are well over 10 TB, the sweet spot for TB/$ should be around 16-18 TB HDDs. So even the smallest 2-way mirror (not really safe or recommended at these sizes) would push minimal RAM to the 32 GB mark (if the calculations were right, and I certainly do not assume so!). Any raidz2 pushes the minimum RAM to at least 64 GB. "Feasible on many more computers", really?

The numbers just do not work the way you'd want.

Sorry, but I think you are wrong in your understanding of what increases the table size. De-duplication in ZFS is enabled on dataset level, and doesn't have anything to do with the configuration of the hard drive pool. Should there be a single drive, two drives mirrored, three drives in RAIDZ2 - they will all have the same table size. Dataset size matters, redundancy does not. It doesn't make sense to multiply the requirements by the number of actual drives in the pool.

Etorix said:
code optimisation would NOT come with a decrease in memory requirements: These ARE by design.

I didn't claim that memory requirements would decrease. What I was talking about was the CPU usage, in reaction to the code being allegedly poorly optimized. Of course the memory usage is by design, the table has to housekeep all the data. Abstraction overhead (which results in more CPU instructions in critical parts of code) on the other hand is something that can be reduced, if developers decide to invest time in more cleverly refactoring the code.

MatMK · Aug 3, 2023

Ericloewe said:
Block cloning allows for such a thing, but makes no attempt to handle the management/discovery/deduplication part of the problem - it's just the required infrastructure.

Sounds really interesting! Would be great if it became more integrated in the system after it matures. A cron script to find and link duplicates transparently every now and then would sure be a way to take the benefits without the costs of a real-time dedup.

MatMK · Aug 3, 2023

Heracles said:
--With more drives, you can do more vDevs and with more vDevs, you have more speed.
--Without dedup, all your RAM is used for caching actual data, so again more speed.
--Without the crypto checksum, you have more speed once again.

But what if I don't need ludicrous speed? If the NAS would mostly be used as a place where you put your files and backups, instead of a continues working directory, you wouldn't care that much for few less IOPS or MB/s, if it would save you space. It all depends on the usage of course, is all I am saying.

MatMK · Aug 3, 2023

winnielinnie said:
let me give you something to consider from a non-technical perspective. The “human” perspective, if you will:

Can you find examples of home users of ZFS that use dedup who are benefitting from it without complications or regrets? If so, what do they share in common?

I am not in this community long enough to know many people that use ZFS, yet alone enable dedup. As I said, I am only a beginner.
I see your point however. It is true that if everything that the user would do would be to store new documents, pictures, etc., de-duplication would be practically useless for them. But if we consider more advanced users, I think that there sure could be valid usecases.
As if it would be without regrets, that would entirely depend of what they would expect the functionality to be like, and what speed they would require.

HoneyBadger · Aug 3, 2023

Ericloewe said:
Sure, and there are proposals. But it's a bunch of work and nobody has really said "we're pushing this forward".

It's not quite verbatim, but how does "the work is underway" rank on that?

Fast ZFS Dedup - Sponsorship Request

It’s no secret that ZFS and deduplication have had performance issues in the past. Use cases that demand high read performance have aligned well with deduplication, given the compressed and deduplicated data is only stored once in ZFS read cache. Sustained write performance however has presented...

www.truenas.com

It's definitely a lot of work, though.

@MatMK you've raised a lot of excellent technical discussion here; as I'm presently on holiday I'll have to get back to this when I have access to a full keyboard and my test lab again to give you a proper response, but for now note that the "5GB per TB" rough guidance is based on an average recordsize of 64K; you can scale the memory footprint accordingly from there. Also, if you haven't read it already, check out the resource from @Stilez regarding the heavy amount of I/O required to support the permanent copies of the DDT (creating an IOPS bottleneck that extra RAM doesn't solve) here:

My experiments in building a home server capable of handling fast + consistent deduplication

AIM: To help people looking at deduplication on TrueNAS 12+, what I've found on the way making it work on mine. On sustained mixed loads, such as 50GB+ file copies and multiple transfers, using TrueNAS 12 with a deduped pool and default config...

www.truenas.com

Heracles · Aug 3, 2023

MatMK said:
But what if I don't need ludicrous speed?

No matter how or what you count : dedup is nothing but a waste.

No need for speed ? Go with a single RaidZ2 or RaidZ3 vdev. You will have good protection and save maximum space.
That is still not enough space ? Go for more or bigger drives…

In all cases, dedup is never the way to get more space and trying to achieve that is its only purpose. That leaves dedup as a completely useless feature for 99.999% of users, including you.

Even should you wish to waste resources on purpose, do it with things that will not jeopardize your data. Create some jails and deploy services instead. You will learn more, more useful stuff in a safe way for your data.

MatMK said:
I think that there sure could be valid usecases

No, there are none. Half a dozen of highly experienced users told you.

You can go and read the discussion section of @Ericloewe resources about ZFS. We talked about that at the very beginning.

HoneyBadger · Aug 3, 2023

Heracles said:
No, there are none. Half a dozen of highly experienced users told you.

Only the Sith and math majors deal in absolutes.

While dedup may not (currently) be the answer for almost every user who asks, there is the occasional situation where it's extremely useful and works well - I can cite personal experience on this matter.

Heracles · Aug 3, 2023

HoneyBadger said:
there is the occasional situation where it's extremely useful and works well

When people are able to understand its deep working, able to adjust tunables and do their own debug and reverse engineering… Then maybe yes. But for anyone just starting with TrueNAS, do not touch it before learning more about many other things…

Arwen · Aug 4, 2023

I do have to agree with @Heracles about anyone just starting with TrueNAS, (and by implication, ZFS), should not touch ZFS de-dup. This is why I wrote that De-Dup Resource, to discourage new or casual users from using that ZFS feature.

Anyone thinking about using ZFS de-duplication, should have a clear, precise understanding of ZFS architecture, command line usage, debugging, backups, and the over all what the server is intended to do. Then, top it off with ZFS de-dup knowledge. With out much of that, the potential to loose data is higher.

Now the original poster, @MatMK, may very well be on the way to achieving that knowledge. So I will stand on the sidelines of that part of this discussion.

Etorix · Aug 4, 2023

MatMK said:
Sorry, but I think you are wrong in your understanding of what increases the table size. De-duplication in ZFS is enabled on dataset level, and doesn't have anything to do with the configuration of the hard drive pool.

Indeed, it doesn't have anything to do with pool geometry, but it does have a lot to do with pool capacity—if you enable dedup on the entire pool, as I had implicitly assumed. Thanks for calling out the implicit assumption and giving me the opportunity to restate my point: Dedup of whole pools of large modern hard drives just require large, consumer-unfriendly, amounts of RAM.

You may restrict dedup to particular datasets, with quotas to ensure that these datasets will not grow too large.
At a base 5 GB/TB, with default record size, you'll still quickly need lots of RAM for a frustratingly small dataset.
You may further set a large record size to lower memory requirements, but this also lower the efficiency of dedup. So you end up with a relatively small dataset (only a few TB before you do push up RAM requirements) which should hold only large files (to go with the large record size) and low deduplication efficiency. How useful is that???
You can throw 64 GB LRDIMM modules at it. You can throw an Optane metadata L2ARC at it. But eventually you'd be better just throwing more drives and/or bigger drives at a non-dedup rather than making dedup work acceptably with exotic hardware.

Take it from @Heracles: For at least 99.9% of users and use cases, dedup just isn't worth it. It takes up way too much ressources for gains which are way too small.
If you think you're in the 0.01%, good for you. But think hard—and then think again.

MatMK said:
But what if I don't need ludicrous speed?

We are not talking about merely "slow" here: We are talking about the NAS becoming so sluggish it becomes at best useless and at worse dangerous for the data it holds!
For instance: A scrub on a backup system taking days to complete instead of mere hours—which then means that the scrub interferes with daily replication tasks to this backup system. (Real personal experience… and that was with "only" 64 GB RAM on said backup system, and dedup enabled only on select datasets with strict quotas—but default record size because duplicates include lots of small files.)

Important Announcement for the TrueNAS Community.

Deduplication with big record size

Cadet

MVP

Powered by Neutrality

Server Wrangler

Cadet

Wizard

Server Wrangler

Wizard

MVP

Cadet

Cadet

Cadet

Cadet

actually does care

Wizard

actually does care

Wizard

MVP

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Deduplication with big record size"

Similar threads