Is Deduplication advised for cloud storage?

mpyusko · Dec 7, 2020

The machine is a purpose build for a NextCloud Server. The logic is,

users potentially having copies of the same file
NextCloud revision history
decent hardware capabilities

LZ4 Compression, no encryption. TrueNAS Core 12

Intel(R) Xeon(R) CPU E5645 @ 2.40GHz (6 cores/12 threads)
48 GB RAM DDR3 EEC (Currently only 12 while I wait for the rest to come in)
500 GB L2ARC (970 EVO Plus NVMe)
4x 4 TB Toshiba X300 CMR (currently 2x 4TB X300 and 2x 1TB P300 while I wait for the rest to come in)
Dual GigE LAN with LAGG
RAIDZ

I'm not sure if I'll add a ZIL/SLOG or not. the majority of the burden will be Read (Sync and Antivirus) with occasional Write once the initial data is loaded. The ARC/L2ARC is critical though. NextCloud will be running via the plugin install on the server, not a VM.

I've been using FreeNAS for a few years, on several servers. I wouldn't call myself a guru, more of a Power User. All the docs I read are murky when it gets to the point of when deduplication is a more of a benefit. It also isn't clear on at what point the server's hardware makes it a non-issue. I'm really thinking that because the array is for archival storage and not to run a stack of VM OS VHD's with all it's heavy I/O, that this would be a textbook scenario where deduplication is acceptable, I think. Space is the primary concern, especially since NextCloud can maintain revision history which I believe could benefit from deduplication. Currently the NextCloud Data is running on a VHD connected to a VM on a separate hypervisor. I'm mixing storage with other live VMs. Moving it to a separate server in a native jail, I hope to increase performance across the board. I would like to think my specs are more than enough to enable it.

Thoughts?

(BTW.... I'm roughing it all out right now, but when the server actually goes "live", all the parts will be in and it will be fully upgraded.)
Edit: I should point out, to maintain optimum performance, 9TB will be my data cap, though I don't plan on reaching that for quite some time.

sretalla · Dec 8, 2020

mpyusko said:
Read (Sync and Antivirus)

Maybe you should look into a metadata VDEV if the bulk of the slowness will actually be the metadata traversal/comparison. You may still benefit from L2ARC if your data is mostly static as you suggest, but with only 48GB of RAM, the forum would recommend upgrading that to 64GB rather than L2ARC first.

HoneyBadger · Dec 8, 2020

mpyusko said:
Is Deduplication advised

No.*

*in your specific use case it may provide some benefit, if the files being copied/stored as revisions span multiple records and only change slightly, so let's dive in. Anyone else hitting this from a Google search - stop right now and read to make sure that you can benefit too before you paint yourself into a corner.

Because ZFS deduplicates at the record level, not the file level, your returns will depend on how big the files are and how much of it is modified. A small Word document of a few dozen KB will likely be stored in a single record (default dataset recordsize is 128K max, and that's after compression) so any changes to it (including an internal "last modified" timestamp) will end up with a whole new record being written as the new file revision. No match, no dedup. You still pay the ~320 byte cost in RAM for the DDT hash on that record. If you're handling larger files (several MBs) that span records, and your revisions aren't extensive - eg, an 8-record file, you modify one record worth of data - you write pointers for 7, store 1, and pay 8x 320 bytes of RAM to do so. Storage is cheap, RAM less so.

I'd switch to larger drives and more of them instead of risking things with deduplication personally, and use ZFS snapshots to handle "previous versions" - however, if you have a chunk of data already in existence with those "multiple revisions" you could set up a test pool, copy the data in with dedup=on, and see what kind of results turn out as well as the impact on your performance/writes.

sretalla said:
Maybe you should look into a metadata VDEV if the bulk of the slowness will actually be the metadata traversal/comparison. You may still benefit from L2ARC if your data is mostly static as you suggest, but with only 48GB of RAM, the forum would recommend upgrading that to 64GB rather than L2ARC first.

Metadata (dedup specifically) on SSD will be beneficial if it turns out that dedup is the answer here, but it doesn't alleviate the need for gobs of RAM. Given the Westmere processor it's likely a 3-channel setup with 6 DIMM slots so 48GB is the easy 6x8GB setup. 6x16GB is better but 16GB DIMMs, even DDR3 RDIMMs, are still a fair bit more expensive than 8GB. If dedup isn't being used it's no issue but if the attempt will be made, definitely find a way to get more RAM than even that.

joeschmuck · Dec 8, 2020

HoneyBadger said:
I'd switch to larger drives and more of them instead of risking things with deduplication personally

I was thinking the same thing.

Herr_Merlin · Dec 8, 2020

We run dedup on some "special" FreeNAS, which is is used as kind of WORM Archive.
There we see a benefit of 4.8x in storage savings.
We run maximum compression as well. So double CPU and memory hungry.
Hitting this unit with a VM move to the archive.. will use 80% of the 20Cores / 40Threads..
So yeah you can save a lot of storage but at CPU and memory expense.
Choose wisely

mpyusko · Dec 8, 2020

The 48 GB of RAM is a Hardware limit.
The 4 LFF drives are a hardware limit
The OS is running on an SATA SSD in the optical bay.
The L2ARC NVMe is running off the PCIe riser.

What is being stored?

Photo/Video files (camera uploads that are likely unique to each user.)
Documents/Graphics/etc (
The ISO Repository for Xen. NextCloud allows sharing "external storage" which can be a folder on the local machine like /mnt/RAIDZ/iso (Which is shared via SMB/NFS to Xen hypervisors). This makes it easy to add/remove ISOs in the repo.
Daily Database backups (multi-gig, another "external storage") files older than 30 Days are automatically deleted. Archives increase from zero MB to multiple GB each day.
Daily Website backups (multi-gig, another "external storage") files older than 30 days are automatically deleted. Archives increase from zero MB to multiple GB each day.
Revision and deletion history for all files in both internal and external storage. ("Unlimited" until quota is full, then oldest copies/files are removed to make room for new data, until minimum history is hit and then quota his hard limit.)

Down the road, I plan to replace the drives with larger ones as my needs require it. Right now all I need is 4 TB. The 9 TB practical limit, with with or without Dedupe, won't be reached for a couple years, at which point the hardware will likely need to be replaced anyway. I'm looking at 5 years, then the secondary (this)gets retired, the current primary becomes the new secondary, and I deploy a new primary.

So the question is, would Deduplication be appropriate for this scenario?

It seems the answer is a resounding, no.

I already loaded the server with about 35 GB of test data and the write performance seemed to easily handle the ~200 MB/s coming at it from the two different sources on my LAN. Of course both the ARC and L2ARC are full. (At this point only 12 GB is installed while I wait for the rest. ). Dedupe is ON, but I can switch it off since this is still on the test bench and I'm only working with copies of the data. (Real copes of the actual data, that will be stored on it, not stimulated test data. I can wipe it and reload It if necessary.). The CPU did not seem to be struggling at all, although it seems with Dedupe the issue is more memory related.

Did I read somewhere that the Deduplication hashtable could actually be placed on a disk, like an NVMe instead of the RAM? Something like, it has to be a mirror though (optane recommend)? This confuses me a bit because RAM is volitle, so the hashtable would disappear with every reboot, right? I think I read that only new data that gets written after Deduplication is turned on is added to the hashtable. Existing data is not hashed. Now I can see a mirror would benefit read speed for the hashtable, is that all? I don't recall reading about a copy of the hashtable being stored on hardware and then loaded to RAM at boot. If the hashtable exceeds RAM, then it spills to L2ARC, or it basically stalls hashing new data if the L2ARC is exceeded/not present. The L2ARC gets wiped between boots too. But again, I don't recall anything about the hashtable being persistently stored on a disk. am I missing something somewhere?

HoneyBadger · Dec 9, 2020

mpyusko said:
Did I read somewhere that the Deduplication hashtable could actually be placed on a disk, like an NVMe instead of the RAM? Something like, it has to be a mirror though (optane recommend)? This confuses me a bit because RAM is volitle, so the hashtable would disappear with every reboot, right? I think I read that only new data that gets written after Deduplication is turned on is added to the hashtable. Existing data is not hashed. Now I can see a mirror would benefit read speed for the hashtable, is that all? I don't recall reading about a copy of the hashtable being stored on hardware and then loaded to RAM at boot. If the hashtable exceeds RAM, then it spills to L2ARC, or it basically stalls hashing new data if the L2ARC is exceeded/not present. The L2ARC gets wiped between boots too. But again, I don't recall anything about the hashtable being persistently stored on a disk. am I missing something somewhere?

Here's the core challenge with deduplication - the hash table (often called "ddt" for DeDuplication Table) is considered metadata, and is a permanent part of the pool. A copy exists in your ARC (and L2ARC if it exists) but that only helps for reads against the ddt (when new writes come in, they need to be checked for a match) - but whenever new data is written, either for a "hash matched a previous record, write a reference pointer" or "hash doesn't match, write the new hash and also the data" then that metadata needs to be updated. Up until 12.0/OpenZFS 0.8+, that permanent copy of the ddt exists on the main data vdevs - and if they were spinning disks, that meant a lot of small random I/O, which kills performance.

In 12.0/OpenZFS 0.8+ you can mark the ddt as special data and use the special vdev type (mirrors strongly recommended, Optane is excellent because it doesn't suffer from the same loss of performance under concurrent R/W) to hold it - but the best performance on ddt lookups still comes when the entire table fits into RAM. Optane is fast, sure, but it's not RAM-fast.

@Stilez has penned up a couple of excellent resources that summarize experiences with deduplication and using Optane for special vdevs; they're long but worth the read.

My experiments in building a home server capable of handling fast + consistent deduplication

AIM: To help people looking at deduplication on TrueNAS 12+, what I've found on the way making it work on mine. On sustained mixed loads, such as 50GB+ file copies and multiple transfers, using TrueNAS 12 with a deduped pool and default config...

www.truenas.com

A bit about SSD perfomance and Optane SSDs, when you're planning your next SSD....

NOTE: I'll be referring in this page, to a type of SSD developed by Intel and Micron, called 3D X-Point (pronounced "crosspoint"). It's most widely sold as Intel's Optane. The Optane devices I mean are things like the 900p, 905p, and P48xx...

www.truenas.com

The hash is always stored on the data or special vdevs; if it exceeds available RAM size then you start getting bottlenecked by how fast your permanent ddt reads/writes are. With Optane, that bottleneck might be 300-400MB/s. With a regular SSD, maybe 100MB/s. Without any SSD? Maybe 3-4MB/s, if you're lucky.

L2ARC can be made persistent in 12.0 as well, but again that only affects the reads. Writing updates to the ddt is painfully slow on spinning disk.

12.0/special vdevs can make it possible. It's still not a panacea and definitely not something that should be enabled for everyone or without paying careful attention to your build.

Stilez · Dec 13, 2020

mpyusko said:
What is being stored?

Photo/Video files (camera uploads that are likely unique to each user.)

Documents/Graphics/etc (

The ISO Repository for Xen. NextCloud allows sharing "external storage" which can be a folder on the local machine like /mnt/RAIDZ/iso (Which is shared via SMB/NFS to Xen hypervisors). This makes it easy to add/remove ISOs in the repo.

Daily Database backups (multi-gig, another "external storage") files older than 30 Days are automatically deleted. Archives increase from zero MB to multiple GB each day.

Daily Website backups (multi-gig, another "external storage") files older than 30 days are automatically deleted. Archives increase from zero MB to multiple GB each day.

Revision and deletion history for all files in both internal and external storage. ("Unlimited" until quota is full, then oldest copies/files are removed to make room for new data, until minimum history is hit and then quota his hard limit.)

Down the road, I plan to replace the drives with larger ones as my needs require it. Right now all I need is 4 TB. The 9 TB practical limit, with with or without Dedupe, won't be reached for a couple years, at which point the hardware will likely need to be replaced anyway. I'm looking at 5 years, then the secondary (this)gets retired, the current primary becomes the new secondary, and I deploy a new primary.

So the question is, would Deduplication be appropriate for this scenario?

It seems the answer is a resounding, no.

I already loaded the server with about 35 GB of test data and the write performance seemed to easily handle the ~200 MB/s coming at it from the two different sources on my LAN. Of course both the ARC and L2ARC are full. (At this point only 12 GB is installed while I wait for the rest. ). Dedupe is ON, but I can switch it off since this is still on the test bench and I'm only working with copies of the data. (Real copes of the actual data, that will be stored on it, not stimulated test data. I can wipe it and reload It if necessary.). The CPU did not seem to be struggling at all, although it seems with Dedupe the issue is more memory related.

Did I read somewhere that the Deduplication hashtable could actually be placed on a disk, like an NVMe instead of the RAM? Something like, it has to be a mirror though (optane recommend)? This confuses me a bit because RAM is volitle, so the hashtable would disappear with every reboot, right? I think I read that only new data that gets written after Deduplication is turned on is added to the hashtable. Existing data is not hashed. Now I can see a mirror would benefit read speed for the hashtable, is that all? I don't recall reading about a copy of the hashtable being stored on hardware and then loaded to RAM at boot. If the hashtable exceeds RAM, then it spills to L2ARC, or it basically stalls hashing new data if the L2ARC is exceeded/not present. The L2ARC gets wiped between boots too. But again, I don't recall anything about the hashtable being persistently stored on a disk. am I missing something somewhere?

@HoneyBadger has written some useful information - and thanks for the mention :)

I'd like to add this:

Techncially, the dedup (hash) table is how data is stored in the pool. ZFS uses a hashed list of blocks, to allow easy identification of duplicate blocks. In simple terms, to find an actual block of data on disk, ZFS uses the DDT as an extra step. It's not a cache, or an extra, that can be stored in RAM except temporarily. It's as much a part of the pool as the dataset layout, the snapshot info, pointers to files, or the file date/time metadata - for practical purposes you can think about ZFS handling dedup metadata identically to any of that sort of stuff, if that helps.

You ask "is dedup appropriate". The answer is always, what are your constraints and compromises? Dedup is used to get around disk space issues, at the cost of extra processing. Dedup will always slow your system down, because it involves extra hashing and checking, and considerable amounts of small block (4k) I/O. If you have limited money or hardware/connectors, you might have to get dedup because the disks would be too expensive. If you don't have a problem with disk space, you may not need it.

These are some questions to ask, to assess whether dedup is right for *you*:

How much disk space do you actually *need* now? How much will you need in 3-5 years (depending how your mental horizon works for future stuff)?
How much of that space will dedup actually save (how much of that data is/will be actually duplicated?)
Can you use something like rsync (built in) which allow you to do incremental backups of your data. If you don't need infinite backups, or you can use incremental rather than full backups, again, you wont use so much disk space over time, so dedup won't show many benefits or be as worthwhile.
Can you/will you buy larger disks to store your near-future storage needs undeduped?
Is speed/simplicity, or money, the limiting factor here? Do you need to use dedup to make it affordable/practical?

Important Announcement for the TrueNAS Community.

Is Deduplication advised for cloud storage?

mpyusko

Dabbler

sretalla

Powered by Neutrality

HoneyBadger

actually does care

joeschmuck

Old Man

Herr_Merlin

Patron

mpyusko

Dabbler

HoneyBadger

actually does care

My experiments in building a home server capable of handling fast + consistent deduplication

A bit about SSD perfomance and Optane SSDs, when you're planning your next SSD....

Stilez

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Is Deduplication advised for cloud storage?

Dabbler

Powered by Neutrality

actually does care

Old Man

Patron

Dabbler

actually does care

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Is Deduplication advised for cloud storage?"

Similar threads