Replicated Archive Servers w/Deduplication

yaplej · Sep 26, 2016

Hello there,

We need to put together an archival solution for our backups. Originally we were going to use Microsoft Storage Spaces but started to think of alternative solutions after finding some serious shortcoming with Storage Spaces as a long term archiving repository.

The main thing is we need two systems that will two-way replicate/sync between each other so if we drop files on either one they get replicated to the other for offsite, redundant backups. We would like to use deduplication because its going to store very large archive files basically forever. These will likely never be modified but some of the large files could possibly deleted at some point.

Our hardware is:
2x Dell Poweredge R820s

4x E5-4640 0 @ 2.40GHz
128GB ram
2x ST9300653SS 300GB 15k (boot)
6x SM843 480GB SSD (zil/l2arc?)
2x Dell PowerVault MD1200
24x WD4001FYYG 4TB 7k disks

There are a few options with creating vdevs with this many drives going with 2-3 RaidZ2 vdevs per enclosure seems like a good idea. Then there are posts leaning towards a single RaidZ3 vdev per enclosure. Seems like if you had 4x MD1200's you could do RaidZ2 vdevs that span all 4 of the enclosures and have enclosure redundancy... Our Compellent SAN is also based on ZFS and uses RaidZ2 + 1 Hotspare per enclosure. We are trying to decide if we actually need the Spare if FreeNAS is unable to automatically initiate a "resilvering?". Odd term but I'm sure it makes sense to someone. :) It would be easy enough to have a few of these disks on-site for staff to swap should we need to rebuild one of them.

Stux · Sep 26, 2016

I'm not sure about a two-way replication. As far as I know, replication is more a one way concept. As any changes between replications on the destination will be rolled backs to the common point.

You could get fancy with clones and whatnot, but perhaps what you could use rsync?

You could definitely do what you want if you relaxed the requirement to upload to either server.

Regarding hot, warm and cold spares. I regard a hot spare as an anti pattern with raidz2. If you had a drive fail in your raidz2, the correct approach would be to let the latest replication finish (to refresh your backup) then replace the disk with a drive you've burnt in with badblock testing. Might as well pull that off a shelf, or leave it on top of the machine, etc.

Stux · Sep 26, 2016

Conservative estimate on dedupe is 5GB of RAM per TB of disk.

It's very rare that the disk saving from dedupe offsets the ram cost.

depasseg · Sep 26, 2016

yaplej said:
The main thing is we need two systems that will two-way replicate/sync between each other so if we drop files on either one they get replicated to the other for offsite, redundant backups.

As long as you don't expect to work on the same data, then this will work. If you have Sites A & B, you can send A to A' at site B and send B to B' at site A.

Stux said:
It's very rare that the disk saving from dedupe offsets the ram cost.

Agree with this. And the worst part is there is a risk that if the dedupe table grows too much and there isn't enough RAM, then you could potentially be unable to mount the pool.

yaplej said:
4x E5-4640 0 @ 2.40GHz

This is complete overkill. You could get by with 1 (maybe 2 depending on your RAM config) CPU's.

yaplej said:
if you had 4x MD1200's you could do RaidZ2 vdevs that span all 4 of the enclosures and have enclosure redundancy

You could, but you would lose half your capacity to parity, since you would effectively have a 4 disk RAIDZ2 vdev, no?

yaplej said:
We are trying to decide if we actually need the Spare if FreeNAS is unable to automatically initiate a "resilvering?"

Having a spare helps if you are remote. It also helps to have a spare installed so that you can initiate a disk replacement before a drive completely fails (I experience "slow" failures more than a hard failures).

Stux · Sep 26, 2016

depasseg said:
You could, but you would lose half your capacity to parity, since you would effectively have a 4 disk RAIDZ2 vdev, no?

Well, if you used 8way z2, you can still lose a quarter of the vdev and be good. Just need to make sure that 2 drives of each vdev are in each enclosure.

A similar thing can apply to controllers. If possible try ensure vdevs are split amongst controllers so a controller failure won't wipe out your vdev.

depasseg · Sep 26, 2016

Stux said:
Well, if you used 8way z2, you can still lose a quarter of the vdev and be good. Just need to make sure that 2 drives of each vdev are in each enclosure.

Good point!

yaplej · Sep 27, 2016

Yes, I was thinking rsync for the replication piece not actual ZFS Replication. We can use rsync to send only and never purge any files. If we need to purge something you would have to login on both systems and delete the files before the systems run rsync again. I knew that ZFS Replication would not work as that only allows data to be written to one system and replicated to a remote system.

To use deduplication it looks like we would need to have about 480GB of ram to deduplicate the 96TB. That's based on raw capacity not capacity after losses to RAIDZ2/3 parity. Seems excessively high (common consensus about ZFS deduplication?). What about using the SSD's as an alternative to RAM? To much of a performance hit for deduplication going to the SSD? I was thinking that the SSDs would be in a 6x 3-way mirror. Would only get 960GB out of it but that's not terrible.

Seems very odd that the opinion of so many is that ZFS deduplication is just not worth it. Its is disappointing there is not some form of deduplication that works in the real world. Perhaps post-process? Our goal is to use these system for archiving our monthly backup exports from our primary backup system. We expect that the files will be similar from month to month. They will probably get used as a target for other backup jobs those will be very small in comparison to the monthly export from our primary backup system. We don't need huge IOPS or throughput as it will likely be bottlenecked by the primary backup system. Even though both will have 10 Gbps network interfaces we have already seen that the primary backup system will not be able to create the archive and export it to the archive server fast enough to bottleneck the R820 running FreeNAS.

Given that performance is not our real goal would using the SSD for deduplication provide "good enough" performance that it would make sense to enable deduplication as its primarily use is archival not live data.

The R820 also has two SAS controllers each connection an IO Module on the 1st MD1200 and daisy chained to the 2nd MD1200.

depasseg · Sep 27, 2016

yaplej said:
What about using the SSD's as an alternative to RAM

This seems like a good compromise - create a large L2ARC. Although with the GUI, I don't think you can create striped mirrors for the L2ARC, so you would need to decide which approach is better (manual config via CLI or just a stripe using the GUI).

As for the rest, my only suggestion would be to plan a decent amount of time to test the performance (you can even monitor the dedupe stats to see if it's worth it). ZFS replication in both directions is possible and is quick, as long as you don't need to sync changes to the same data (which it doesn't sound like you need to do). It sounds like you have a backup system at site A and another backup system at site B, and you want each set of data stored both locally and remotely. Is that correct or no?

yaplej · Sep 27, 2016

depasseg said:
It sounds like you have a backup system at site A and another backup system at site B, and you want each set of data stored both locally and remotely. Is that correct or no?

Exactly. Once the data is written is should be static (no modification). It might be read or deleted after that point. We are probably going to create a "writable" share that is used to ingest the archives and backups files and then move them on a schedule to a read-only folder that is replicated between the two system. Unless we can figure out how to make a share writable but deny modification to existing files. We really don't want the archives to be deletable/modifiable via any of the shares. Once something is archived it's there unless we explicitly go in to the archive server and delete it. These systems would be totally independant of our existing network and backup solution and documentation for it (such as credentials and encryption keys) would be kept separate and isolated from the existing network. It's not offline backups but pretty close.

depasseg · Sep 27, 2016

yaplej said:
and then move them on a schedule to a read-only folder that is replicated between the two system.

I'm envisioning a dataset which is shared (and writable), which then gets snapshotted and replicated (both locally and remotely) to a read-only and unshared dataset. This would greatly limit the exposure to being able to change data, since there wouldn't be a mechanism for a user or system to connect to the written data.

yaplej · Sep 27, 2016

We still want to maintain an option to purge data just not from a remote share. Assuming the credentials used to backup to the system were compromised they would not be able to delete the archived backups. Upon reflecting it might be ideal for those credentials to not even read the existing archived files. That part is more flexible once we get the thing setup so I want to focus on getting the hardware/vdev/pool options sorted out first.

There are a few details that I want to figure out with my department:

Do we want to have a designated spare in the system? (1 spare per enclosure, per disk type 2x WD4001FYYG + 1x SM843 we have staff that can be onsite in short order to swap a drive)
What RAIDZ and disk count do we want for each vdev? (4x RAIDZ2[4+2] or 2x RAIDZ3[9+3] possibly some other variation if using spare?)
Should we add ZIL drive or wait on it? (2x SM1625/PM1633)
With deduplication enabled should the L2ARC be 2-way or 3-way mirrored?
Replication 2-way replication A->B, B->A (Rsync vs ZFS Replication consensus is ZFS is faster provided it will re-sync if data is only deleted from one system)

What would be ideal 4x smaller RAIDZ2 or 2x larger RAIDZ3 vdevs? I thinking that RAIDZ2 is totally acceptable in our situation. Given that the RAIDZ2 using 6 disks seems to perform similarly to the RAIDZ3 using 12 disks it seems like 4x RAIDZ2 vdevs would offer better theoretical performance with minimal capacity lose. A single larger RAIDZ2 vdev would net the most capacity but about 1/2 the write performance of splitting it into two vdevs.

https://calomel.org/zfs_raid_speed_capacity.html
12x 4TB, 2 striped 6x raidz2, 30.1 TB, w=638MB/s , rw=105MB/s , r=990MB/s
12x 4TB, raidz2 (raid6), 37.4 TB, w=317MB/s , rw=98MB/s , r=1065MB/s
12x 4TB, raidz3 (raid7), 33.6 TB, w=452MB/s , rw=105MB/s , r=840MB/s

depasseg · Sep 27, 2016

In your mind, separate the share (SMB/NFS) from the Dataset. You can share via SMB or NFS to the backup server, and then replicate that dataset to other places that are not shared. Replication uses the root account over SSH, so it's unlikely that the root private key will become compromised. Using the snapshot management in the GUI, you can purge whatever snapshot you want as the Root user in the GUI. Nothing is available via the share to do that.

It sounds like you could get away with 12 disk RAIDZ3 and no hot spares.

yaplej said:
Should we add ZIL drive or wait on it? (2x SM1625/PM1633)

What protocol are you going to use? If you aren't doing sync writes, then the SLOG isn't needed.

yaplej said:
A single larger RAIDZ2 vdev would net the most capacity but about 1/2 the write performance of splitting it into two vdevs.

Sequential throughput will be similar (it's pretty much the sum of the throughput of the data disks). Having 2 vdevs would provide double the IOPS (~300 vs ~150), which probably aren't relevant in your workload scenario.

As for throughput, again I would suggest doing some testing since you will likely start hitting the networking and/or backup system throughput ceiling. I think the good news is that you have the basic building blocks and you can put them together whichever way best meets the need.

yaplej · Sep 27, 2016

I really appreciate your insight on this setup/build.

Looking through the ZFS Primer.
https://doc.freenas.org/9.3/zfsprimer.html

It has some statements about L2ARC size not being 5x your memory. I think with having deduplication enabled the availability of the L2ARC would be fairly important to maintaining a consistent level of performance. The 3-way mirror for L2ARC might be overkill but would help ensure a similar level of availability that the RAIDZ3 would (two drive failures per stripe). That would give us ~960GB of L2ARC but that is more than 5x memory (640GB for 128GB of Memory) as recommended in the ZFS Primer. Given that the 96TB system would need around 480GB for the DDT we would have more than enough ARC + L2ARC.

However I was unable to find any commands on creating any type of mirrored L2ARC. In fact it appears you cannot mirror cache drives. Perhaps this is just outdated information but I have yet to find something that indicates otherwise. That is unfortunate.
http://constantin.glez.de/blog/2011...stions-about-flash-memory-ssds-and-zfs#mirror
http://www.eall.com.br/blog/?p=1844

Lots of theoretical stuff right now just to wrap my head around what is or is not possible. Perhaps we will do some testing soon and see how things go.

Stux · Sep 27, 2016

Mirroring L2ARC is silly. It supposed to be cached data and if it is corrupted or the entire device goes away, then ZFS is supposed to fallback to simply fetching the data again.

Striping it on the other hand, gets you faster l2arc and more of it

Unless you're running iscsi or nfs, you probably don't need a slog.

I'd suggest 2x6x4TB in RAIDZ2. Imo the gain in iops over z3 is worth it, assuming you'll have people using this as a file and not just a data dump.

So. You have two servers. In different locations? With a different set of users using a different set of files?

Or is there a primary and a secondary?

yaplej · Sep 27, 2016

I could then just add all 6x SM843 disks as cache and let ZFS deal with failures. If a particular cache disk fails it should just re-cache that data into another cache disk. That would make my L2ARC cache way over the 5x memory recommended in the ZFS primer. Problem is if deduplication is using the L2ARC cache heavily then there could be a noticable performance hit until all that data is re-cached. By mirroring we could just avoid having to re-warm/re-cache the deduplication data.

We will dump files to both systems and replicated everything between the two. We want a single dataset any files placed on System A get copied to System B and any files put on System B get copied to System A. If a file is deleted off System A it gets recopied back from System B in the next sync. The only way to permanently delete something is to login on both systems and delete the file simultaneously. Been skimming the manual and online but haven't figured out how ZFS replication would handled something like that. Rsync could work and would only sync new files/missing files.

These are not systems users will be accessing. These are long term backup/archive/DR/cryptolocker protection.

Stux · Sep 27, 2016

I wonder if a union mount is s thing in FreeNAS.

If it were, you basically have two separate repositories, and merge them at mount time.

Otherwise, rsync between two repositories is what you want. ZFS send/receive will always make the receiver match the sender, including removing remote changes which have been made since the last replication. So you would need two reciprocal repositories and replicate each one in a different direction. Hence the union mount idea.

depasseg · Sep 27, 2016

yaplej said:
We want a single dataset any files placed on System A get copied to System B and any files put on System B get copied to System A. If a file is deleted off System A it gets recopied back from System B in the next sync.

This isn't possible with ZFS. You can't share a single dataset back and forth. You can easily have a subdataset for A and another one for B. You can also prevent a deletion on System A (and/or B) and you can also have snapshots of your backups so that even if someone/something modified a backup, you would still be able to clone and/or roll back.

Stux · Sep 27, 2016

But you can use rsync to mirror changes

yaplej · Sep 27, 2016

depasseg said:
This isn't possible with ZFS. You can't share a single dataset back and forth. You can easily have a subdataset for A and another one for B. You can also prevent a deletion on System A (and/or B) and you can also have snapshots of your backups so that even if someone/something modified a backup, you would still be able to clone and/or roll back.

I think we were talking about two different things. I was just talking about the data (large backup export files in particular) not specifically a dataset (the ZFS object). Forgive me if I have been using the wrong terms here. Each system would have their own zpool, dataset and filesystem mount to that dataset. Files (those large backup export files) would be on that dataset we would just rsync from one dataset mount to the remote system dataset mount (specifically using rsync push would probably be best). That seems to make sense because I can access the files on my datasets from the shell so rsync from these mount points should also work.

However, I think (and hope) I now understand what you were proposing. Having two separate datasets one for each system (A-to-B and B-to-A) writes to System A would be made to the (A-to-B) dataset and then replicated to System B (A-to-B). Deleting a file would have to happen on System A and that would be replicated to System B. The mirror would be true for writes to System B. They would be written to dataset (B-to-A) and then replicated to System A (B-to-A).

It would not have the ability to automatically re-sync the data back from System B it would just be deleted off System A. We could use a Snapshot to go back and recover the file if needed. However that would not give us the "must purposely delete file off both systems to fully delete archived files". The next rsync and the file would be back on the system it was deleted off (we are assuming that the file was successfully synced initially).

Stux · Sep 27, 2016

Yes. I think you want to use rsync. "Replication" is a specific ZFS term in this context. And it's one way

Important Announcement for the TrueNAS Community.

Replicated Archive Servers w/Deduplication

Dabbler

MVP

MVP

FreeNAS Replicant

MVP

FreeNAS Replicant

Dabbler

FreeNAS Replicant

Dabbler

FreeNAS Replicant

Dabbler

FreeNAS Replicant

Dabbler

MVP

Dabbler

MVP

FreeNAS Replicant

MVP

Dabbler

MVP

Similar threads