Best way to replicate very large data pools over long distance

mauzilla · Aug 14, 2022

I am trying to setup replication of 4 different data sets (all subs of another dataset) to a remote server that is about 800km away. Each pool is about 1-1.5TB in size so we're effectively looking at about 5TB of initial data, and after that will be snapshot data.

The source is located in our datacentre, but the remote is located 800km away at our office which unfortunately is prone to unexpected internet and power issues. Although we have inverter backed power, our ISP is prone to having daily "connection issues" which although is not long (maybe a couple of minutes) this leads to the initial snapshot transfer to be interrupted, and from what I can see there is no way to "continue where you left off" as the actual task returns as an error and rerunning the task seems to simply start fresh (with the previous snapshot data still present), which unfortunately means that trying to replicate 1.5TB over 100Mbps which requires about 50 hours is becoming quite a challenge. Once the initial seed is done, our daily replication is much smaller and manageable, but getting the initial data across is proving to be less easy. Worth noting that I am also replicating 4 different data pools instead of 1 to try and "chunk" the data to be transferred with the hopes of getting it across.

I tried replicating the 1st data set yesterday and came to about 300GB before we had an unexpected power cut (I don't quite know why the server rebooted as there was power, but unfortunately the outage ran into 5 hours which means our batteries had to be switched off to preserve them). The actual task stopped in error (as expected) but trying to start it again this morning a) seem to have left the previous snapshot attempt data in place and b) started from scratch. I am letting it run to see how far we get but also note it may be time to look at other options.

What options (if any) are there to get the previous task "rerun" but to continue where you left off? I am using SSH (without netcat as I ran into post SSH connection errors likely due to not having configured port forwarding for the netcat service).
I intend on creating 4 different tasks but the data is all located in a parent data set on the same data pool. I assume I can probably get away by having a single task, but I don't know how snapshots are created and fear I risk having a single snapshot which will be 5TB to try and transfer instead of 1-1.5TB at a time. Further to this, if the task fails, will it be able to continue where it left off providing it was able to transfer 1 of the 4 individual snapshots, not redoing the 1st completed transfer but redoing the failed one etc?
If this does not work out, I could likely get temp storage plugged into the cabinet at the DC, replicate to external storage, ship the data to us and import here. I am however not sure how this would work or is even possible?

Hoping someone who has ran into a similar issue has some insights to help a brother out :)

Alecmascot · Aug 14, 2022

Make a temporary local pool.
Replicate to it.
Pull the disks and Fedex to the remote server.
Import the pool to the remote server
!!!

winnielinnie · Aug 14, 2022

mauzilla said:
our ISP is prone to having daily "connection issues" which although is not long (maybe a couple of minutes) this leads to the initial snapshot transfer to be interrupted, and from what I can see there is no way to "continue where you left off" as the actual task returns as an error and rerunning the task seems to simply start fresh

mauzilla said:
What options (if any) are there to get the previous task "rerun" but to continue where you left off?

You can use ZFS' built-in "resume token".

I give examples (and explanations) in another thread of a similar issue:

can't maintain ssh connection to perform large zfs send

I'm trying to zfs send a 6 TB pool from some old disks to some new disks. It gets a few 100 GB through, but then the ssh connection always drops with: client_loop: send disconnect: Broken pipe I've set ServerAliveInterval in my client's ~/.ssh/config but that doesn't seem to help. Could it...

www.truenas.com

Using resume tokens is not as intuitive as you would think, so read the thread carefully.

I always thought that the GUI's Replication Tasks automatically enabled resume tokens under-the-hood?

Important Announcement for the TrueNAS Community.

Best way to replicate very large data pools over long distance

mauzilla

Dabbler

Alecmascot

Guru

winnielinnie

MVP

can't maintain ssh connection to perform large zfs send

Similar threads

Important Announcement for the TrueNAS Community.

Best way to replicate very large data pools over long distance

mauzilla

Dabbler

Alecmascot

Guru

winnielinnie

MVP

can't maintain ssh connection to perform large zfs send

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Best way to replicate very large data pools over long distance"

Similar threads