I am trying to achieve snapshot replication over a high-latency VPN between Europe and Asia, but the connection keeps dropping before the first snapshot is successfully transferred, and each retry is restarting the transfer from the beginning.
Sending server:
FreeNAS 9.10.2-U6 (virtualized, VMware ESXi 6.7)
12 GB RAM, 2 vCPU ( Xeon E3-1226 v3 )
Internet connectivity is a symmetrical 250/250 Mbps
Receiving server:
TrueNAS 12.0-U8 (virtualized, VMware ESXi 7.0)
16 GB RAM, 2 vCPU ( Pentium Gold 5405U )
Internet connectivity is a symmetrical 1000/1000 Mbps (shared in large building, speeds typically reach 500+ Mbps)
A copy of pfSense is running on each local virtualization host, and they are linked using Wireguard VPN. RTT latency is around 265 ms.
Transfer speed from sending to receiving server fluctuates quite a bit between 15-60 Mbps. Setting the sending side TCP to use htcp (net.inet.tcp.cc.algorithm=htcp) the transfer speed seems to average around 25 Mbps.
The first snapshot the replication function attempts to synchronize is 58 GB (which is sent without compression -- I had to disable transfer compression due to the version difference between sender and receiver systems), and continuously checking "zfs list" on the receiving server I see the initial snapshot replication reach up to approximately 1-2 GB before the connection is reset and the transfer restarts from the beginning.
/var/log/debug.log on sending server shows only that the connection is closed by remote host (I assume some hiccup due to the poor link and high latency), and approximately 20 seconds prior to this the throughput as seen from pfSense has dropped to zero.
"zfs recv" has an option "-s" that creates a token for use with "zfs send" to resume from a partially received state, but I do not see that this is used by the replication function available in the GUI on either system.
As a trial I am manually sending this initial snapshot using the resume functionality. It requires fetching the latest token after each partial failure using "zfs get" command like "/sbin/zfs get -H -o value receive_resume_token <target_dataset>", then passing this token to "zfs send -t <token>".
For those curious, sending without a token, when using "zfs recv -s", gives this error:
If I use an incorrect (old) token for "zfs send -t" to attempt to resume, the error is not very helpful.
Finally, my question:
Is there any way to configure (via GUI or otherwise) the automatic ZFS replication to allow for resume of snapshot transfers, or is there some other way to mitigate this situation via built-in functionality?
Helpful resources:
oshogbo.vexillium.org
openzfs.github.io
openzfs.github.io
openzfs.github.io
Sending server:
FreeNAS 9.10.2-U6 (virtualized, VMware ESXi 6.7)
12 GB RAM, 2 vCPU ( Xeon E3-1226 v3 )
Internet connectivity is a symmetrical 250/250 Mbps
Receiving server:
TrueNAS 12.0-U8 (virtualized, VMware ESXi 7.0)
16 GB RAM, 2 vCPU ( Pentium Gold 5405U )
Internet connectivity is a symmetrical 1000/1000 Mbps (shared in large building, speeds typically reach 500+ Mbps)
A copy of pfSense is running on each local virtualization host, and they are linked using Wireguard VPN. RTT latency is around 265 ms.
Transfer speed from sending to receiving server fluctuates quite a bit between 15-60 Mbps. Setting the sending side TCP to use htcp (net.inet.tcp.cc.algorithm=htcp) the transfer speed seems to average around 25 Mbps.
The first snapshot the replication function attempts to synchronize is 58 GB (which is sent without compression -- I had to disable transfer compression due to the version difference between sender and receiver systems), and continuously checking "zfs list" on the receiving server I see the initial snapshot replication reach up to approximately 1-2 GB before the connection is reset and the transfer restarts from the beginning.
/var/log/debug.log on sending server shows only that the connection is closed by remote host (I assume some hiccup due to the poor link and high latency), and approximately 20 seconds prior to this the throughput as seen from pfSense has dropped to zero.
Feb 24 04:38:01 freenas autorepl.py: [tools.autorepl:157] Replication result: Connection to truenas.localdomain closed by remote host.
Failed to write to stdout: Broken pipe
"zfs recv" has an option "-s" that creates a token for use with "zfs send" to resume from a partially received state, but I do not see that this is used by the replication function available in the GUI on either system.
As a trial I am manually sending this initial snapshot using the resume functionality. It requires fetching the latest token after each partial failure using "zfs get" command like "/sbin/zfs get -H -o value receive_resume_token <target_dataset>", then passing this token to "zfs send -t <token>".
For those curious, sending without a token, when using "zfs recv -s", gives this error:
cannot receive new filesystem stream: destination mypool/mydataset/backup contains partially-complete state from "zfs receive -s".
If I use an incorrect (old) token for "zfs send -t" to attempt to resume, the error is not very helpful.
cannot receive resume stream: kernel modules must be upgraded to receive this stream.
Finally, my question:
Is there any way to configure (via GUI or otherwise) the automatic ZFS replication to allow for resume of snapshot transfers, or is there some other way to mitigate this situation via built-in functionality?
Helpful resources:

Resuming ZFS send
One of the amazing functionalities of ZFS is the possibility of sending a whole dataset from one place to another. This mechanism is amazing to create backups of your ZFS based machines. Although, there were some issues with this functionality for a long time when a user sent a big chunk of...