Replication Task Never Finishes

mrbackup

Cadet
Joined
Mar 21, 2023
Messages
1
Hello, I am having some issues getting my backup/zfs-replication setup working.

My idea of a backup scheme is illustrated in the picture below.

Sites-Backup - Help (1).jpg



I have a server (SERVER) at Site 1, and two backup servers (BACKUP-1 and BACKUP-2) located at Site 2. SERVER and BACKUP-2 both have (unique) SSH keys to BACKUP-1. BACKUP-1 has no access to the other servers. I have set it up this way, such that if someone unwelcome gains access to SERVER, then they cannot reach BACKUP-2, if someone gets access to BACKUP-1, they cant get access to any of the other servers, and if someone gets access to BACKUP-2, they cannot reach SERVER. I though this configuration was a neat way of limiting potential risk of one server is breached, but please let me know if you have any better ideas.

My problem is with the replication task defined on BACKUP-2 to pull from BACKUP-1, that does not seem to work (at least not stable). I’ll explain my setup first, and elaborate more on the problem at the end.


Setup:

All servers (SERVER, BACKUP-1, BACKUP-2) are running TrueNAS Core v 12.0-U8.1.

The reason for not upgrading to 13 is because I use Proxmox to communicate with TrueNAS (SERVER Site 1), and Proxmox is, to my understanding, not compatible with the iSCSI implementation in TrueNAS Core v 13. Upgrading BACKUP-1 and BACKUP-2 to v13 is then again not possible, because TrueNAS does not support replication between v13 and v12. Please correct med if I am wrong!

Pools​

SERVER has the following pool setup:
Code:
/
    tank        
        dir-1    
        dir-2    
        dir-3    

BACKUP-1 has the following pool setup:
Code:
/
    tank            
        backup-1
            dir-1           # by replication
            dir-2           # by replication
            dir-3           # by replication

BACKUP-2 has the following pool setup:
Code:
/
    tank            
        backup-2   
            dir-1           # by replication
            dir-2           # by replication
            dir-3           # by replication


dir-1, dir-2, and dir-3 at the server all contain different ZVOLs.

Periodic Snapshot Tasks​

SERVER has the following periodic snapshot task:
Dataset:tank/
Recursive:Yes
Schema:Every hour
Allow taking empty snapshots:Yes

BACKUP-1 and BACKUP-2 has no periodic snapshot task.

Replication Tasks​

On SERVER [RT1] (PUSH FROM SERVER TO BACKUP-1):
Code:
Direction: PULL
Transport: SSH

Stream compression: Disabled
Limit: 50 MiB
Allow Blocks Larger than 128KB: Yes
Allow Compressed WRITE Records: Yes

Source: /tank/dir-1,  /tank/dir-2,   /tank/dir-3
Recursive: Yes
Include Dataset Properties: Yes
(Almost) Full Filesystem Replication: No

Destination: /tank/backup-1/
Destination Dataset Read-only Policy: Set
Encryption: No
Synchronize Destination Snapshots With Source: Yes


On BACKUP-2 [RT2] (PULL FROM BACKUP-1 TO BACKUP-2):
Code:
Direction: PULL
Transport: SSH

Stream compression: Disabled
Limit: None (1 GB/s cable)
Allow Blocks Larger than 128KB: Yes
Allow Compressed WRITE Records: Yes

Source: /tank/backup-1/dir-1, /tank/backup-2/dir-2, /tank/backup-3/dir-3
Recursive: N/A
Include Dataset Properties: N/A
(Almost) Full Filesystem Replication: yes

Destination: /tank/backup-2/
Destination Dataset Read-only Policy: Set
Encryption: No
Synchronize Destination Snapshots With Source: Yes

Problem​

Now, the problem is that RT1 works perfectly, while there are a lot of problems with RT2.

1: It does not seem like “Synchronize Destination Snapshots With Source” has much effect. BACKUP-2 was down for a while, and thus RT2 was not being run. After getting BACKUP-2 up again, and starting RT2 errors like the ones below started to pop up:

Error: Refusing to overwrite data
Error: Target dataset 'backup-2' does not have snapshots but has data (29071913 bytes used) and replication from scratch is not allowed. Refusing to overwrite existing data

I thought enabling “Synchronize Destination Snapshots With Source” would allow overwriting existing data in /tank/backup-2/?
I have not seen any of the same errors connected to RT1.

Anyway, I circumvented this problem by creating a new dataset in /tank/ on BACKUP-2.

2: After creating the new dataset, everything was working fine for a while, until recently.

The problem started after we added a new ZVOL on SERVER under tank/dir-2/.
RT1 picked up this new ZVOL without any problems, and the replication task is still working as intended.
RT2 picked it up once and finished its replication task. The next time RT2 was run however, the task never completed. RT2 was able to replicate all snapshots of /tank/backup-1/dir-1, but is freezing/hanging on snapshots of /tank/backup-1/dir-2.

The GUI reports that the task is running, but it never seems to complete (It would hang like this forever):
image.png


So I thought disabling, enabling and starting RT2 again hopefully would solve this problem. However, all it seems to do is move the replication task one image forwards, before it again freezes (notice the timestamp have moved one hour forwards, which matches with the snapshot task at SERVER):
image (1).png


This behaviour continued when I tried to disable, enable and start the replication task for five more times, until I gave up on this solution.

Investigating further I wanted to make sure that all replication task processes was killed before I tried to start the task again. The TaskManager reported quite a lot of processes running, although no replication task was enabled (Ignored the failed tasks, thats me experimenting around):
image (2).png



Opening a shell to try and find the process there and kill them I ran:

Code:
root@backup-2[~]# ps -A | grep zfs
16  -  DL    0:09.28 [zfskern]
227  -  Ss    0:00.03 /usr/local/sbin/zfsd
1466  -  Is    0:00.01 sh -c exec 3>&1; eval $(exec 4>&1 >&3 3>&-; { /usr/local/bin/ssh -i /tmp/tmpzca3ano_ -o UserKnownHostsFile=/tmp/tmpaclvz49q -o StrictHostKeyChecking=yes -o BatchMode=yes -o ConnectTimeout=10 -p22000 root@10.0.15.10 'sh -c '"'"'(zfs send -V -R -w -i tank/backup-1/dir-2@auto-2023-03-11_09-00 -L -c tank/backup-1/dir-2@auto-2023-03-11_10-00 & PID=$!; echo "zettarepl: zfs send PID is $PID" 1>&2; wait $PID)'"'"'' 4>&-; echo "pipestatus0=$?;" >&4; } | { zfs recv -s -F -x mountpoint -x sharenfs -x sharesmb tank/backup-2/dir-2 4>&-; echo "pipestatus1=$?;" >&4; }); [ $pipestatus0 -ne0 ] && exit $pipestatus0; [ $pipestatus1 -ne 0 ] && exit $pipestatus1; exit 0
1467  -  I     0:00.00 sh -c exec 3>&1; eval $(exec 4>&1 >&3 3>&-; { /usr/local/bin/ssh -i /tmp/tmpzca3ano_ -o UserKnownHostsFile=/tmp/tmpaclvz49q -o StrictHostKeyChecking=yes -o BatchMode=yes -o ConnectTimeout=10 -p22000 root@10.0.15.10 'sh -c '"'"'(zfs send -V -R -w -i tank/backup-1/dir-2@auto-2023-03-11_09-00 -L -c tank/backup-1/dir-2@auto-2023-03-11_10-00 & PID=$!; echo "zettarepl: zfs send PID is $PID" 1>&2; wait $PID)'"'"'' 4>&-; echo "pipestatus0=$?;" >&4; } | { zfs recv -s -F -x mountpoint -x sharenfs -x sharesmb tank/backup-2/dir-2 4>&-; echo "pipestatus1=$?;" >&4; }); [ $pipestatus0 -ne0 ] && exit $pipestatus0; [ $pipestatus1 -ne 0 ] && exit $pipestatus1; exit 0
1468  -  I     0:00.00 sh -c exec 3>&1; eval $(exec 4>&1 >&3 3>&-; { /usr/local/bin/ssh -i /tmp/tmpzca3ano_ -o UserKnownHostsFile=/tmp/tmpaclvz49q -o StrictHostKeyChecking=yes -o BatchMode=yes -o ConnectTimeout=10 -p22000 root@10.0.15.10 'sh -c '"'"'(zfs send -V -R -w -i tank/backup-1/dir-2@auto-2023-03-11_09-00 -L -c tank/backup-1/dir-2@auto-2023-03-11_10-00 & PID=$!; echo "zettarepl: zfs send PID is $PID" 1>&2; wait $PID)'"'"'' 4>&-; echo "pipestatus0=$?;" >&4; } | { zfs recv -s -F -x mountpoint -x sharenfs -x sharesmb tank/backup-2/dir-2 4>&-; echo "pipestatus1=$?;" >&4; }); [ $pipestatus0 -ne0 ] && exit $pipestatus0; [ $pipestatus1 -ne 0 ] && exit $pipestatus1; exit 0
1469  -  I     0:00.00 sh -c exec 3>&1; eval $(exec 4>&1 >&3 3>&-; { /usr/local/bin/ssh -i /tmp/tmpzca3ano_ -o UserKnownHostsFile=/tmp/tmpaclvz49q -o StrictHostKeyChecking=yes -o BatchMode=yes -o ConnectTimeout=10 -p22000 root@10.0.15.10 'sh -c '"'"'(zfs send -V -R -w -i tank/backup-1/dir-2@auto-2023-03-11_09-00 -L -c tank/backup-1/dir-2@auto-2023-03-11_10-00 & PID=$!; echo "zettarepl: zfs send PID is $PID" 1>&2; wait $PID)'"'"'' 4>&-; echo "pipestatus0=$?;" >&4; } | { zfs recv -s -F -x mountpoint -x sharenfs -x sharesmb tank/backup-2/dir-2 4>&-; echo "pipestatus1=$?;" >&4; }); [ $pipestatus0 -ne0 ] && exit $pipestatus0; [ $pipestatus1 -ne 0 ] && exit $pipestatus1; exit 0
1470  -  R     0:57.27 /usr/local/bin/ssh -i /tmp/tmpzca3ano_ -o UserKnownHostsFile=/tmp/tmpaclvz49q -o StrictHostKeyChecking=yes -o BatchMode=yes -o ConnectTimeout=10 -p22000 root@10.0.15.10 sh -c '(zfs send -V -R -w -i tank/backup-1/dir-2@auto-2023-03-11_09-00 -L -c tank/backup-1/dir-2@auto-2023-03-11_10-00 & PID=$!; echo "zettarepl: zfs send PID is $PID" 1>&2; wait $PID)'
1471  -  R     0:18.36 zfs recv -s -F -x mountpoint -x sharenfs -x sharesmb tank/backup-2/dir-2


And then I killed process 1466 to 1471. The TaskManager was however reporting that some replication.run process was stilling going on, so I did a restart of the system as well (using the GUI). After the restart, the TaskManager did not report any tasks running, and neither did ps -A | grep zfs

So, full of hope, I started the replication task again and hope it would be able to finish this time (if not successfully, at least with an error). However, the task replicated /tank/backup-1/dir-1 just fine, but when it got to /tank/backup-1/dir-2, it moved the timestamp one hour forwards, and has been stuck there every since. I even repeated this entire process, and created a new (identical) replication task in case it was something related to the ID of the previous replication task, but that did not work either.

Now, I could probably “solve” this problem by creating a new dataset at BACKUP-2, and a new replication task, but that would require transferring a lot of data, and does not seem like a reliable solution, as this problem might just pop up there as well..

I would greatly appreciate if anyone has any idea of what is going on, if there is something I am doing wrong or have completely misunderstood or just some general thoughts about what I can do to try and narrow this problem down.

Thanks!
 
Top