"replication has stuck" after upgrading pool to latest version. Job aborts exactly after 1 hour.

Morpheus187

Explorer
Joined
Mar 11, 2016
Messages
61
Hello

I got a setup of 2 backup servers that are synchronizing a dataset with the normal "Replication Tasks".

Server A sends snapshots to server B, this has worked flawlessly until I chosen to upgrade the ZFS pools on both machines to the latest version. It always asked me to do that and so I did it yesterday. Both systems are running "TrueNAS-12.0-U8.1"

(With upgrading, I mean upgrade the feature level that is offered after a major release update )

The next day I got a mail "Replication "TANK/BACKUP_Application -> root@[IP_of_server_B]:TANK/BACKUP_Application" failed: Replication has stuck.."

I tried to restart the replication via gui, the same problem appeared again. I then deleted the latest snapshot on the receiving side and the replication tasks successfully synced that one but failed again on the latest snapshot.

I also noticed that synching snapshots that were taken with the older ZFS pool version is done with line speed ( 1 GBit ) but the latest snapshot which always fails, only uses about 2-4 Mbit and is super slow. Watching the progress via gui confirms that and it aborts at about 20-25%

The log file shows the following:

Code:
[2022/04/29 08:11:58] INFO     [Thread-97444] [zettarepl.paramiko.replication_task__task_6] Connected (version 2.0, client OpenSSH_8.4-hpn14v15)
[2022/04/29 08:11:58] INFO     [Thread-97444] [zettarepl.paramiko.replication_task__task_6] Authentication (publickey) successful!
[2022/04/29 08:11:58] INFO     [replication_task__task_6] [zettarepl.replication.run] Resuming replication for destination dataset 'TANK/BACKUP_Application'
[2022/04/29 08:11:58] INFO     [replication_task__task_6] [zettarepl.replication.run] For replication task 'task_6': doing push from 'TANK/BACKUP_Application' to 'TANK/BACKUP_Application' of snapshot=None incremental_base=None receive_resume_token='1-f62a7f883-100-789c636064000310a501c49c50360710a715e5e7a69766a63040819bfb43b9c41529960a40363b92bafca4acd4e41206861b052c607518f26969c5a9250c7000926743924faa2c492d06d21f54f6f262d35f920f71c572d315abb84f31f17b20c97382e5f31273531918421cfdbcf59d1c9dbd4303e20332f253f3322b1c124b4bf2758d0c8c8c0c4c8c2cf50c8040d73405e1140600602b2613' encryption=False
[2022/04/29 09:11:59] WARNING  [replication_task__task_6.monitor] [zettarepl.replication.process_runner] Stopping stuck replication process
[2022/04/29 09:11:59] WARNING  [replication_task__task_6] [zettarepl.replication.run] For task 'task_6' at attempt 1 recoverable replication error StuckReplicationError('Replication has stuck')
[2022/04/29 09:11:59] ERROR    [replication_task__task_6] [zettarepl.replication.run] Failed replication task 'task_6' after 1 retries


Looking at the logfile , its notewhorty that the process is being stopped exactly after 1 hour.


TLDR:

Snapshot that was created with the OLD version of the ZFS pool -> being synched with 1 Gbit successfully
Snapshot that was created with the NEW version of the ZFS pool -> being synched with 4 mbit, aborts after exactly 1 hour ( 3601 seconds ) with replication stuck error


Is there anything I can do or try to fix the issue? The next step would probably be to delete all snapshots and try it again, or even delete the remote dataset and sync everything from scratch. I would like to avoid this because it's around 30 TB of data.
 
Top