Long-standing replication failing with "Replication has stuck.." error after 4 hours

VulcanRidr · Nov 13, 2023

I am having a problem with replication on a dataset. This replication was set up at least three years ago, but three weeks ago, a combination of several snapshots that were not deleted (and were holding a bunch of data on the dataset), and several new largish projects that were added to the dataset, gave me a perfect storm of additional data, which filled up said dataset. I was able to get the researchers and project leads to migrate data elsewhere, as well as finding, identifying, and deleting these two-year-old snapshots that should have been deleted long ago (retention is 8 weeks), and now the dataset is sitting at 71% utilization. However ever since the dataset filled up, I have not been able to successfully replicate snapshots to the backup NAS box. I have the following error showing up in the alerts gui:

Code:

* Replication "fastpool/Torque_fast - bigpool/snapshots/hazor/Torque_Fast" failed: Replication has stuck..

I looked in /var/log/zettarepl.log, and this is what I saw from this morning's attempt to replicate the dataset in question:

Code:

[2023/11/13 00:00:01] INFO     [Thread-37] [zettarepl.paramiko.replication_task__task_11] Connected (version 2.0, client OpenSSH_8.8-hpn14v15)
[2023/11/13 00:00:01] INFO     [Thread-37] [zettarepl.paramiko.replication_task__task_11] Authentication (publickey) successful!
[2023/11/13 00:00:02] INFO     [replication_task__task_11] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: [('bigpool/snapshots/hazor/Torque_Fast', 'auto-2023-09-18_00-00')]
[2023/11/13 00:00:02] INFO     [replication_task__task_11] [zettarepl.snapshot.destroy] On <Shell(<SSH Transport(root@meron.edge.noblis.org)>)> for dataset 'bigpool/snapshots/hazor/Torque_Fast' destroying snapshots {'auto-2023-09-18_00-00'}
[2023/11/13 00:00:04] INFO     [replication_task__task_11] [zettarepl.replication.run] Resuming replication for destination dataset 'bigpool/snapshots/hazor/Torque_Fast'
[2023/11/13 00:00:04] INFO     [replication_task__task_11] [zettarepl.replication.run] For replication task 'task_11': doing push from 'fastpool/Torque_fast' to 'bigpool/snapshots/hazor/Torque_Fast' of snapshot=None incremental_base=None receive_resume_token='1-14ed5f74ff-120-789c636064000310a501c49c50360710a715e5e7a69766a63040c1a11ff97779371e8b5600b2d991d4e52765a5269730307cd3971202a9c3904f4b2b4e05ca3334548ac1e4d990e4932a4b528b81f40d9f3b3f59b1e82fc987b8e2dd1cd3d3130d74a33c90e439c1f27989b9a940f726169714e4e7e7e887e4171596a6c683f80e89a525f9ba460646c6ba8606ba8696f10606ba060610ff80ece16640f83f393fb7a028b5b8383f9b010e0041e730f0' encryption=False
[2023/11/13 01:00:06] WARNING  [replication_task__task_11.monitor] [zettarepl.replication.process_runner] Stopping stuck replication process
[2023/11/13 01:00:06] WARNING  [replication_task__task_11] [zettarepl.replication.run] For task 'task_11' at attempt 1 recoverable replication error StuckReplicationError('Replication has stuck')
[2023/11/13 01:00:06] INFO     [replication_task__task_11] [zettarepl.replication.run] After recoverable error sleeping for 1 seconds
[2023/11/13 01:00:07] INFO     [replication_task__task_11] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2023/11/13 01:00:07] INFO     [replication_task__task_11] [zettarepl.replication.run] Resuming replication for destination dataset 'bigpool/snapshots/hazor/Torque_Fast'
[2023/11/13 01:00:07] INFO     [replication_task__task_11] [zettarepl.replication.run] For replication task 'task_11': doing push from 'fastpool/Torque_fast' to 'bigpool/snapshots/hazor/Torque_Fast' of snapshot=None incremental_base=None receive_resume_token='1-1399cf3220-120-789c636064000310a501c49c50360710a715e5e7a69766a63040c1a11ff97779371e8b5600b2d991d4e52765a5269730307cd3971202a9c3904f4b2b4e05ca3334548ac1e4d990e4932a4b528b81f48149777eb262d15f920f71c5bb39a6a7271ae8467920c97382e5f312735381ee4d2c2e29c8cfcfd10fc92f2a2c4d8d07f11d124b4bf2758d0c8c8c750d0d740d2de30d0c740d0c20fe01d9c3cd80f07f727e6e41516a71717e36031c00005c39311e' encryption=False
[2023/11/13 01:00:07] WARNING  [replication_task__task_11] [zettarepl.replication.run] For task 'task_11' at attempt 2 recoverable replication error RecoverableReplicationError('resume token contents:\nnvlist version: 0\n\tfromguid = 0x5bc6b10ddd6ff8c2\n\tobject = 0x121a2ff6\n\toffset = 0x16798000\n\tbytes = 0x5f9dc92c0\n\ttoguid = 0x5a2d3091cb359cee\n\ttoname = fastpool/Torque_fast@auto-2023-10-19_00-00\n\tcompressok = 1\ncannot receive resume stream: destination bigpool/snapshots/hazor/Torque_Fast contains partially-complete state from "zfs receive -s".\nwarning: cannot send \'fastpool/Torque_fast@auto-2023-10-19_00-00\': signal received')
[2023/11/13 01:00:07] INFO     [replication_task__task_11] [zettarepl.replication.run] After recoverable error sleeping for 2 seconds
[2023/11/13 01:00:10] INFO     [replication_task__task_11] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2023/11/13 01:00:10] INFO     [replication_task__task_11] [zettarepl.replication.run] Resuming replication for destination dataset 'bigpool/snapshots/hazor/Torque_Fast'
[2023/11/13 01:00:10] INFO     [replication_task__task_11] [zettarepl.replication.run] For replication task 'task_11': doing push from 'fastpool/Torque_fast' to 'bigpool/snapshots/hazor/Torque_Fast' of snapshot=None incremental_base=None receive_resume_token='1-1399cf3220-120-789c636064000310a501c49c50360710a715e5e7a69766a63040c1a11ff97779371e8b5600b2d991d4e52765a5269730307cd3971202a9c3904f4b2b4e05ca3334548ac1e4d990e4932a4b528b81f48149777eb262d15f920f71c5bb39a6a7271ae8467920c97382e5f312735381ee4d2c2e29c8cfcfd10fc92f2a2c4d8d07f11d124b4bf2758d0c8c8c750d0d740d2de30d0c740d0c20fe01d9c3cd80f07f727e6e41516a71717e36031c00005c39311e' encryption=False
[2023/11/13 02:00:10] WARNING  [replication_task__task_11.monitor] [zettarepl.replication.process_runner] Stopping stuck replication process
[2023/11/13 02:00:10] WARNING  [replication_task__task_11] [zettarepl.replication.run] For task 'task_11' at attempt 3 recoverable replication error StuckReplicationError('Replication has stuck')
[2023/11/13 02:00:10] INFO     [replication_task__task_11] [zettarepl.replication.run] After recoverable error sleeping for 4 seconds
[2023/11/13 02:00:15] INFO     [replication_task__task_11] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2023/11/13 02:00:15] INFO     [replication_task__task_11] [zettarepl.replication.run] Resuming replication for destination dataset 'bigpool/snapshots/hazor/Torque_Fast'
[2023/11/13 02:00:15] INFO     [replication_task__task_11] [zettarepl.replication.run] For replication task 'task_11': doing push from 'fastpool/Torque_fast' to 'bigpool/snapshots/hazor/Torque_Fast' of snapshot=None incremental_base=None receive_resume_token='1-13a3214bf2-120-789c636064000310a501c49c50360710a715e5e7a69766a63040c1a11ff97779371e8b5600b2d991d4e52765a5269730307cd3971202a9c3904f4b2b4e05ca3334548ac1e4d990e4932a4b528b81f48a1b777eb262d15f920f71c5bb39a6a7271ae8467920c97382e5f312735381ee4d2c2e29c8cfcfd10fc92f2a2c4d8d07f11d124b4bf2758d0c8c8c750d0d740d2de30d0c740d0c20fe01d9c3cd80f07f727e6e41516a71717e36031c0000768b314c' encryption=False
[2023/11/13 03:00:16] WARNING  [replication_task__task_11.monitor] [zettarepl.replication.process_runner] Stopping stuck replication process
[2023/11/13 03:00:16] WARNING  [replication_task__task_11] [zettarepl.replication.run] For task 'task_11' at attempt 4 recoverable replication error StuckReplicationError('Replication has stuck')
[2023/11/13 03:00:16] INFO     [replication_task__task_11] [zettarepl.replication.run] After recoverable error sleeping for 8 seconds
[2023/11/13 03:00:25] INFO     [replication_task__task_11] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2023/11/13 03:00:25] INFO     [replication_task__task_11] [zettarepl.replication.run] Resuming replication for destination dataset 'bigpool/snapshots/hazor/Torque_Fast'
[2023/11/13 03:00:25] INFO     [replication_task__task_11] [zettarepl.replication.run] For replication task 'task_11': doing push from 'fastpool/Torque_fast' to 'bigpool/snapshots/hazor/Torque_Fast' of snapshot=None incremental_base=None receive_resume_token='1-146f5f6ed6-120-789c636064000310a501c49c50360710a715e5e7a69766a63040c1a11ff97779371e8b5600b2d991d4e52765a5269730307cd3971202a9c3904f4b2b4e05ca3334548ac1e4d990e4932a4b528b81f404b9bb3f59b1e82fc987b8e2dd1cd3d3130d74a33c90e439c1f27989b9a940f726169714e4e7e7e887e4171596a6c683f80e89a525f9ba460646c6ba8606ba8696f10606ba060610ff80ece16640f83f393fb7a028b5b8383f9b010e00fe60307b' encryption=False
[2023/11/13 04:00:26] WARNING  [replication_task__task_11.monitor] [zettarepl.replication.process_runner] Stopping stuck replication process
[2023/11/13 04:00:26] WARNING  [replication_task__task_11] [zettarepl.replication.run] For task 'task_11' at attempt 5 recoverable replication error StuckReplicationError('Replication has stuck')
[2023/11/13 04:00:26] ERROR    [replication_task__task_11] [zettarepl.replication.run] Failed replication task 'task_11' after 5 retries

So from what I am reading, it should not be trying to resume a receive...Is there a way to clear that flag? I am thinking that it is a remnant from the full dataset. Also note that I have rebooted both ends of the replication task, and both are running Core and are on 13.0u5.3.

Anyone have a suggestion? Or am I barking up the wrong tree?

Thanks.

VulcanRidr · Nov 14, 2023

Not marking this solved, since I merely did a workaround, more doing this for anyone in the future that runs across this post. What I ended up doing was to set up a separate zfs send/receive, then updated the replication task in the gui to point to the new target, and verified it was working. I will keep the old store until it is outside the retention period, then destroy those snapshots.

Important Announcement for the TrueNAS Community.

Long-standing replication failing with "Replication has stuck.." error after 4 hours

VulcanRidr

Explorer

VulcanRidr

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

Long-standing replication failing with "Replication has stuck.." error after 4 hours

VulcanRidr

Explorer

VulcanRidr

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Long-standing replication failing with "Replication has stuck.." error after 4 hours"

Similar threads