Replication task stuck after a few minutes - how to diagnose?

jmruc

Cadet
Joined
Dec 26, 2022
Messages
4
I have a PULL replication task in TrueNAS-13.0-U3.1 which runs for a few minutes before getting totally stuck and not making any progress. It still hasn't gotten over the initial import of a 60GB dataset. I see from the network graph and the used storage, that it runs for several seconds to several minutes. If I leave it for a while I get a "Stopping stuck replication process" and it is marked as failed. How do I diagnose what the problem is?

The full log is:
Code:
[2022/12/27 12:38:48] INFO     [Thread-5] [zettarepl.paramiko.replication_task__task_1] Connected (version 2.0, client OpenSSH_8.4p1)
[2022/12/27 12:38:48] INFO     [Thread-5] [zettarepl.paramiko.replication_task__task_1] Authentication (publickey) successful!
[2022/12/27 12:38:53] INFO     [replication_task__task_1] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2022/12/27 12:38:53] INFO     [replication_task__task_1] [zettarepl.replication.run] Resuming replication for destination dataset 'tank/backup/releases/photoprism/volumes/pvc-8af53ee9-0479-47af-95fb-151da4fcb107'
[2022/12/27 12:38:53] INFO     [replication_task__task_1] [zettarepl.replication.run] For replication task 'task_1': doing pull from 'hoard/ix-applications/releases/photoprism/volumes/pvc-8af53ee9-0479-47af-95fb-151da4fcb107' to 'tank/backup/releases/photoprism/volumes/pvc-8af53ee9-0479-47af-95fb-151da4fcb107' of snapshot=None incremental_base=None receive_resume_token='1-1a5f1d5005-158-789c6d905d4ec3300cc783c4d784c419b8406852124adf38c4dea3247558b6768e92741ac7e08923c04590380b4f1c81b6a031c12c5b7fcb3fdb924d8ec864a35c0d71f6939f0f8166093613226e8fa79e7fdcb9047924973b7ebac7cd638634e8f6f9faf3e4c07cc687de3784cc9e64f5f6f2f1febac767135feb0e0859a08e4de1b75487d07aabb3c7752a22b4a013a4222c3063883e75c506dbbe1b4b1b4befb49337003565a2aaa9a8b4a3b5748672c91b2d9c359c55f77f9652a3edaa0f741e7b9823b6aa6465a9f8e05c71a19854acfcbee382fcfecd621722a4842bb2b32ff2a7492d' encryption=False
[2022/12/27 12:38:53] INFO     [replication_task__task_1] [zettarepl.paramiko.replication_task__task_1.sftp] [chan 73] Opened sftp connection (server version 3)
[2022/12/27 12:38:53] INFO     [replication_task__task_1] [zettarepl.transport.ssh_netcat] Automatically chose connect address '192.168.68.68'
[2022/12/27 13:58:54] WARNING  [replication_task__task_1.monitor] [zettarepl.replication.process_runner] Stopping stuck replication process
[2022/12/27 13:58:54] WARNING  [replication_task__task_1] [zettarepl.replication.run] For task 'task_1' at attempt 1 recoverable replication error StuckReplicationError('Replication has stuck')
[2022/12/27 13:58:54] ERROR    [replication_task__task_1] [zettarepl.replication.run] Failed replication task 'task_1' after 1 retries


The only way to make any progress is to restart the machine and start the task again, OR wait for a long time until it fails and start it again.

The TrueNAS-13.0-U3.1 with 8 GB RAM, but the CPU usage never gets over 3-4%, the services memory more than 2 GB and the ZFS cache more than 200 MB.

The remote machine is running TrueNAS-SCALE-22.12.0, and it has AMD Ryzen 5 PRO 4650G processor with 16 GB ECC memory. It also sees no increase in CPU, memory, cache, disk reads, networking.

Here's the replication task config:
1672144481245.png
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
Intel NICs ?
Try it with NETCAT turned off.
 

jmruc

Cadet
Joined
Dec 26, 2022
Messages
4
Intel NICs ?
Try it with NETCAT turned off.
I'm not really sure about the NICs, but the machines are connected using Tailscale.
Turning NETCAT on/off makes no difference for the problem, it's just lower transfer when it's off.
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177

Pare the layers away until you have the simplest configuration.

Posting your full hardware details would get more responses.
 

jmruc

Cadet
Joined
Dec 26, 2022
Messages
4
Thanks for the link! I never had any idea how much it can affect things.

The VM is on a laptopusing Intel(R) Wi-Fi 6 AX200 160MHz
The TrueNAS SCALE is using GigaLAN Intel® I225V
I already left some info about the other hardware in the original post.

It might really be the combo of being on WiFi and going over Tailscale. I tried opening a SSH port and going over the internet (I know, living dangerously and all that), and the remainder of the application failed only once, and finally finished. The only error was I think network related:

Code:
[2022/12/27 23:00:50] WARNING  [replication_task__task_1.stdout_copy] [zettarepl.transport.local] [shell:1] [async_exec:5094] Copying stdout from <_io.TextIOWrapper name=20 encoding='utf8'> failed: ValueError('I/O operation on closed file')
[2022/12/27 23:00:50] WARNING  [replication_task__task_1] [zettarepl.replication.run] For task 'task_1' at attempt 1 recoverable replication error RecoverableReplicationError('Timeout in head()')
[2022/12/27 23:00:50] ERROR    [replication_task__task_1] [zettarepl.replication.run] Failed replication task 'task_1' after 1 retries


I'll try the whole thing again tomorrow, but for now I'm closing the SSH port and giving it a rest for a bit. I'll update again if I find anything interesting.
 

jmruc

Cadet
Joined
Dec 26, 2022
Messages
4
I can confirm - it is because of Tailscale. Repeating the whole thing over internet (no VPN) had no issues. Trying it with over Tailscale took less than a minute for the task to get stuck. I'll look into OpenVPN just for that, but I have a workaround already.

Thanks again @Alecmascot for pointing me in the right direction!
 
Top