Replication Tasks error: kex_exchange_identification: Connection closed by remote host.

nokia88

Cadet
Joined
Jul 18, 2018
Messages
4
I updated my 2 FreeNAS boxes to version 11.3-U5 recently. They were running version 11.1 before.
Version 11.3-U5 came with a new replication engine. So i replaced all the legacy replication task by reconfiguring them with SSH transport mode and using the wizard.

With the new replication tasks, errors appeared at random and the replication task changed to state ERROR. This is happening at random. The errors clear themself after some time. I only see those errors on POOL1 in Box 1, the pool with the 30 datasets / replication tasks.

Setup:
Box 1 has 2 POOLS. POOL 1 has 30 datasets, POOL 2 has 7 datasets.
Box 2 is the opposite: POOL 2 has 7 datasets and POOL 1 has 30 datasets.

Both boxes replicate to eachother:

Box 1 POOL1 to Box 2 POOL2
Box 2 POOL1 to Box 1 POOL2

Periodic Snapshots are made every hour the same for the Replication Tasks.

Replication is send over a dedicated 10Gbit nic. To force FreeNAS sending it's replication data over this dedicated nic i did the following:
  • bind only the 10Gbe nic to the SSH service
  • use the 10Gbe nic IP address as host in the SSH connection used by the replication tasks
The replication task settings i'm using are:
  • direction = push
  • transport = ssh
  • Run automatically
  • Destination dataset read only policy = ignore
  • snapshot retention policy = same as source
  • Stream compression = disabled
  • Allow blocks larger than 128KB
  • Allow compressed write records
  • Number of retries for failed replications = 5
Details of the error log: (example for dataset1)

Code:
[2021/05/13 03:00:11] INFO     [Thread-7395] [zettarepl.paramiko.replication_task__task_41] Connected (version 2.0, client OpenSSH_8.0-hpn14v15)
[2021/05/13 03:00:11] INFO     [Thread-7395] [zettarepl.paramiko.replication_task__task_41] Authentication (publickey) successful!
[2021/05/13 03:00:12] INFO     [replication_task__task_41] [zettarepl.replication.run] For replication task 'task_41': doing push from 'POOL1/dataset1' to 'POOL2/dataset1' of snapshot='auto-20210513.0300-2w' incremental_base='auto-20210513.0200-2w' receive_resume_token=None
[2021/05/13 03:00:13] ERROR    [replication_task__task_41] [zettarepl.replication.run] For task 'task_41' unhandled replication error ExecException(141, 'kex_exchange_identification: Connection closed by remote host\n')
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 143, in run_replication_tasks
    run_replication_task_part(replication_task, source_dataset, src_context, dst_context, observer)
  File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 215, in run_replication_task_part
    run_replication_steps(step_templates, observer)
  File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 399, in run_replication_steps
    replicate_snapshots(step_template, incremental_base, snapshots, observer)
  File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 454, in replicate_snapshots
    run_replication_step(step_template.instantiate(incremental_base=incremental_base, snapshot=snapshot), observer)
  File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 517, in run_replication_step
    ReplicationProcessRunner(process, monitor).run()
  File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/process_runner.py", line 33, in run
    raise self.process_exception
  File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/process_runner.py", line 37, in _wait_process
    self.replication_process.wait()
  File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/ssh.py", line 136, in wait
    self.async_exec.wait()
  File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/async_exec_tee.py", line 100, in wait
    raise ExecException(exit_event.returncode, self.output)
zettarepl.transport.interface.ExecException: kex_exchange_identification: Connection closed by remote host


I have changed the MaxStartups setting to 100 in /etc/ssh/sshd_config on both machines but that didn't help.
I'm wondering if 30 datasets are too much? This wasn't a problem with the legacy replication.
The data on both boxes are mostly office data.

Any advice?

Kind regards.
 
Top