I updated my 2 FreeNAS boxes to version 11.3-U5 recently. They were running version 11.1 before.
Version 11.3-U5 came with a new replication engine. So i replaced all the legacy replication task by reconfiguring them with SSH transport mode and using the wizard.
With the new replication tasks, errors appeared at random and the replication task changed to state ERROR. This is happening at random. The errors clear themself after some time. I only see those errors on POOL1 in Box 1, the pool with the 30 datasets / replication tasks.
Setup:
Box 1 has 2 POOLS. POOL 1 has 30 datasets, POOL 2 has 7 datasets.
Box 2 is the opposite: POOL 2 has 7 datasets and POOL 1 has 30 datasets.
Both boxes replicate to eachother:
Box 1 POOL1 to Box 2 POOL2
Box 2 POOL1 to Box 1 POOL2
Periodic Snapshots are made every hour the same for the Replication Tasks.
Replication is send over a dedicated 10Gbit nic. To force FreeNAS sending it's replication data over this dedicated nic i did the following:
I have changed the MaxStartups setting to 100 in /etc/ssh/sshd_config on both machines but that didn't help.
I'm wondering if 30 datasets are too much? This wasn't a problem with the legacy replication.
The data on both boxes are mostly office data.
Any advice?
Kind regards.
Version 11.3-U5 came with a new replication engine. So i replaced all the legacy replication task by reconfiguring them with SSH transport mode and using the wizard.
With the new replication tasks, errors appeared at random and the replication task changed to state ERROR. This is happening at random. The errors clear themself after some time. I only see those errors on POOL1 in Box 1, the pool with the 30 datasets / replication tasks.
Setup:
Box 1 has 2 POOLS. POOL 1 has 30 datasets, POOL 2 has 7 datasets.
Box 2 is the opposite: POOL 2 has 7 datasets and POOL 1 has 30 datasets.
Both boxes replicate to eachother:
Box 1 POOL1 to Box 2 POOL2
Box 2 POOL1 to Box 1 POOL2
Periodic Snapshots are made every hour the same for the Replication Tasks.
Replication is send over a dedicated 10Gbit nic. To force FreeNAS sending it's replication data over this dedicated nic i did the following:
- bind only the 10Gbe nic to the SSH service
- use the 10Gbe nic IP address as host in the SSH connection used by the replication tasks
- direction = push
- transport = ssh
- Run automatically
- Destination dataset read only policy = ignore
- snapshot retention policy = same as source
- Stream compression = disabled
- Allow blocks larger than 128KB
- Allow compressed write records
- Number of retries for failed replications = 5
Code:
[2021/05/13 03:00:11] INFO [Thread-7395] [zettarepl.paramiko.replication_task__task_41] Connected (version 2.0, client OpenSSH_8.0-hpn14v15) [2021/05/13 03:00:11] INFO [Thread-7395] [zettarepl.paramiko.replication_task__task_41] Authentication (publickey) successful! [2021/05/13 03:00:12] INFO [replication_task__task_41] [zettarepl.replication.run] For replication task 'task_41': doing push from 'POOL1/dataset1' to 'POOL2/dataset1' of snapshot='auto-20210513.0300-2w' incremental_base='auto-20210513.0200-2w' receive_resume_token=None [2021/05/13 03:00:13] ERROR [replication_task__task_41] [zettarepl.replication.run] For task 'task_41' unhandled replication error ExecException(141, 'kex_exchange_identification: Connection closed by remote host\n') Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 143, in run_replication_tasks run_replication_task_part(replication_task, source_dataset, src_context, dst_context, observer) File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 215, in run_replication_task_part run_replication_steps(step_templates, observer) File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 399, in run_replication_steps replicate_snapshots(step_template, incremental_base, snapshots, observer) File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 454, in replicate_snapshots run_replication_step(step_template.instantiate(incremental_base=incremental_base, snapshot=snapshot), observer) File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 517, in run_replication_step ReplicationProcessRunner(process, monitor).run() File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/process_runner.py", line 33, in run raise self.process_exception File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/process_runner.py", line 37, in _wait_process self.replication_process.wait() File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/ssh.py", line 136, in wait self.async_exec.wait() File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/async_exec_tee.py", line 100, in wait raise ExecException(exit_event.returncode, self.output) zettarepl.transport.interface.ExecException: kex_exchange_identification: Connection closed by remote host
I have changed the MaxStartups setting to 100 in /etc/ssh/sshd_config on both machines but that didn't help.
I'm wondering if 30 datasets are too much? This wasn't a problem with the legacy replication.
The data on both boxes are mostly office data.
Any advice?
Kind regards.