Replication error...but no error in log?

leonardorame

Contributor
Joined
Jun 30, 2018
Messages
106
Hi, I'm trying to execute a Replication Task from one server to another vía SSH, everything configured from the 11.3-U1 user interface.

The replication finishes with error, but it's not clear what the error is. Can you help me determine the error?.

When I click the red error button (see the screenshot below), the error message is this:

screenshot_1.png


Code:
Error

.

Logs

[2020/03/28 16:08:32] DEBUG    [replication_task__task_3] [zettarepl.transport.local.shell.1.async_exec.83119] Running ['zfs', 'get', '-H', '-p', 'type', 'Datos/BD/Postgresql']
[2020/03/28 16:08:32] DEBUG    [replication_task__task_3] [zettarepl.transport.local.shell.1.async_exec.83119] Success: 'Datos/BD/Postgresql\ttype\tvolume\t-\n'
[2020/03/28 16:08:32] DEBUG    [replication_task__task_3] [zettarepl.transport.base_ssh.root@192.168.1.102.shell.967] Connecting...
[2020/03/28 16:08:32] DEBUG    [Thread-1220] [zettarepl.paramiko.replication_task__task_3] starting thread (client mode): 0x21c098d0
[2020/03/28 16:08:32] DEBUG    [Thread-1220] [zettarepl.paramiko.replication_task__task_3] Local version/idstring: SSH-2.0-paramiko_2.7.1
[2020/03/28 16:08:32] DEBUG    [Thread-1220] [zettarepl.paramiko.replication_task__task_3] Remote version/idstring: SSH-2.0-OpenSSH_8.0-hpn14v15
[2020/03/28 16:08:32] INFO     [Thread-1220] [zettarepl.paramiko.replication_task__task_3] Connected (version 2.0, client OpenSSH_8.0-hpn14v15)
[2020/03/28 16:08:32] DEBUG    [Thread-1220] [zettarepl.paramiko.replication_task__task_3] kex algos:['curve25519-sha256', 'curve25519-sha256@libssh.org', 'ecdh-sha2-nistp256', 'ecdh-sha2-nistp384', 'ecdh-sha2-nistp521', 'diffie-hellman-group-exchange-sha256', 'diffie-hellman-group16-sha512', 'diffie-hellman-group18-sha512', 'diffie-hellman-group14-sha256', 'diffie-hellman-group14-sha1'] server key:['rsa-sha2-512', 'rsa-sha2-256', 'ssh-rsa', 'ecdsa-sha2-nistp256', 'ssh-ed25519'] client encrypt:['chacha20-poly1305@openssh.com', 'aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'aes128-gcm@openssh.com', 'aes256-gcm@openssh.com', 'aes128-cbc', 'none'] server encrypt:['chacha20-poly1305@openssh.com', 'aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'aes128-gcm@openssh.com', 'aes256-gcm@openssh.com', 'aes128-cbc', 'none'] client mac:['umac-64-etm@openssh.com', 'umac-128-etm@openssh.com', 'hmac-sha2-256-etm@openssh.com', 'hmac-sha2-512-etm@openssh.com', 'hmac-sha1-etm@openssh.com', 'umac-64@openssh.com', 'umac-128@openssh.com', 'hmac-sha2-256', 'hmac-sha2-512', 'hmac-sha1'] server mac:['umac-64-etm@openssh.com', 'umac-128-etm@openssh.com', 'hmac-sha2-256-etm@openssh.com', 'hmac-sha2-512-etm@openssh.com', 'hmac-sha1-etm@openssh.com', 'umac-64@openssh.com', 'umac-128@openssh.com', 'hmac-sha2-256', 'hmac-sha2-512', 'hmac-sha1'] client compress:['none'] server compress:['none'] client lang:[''] server lang:[''] kex follows?False
[2020/03/28 16:08:32] DEBUG    [Thread-1220] [zettarepl.paramiko.replication_task__task_3] Kex agreed: ecdh-sha2-nistp256
[2020/03/28 16:08:32] DEBUG    [Thread-1220] [zettarepl.paramiko.replication_task__task_3] HostKey agreed: ssh-ed25519
... 43 more lines ...
[2020/03/28 16:08:32] DEBUG    [replication_task__task_3] [zettarepl.paramiko.replication_task__task_3] [chan 3] Max packet in: 32768 bytes
[2020/03/28 16:08:32] DEBUG    [Thread-1220] [zettarepl.paramiko.replication_task__task_3] [chan 3] Max packet out: 32768 bytes
[2020/03/28 16:08:32] DEBUG    [Thread-1220] [zettarepl.paramiko.replication_task__task_3] Secsh channel 3 opened.
[2020/03/28 16:08:32] DEBUG    [Thread-1220] [zettarepl.paramiko.replication_task__task_3] [chan 3] Sesch channel 3 request ok
[2020/03/28 16:08:32] DEBUG    [replication_task__task_3] [zettarepl.transport.base_ssh.root@192.168.1.102.shell.967.async_exec.83125] Waiting for exit status
[2020/03/28 16:08:32] DEBUG    [Thread-1220] [zettarepl.paramiko.replication_task__task_3] [chan 3] EOF received (3)
[2020/03/28 16:08:32] DEBUG    [Thread-1220] [zettarepl.paramiko.replication_task__task_3] [chan 3] EOF sent (3)
[2020/03/28 16:08:32] DEBUG    [replication_task__task_3] [zettarepl.transport.base_ssh.root@192.168.1.102.shell.967.async_exec.83125] Success: 'datos/Postgresql\treceive_resum....15057717e36031c0000375c255b\t-\n'
[2020/03/28 16:08:32] INFO     [replication_task__task_3] [zettarepl.replication.run] Resuming replication for destination dataset 'datos/Postgresql'
[2020/03/28 16:08:32] INFO     [replication_task__task_3] [zettarepl.replication.run] For replication task 'task_3': doing push from 'Datos/BD/Postgresql' to 'datos/Postgresql' of snapshot=None incremental_base=None receive_resume_token='1-12037eef19-f8-789c6360640003....6e015057717e36031c0000375c255b'
 

leonardorame

Contributor
Joined
Jun 30, 2018
Messages
106
I deleted the dataset from the target server, then manually started the replication (by clickon RUN NOW button on Replication Tasks), but it rebooted the target server again. This time I was running top trying to see somethin interesting, but I don't see nothing special, here's the screenshot:

screenshot_3.png


Also I've read (https://www.ixsystems.com/community/threads/had-an-unscheduled-system-reboot.82582/) that SMART can be disabled on the SSD boot drive, I did that, but the server is still rebooting while replicating files.
 

leonardorame

Contributor
Joined
Jun 30, 2018
Messages
106
This morning I've executed gstat, top, and top -m io to watch the io, cpu and memory at the moment of reboot, here's the screenshot.

1585474327717.png
 

leonardorame

Contributor
Joined
Jun 30, 2018
Messages
106
The last test I did was a dd if=/dev/zero /mnt/datos/dd.file for more than 10 minutes, it created a ~50gb file without issues, also gstat shows those above 100 %busy red numbers, but the system didn't reboot.

I'm starting to thing this is a nic problem.
 

leonardorame

Contributor
Joined
Jun 30, 2018
Messages
106
The nic is this:

Code:
rgephy0: <RTL8251 1000BASE-T media interface> PHY 1 on miibus0
rgephy0:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re0: Using defaults for TSO: 65518/35/2048
re0: Ethernet address: b4:2e:99:6f:71:98


As I know Realtek nics are not recommended I change from 1000baseTX to 100baseTX to see if this helps. Tomorrow I'll add a PCIe Intel NIC also.
 

leonardorame

Contributor
Joined
Jun 30, 2018
Messages
106
Well, it only took a couple of minutes for the server to be rebooted, so the change to 100Tx didn't help. The last option is changing the nic.
 

leonardorame

Contributor
Joined
Jun 30, 2018
Messages
106
Today the folks at the remote site where the server is running added an PCIe Intel nic, now the replication is running smoothly, let's see if it finishes.

Code:
%  pciconf -lv | grep -A1 -B3 network
igb0@pci0:3:0:0:        class=0x020000 card=0x10a78086 chip=0x10a78086 rev=0x02 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82575EB Gigabit Network Connection'
    class      = network
    subclass   = ethernet
igb1@pci0:3:0:1:        class=0x020000 card=0x10a78086 chip=0x10a78086 rev=0x02 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82575EB Gigabit Network Connection'
    class      = network
    subclass   = ethernet
re0@pci0:4:0:0: class=0x020000 card=0xe0001458 chip=0x816810ec rev=0x16 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet
 

leonardorame

Contributor
Joined
Jun 30, 2018
Messages
106
The folks at the remote site replaced the original RAM with a new one, this time with only 4gb, and it rebooted again when started replicating.
 

leonardorame

Contributor
Joined
Jun 30, 2018
Messages
106
One interesting point I found is the reboot only hapens when I replicate a dataset that doesn't exists on the target server.
 

leonardorame

Contributor
Joined
Jun 30, 2018
Messages
106
Today the machine has 12gb of ram and I'm sending a 25gb dataset. I'm monitoring with top and seeing how dramatically the free ram decreases and the ARC increments.
 

leonardorame

Contributor
Joined
Jun 30, 2018
Messages
106
Well, it didn't reboot since we added more ram, let's watch it for a couple of hours from now, but the lack of ram could be the cause.
 
Top