ZFS Replication Fails

Status
Not open for further replies.

JimKusz

Dabbler
Joined
Sep 10, 2018
Messages
19
Hello:

I have two identical FreeNAS systems, about 100TB in capacity, that have been in production for several years now. The master has one main export, which is exported via SMB to about 30 windows clients. Permissions were set as "Windows" at creation time, but were also set as guest allowed / guest-only initially. The backup system was a zfs replication destination, and the data and replicas were kept up to date and all was good for several years. While not relevant to this situation, the backup system is located in a 2nd building connected via point to point microwave network connection, and there are employees in the 2nd building (but all their computers still connect back to the master in building A -- there are no users of the 2nd system).

Recently management decided to migrate to an individual users model on all workstations, so master was set up with domain controller rights (there are NO windows servers), and users were added and permissions on the SMB export were changed via windows clients with administrator account. That is all working.

However, shortly after we set up domain controller and permissions, replication to the slave started failing. Logs are not giving anything useful. The replication goes to 100%, sits there for about an hour or two, then fails. For a period afterward, replication attempts fail with dataset is busy error, then is suddenly able to replicate again, and the cycle repeats.

I'm now a full week behind on replication, the same snapshot has been attempted this entire time. I've asked on IRC, but nobody has any suggestions on how to proceed. Here is a snapshot of the debug log catching a full cycle and the beginnings of a second one (I've omitted a chunk in the middle of identical repeating, as indicated by ... ):

Code:
Sep 10 08:47:22 storage /autorepl.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 08:47:22 storage /autorepl.py: [tools.autorepl:246] cys/jails/resilio_1 -> 192.168.1.12:cys-bkup replication not enabled
Sep 10 08:47:22 storage /autorepl.py: [tools.autorepl:615] Relaunching autorepl
Sep 10 08:47:27 storage /autorepl.py: [tools.autorepl:170] Skipping check if autorepl is running.
Sep 10 08:47:27 storage /autorepl.py: [tools.autorepl:221] Autosnap replication started
Sep 10 08:47:27 storage /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-72606
Sep 10 08:47:27 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/HumanResources
Sep 10 08:47:28 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/cys-export
Sep 10 08:47:30 storage /autorepl.py: [tools.autorepl:131] Sending zfs snapshot: /sbin/zfs send -V -p -i cys/cys-export@auto-20180903.2000-2w cys/cys-export@auto-20180904.0800-2w | /usr/local/bin/lz4c | /usr/local/bin/throttle -K 70000 | /usr/local/bin/pipewatcher $$ | /usr/local/bin/ssh -ononeenabled=yes -ononeswitch=yes -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.1.12 "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'cys-bkup' && echo Succeeded"
Sep 10 08:47:31 storage /autorepl.py: [tools.autorepl:150] Replication result: cannot receive incremental stream: dataset is busy
Failed to write to stdout: Broken pipe
Sep 10 08:47:31 storage /autorepl.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 08:47:31 storage /autorepl.py: [tools.autorepl:246] cys/jails/resilio_1 -> 192.168.1.12:cys-bkup replication not enabled
Sep 10 08:47:31 storage /autorepl.py: [tools.autorepl:625] Autosnap replication finished
Sep 10 08:47:41 storage /alert.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 08:47:44 storage /alert.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 08:48:15 storage /autorepl.py: [tools.autorepl:221] Autosnap replication started
Sep 10 08:48:15 storage /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-85621
Sep 10 08:48:15 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/HumanResources
Sep 10 08:48:16 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/cys-export
Sep 10 08:48:16 storage /autorepl.py: [tools.autorepl:131] Sending zfs snapshot: /sbin/zfs send -V -p -i cys/cys-export@auto-20180903.2000-2w cys/cys-export@auto-20180904.0800-2w | /usr/local/bin/lz4c | /usr/local/bin/throttle -K 70000 | /usr/local/bin/pipewatcher $$ | /usr/local/bin/ssh -ononeenabled=yes -ononeswitch=yes -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.1.12 "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'cys-bkup' && echo Succeeded"
Sep 10 08:48:17 storage /autorepl.py: [tools.autorepl:150] Replication result: cannot receive incremental stream: dataset is busy
Failed to write to stdout: Broken pipe
Sep 10 08:48:17 storage /autorepl.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 08:48:17 storage /autorepl.py: [tools.autorepl:246] cys/jails/resilio_1 -> 192.168.1.12:cys-bkup replication not enabled
Sep 10 08:48:17 storage /autorepl.py: [tools.autorepl:625] Autosnap replication finished
Sep 10 08:48:45 storage /alert.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 08:49:09 storage /autorepl.py: [tools.autorepl:221] Autosnap replication started
Sep 10 08:49:09 storage /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-85773
Sep 10 08:49:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/HumanResources
Sep 10 08:49:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/cys-export
Sep 10 08:49:10 storage /autorepl.py: [tools.autorepl:131] Sending zfs snapshot: /sbin/zfs send -V -p -i cys/cys-export@auto-20180903.2000-2w cys/cys-export@auto-20180904.0800-2w | /usr/local/bin/lz4c | /usr/local/bin/throttle -K 70000 | /usr/local/bin/pipewatcher $$ | /usr/local/bin/ssh -ononeenabled=yes -ononeswitch=yes -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.1.12 "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'cys-bkup' && echo Succeeded"
Sep 10 08:49:10 storage /autorepl.py: [tools.autorepl:150] Replication result: cannot receive incremental stream: dataset is busy
Failed to write to stdout: Broken pipe
Sep 10 08:49:10 storage /autorepl.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 08:49:10 storage /autorepl.py: [tools.autorepl:246] cys/jails/resilio_1 -> 192.168.1.12:cys-bkup replication not enabled
Sep 10 08:49:10 storage /autorepl.py: [tools.autorepl:625] Autosnap replication finished
Sep 10 08:49:45 storage /alert.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 08:50:09 storage /autorepl.py: [tools.autorepl:221] Autosnap replication started
Sep 10 08:50:09 storage /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-85947
Sep 10 08:50:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/HumanResources
Sep 10 08:50:10 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/cys-export
Sep 10 08:50:10 storage /autorepl.py: [tools.autorepl:131] Sending zfs snapshot: /sbin/zfs send -V -p -i cys/cys-export@auto-20180903.2000-2w cys/cys-export@auto-20180904.0800-2w | /usr/local/bin/lz4c | /usr/local/bin/throttle -K 70000 | /usr/local/bin/pipewatcher $$ | /usr/local/bin/ssh -ononeenabled=yes -ononeswitch=yes -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.1.12 "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'cys-bkup' && echo Succeeded"
Sep 10 08:50:11 storage /autorepl.py: [tools.autorepl:150] Replication result: cannot receive incremental stream: dataset is busy
Failed to write to stdout: Broken pipe
Sep 10 08:50:11 storage /autorepl.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 08:50:11 storage /autorepl.py: [tools.autorepl:246] cys/jails/resilio_1 -> 192.168.1.12:cys-bkup replication not enabled
Sep 10 08:50:11 storage /autorepl.py: [tools.autorepl:625] Autosnap replication finished
Sep 10 08:50:45 storage /alert.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 08:51:09 storage /autorepl.py: [tools.autorepl:221] Autosnap replication started
Sep 10 08:51:09 storage /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-86109
Sep 10 08:51:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/HumanResources
Sep 10 08:51:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/cys-export
Sep 10 08:51:10 storage /autorepl.py: [tools.autorepl:131] Sending zfs snapshot: /sbin/zfs send -V -p -i cys/cys-export@auto-20180903.2000-2w cys/cys-export@auto-20180904.0800-2w | /usr/local/bin/lz4c | /usr/local/bin/throttle -K 70000 | /usr/local/bin/pipewatcher $$ | /usr/local/bin/ssh -ononeenabled=yes -ononeswitch=yes -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.1.12 "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'cys-bkup' && echo Succeeded"
Sep 10 08:51:10 storage /autorepl.py: [tools.autorepl:150] Replication result: cannot receive incremental stream: dataset is busy
Failed to write to stdout: Broken pipe
Sep 10 08:51:10 storage /autorepl.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 08:51:10 storage /autorepl.py: [tools.autorepl:246] cys/jails/resilio_1 -> 192.168.1.12:cys-bkup replication not enabled
Sep 10 08:51:10 storage /autorepl.py: [tools.autorepl:625] Autosnap replication finished
Sep 10 08:51:46 storage /alert.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 08:52:09 storage /autorepl.py: [tools.autorepl:221] Autosnap replication started
Sep 10 08:52:09 storage /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-86269
Sep 10 08:52:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/HumanResources
Sep 10 08:52:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/cys-export
Sep 10 08:52:10 storage /autorepl.py: [tools.autorepl:131] Sending zfs snapshot: /sbin/zfs send -V -p -i cys/cys-export@auto-20180903.2000-2w cys/cys-export@auto-20180904.0800-2w | /usr/local/bin/lz4c | /usr/local/bin/throttle -K 70000 | /usr/local/bin/pipewatcher $$ | /usr/local/bin/ssh -ononeenabled=yes -ononeswitch=yes -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.1.12 "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'cys-bkup' && echo Succeeded"
.......
Sep 10 10:16:10 storage /autorepl.py: [tools.autorepl:150] Replication result: cannot receive incremental stream: dataset is busy
Failed to write to stdout: Broken pipe
Sep 10 10:16:10 storage /autorepl.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 10:16:10 storage /autorepl.py: [tools.autorepl:246] cys/jails/resilio_1 -> 192.168.1.12:cys-bkup replication not enabled
Sep 10 10:16:10 storage /autorepl.py: [tools.autorepl:625] Autosnap replication finished
Sep 10 10:16:24 storage /alert.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 10:17:09 storage /autorepl.py: [tools.autorepl:221] Autosnap replication started
Sep 10 10:17:09 storage /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-1851
Sep 10 10:17:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/HumanResources
Sep 10 10:17:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/cys-export
Sep 10 10:17:09 storage /autorepl.py: [tools.autorepl:131] Sending zfs snapshot: /sbin/zfs send -V -p -i cys/cys-export@auto-201809
03.2000-2w cys/cys-export@auto-20180904.0800-2w | /usr/local/bin/lz4c | /usr/local/bin/throttle -K 70000 | /usr/local/bin/pipewatch
er $$ | /usr/local/bin/ssh -ononeenabled=yes -ononeswitch=yes -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=ye
s -o ConnectTimeout=7 -p 22 192.168.1.12 "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'cys-bkup' && echo Succeeded"
Sep 10 10:17:25 storage /alert.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 10:18:04 storage /autosnap.py: [tools.autosnap:614] Autorepl running, skip destroying snapshots
Sep 10 10:18:09 storage /autorepl.py: [tools.autorepl:172] Checking if process 1851 is still alive
Sep 10 10:18:09 storage /autorepl.py: [tools.autorepl:176] Process 1851 still working, quitting
Sep 10 10:18:25 storage /alert.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 10:19:04 storage /autosnap.py: [tools.autosnap:614] Autorepl running, skip destroying snapshots
Sep 10 10:19:09 storage /autorepl.py: [tools.autorepl:172] Checking if process 1851 is still alive
Sep 10 10:19:09 storage /autorepl.py: [tools.autorepl:176] Process 1851 still working, quitting
Sep 10 10:19:25 storage /alert.py: [ws4py:360] Closing message received (1000) 'b'''
Sep 10 10:20:04 storage /autosnap.py: [tools.autosnap:614] Autorepl running, skip destroying snapshots
Sep 10 10:20:09 storage /autorepl.py: [tools.autorepl:172] Checking if process 1851 is still alive
Sep 10 10:20:09 storage /autorepl.py: [tools.autorepl:176] Process 1851 still working, quitting
Sep 10 10:20:26 storage /alert.py: [ws4py:360] Closing message received (1000) 'b'''


How do I fix this?

thanks!
--Jim
 

JimKusz

Dabbler
Joined
Sep 10, 2018
Messages
19
So I just went over to the backup system and noticed all the drive lights are blinking like they're quite active, but there are no processes or shares or anything else that should cause that other than replication or scrubs (and it doesn't appear that a scrub is running). So perhaps there's something in this snapshot that takes a long time to apply? Perhaps its timing out before it finishes?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Can you give us more details? Dataset structure on both ends and what is being replicated to where would be useful to know.
 
Status
Not open for further replies.
Top