Hello:
I have two identical FreeNAS systems, about 100TB in capacity, that have been in production for several years now. The master has one main export, which is exported via SMB to about 30 windows clients. Permissions were set as "Windows" at creation time, but were also set as guest allowed / guest-only initially. The backup system was a zfs replication destination, and the data and replicas were kept up to date and all was good for several years. While not relevant to this situation, the backup system is located in a 2nd building connected via point to point microwave network connection, and there are employees in the 2nd building (but all their computers still connect back to the master in building A -- there are no users of the 2nd system).
Recently management decided to migrate to an individual users model on all workstations, so master was set up with domain controller rights (there are NO windows servers), and users were added and permissions on the SMB export were changed via windows clients with administrator account. That is all working.
However, shortly after we set up domain controller and permissions, replication to the slave started failing. Logs are not giving anything useful. The replication goes to 100%, sits there for about an hour or two, then fails. For a period afterward, replication attempts fail with dataset is busy error, then is suddenly able to replicate again, and the cycle repeats.
I'm now a full week behind on replication, the same snapshot has been attempted this entire time. I've asked on IRC, but nobody has any suggestions on how to proceed. Here is a snapshot of the debug log catching a full cycle and the beginnings of a second one (I've omitted a chunk in the middle of identical repeating, as indicated by ... ):
How do I fix this?
thanks!
--Jim
I have two identical FreeNAS systems, about 100TB in capacity, that have been in production for several years now. The master has one main export, which is exported via SMB to about 30 windows clients. Permissions were set as "Windows" at creation time, but were also set as guest allowed / guest-only initially. The backup system was a zfs replication destination, and the data and replicas were kept up to date and all was good for several years. While not relevant to this situation, the backup system is located in a 2nd building connected via point to point microwave network connection, and there are employees in the 2nd building (but all their computers still connect back to the master in building A -- there are no users of the 2nd system).
Recently management decided to migrate to an individual users model on all workstations, so master was set up with domain controller rights (there are NO windows servers), and users were added and permissions on the SMB export were changed via windows clients with administrator account. That is all working.
However, shortly after we set up domain controller and permissions, replication to the slave started failing. Logs are not giving anything useful. The replication goes to 100%, sits there for about an hour or two, then fails. For a period afterward, replication attempts fail with dataset is busy error, then is suddenly able to replicate again, and the cycle repeats.
I'm now a full week behind on replication, the same snapshot has been attempted this entire time. I've asked on IRC, but nobody has any suggestions on how to proceed. Here is a snapshot of the debug log catching a full cycle and the beginnings of a second one (I've omitted a chunk in the middle of identical repeating, as indicated by ... ):
Code:
Sep 10 08:47:22 storage /autorepl.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 08:47:22 storage /autorepl.py: [tools.autorepl:246] cys/jails/resilio_1 -> 192.168.1.12:cys-bkup replication not enabled Sep 10 08:47:22 storage /autorepl.py: [tools.autorepl:615] Relaunching autorepl Sep 10 08:47:27 storage /autorepl.py: [tools.autorepl:170] Skipping check if autorepl is running. Sep 10 08:47:27 storage /autorepl.py: [tools.autorepl:221] Autosnap replication started Sep 10 08:47:27 storage /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-72606 Sep 10 08:47:27 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/HumanResources Sep 10 08:47:28 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/cys-export Sep 10 08:47:30 storage /autorepl.py: [tools.autorepl:131] Sending zfs snapshot: /sbin/zfs send -V -p -i cys/cys-export@auto-20180903.2000-2w cys/cys-export@auto-20180904.0800-2w | /usr/local/bin/lz4c | /usr/local/bin/throttle -K 70000 | /usr/local/bin/pipewatcher $$ | /usr/local/bin/ssh -ononeenabled=yes -ononeswitch=yes -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.1.12 "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'cys-bkup' && echo Succeeded" Sep 10 08:47:31 storage /autorepl.py: [tools.autorepl:150] Replication result: cannot receive incremental stream: dataset is busy Failed to write to stdout: Broken pipe Sep 10 08:47:31 storage /autorepl.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 08:47:31 storage /autorepl.py: [tools.autorepl:246] cys/jails/resilio_1 -> 192.168.1.12:cys-bkup replication not enabled Sep 10 08:47:31 storage /autorepl.py: [tools.autorepl:625] Autosnap replication finished Sep 10 08:47:41 storage /alert.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 08:47:44 storage /alert.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 08:48:15 storage /autorepl.py: [tools.autorepl:221] Autosnap replication started Sep 10 08:48:15 storage /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-85621 Sep 10 08:48:15 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/HumanResources Sep 10 08:48:16 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/cys-export Sep 10 08:48:16 storage /autorepl.py: [tools.autorepl:131] Sending zfs snapshot: /sbin/zfs send -V -p -i cys/cys-export@auto-20180903.2000-2w cys/cys-export@auto-20180904.0800-2w | /usr/local/bin/lz4c | /usr/local/bin/throttle -K 70000 | /usr/local/bin/pipewatcher $$ | /usr/local/bin/ssh -ononeenabled=yes -ononeswitch=yes -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.1.12 "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'cys-bkup' && echo Succeeded" Sep 10 08:48:17 storage /autorepl.py: [tools.autorepl:150] Replication result: cannot receive incremental stream: dataset is busy Failed to write to stdout: Broken pipe Sep 10 08:48:17 storage /autorepl.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 08:48:17 storage /autorepl.py: [tools.autorepl:246] cys/jails/resilio_1 -> 192.168.1.12:cys-bkup replication not enabled Sep 10 08:48:17 storage /autorepl.py: [tools.autorepl:625] Autosnap replication finished Sep 10 08:48:45 storage /alert.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 08:49:09 storage /autorepl.py: [tools.autorepl:221] Autosnap replication started Sep 10 08:49:09 storage /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-85773 Sep 10 08:49:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/HumanResources Sep 10 08:49:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/cys-export Sep 10 08:49:10 storage /autorepl.py: [tools.autorepl:131] Sending zfs snapshot: /sbin/zfs send -V -p -i cys/cys-export@auto-20180903.2000-2w cys/cys-export@auto-20180904.0800-2w | /usr/local/bin/lz4c | /usr/local/bin/throttle -K 70000 | /usr/local/bin/pipewatcher $$ | /usr/local/bin/ssh -ononeenabled=yes -ononeswitch=yes -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.1.12 "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'cys-bkup' && echo Succeeded" Sep 10 08:49:10 storage /autorepl.py: [tools.autorepl:150] Replication result: cannot receive incremental stream: dataset is busy Failed to write to stdout: Broken pipe Sep 10 08:49:10 storage /autorepl.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 08:49:10 storage /autorepl.py: [tools.autorepl:246] cys/jails/resilio_1 -> 192.168.1.12:cys-bkup replication not enabled Sep 10 08:49:10 storage /autorepl.py: [tools.autorepl:625] Autosnap replication finished Sep 10 08:49:45 storage /alert.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 08:50:09 storage /autorepl.py: [tools.autorepl:221] Autosnap replication started Sep 10 08:50:09 storage /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-85947 Sep 10 08:50:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/HumanResources Sep 10 08:50:10 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/cys-export Sep 10 08:50:10 storage /autorepl.py: [tools.autorepl:131] Sending zfs snapshot: /sbin/zfs send -V -p -i cys/cys-export@auto-20180903.2000-2w cys/cys-export@auto-20180904.0800-2w | /usr/local/bin/lz4c | /usr/local/bin/throttle -K 70000 | /usr/local/bin/pipewatcher $$ | /usr/local/bin/ssh -ononeenabled=yes -ononeswitch=yes -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.1.12 "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'cys-bkup' && echo Succeeded" Sep 10 08:50:11 storage /autorepl.py: [tools.autorepl:150] Replication result: cannot receive incremental stream: dataset is busy Failed to write to stdout: Broken pipe Sep 10 08:50:11 storage /autorepl.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 08:50:11 storage /autorepl.py: [tools.autorepl:246] cys/jails/resilio_1 -> 192.168.1.12:cys-bkup replication not enabled Sep 10 08:50:11 storage /autorepl.py: [tools.autorepl:625] Autosnap replication finished Sep 10 08:50:45 storage /alert.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 08:51:09 storage /autorepl.py: [tools.autorepl:221] Autosnap replication started Sep 10 08:51:09 storage /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-86109 Sep 10 08:51:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/HumanResources Sep 10 08:51:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/cys-export Sep 10 08:51:10 storage /autorepl.py: [tools.autorepl:131] Sending zfs snapshot: /sbin/zfs send -V -p -i cys/cys-export@auto-20180903.2000-2w cys/cys-export@auto-20180904.0800-2w | /usr/local/bin/lz4c | /usr/local/bin/throttle -K 70000 | /usr/local/bin/pipewatcher $$ | /usr/local/bin/ssh -ononeenabled=yes -ononeswitch=yes -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.1.12 "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'cys-bkup' && echo Succeeded" Sep 10 08:51:10 storage /autorepl.py: [tools.autorepl:150] Replication result: cannot receive incremental stream: dataset is busy Failed to write to stdout: Broken pipe Sep 10 08:51:10 storage /autorepl.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 08:51:10 storage /autorepl.py: [tools.autorepl:246] cys/jails/resilio_1 -> 192.168.1.12:cys-bkup replication not enabled Sep 10 08:51:10 storage /autorepl.py: [tools.autorepl:625] Autosnap replication finished Sep 10 08:51:46 storage /alert.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 08:52:09 storage /autorepl.py: [tools.autorepl:221] Autosnap replication started Sep 10 08:52:09 storage /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-86269 Sep 10 08:52:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/HumanResources Sep 10 08:52:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/cys-export Sep 10 08:52:10 storage /autorepl.py: [tools.autorepl:131] Sending zfs snapshot: /sbin/zfs send -V -p -i cys/cys-export@auto-20180903.2000-2w cys/cys-export@auto-20180904.0800-2w | /usr/local/bin/lz4c | /usr/local/bin/throttle -K 70000 | /usr/local/bin/pipewatcher $$ | /usr/local/bin/ssh -ononeenabled=yes -ononeswitch=yes -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.1.12 "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'cys-bkup' && echo Succeeded" ....... Sep 10 10:16:10 storage /autorepl.py: [tools.autorepl:150] Replication result: cannot receive incremental stream: dataset is busy Failed to write to stdout: Broken pipe Sep 10 10:16:10 storage /autorepl.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 10:16:10 storage /autorepl.py: [tools.autorepl:246] cys/jails/resilio_1 -> 192.168.1.12:cys-bkup replication not enabled Sep 10 10:16:10 storage /autorepl.py: [tools.autorepl:625] Autosnap replication finished Sep 10 10:16:24 storage /alert.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 10:17:09 storage /autorepl.py: [tools.autorepl:221] Autosnap replication started Sep 10 10:17:09 storage /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-1851 Sep 10 10:17:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/HumanResources Sep 10 10:17:09 storage /autorepl.py: [tools.autorepl:291] Checking dataset cys/cys-export Sep 10 10:17:09 storage /autorepl.py: [tools.autorepl:131] Sending zfs snapshot: /sbin/zfs send -V -p -i cys/cys-export@auto-201809 03.2000-2w cys/cys-export@auto-20180904.0800-2w | /usr/local/bin/lz4c | /usr/local/bin/throttle -K 70000 | /usr/local/bin/pipewatch er $$ | /usr/local/bin/ssh -ononeenabled=yes -ononeswitch=yes -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=ye s -o ConnectTimeout=7 -p 22 192.168.1.12 "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'cys-bkup' && echo Succeeded" Sep 10 10:17:25 storage /alert.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 10:18:04 storage /autosnap.py: [tools.autosnap:614] Autorepl running, skip destroying snapshots Sep 10 10:18:09 storage /autorepl.py: [tools.autorepl:172] Checking if process 1851 is still alive Sep 10 10:18:09 storage /autorepl.py: [tools.autorepl:176] Process 1851 still working, quitting Sep 10 10:18:25 storage /alert.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 10:19:04 storage /autosnap.py: [tools.autosnap:614] Autorepl running, skip destroying snapshots Sep 10 10:19:09 storage /autorepl.py: [tools.autorepl:172] Checking if process 1851 is still alive Sep 10 10:19:09 storage /autorepl.py: [tools.autorepl:176] Process 1851 still working, quitting Sep 10 10:19:25 storage /alert.py: [ws4py:360] Closing message received (1000) 'b''' Sep 10 10:20:04 storage /autosnap.py: [tools.autosnap:614] Autorepl running, skip destroying snapshots Sep 10 10:20:09 storage /autorepl.py: [tools.autorepl:172] Checking if process 1851 is still alive Sep 10 10:20:09 storage /autorepl.py: [tools.autorepl:176] Process 1851 still working, quitting Sep 10 10:20:26 storage /alert.py: [ws4py:360] Closing message received (1000) 'b'''
How do I fix this?
thanks!
--Jim