Replication Interruption

joobie · Apr 4, 2013

Hi Guys,

I have setup FreeNAS 8.3.0 RELEASE p1 x64 and have bumped up ZFS to the latest version.

I have replication setup and have been doing so for sometime. Last night, replication seemed to take longer and it ate into business hours the following morning (the replication schedule is set to allow replication 24 hours a day at the moment). To stop it, I amended our firewall to drop packets to the destination FreeNAS at the gateway.

This worked, but then when I lifted the rules, the replication came back with the following errors..

Apr 4 19:01:10 FreeNAS-Push autorepl.py: [common.pipesubr:42] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 mybox.no-ip.org "zfs list -Hr -o name -S creation -t snapshot -d 1 backup | head -n 1 | cut -d@ -f2"
Apr 4 19:01:11 FreeNAS-Push autorepl.py: [tools.autorepl:307] Remote and local mismatch after replication: data@auto-20130403.1938-6m vs data@auto-20130402.1938-6m
Apr 4 19:01:11 FreeNAS-Push autorepl.py: [common.pipesubr:42] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 mybox.no-ip.org "zfs list -Ho name -t snapshot backup | head -n 1 | cut -d@ -f2"
Apr 4 19:01:12 FreeNAS-Push autorepl.py: [tools.autorepl:323] Replication of data@auto-20130402.1938-6m failed with cannot receive new filesystem stream: destination has snapshots (eg. backup@auto-20130330.1938-6m) must destroy them to overwrite it warning: cannot send 'data@auto-20130213.1746-6m': Broken pipe warning: cannot send 'data@auto-20130214.1746-6m': Broken pipe warning: cannot send 'data@auto-20130215.1746-6m': Broken pipe warning: cannot send 'data@auto-20130216.1746-6m': Broken pipe warning: cannot send 'data@auto-20130217.1746-6m': Broken pipe warning: cannot send 'data@auto-20130218.1746-6m': Broken pipe warning: cannot send 'data@auto-20130219.1746-6m': Broken pipe warning: cannot send 'data@auto-20130220.1746-6m': Broken pipe warning: cannot send 'data@auto-20130221.1746-6m': Broken pipe warning: cannot send 'data@auto-20130222.1746-6m': Broken pipe warning: cannot send 'data@auto-20130223.1746-6m': Broken pipe warning: cannot send 'data@auto-20130224.1746-6m': Broken pipe warning: cannot send 'data@
Apr 4 19:01:21 FreeNAS-Push autorepl.py: [tools.autorepl:264] Creating backup on remote system
Apr 4 19:01:21 FreeNAS-Push autorepl.py: [common.pipesubr:49] Executing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 mybox.no-ip.org /sbin/zfs create -o readonly=on -p backup
Apr 4 19:01:21 FreeNAS-Push autorepl.py: [common.pipesubr:49] Executing: (/sbin/zfs send -R data@auto-20130402.1938-6m | /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 mybox.no-ip.org "/sbin/zfs receive -F -d backup && echo Succeeded.") > /tmp/repl-13932 2>&1

I figured since the backup it was trying to replicate which I stopped was "20130403", I would delete all these snapshots on the remote end. BUT, I accidentally deleted "20130402". So I continued on and deleted "20130403" also..

Now when I replicate I get this error..

Apr 4 19:22:10 FreeNAS-Push autorepl.py: [tools.autorepl:264] Creating backup on remote system
Apr 4 19:22:10 FreeNAS-Push autorepl.py: [common.pipesubr:49] Executing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 mybox.no-ip.org /sbin/zfs create -o readonly=on -p backup
Apr 4 19:22:10 FreeNAS-Push autorepl.py: [common.pipesubr:49] Executing: (/sbin/zfs send data@auto-20130402.1938-6m | /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 mybox.no-ip.org "/sbin/zfs receive -F -d backup && echo Succeeded.") > /tmp/repl-15639 2>&1
Apr 4 19:22:11 FreeNAS-Push autorepl.py: [common.pipesubr:42] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 mybox.no-ip.org "zfs list -Hr -o name -S creation -t snapshot -d 1 backup | head -n 1 | cut -d@ -f2"
Apr 4 19:22:12 FreeNAS-Push autorepl.py: [tools.autorepl:307] Remote and local mismatch after replication: data@auto-20130401.1938-6m vs data@auto-20130402.1938-6m
Apr 4 19:22:12 FreeNAS-Push autorepl.py: [common.pipesubr:42] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 mybox.no-ip.org "zfs list -Ho name -t snapshot backup | head -n 1 | cut -d@ -f2"
Apr 4 19:22:12 FreeNAS-Push autorepl.py: [tools.autorepl:323] Replication of data@auto-20130402.1938-6m failed with cannot receive new filesystem stream: destination has snapshots (eg. backup@auto-20130330.1938-6m) must destroy them to overwrite it warning: cannot send 'data@auto-20130402.1938-6m': Broken pipe

I tried also unticking the "recursively replicate" option but this has not helped.

What are your thoughts? Shouldn't the FreeNAS be able to pick itself up when the datastream is interrupted during a replication?

Thanks, Richard

William Grzybowski · Apr 4, 2013

I guess you have to start over, from scratch.

joobie · Apr 4, 2013

This is not a great option for me - the remote FreeNAS is physically 50KM's away from the main FreeNAS. I've had to do this before and it's a pain.

Surely there is a way to fix the replication when it falls in a state like this? I would have thought the FreeNAS could handle an interruption to the replication process, without having to wipe and replicate all over again?

Thanks, Richard

William Grzybowski · Apr 4, 2013

There is a checkbox in the replication configuration screen for this, this is what I meant.

joobie · Apr 4, 2013

Do you mean the Initialize remote side for once option?

If so, this option will completely wipe the remote end and it will need to replicate all snapshots again. This is not possible over the link that I replicate over (it's only suitable for the delta to be replicated in terms of its speed and data quota). I would need to relocate the remote box to be local in order to do this.

Is there no way to fix the issue, without having to wipe and start over again? It's not a great option, given every time the replication is interrupted you need to do a fresh wipe?

Thanks, Richard

William Grzybowski · Apr 4, 2013

Well, AFAIK that happened because you deleted the wrong snapshot, not because it was interrupted ;)

I don't know of any workaround, maybe someone else does.

cyberjock · Apr 4, 2013

I don't think the problem was that replication was interrupted. I think the problem is that you manually deleted snapshots.

joobie · Apr 4, 2013

I deleted an older snapshot, which should not cause an issue (this just means the newer snapshot will roll more into it so that it's consistent). I immediately deleted the latest snapshot also, so instead of the remote end not having the latest, it doesn't have the latest and the day before.

Can't see why this would cause an issue with replication? Replication should just see that there are 2 days missing and replicate both days, right? This scenario is much like the remote replication FreeNAS was offline for 2 days, and now it's back - FreeNAS should be able to handle this?

The problem is that the replication was interrupted. The first log excerpt that I posted above is purely after interrupting the replication. It looped through those errors for hours after I lifted the firewall rule. I then proceeded to manually remove snapshots which caused the second set of log errors.

I only proceeded with the second step of deleting snapshots as the replication was erroring after it was restored.

cyberjock · Apr 4, 2013

I don't know. I know I've had replication be interrupted before without any problems. I've never tried deleting them, which is why I thought it may be the cause for your issues. If I get bored in the next couple of days I might have to test this.

joobie · Apr 4, 2013

Any other suggestions? This does sound like a bug.

cyberjock · Apr 4, 2013

I went trying to see if I could find anything that would explain what can cause the error you got. There's literally nothing out there that explains it very well, or explains it enough to make it fit your situation. I still think that the error was with manually deleting them instead of letting the system delete them oldest-to-newest or deleting the ones it knew were incomplete.

I believe that the issue is that your 2 servers are out of sync. The source server has expectations that aren't being filled by the destination(The last successfully completed snapshot is missing) so it has no idea what reference point it should start with. Snapshots build on the previous, so now that you are 2 behind(one of which the destination server no longer has) the source server is basically erroring out saying its your problem to deal with since it doesn't want to make bad assumptions and do something to make things even worse. I believe that the source server's ZFS replication only keeps track of the last successful snapshot. Since you deleted that one on accident the source server is not sure what to think. Its expecting data where there is none. If there's one lesson I've learned about syncing stuff(regardless of how its done) is that syncing tasks need to be able to properly handle partial syncs and successfully recover from them with no intervention(unless you chose settings to force intervention on error). I think if you had left it alone you wouldn't have the issue you are having. I'm convinced that it isn't a bug though.

Some people said they deleted and restored snapshots using rsync. So that may be an option for you. I'm not sure how much bigger/smaller/faster/slower the backup will be, but it sounds like it is better than scrapping the whole replication and starting over from scratch.

If you interrupted the replication task because it rolled into daylight hours I can understand the need to do that. But why not set a throttle speed or pay for faster internet. Ultimately, but effectively cancelling the snapshot that was in progress the whole snapshot would be sent again the following night along with the even newer snapshot, which would have definitely run into the day shift hours because 1 snapshot did. Now you'd have 2 snapshots worth of data to restore. So in essence, you were pretty much screwed with the replication the second you were in the situation where you had to cancel it unless you waited until the weekend and then had even more snapshots and hoped they'd all replicate through before Monday morning.

I thought about trying to test this, but it would require me to setup 2 VMs, setup fake data, etc. It's alot of work when I'm 99% sure the issue is with the manual deletion of snapshots.

joobie · Apr 5, 2013

Thanks cyberjock for the detailed response.

I didn't delete the snapshots initially - this is key to note. I only deleted it as I was in a loop of errors prior to deleting them (the logs I posted in the first excerpt reflect the errors I was getting in a loop, prior to deleting the snapshots).

I can, and am likely to setup a limit on the bandwidth for the replication - but I just thought FreeNAS would be developed enough to be able to handle an interruption to the stream mid replication. Really if the product cannot handle this, it doesn't give much confidence that the replication functionality of FreeNAS is well developed. From what I have read, people opt to not use the GUI and use the CLI to manually replicate - because the GUI sucks.

Anyway is this a bug? what is the next step from here?

Cheers, RIchard

jgreco · Apr 5, 2013

I've seen interrupted replications in the past. A little cleanup or resetting the replications to start all over will fix it, but if you do the cleanup wrong, you're left with the "reset replication" solution which is a bit of a pain. Upon seeing this thread earlier, my impression was "...and that's what happens when you biff the cleanup." With that said, it would be nice to have some guidance to users as to how to fix that, and if the guidance is sufficiently simple, maybe FreeNAS should do it automatically.

cyberjock · Apr 5, 2013

Here's a thread that caught my eye, and made me think of this thread...http://forums.freenas.org/showthread.php?10948-quot-freenas-state-quot-and-other-attributes

If you read that, it looks like snapshots have 3 "attributes". NEW, LATEST, and -. A new snapshot that hasn't been replicated is "NEW". Once it's been successfully replicated it changes to LATEST. Once the LATEST has been replaced with another NEW snapshot that is relabeled to LATEST then the old LATEST becomes -.

I do find it interesting though that you said that you didn't delete snapshots until you had gotten other errors. I don't get "good vibes" from how you blocked the snapshot from replicating. While I can understand why you chose to drop packets going one direction I'm not sure if that was wise. I probably would have blocked the packets in both directions. I'm not a wizard at replication though. I understand the theory and have setup snapshots and replication only once or twice and only in a very simple fashion. So my gut feelings could be totally off.

This issue is interesting to me. It's always interesting to see odd limitations of services that are considered to be robust. You just never know when a small tidbit of information will come in handy later.

joobie · Apr 5, 2013

jgreco said:
I've seen interrupted replications in the past. A little cleanup or resetting the replications to start all over will fix it

For knowledge sake, the first excerpt of the log I posted is what happened after replication was interrupted. What would the clean-up process be in this case?

I am just beside myself that it can't be fixed up from where it is. There are 200 or so snapshots already on the remote NAS, it's just the last 4 or 5 snapshots that aren't there. It sounds like the replication logic in the FreeNAS needs to be fixed. The 200 or so older snapshots will be identical to that on the push side - FreeNAS should be able to pick itself up from this state and just replicate the 4 or 5 snapshots that are not on the remote NAS.

Sounds like I may need to bring the NAS back into the office, but it would be great to understand what the "proper fix" is, based on the logs I posted in the event the replication is interrupted in future. This is a likely scenario.

Cheers, Richard

cyberjock · Apr 5, 2013

I think that the correct fix was to delete only the incomplete snapshot. I find it weird that ZFS replication can't resend the incomplete snapshot without user interaction.. but I'm not a ZFS replication wiz. Once you had accidentally deleted the latest completed snapshot I don't think there are any recovery options except to rsync the latest completed again. It sounds like that's what you wanted to do but deleted the wrong one.

Have you tried using rsync to restore the delete completed snapshot? Quite a few people mentioned restoring snapshots using rsync. It might save you the trouble of having to take the remote server to the local server and redoing the complete backup again. I'm not sure how much data you have, but it could be anywhere from a few hours to a few days depending on the data and your server speeds.

paleoN · Apr 5, 2013

Have you tried 6.2.3 Troubleshooting Replication? More the commands toward the end and not blindly deleting all the snaps.

joobie · May 26, 2013

Guys, I think this same problem has resurfaced again :/

I've spawned a new thread, in case you are following this - here is the new thread http://forums.freenas.org/showthread.php?12770-Replication-stopped-for-ZFS-Volume

Cheers, joo

iceflatline · Oct 13, 2014

I'm not a current user of FreeNAS but encountered this problem while performing incremental replications from one FreeBSD system to another. The solution is to manually increment the missing snapshot(s) to the destination. For example, say you have the following snapshots on your source system:

pool_0/dataset_0@snap-20141009
pool_0/dataset_0@snap-20141010
pool_0/dataset_0@snap-20141011

Now say for example that snap-20141012 was taken at the source but never successfully sent incrementally to the destination. The next time the replication script runs (in this example, the next day) it will take a another snapshot on the source side but will not be able to send it incrementally because snap-20141012 is missing on the destination. To fix this, from the command line manually do a incremental send using snap-20141011 and snap-20141012:

zfs send -R -i pool_0/dataset_0@snap-20141011 pool_0/dataset_0@snap-20141012 | ssh some-user@destination zfs receive -vudF pool_1

Then another from snap-20141012 to snap-20141013, until your caught up and both source and destination have the same snapshots:

zfs send -R -i pool_0/dataset_0@snap-20141012 pool_0/dataset_0@snap-20141013 | ssh some-user@destination zfs receive -vudF pool_1

At that point the script the script should take over successfully.

cyberjock · Oct 13, 2014

First, you just replied to a thread from May 2013.

Second you do NOT want to do replication via the CLI. You will fubar the autorepl.py code.

Third, the next time a zfs replication task runs, it will automagically send anything that has failed in the past.

Sorry iceflatline, but FreeBSD stuff doesn't necessarily apply to FreeNAS, which is basically an appliance and has limitations in its functionality. Thanks for trying though.

Important Announcement for the TrueNAS Community.

Replication Interruption

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Wizard

Inactive Account

Dabbler

Inactive Account

Dabbler

Inactive Account

Dabbler

Resident Grinch

Inactive Account

Dabbler

Inactive Account

Wizard

Dabbler

Cadet

Inactive Account

Similar threads