Replication stopped working after PULL pool rebuilt

toadman · Aug 30, 2015

I'm running 9.3.1 on two servers, one is used to backup the main server. (HW listed below.) For about a year I've replicated 4 datasets daily between the two. I was running out of space on the backup server, so I destroyed the pool, added capacity, and created a new pool with the same name.

I then turned on replication again and ticked "initialize once" in each of the replication tasks. They all gave me a failure message.

So I went ahead and started over. I deleted all of the replication tasks and set them back up again. I double checked the public key from PUSH was in the root of PULL, etc.

It's still not working. However, from the command line I am able to properly execute sending a snapshot over to backup with this:

Code:

zfs send pool0/userdata@auto-20150101.0100-1y | ssh -i /data/ssh/replication backupserver.freenas.lan zfs receive backup0/userdata@auto-20150101.0100-1y

So I'm wondering if something goofed in whatever configuration the replication script is looking at? Any ideas?

Main server (PUSH):
freenas 9.3.1 (updated via 9.3 STABLE)
AMD 955
ASUS m4A89GTD motherboard
16GB ECC memory
LSI 9211-8i HBA
8 disks configured in 4 mirrored pairs
2x intel gigabit nics
2x realtek nics (not used)

Backup server (PULL)
freenas 9.3.1 (updated via 9.3 STABLE)
vm running under esxi 6.0
8GB ECC memory
4 vCPUs (AMD 1100T) (normally runs with only 2)
AMD SB950 SATA controller passed through running AHCI
6 3TB WD RED in RAIDZ2
2x intel gigabit nics

Robert Trevellyan · Aug 30, 2015

toadman said:
I'm running 9.3.1 on two servers

There have been several reports of problems with replication in 9.3.1.

toadman · Aug 30, 2015

Robert Trevellyan said:
There have been several reports of problems with replication in 9.3.1.

Ugh. Figures. Thanks for the reply, at least I know it's apparently not just me. :)

What seems to have worked for me today is to manually send the first snapshot for a given dataset via command line. When this completes I'm able to then create a replication task for that dataset, but I don't "initialize once" when creating the replication task since the first snapshot is already at the PULL system. The replication script then takes over and successfully replicates the remaining snapshots for that dataset. Well, it worked on the first dataset I tried anyway. I'm now trying the same procedure with a dataset containing zvols and a recursively replicated dataset. We'll see what happens.

So my guess is that something is goofed in the script when trying to do the initialize once part with a new replication task. I haven't had the time to dive into the script itself to see if something doesn't make sense or specifically diagnose which step sees the problem. Priority at this point is getting the backups rebuilt, however I have to do so in the short term. Then I can move to diagnosing. :)

Robert Trevellyan · Aug 31, 2015

Have you considered reverting to a previous 9.3-STABLE release before reinitializing your backups?

toadman · Aug 31, 2015

Yes, good suggestion, and that would be quite easy too. At this point I'm still waiting on a large dataset to finish the initial send with the above mentioned manual zfs send.

I'll definitely play around with a roll back and other other diags when this finishes.

toadman · Aug 31, 2015

One other thing of note, while the replications tasks are working after I manually send the first snapshot, the "Last snapshot send to remote side" field is not being updated. I can confirm it's properly matching the snapshots on both PUSH and PULL though. I'll file a bug on this if I can't get it corrected.

Johnny Fartpants · Aug 31, 2015

Try ps -A |grep zfs on your primary server as you may fine your old replication task is still stuck. If so kill -9 the process.

Sent from my iPhone using Tapatalk

rogerh · Sep 1, 2015

toadman said:
One other thing of note, while the replications tasks are working after I manually send the first snapshot, the "Last snapshot send to remote side" field is not being updated. I can confirm it's properly matching the snapshots on both PUSH and PULL though. I'll file a bug on this if I can't get it corrected.

https://bugs.freenas.org/issues/11130

More worryingly; a) the new replication task process doesn't destroy stale snapshots on the destination by design, and b) snapshot tasks being replicated by the new replication process seem to be unable to continue to delete stale snapshots on the sending server.

toadman · Sep 1, 2015

Johnny Fartpants said:
Try ps -A |grep zfs on your primary server as you may fine your old replication task is still stuck. If so kill -9 the process

Just have zfsd running, so no old replication...

Code:

fileserver# ps -A |grep zfs
    5 ??  DL  13:54.87 [zfskern]
6046  ??  S    0:00.02 /sbin/zfsd -d zfsd
44645  0  S+   0:00.00 grep zfs
fileserver#

toadman · Sep 1, 2015

rogerh said:
https://bugs.freenas.org/issues/11130

More worryingly; a) the new replication task process doesn't destroy stale snapshots on the destination by design, and b) snapshot tasks being replicated by the new replication process seem to be unable to continue to delete stale snapshots on the sending server.

Thanks for the bug number. I'll watch how that develops. I haven't noticed (a) yet, but that's because the system hasn't dropped any snapshots on PUSH yet. I have an easy work around for this one I think, so less worried about it.

I was a little confused by what you meant by (b). (I do have my own script that deletes stale snapshots, so maybe I wouldn't see this one either. I would simply use this same script on PULL if needed.)

rogerh · Sep 1, 2015

toadman said:
Thanks for the bug number. I'll watch how that develops. I haven't noticed (a) yet, but that's because the system hasn't dropped any snapshots on PUSH yet. I have an easy work around for this one I think, so less worried about it.

I was a little confused by what you meant by (b). (I do have my own script that deletes stale snapshots, so maybe I wouldn't see this one either. I would simply use this same script on PULL if needed.)

This is just a GUI bug. a) is purely cosmetic. Replication tasks that were running before updating to 9.3.1 seem to carry on running, just don't tell you in the GUI where they have got to. b) relates to a non-fatal error message that is peculiar to my system, as I am replicating to zfsonlinux.

More troublesome is the failure to set up new replication tasks through the GUI, not clear who this affects yet: https://bugs.freenas.org/issues/11230

But if you are using your own scripts to replicate you shouldn't have any problem.

toadman · Sep 1, 2015

Yes, I ran into 11230. I had several scripts that worked on transition to 9.3.1, but I had to rebuild my PULL pool, effectively creating a new target. That's when things broke. I couldn't get the existing scripts to "initialize once" again. The work around is to send the first snapshot manually, then setup a new replication task and the rest of the snapshots go just fine.

I'll know in a few days if the snapshots are not deleting on the PULL side.

Important Announcement for the TrueNAS Community.

Replication stopped working after PULL pool rebuilt

toadman

Guru

Robert Trevellyan

Pony Wrangler

toadman

Guru

Robert Trevellyan

Pony Wrangler

toadman

Guru

toadman

Guru

Johnny Fartpants

Guru

rogerh

Guru

toadman

Guru

toadman

Guru

rogerh

Guru

toadman

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Replication stopped working after PULL pool rebuilt

Guru

Pony Wrangler

Guru

Pony Wrangler

Guru

Guru

Guru

Guru

Guru

Guru

Guru

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Replication stopped working after PULL pool rebuilt"

Similar threads