A second replication task throws erroneous errors

Status
Not open for further replies.

praecorloth

Contributor
Joined
Jun 2, 2011
Messages
159
Hey everyone,

I've got two FreeNAS boxen I'm using to demo replication. The issue I'm running into is that with one dataset, FreeNAS is reporting a failure, emailing out that there was a failure, but there was no failure, and on the second run the failure is cleared. First, some specs.

FreeNAS1
FreeNAS 9.3-STABLE-201605170422
AMD Phenom II x4 820
8GB non-ECC memory
Three 7200RPM SATA disks I pulled out of a bin to make a RAIDZ1
Some onboard Realtek based gigabit NIC

FreeNAS2
FreeNAS 9.3-STABLE-201605170422
Xeon E3-1220
16GB ECC memory
Six 7200RPM SATA drives that were purchased to make a RAIDZ2
Nice Intel gigabit NIC

On to the nitty-gritty. On FreeNAS1 we have the RAIDZ1 which hosts two datasets. We'll call them CheetahZFS and ProxmoxVMs. Both of these hold VMs, but only the ProxmoxVMs has any real changing data on it. The CheetahZFS dataset was replicated to FreeNAS1 from yet a third FreeNAS box that is no longer around.

When CheetahZFS replicates a snapshot over to FreeNAS2, FreeNAS1 throws an error stating that the replication has failed. I see in the Replication Tasks that it's showing a failure, I get an email that a failure has occurred, I check FreeNAS2 and the snapshot is there. The next minute when autorepl.py runs again, the error is cleared.

I ended up enabling debug in autorepl.py to get a better understanding of what was going on. I was hoping for a log file specific to replication, but I have not found one. But I did get these commands out of the messages file.

autorepl.py: [common.pipesubr:71] Popen()ing: /sbin/zfs list -H -t snapshot -p -o name,creation -r -d 1 "FreeNAS1/CheetahZFS"

Followed by

autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -l CheetahRep -p 22 192.168.0.2 "zfs list -H -t snapshot -p -o name,creation -d 1 -r 'Repository/CustomerCheetah/CheetahZFS'"

So the first command I imagine checks the snapshots on FreeNAS1, and the second command checks the snapshots on FreeNAS2. Happy. So for grins, I create a manual snapshot, I wait for the page to reload after naming the snapshot, I switch to a console window on FreeNAS1 with that command ready and waiting, and hit enter. In that short amount of time, I see my manual snapshot on FreeNAS2. A short while later I receive the error. The next minute, autorepl.py runs again, and the error is cleared.

This happens with every hourly snapshot that is automatically generated. What I find most peculiar are the failure emails. Check this out.

Replication FreeNAS1/CheetahZFS -> 192.168.0.2:Repository/CustomerCheetah failed: Failed: FreeNAS1/CheetahZFS (test8-20160716->test9-20160716)

Specifically I'm looking at the names of the snapshots. What we see in the parenthesis at the end of the message is previous snapshot -> current snapshot. Is it trying to replicate out of order or something? I don't know.

And to throw the final monkey into this wrench, the other dataset I mentioned, ProxmoxVMs? Snapshots are created in the same manner, replication is setup the same way, to the same target, and it works flawlessly.

The only difference I can come up with is that CheetahZFS came to FreeNAS1 from another FreeNAS box. It was replicated to FreeNAS1. The way that I did it was unorthodox as well, but the way I thought it should work, and the way it executed were perfect. It was my attempt at seeding replication via an external USB drive. The drive was plugged into the now defunct FreeNAS box, it replicated to localhost, to that USB drive. I exported the drive, unplugged it, and brought it to FreeNAS1 (never underestimate the bandwidth of a station wagon full of tapes), plugged it in, imported it, and replicated to localhost once more.


I'm going to test along the obvious lines of whether or not my replication seeding is the source of these errors. It just seemed like an odd one that I should throw past the FreeNAS experts. Thoughts and opinions?
 

praecorloth

Contributor
Joined
Jun 2, 2011
Messages
159
Sadly, not yet. I've given up on troubleshooting the instance of the issue that I ran into. I feel like I probably borked things worse when I stayed up playing with this one night.

What I'm going to do now is try to replicate (ha ha, see what I did there?) the issue. So I've got the dataset from one FreeNAS box seeded to an external drive, and transferred over to a second FreeNAS box. I just need to get some time to import the volume, replicate the dataset to the local drives, and then setup replication to the third FreeNAS box.
 

praecorloth

Contributor
Joined
Jun 2, 2011
Messages
159
Wow. A month later, and I'm finally able to get back to this. First things first. I've run through my documentation for replication with dedicated users once more. I wanted to make sure that was solid. Once I verified that, I went and verify the problem itself. I could recreate the problem using my dedicated user and the seeded dataset. I then verified that I could use the same user to replicate a new dataset from FreeNAS02 to FreeNAS03.

Then I got to thinking about how I got here in the first place, and recalled some events that didn't strike me as relevant before, but now seem like they could be the source of my trouble.

The seeded dataset was originally meant to be a proof of concept, to see how (and if) the process worked. Here's the process.
Dataset on FreeNAS01 is created, and replicated to an external drive.
External drive is exported, moved physically to FreeNAS02, and imported.
Dataset is replicated from the external drive to a pool on FreeNAS02.
FreeNAS01 is then setup to replicate to FreeNAS02, without having to replicate the initial image over the filthy, unwashed Internet.

Once we confirmed this worked, the project was marked as successful, and the replication between FreeNAS01 and FreeNAS02 was terminated. I feel this is important. Some time later, in an effort to test our replication documentation, as well as get a new tech experience in setting up replication, we had the new tech go through the documentation step by step. But he needed a dataset to replicate. The seeded dataset.

This is the moment when we first encountered the emails stating that the dataset had failed to copy over. But as I recall (and cannot prove at this point, so I am working on recreating the scenario), a number of the snapshots copied over without complaint. It was only new snapshots that had issues.

It strikes me that something very important has changed, and maybe I'm doing something silly that should just never be done.

The original way I was thinking about this was:
Dataset lives on FreeNAS01, replicates to FreeNAS02, replicates to FreeNAS03.

But how it actually ends up working is:
Dataset starts on FreeNAS01, replicates to FreeNAS02, connection between FreeNAS01 and FreeNAS02 terminates after some time, new data and snapshots are generated on FreeNAS02, replicates to FreeNAS03.

I feel like that's not an impossible scenario for FreeNAS to encounter in the wild. Imagine someone's FreeNAS box is replicating to an offsite box. Building burns down. Offsite replication FreeNAS box becomes production. Now they want to replicate to a new offsite FreeNAS box. But ultimately how this is actually working is important to troubleshooting, and I'm not sure my previous stream of consciousness represented it accurately.

While I'm in troubleshooting mode, are there any suggestions on logs I could look at, commands that could capture more data, etc.? I have free reign of FreeNAS02 and FreeNAS03 for a couple of weeks.
 

praecorloth

Contributor
Joined
Jun 2, 2011
Messages
159
Update on this. I've started encountering the same errors with a clean dataset started on FreeNAS02 to FreeNAS03. So now I'm down to thinking this is a misconfiguration somewhere on one of the NAS boxen, and not a problem with 1->2 2->3 replication.
 
Status
Not open for further replies.
Top