praecorloth
Contributor
- Joined
- Jun 2, 2011
- Messages
- 159
Hey everyone,
I've got two FreeNAS boxen I'm using to demo replication. The issue I'm running into is that with one dataset, FreeNAS is reporting a failure, emailing out that there was a failure, but there was no failure, and on the second run the failure is cleared. First, some specs.
FreeNAS1
FreeNAS 9.3-STABLE-201605170422
AMD Phenom II x4 820
8GB non-ECC memory
Three 7200RPM SATA disks I pulled out of a bin to make a RAIDZ1
Some onboard Realtek based gigabit NIC
FreeNAS2
FreeNAS 9.3-STABLE-201605170422
Xeon E3-1220
16GB ECC memory
Six 7200RPM SATA drives that were purchased to make a RAIDZ2
Nice Intel gigabit NIC
On to the nitty-gritty. On FreeNAS1 we have the RAIDZ1 which hosts two datasets. We'll call them CheetahZFS and ProxmoxVMs. Both of these hold VMs, but only the ProxmoxVMs has any real changing data on it. The CheetahZFS dataset was replicated to FreeNAS1 from yet a third FreeNAS box that is no longer around.
When CheetahZFS replicates a snapshot over to FreeNAS2, FreeNAS1 throws an error stating that the replication has failed. I see in the Replication Tasks that it's showing a failure, I get an email that a failure has occurred, I check FreeNAS2 and the snapshot is there. The next minute when autorepl.py runs again, the error is cleared.
I ended up enabling debug in autorepl.py to get a better understanding of what was going on. I was hoping for a log file specific to replication, but I have not found one. But I did get these commands out of the messages file.
autorepl.py: [common.pipesubr:71] Popen()ing: /sbin/zfs list -H -t snapshot -p -o name,creation -r -d 1 "FreeNAS1/CheetahZFS"
Followed by
autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -l CheetahRep -p 22 192.168.0.2 "zfs list -H -t snapshot -p -o name,creation -d 1 -r 'Repository/CustomerCheetah/CheetahZFS'"
So the first command I imagine checks the snapshots on FreeNAS1, and the second command checks the snapshots on FreeNAS2. Happy. So for grins, I create a manual snapshot, I wait for the page to reload after naming the snapshot, I switch to a console window on FreeNAS1 with that command ready and waiting, and hit enter. In that short amount of time, I see my manual snapshot on FreeNAS2. A short while later I receive the error. The next minute, autorepl.py runs again, and the error is cleared.
This happens with every hourly snapshot that is automatically generated. What I find most peculiar are the failure emails. Check this out.
Replication FreeNAS1/CheetahZFS -> 192.168.0.2:Repository/CustomerCheetah failed: Failed: FreeNAS1/CheetahZFS (test8-20160716->test9-20160716)
Specifically I'm looking at the names of the snapshots. What we see in the parenthesis at the end of the message is previous snapshot -> current snapshot. Is it trying to replicate out of order or something? I don't know.
And to throw the final monkey into this wrench, the other dataset I mentioned, ProxmoxVMs? Snapshots are created in the same manner, replication is setup the same way, to the same target, and it works flawlessly.
The only difference I can come up with is that CheetahZFS came to FreeNAS1 from another FreeNAS box. It was replicated to FreeNAS1. The way that I did it was unorthodox as well, but the way I thought it should work, and the way it executed were perfect. It was my attempt at seeding replication via an external USB drive. The drive was plugged into the now defunct FreeNAS box, it replicated to localhost, to that USB drive. I exported the drive, unplugged it, and brought it to FreeNAS1 (never underestimate the bandwidth of a station wagon full of tapes), plugged it in, imported it, and replicated to localhost once more.
I'm going to test along the obvious lines of whether or not my replication seeding is the source of these errors. It just seemed like an odd one that I should throw past the FreeNAS experts. Thoughts and opinions?
I've got two FreeNAS boxen I'm using to demo replication. The issue I'm running into is that with one dataset, FreeNAS is reporting a failure, emailing out that there was a failure, but there was no failure, and on the second run the failure is cleared. First, some specs.
FreeNAS1
FreeNAS 9.3-STABLE-201605170422
AMD Phenom II x4 820
8GB non-ECC memory
Three 7200RPM SATA disks I pulled out of a bin to make a RAIDZ1
Some onboard Realtek based gigabit NIC
FreeNAS2
FreeNAS 9.3-STABLE-201605170422
Xeon E3-1220
16GB ECC memory
Six 7200RPM SATA drives that were purchased to make a RAIDZ2
Nice Intel gigabit NIC
On to the nitty-gritty. On FreeNAS1 we have the RAIDZ1 which hosts two datasets. We'll call them CheetahZFS and ProxmoxVMs. Both of these hold VMs, but only the ProxmoxVMs has any real changing data on it. The CheetahZFS dataset was replicated to FreeNAS1 from yet a third FreeNAS box that is no longer around.
When CheetahZFS replicates a snapshot over to FreeNAS2, FreeNAS1 throws an error stating that the replication has failed. I see in the Replication Tasks that it's showing a failure, I get an email that a failure has occurred, I check FreeNAS2 and the snapshot is there. The next minute when autorepl.py runs again, the error is cleared.
I ended up enabling debug in autorepl.py to get a better understanding of what was going on. I was hoping for a log file specific to replication, but I have not found one. But I did get these commands out of the messages file.
autorepl.py: [common.pipesubr:71] Popen()ing: /sbin/zfs list -H -t snapshot -p -o name,creation -r -d 1 "FreeNAS1/CheetahZFS"
Followed by
autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -l CheetahRep -p 22 192.168.0.2 "zfs list -H -t snapshot -p -o name,creation -d 1 -r 'Repository/CustomerCheetah/CheetahZFS'"
So the first command I imagine checks the snapshots on FreeNAS1, and the second command checks the snapshots on FreeNAS2. Happy. So for grins, I create a manual snapshot, I wait for the page to reload after naming the snapshot, I switch to a console window on FreeNAS1 with that command ready and waiting, and hit enter. In that short amount of time, I see my manual snapshot on FreeNAS2. A short while later I receive the error. The next minute, autorepl.py runs again, and the error is cleared.
This happens with every hourly snapshot that is automatically generated. What I find most peculiar are the failure emails. Check this out.
Replication FreeNAS1/CheetahZFS -> 192.168.0.2:Repository/CustomerCheetah failed: Failed: FreeNAS1/CheetahZFS (test8-20160716->test9-20160716)
Specifically I'm looking at the names of the snapshots. What we see in the parenthesis at the end of the message is previous snapshot -> current snapshot. Is it trying to replicate out of order or something? I don't know.
And to throw the final monkey into this wrench, the other dataset I mentioned, ProxmoxVMs? Snapshots are created in the same manner, replication is setup the same way, to the same target, and it works flawlessly.
The only difference I can come up with is that CheetahZFS came to FreeNAS1 from another FreeNAS box. It was replicated to FreeNAS1. The way that I did it was unorthodox as well, but the way I thought it should work, and the way it executed were perfect. It was my attempt at seeding replication via an external USB drive. The drive was plugged into the now defunct FreeNAS box, it replicated to localhost, to that USB drive. I exported the drive, unplugged it, and brought it to FreeNAS1 (never underestimate the bandwidth of a station wagon full of tapes), plugged it in, imported it, and replicated to localhost once more.
I'm going to test along the obvious lines of whether or not my replication seeding is the source of these errors. It just seemed like an odd one that I should throw past the FreeNAS experts. Thoughts and opinions?