Replication stopped for ZFS Volume?

Status
Not open for further replies.

joobie

Dabbler
Joined
Apr 4, 2013
Messages
27
Hi Guys,

I have an issue where I have 3 ZFS volumes ( /dev/zvol/data/image1, /dev/zvol/data/image2, /dev/zvol/data/image3) sitting on-top of a RAID-Z volume (/mnt/data).

I have a replication schedule setup to replicate /mnt/data to the remote FreeNAS to /mnt/backupData.


All was working well until a week or so ago when i noticed that /dev/zbol/data/image2 stopped replicating. The Used column on the Push box says "0" against the last image.. I can see newer images of /dev/zbol/data/image1 and /dev/zvol/data/image3 on the Push.... very weird.


What should I do to resolve / troubleshoot this?


Cheers, joo
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
Nope. Just nobody has the answer I guess. Some things are too complex to try to troubleshoot over a forum setting.

Also you've provided no information to even give a clue to what is wrong. Your thread is nothing more than the proverbial "My car won't start.. what is broken?" in which case the answer is to fix it yourself or take it to a mechanic. With no useful information, error message or otherwise, there could be a long long list of possible problems.
 

joobie

Dabbler
Joined
Apr 4, 2013
Messages
27
There are no error messages that I can see in the console. Is there a more detailed log I can provide?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
There might be something in /var/log and /var/tmp
 

joobie

Dabbler
Joined
Apr 4, 2013
Messages
27
OK, I went digging around again and now I'm seeing errors coming up in the console of PUSH. Below is an excerpt of what I'm seeing..

Code:
May 26 20:17:01 FreeNAS-Push autorepl.py: [common.pipesubr:42] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 remote-freenas.no-ip.org "zfs list -Hr -o name -S creation -t snapshot -d 1 backup | head -n 1 | cut -d@ -f2"
May 26 20:17:01 FreeNAS-Push autorepl.py: [tools.autorepl:307] Remote and local mismatch after replication: data@auto-20130515.1939-6m vs data@auto-20130514.1939-6m
May 26 20:17:01 FreeNAS-Push autorepl.py: [common.pipesubr:42] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 remote-freenas.no-ip.org "zfs list -Ho name -t snapshot backup | head -n 1 | cut -d@ -f2"
May 26 20:17:02 FreeNAS-Push autorepl.py: [tools.autorepl:323] Replication of data@auto-20130514.1939-6m failed with cannot receive new filesystem stream: destination has snapshots (eg. backup@auto-20130226.1916-6m) must destroy them to overwrite it warning: cannot send 'data@auto-20130213.1746-6m': Broken pipe warning: cannot send 'data@auto-20130214.1746-6m': Broken pipe warning: cannot send 'data@auto-20130215.1746-6m': Broken pipe warning: cannot send 'data@auto-20130216.1746-6m': Broken pipe warning: cannot send 'data@auto-20130217.1746-6m': Broken pipe warning: cannot send 'data@auto-20130218.1746-6m': Broken pipe warning: cannot send 'data@auto-20130219.1746-6m': Broken pipe warning: cannot send 'data@auto-20130220.1746-6m': Broken pipe warning: cannot send 'data@auto-20130221.1746-6m': Broken pipe warning: cannot send 'data@auto-20130222.1746-6m': Broken pipe warning: cannot send 'data@auto-20130223.1746-6m': Broken pipe warning: cannot send 'data@auto-20130224.1746-6m': Broken pipe warning: cannot send 'data@
May 26 20:17:19 FreeNAS-Push autorepl.py: [tools.autorepl:264] Creating backup on remote system
May 26 20:17:19 FreeNAS-Push autorepl.py: [common.pipesubr:49] Executing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 remote-freenas.no-ip.org /sbin/zfs create -o readonly=on -p backup
May 26 20:17:19 FreeNAS-Push autorepl.py: [common.pipesubr:49] Executing: (/sbin/zfs send -R data@auto-20130514.1939-6m | /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 remote-freenas.no-ip.org "/sbin/zfs receive -F -d backup && echo Succeeded.") > /tmp/repl-45970 2>&1


I also ran up zfs list -Ht snapshot -o name,freenas:state on both PUSH and PULL, below are the results (condensed where there is consistency).

PUSH
data@auto-20130213.1746-6m -
data@auto-20130214.1746-6m -
<this continues on in sequence>
data@auto-20130513.1939-6m -
data@auto-20130514.1939-6m NEW
data@auto-20130515.1939-6m NEW
<this continues on in sequence>
data@auto-20130525.1943-6m NEW
data@auto-20130526.1943-6m NEW
data/rogue@auto-20130213.1746-6m -
data/rogue@auto-20130214.1746-6m -
<this continues on in sequence>
data/rogue@auto-20130327.1938-6m -
data/rogue@auto-20130328.1938-6m -
data/rogue@auto-20130329.1938-6m NEW
data/rogue@auto-20130330.1938-6m NEW
<this continues on in sequence>
data/rogue@auto-20130525.1943-6m NEW
data/rogue@auto-20130526.1943-6m NEW
data/server2008@server2008-15.02.2013 -
data/server2008@auto-20130516.1939-6m NEW
data/server2008@auto-20130517.1939-6m NEW
<this continues on in sequence>
data/server2008@auto-20130525.1943-6m NEW
data/server2008@auto-20130526.1943-6m NEW
data/squid@auto-20130213.1746-6m -
data/squid@auto-20130214.1746-6m -
<this continues on in sequence>
data/squid@auto-20130327.1938-6m -
data/squid@auto-20130328.1938-6m -
data/squid@auto-20130329.1938-6m NEW
data/squid@auto-20130330.1938-6m NEW
<this continues on in sequence>
data/squid@auto-20130525.1943-6m NEW
data/squid@auto-20130526.1943-6m NEW

PULL
backup@auto-20130213.1746-6m -
backup@auto-20130214.1746-6m -
<this continues on in sequence>
backup@auto-20130513.1939-6m -
backup@auto-20130514.1939-6m -
backup@auto-20130515.1939-6m NEW
backup/rogue@auto-20130213.1746-6m -
backup/rogue@auto-20130214.1746-6m -
<this continues on in sequence>
backup/rogue@auto-20130430.1939-6m -
backup/rogue@auto-20130501.1939-6m -
backup/server2008@server2008-15.02.2013 -
backup/server2008@auto-20130419.1938-6m -
backup/server2008@auto-20130420.1938-6m -
<this continues on in sequence>
backup/server2008@auto-20130513.1939-6m -
backup/server2008@auto-20130514.1939-6m -
backup/squid@auto-20130213.1746-6m -
backup/squid@auto-20130214.1746-6m -
<this continues on in sequence>
backup/squid@auto-20130513.1939-6m -
backup/squid@auto-20130514.1939-6m -

The full dump of these logs are online at http://pastebin.com/mY8bYfK9 (PULL) and http://pastebin.com/ZHRyRsHf (PUSH).


Cheers, joo
 

joobie

Dabbler
Joined
Apr 4, 2013
Messages
27
PS. This is very similar to the issue I had previously.

What I did last time is physically relocate the PULL to the office, flame it and resync from scratch. As it's happened again, i'm trying not to do anything (last time I deleted some older snapshots and supposedly this made my instance unrecoverable, forcing me to do a full resync).

Here is the original thread - http://forums.freenas.org/showthread.php?12205-Replication-Interruption

Cheers, joo
 

joobie

Dabbler
Joined
Apr 4, 2013
Messages
27
.. i'm starting to hear my echo again :/

I'm leaving my replication in this state, so I can identify and work through a fix. Last time I had this issue, I resynced and all was good until it came up again.

Appreciate if anyone can help to get a quick resolution on this.


Cheers, joo
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
Are you doing replication tasks every 6 minutes?

If those are 6 minute increments, that's really frequent. My first guess is that the frequency is causing overlap between the snapshots and the replication tasks. In effect, the replication task isn't finishing before the next snapshot(and its replication task) runs and that's causing problems. Either increase the bandwidth between the 2 machines or maybe try something less frequent like 30 minutes or an hour.
 

joobie

Dabbler
Joined
Apr 4, 2013
Messages
27
Thanks for the response cyberjack.

I have periodic snapshots setup for once a day, not every 6 mins.

It was replicating fine for weeks .. How can resolve the issue where it's at now, so that I can restart replication and get it back on track?


cheers, joo
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
I got nothing. Just like before, the only idea I have is to all the snapshots on the backup server and start over. Naturally, I know this isn't a good plan for you since it involved you physically relocating your server.

I'm not sure what your problem is in particular or why you and you alone keep having this weird issue. I'm wondering if its related to the network performance(or reliability) between the 2 servers. It's very odd that you are the only one with this issue.

If I were in your shoes I might setup your primary server to ping to the backup server 24x7 for a week and see if you have high packet loss or something. Something is just Not Right about your situation. Unfortunately, I don't know what that "something" is and I don't have any good ideas. Maybe a RAM test on both machines is in order too?

Edit: There's a thread around here somewhere that someone wrote for Replication tasks. I can't remember what it did exactly, but it was supposed to be pretty good. I'd see if you can find that thread and see if that script could help you.
 

joobie

Dabbler
Joined
Apr 4, 2013
Messages
27
Thanks Cyberjock.

I had a dig around through the forum and couldn't locate it. Can I trouble you to have a dig around if you get a moment?

I'm not sure what the issue is. I will take it in, wipe and sync it again.. I'm just conscious that this is the 2nd time i've had to do this and it's damn heavy. Either way, i've gone without a replicated backup for 2 months now so it's getting about time.

I'd really appreciate if you could find that script. I'm hesitant to keep the replication in place that i've been using via the GUI for a long period of time as this issue may just resurface again. Switching over to the script method sounds like a more controllable method.

Cheers, joo
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525

Kernel

Cadet
Joined
Aug 20, 2013
Messages
1
I had a similar problem.
My PUSH server has been offline due to a faulty USB pendrive and replication tasks failed during the weekend.
I replaced the USB drive, recovered the configuration and everything looked ok... everything but replication.
It was not working at all: the first server (mainly CIFS shares) was not replicating to the second one and the second server (NFS share for virtual machines vdisks) was not replicating to its counterpart.

I copied and pasted the command line from the logs trying to understand what was wrong; on the first server logs I found this command:

/usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 192.168.10.13 "zfs list -Ho name -t snapshot datastor/replica/shire | head -n 1 | cut -d@ -f2"

I tried copying and pasting the command as it is and I simply got the prompt while I was expecting a list of snapshots of the remote side.
On the remote side I tried issuing the zfs command that it would have received via ssh:

zfs list -Ho name -t snapshot datastor/replica/shire

and I got a list of snapshots; the remote side looked ok.

Then I tried issuing the command:
/usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -p 22 192.168.10.13

on the "PUSH" server: it's a little bit different from the one I found in the log because I removed the "-q" switch to see any error message and... TA-DA! I was not able to log on the remote side!
I checked the public keys involved and I found that they were missing (maybe because of my USB drive replacement) so I simply re-followed the steps to set up replication and copied the keys over again.
My servers (two HP Proliant MicroServer using FreeNas 8.3.1-p2) were stuck for about five minutes (but the services were working correctly, only SSH and web interface were unusable) but data is now flowing from one server to the other without any problem.

Hope this helps!
Ciao,
Roberto
 

Sir.Robin

Guru
Joined
Apr 14, 2012
Messages
554
Tried to initialize the other side?
 

Sir.Robin

Guru
Joined
Apr 14, 2012
Messages
554

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,402
Was it my proposal that were funny or the Darth vs Clint?
Darth vs Clint I believe. Darth, Kernel, appears to have resolved his replication issue.
My servers (two HP Proliant MicroServer using FreeNas 8.3.1-p2) were stuck for about five minutes (but the services were working correctly, only SSH and web interface were unusable) but data is now flowing from one server to the other without any problem.
 
Status
Not open for further replies.
Top