Replication stopped for ZFS Volume?

joobie · May 15, 2013

Hi Guys,

I have an issue where I have 3 ZFS volumes ( /dev/zvol/data/image1, /dev/zvol/data/image2, /dev/zvol/data/image3) sitting on-top of a RAID-Z volume (/mnt/data).

I have a replication schedule setup to replicate /mnt/data to the remote FreeNAS to /mnt/backupData.

All was working well until a week or so ago when i noticed that /dev/zbol/data/image2 stopped replicating. The Used column on the Push box says "0" against the last image.. I can see newer images of /dev/zbol/data/image1 and /dev/zvol/data/image3 on the Push.... very weird.

What should I do to resolve / troubleshoot this?

Cheers, joo

joobie · May 17, 2013

bump..

joobie · May 21, 2013

echoo echoooo echoooo

joobie · May 25, 2013

Am I posting this to the wrong forum?

cyberjock · May 25, 2013

Nope. Just nobody has the answer I guess. Some things are too complex to try to troubleshoot over a forum setting.

Also you've provided no information to even give a clue to what is wrong. Your thread is nothing more than the proverbial "My car won't start.. what is broken?" in which case the answer is to fix it yourself or take it to a mechanic. With no useful information, error message or otherwise, there could be a long long list of possible problems.

joobie · May 26, 2013

There are no error messages that I can see in the console. Is there a more detailed log I can provide?

cyberjock · May 26, 2013

There might be something in /var/log and /var/tmp

joobie · May 26, 2013

OK, I went digging around again and now I'm seeing errors coming up in the console of PUSH. Below is an excerpt of what I'm seeing..

Code:

May 26 20:17:01 FreeNAS-Push autorepl.py: [common.pipesubr:42] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 remote-freenas.no-ip.org "zfs list -Hr -o name -S creation -t snapshot -d 1 backup | head -n 1 | cut -d@ -f2"
May 26 20:17:01 FreeNAS-Push autorepl.py: [tools.autorepl:307] Remote and local mismatch after replication: data@auto-20130515.1939-6m vs data@auto-20130514.1939-6m
May 26 20:17:01 FreeNAS-Push autorepl.py: [common.pipesubr:42] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 remote-freenas.no-ip.org "zfs list -Ho name -t snapshot backup | head -n 1 | cut -d@ -f2"
May 26 20:17:02 FreeNAS-Push autorepl.py: [tools.autorepl:323] Replication of data@auto-20130514.1939-6m failed with cannot receive new filesystem stream: destination has snapshots (eg. backup@auto-20130226.1916-6m) must destroy them to overwrite it warning: cannot send 'data@auto-20130213.1746-6m': Broken pipe warning: cannot send 'data@auto-20130214.1746-6m': Broken pipe warning: cannot send 'data@auto-20130215.1746-6m': Broken pipe warning: cannot send 'data@auto-20130216.1746-6m': Broken pipe warning: cannot send 'data@auto-20130217.1746-6m': Broken pipe warning: cannot send 'data@auto-20130218.1746-6m': Broken pipe warning: cannot send 'data@auto-20130219.1746-6m': Broken pipe warning: cannot send 'data@auto-20130220.1746-6m': Broken pipe warning: cannot send 'data@auto-20130221.1746-6m': Broken pipe warning: cannot send 'data@auto-20130222.1746-6m': Broken pipe warning: cannot send 'data@auto-20130223.1746-6m': Broken pipe warning: cannot send 'data@auto-20130224.1746-6m': Broken pipe warning: cannot send 'data@
May 26 20:17:19 FreeNAS-Push autorepl.py: [tools.autorepl:264] Creating backup on remote system
May 26 20:17:19 FreeNAS-Push autorepl.py: [common.pipesubr:49] Executing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 remote-freenas.no-ip.org /sbin/zfs create -o readonly=on -p backup
May 26 20:17:19 FreeNAS-Push autorepl.py: [common.pipesubr:49] Executing: (/sbin/zfs send -R data@auto-20130514.1939-6m | /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 remote-freenas.no-ip.org "/sbin/zfs receive -F -d backup && echo Succeeded.") > /tmp/repl-45970 2>&1

I also ran up zfs list -Ht snapshot -o name,freenas:state on both PUSH and PULL, below are the results (condensed where there is consistency).

PUSH
data@auto-20130213.1746-6m -
data@auto-20130214.1746-6m -
<this continues on in sequence>
data@auto-20130513.1939-6m -
data@auto-20130514.1939-6m NEW
data@auto-20130515.1939-6m NEW
<this continues on in sequence>
data@auto-20130525.1943-6m NEW
data@auto-20130526.1943-6m NEW
data/rogue@auto-20130213.1746-6m -
data/rogue@auto-20130214.1746-6m -
<this continues on in sequence>
data/rogue@auto-20130327.1938-6m -
data/rogue@auto-20130328.1938-6m -
data/rogue@auto-20130329.1938-6m NEW
data/rogue@auto-20130330.1938-6m NEW
<this continues on in sequence>
data/rogue@auto-20130525.1943-6m NEW
data/rogue@auto-20130526.1943-6m NEW
data/server2008@server2008-15.02.2013 -
data/server2008@auto-20130516.1939-6m NEW
data/server2008@auto-20130517.1939-6m NEW
<this continues on in sequence>
data/server2008@auto-20130525.1943-6m NEW
data/server2008@auto-20130526.1943-6m NEW
data/squid@auto-20130213.1746-6m -
data/squid@auto-20130214.1746-6m -
<this continues on in sequence>
data/squid@auto-20130327.1938-6m -
data/squid@auto-20130328.1938-6m -
data/squid@auto-20130329.1938-6m NEW
data/squid@auto-20130330.1938-6m NEW
<this continues on in sequence>
data/squid@auto-20130525.1943-6m NEW
data/squid@auto-20130526.1943-6m NEW

PULL
backup@auto-20130213.1746-6m -
backup@auto-20130214.1746-6m -
<this continues on in sequence>
backup@auto-20130513.1939-6m -
backup@auto-20130514.1939-6m -
backup@auto-20130515.1939-6m NEW
backup/rogue@auto-20130213.1746-6m -
backup/rogue@auto-20130214.1746-6m -
<this continues on in sequence>
backup/rogue@auto-20130430.1939-6m -
backup/rogue@auto-20130501.1939-6m -
backup/server2008@server2008-15.02.2013 -
backup/server2008@auto-20130419.1938-6m -
backup/server2008@auto-20130420.1938-6m -
<this continues on in sequence>
backup/server2008@auto-20130513.1939-6m -
backup/server2008@auto-20130514.1939-6m -
backup/squid@auto-20130213.1746-6m -
backup/squid@auto-20130214.1746-6m -
<this continues on in sequence>
backup/squid@auto-20130513.1939-6m -
backup/squid@auto-20130514.1939-6m -

The full dump of these logs are online at http://pastebin.com/mY8bYfK9 (PULL) and http://pastebin.com/ZHRyRsHf (PUSH).

Cheers, joo

joobie · May 26, 2013

PS. This is very similar to the issue I had previously.

What I did last time is physically relocate the PULL to the office, flame it and resync from scratch. As it's happened again, i'm trying not to do anything (last time I deleted some older snapshots and supposedly this made my instance unrecoverable, forcing me to do a full resync).

Here is the original thread - http://forums.freenas.org/showthread.php?12205-Replication-Interruption

Cheers, joo

joobie · May 29, 2013

.. i'm starting to hear my echo again :/

I'm leaving my replication in this state, so I can identify and work through a fix. Last time I had this issue, I resynced and all was good until it came up again.

Appreciate if anyone can help to get a quick resolution on this.

Cheers, joo

cyberjock · May 29, 2013

Are you doing replication tasks every 6 minutes?

If those are 6 minute increments, that's really frequent. My first guess is that the frequency is causing overlap between the snapshots and the replication tasks. In effect, the replication task isn't finishing before the next snapshot(and its replication task) runs and that's causing problems. Either increase the bandwidth between the 2 machines or maybe try something less frequent like 30 minutes or an hour.

joobie · May 30, 2013

Thanks for the response cyberjack.

I have periodic snapshots setup for once a day, not every 6 mins.

It was replicating fine for weeks .. How can resolve the issue where it's at now, so that I can restart replication and get it back on track?

cheers, joo

joobie · Jun 3, 2013

bump..

cyberjock · Jun 3, 2013

I got nothing. Just like before, the only idea I have is to all the snapshots on the backup server and start over. Naturally, I know this isn't a good plan for you since it involved you physically relocating your server.

I'm not sure what your problem is in particular or why you and you alone keep having this weird issue. I'm wondering if its related to the network performance(or reliability) between the 2 servers. It's very odd that you are the only one with this issue.

If I were in your shoes I might setup your primary server to ping to the backup server 24x7 for a week and see if you have high packet loss or something. Something is just Not Right about your situation. Unfortunately, I don't know what that "something" is and I don't have any good ideas. Maybe a RAM test on both machines is in order too?

Edit: There's a thread around here somewhere that someone wrote for Replication tasks. I can't remember what it did exactly, but it was supposed to be pretty good. I'd see if you can find that thread and see if that script could help you.

joobie · Jun 4, 2013

Thanks Cyberjock.

I had a dig around through the forum and couldn't locate it. Can I trouble you to have a dig around if you get a moment?

I'm not sure what the issue is. I will take it in, wipe and sync it again.. I'm just conscious that this is the 2nd time i've had to do this and it's damn heavy. Either way, i've gone without a replicated backup for 2 months now so it's getting about time.

I'd really appreciate if you could find that script. I'm hesitant to keep the replication in place that i've been using via the GUI for a long period of time as this issue may just resurface again. Switching over to the script method sounds like a more controllable method.

Cheers, joo

cyberjock · Jun 4, 2013

I found it, but I'm not sure how valuable it will be. It doesn't do exactly what I thought it would.

http://forums.freenas.org/showthrea...ning-snapshots-similar-to-Apple-s-TimeMachine

Kernel · Aug 20, 2013

I had a similar problem.
My PUSH server has been offline due to a faulty USB pendrive and replication tasks failed during the weekend.
I replaced the USB drive, recovered the configuration and everything looked ok... everything but replication.
It was not working at all: the first server (mainly CIFS shares) was not replicating to the second one and the second server (NFS share for virtual machines vdisks) was not replicating to its counterpart.

I copied and pasted the command line from the logs trying to understand what was wrong; on the first server logs I found this command:

/usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 192.168.10.13 "zfs list -Ho name -t snapshot datastor/replica/shire | head -n 1 | cut -d@ -f2"

I tried copying and pasting the command as it is and I simply got the prompt while I was expecting a list of snapshots of the remote side.
On the remote side I tried issuing the zfs command that it would have received via ssh:

zfs list -Ho name -t snapshot datastor/replica/shire

and I got a list of snapshots; the remote side looked ok.

Then I tried issuing the command:
/usr/bin/ssh -c arcfour256,arcfour128,blowfish-cbc,aes128-ctr,aes192-ctr,aes256-ctr -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -p 22 192.168.10.13

on the "PUSH" server: it's a little bit different from the one I found in the log because I removed the "-q" switch to see any error message and... TA-DA! I was not able to log on the remote side!
I checked the public keys involved and I found that they were missing (maybe because of my USB drive replacement) so I simply re-followed the steps to set up replication and copied the keys over again.
My servers (two HP Proliant MicroServer using FreeNas 8.3.1-p2) were stuck for about five minutes (but the services were working correctly, only SSH and web interface were unusable) but data is now flowing from one server to the other without any problem.

Hope this helps!
Ciao,
Roberto

Sir.Robin · Aug 20, 2013

Tried to initialize the other side?

Sir.Robin · Aug 21, 2013

Was it my proposal that were funny or the Darth vs Clint?

See here:
http://doc.freenas.org/index.php/Replication_Tasks

Initializing the remote side might help when having trouble with replication.

And Clint woops Darth's bung any day. ;)

paleoN · Aug 21, 2013

Sir.Robin said:
Was it my proposal that were funny or the Darth vs Clint?

Darth vs Clint I believe. Darth, Kernel, appears to have resolved his replication issue.

Kernel said:
My servers (two HP Proliant MicroServer using FreeNas 8.3.1-p2) were stuck for about five minutes (but the services were working correctly, only SSH and web interface were unusable) but data is now flowing from one server to the other without any problem.

Important Announcement for the TrueNAS Community.

Replication stopped for ZFS Volume?

Dabbler

Dabbler

Dabbler

Dabbler

Inactive Account

Dabbler

Inactive Account

Dabbler

Dabbler

Dabbler

Inactive Account

Dabbler

Dabbler

Inactive Account

Dabbler

Inactive Account

Cadet

Guru

Guru

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Replication stopped for ZFS Volume?"

Similar threads