zfs replication and ntp

Daniel Nakamura · Mar 10, 2014

Howdy all,

My server received an ntp correction which basically made it 1 day behind so some replication tasks didn't happen today. The zfs snapshots are now out of sync by about 1 day. Now that ntp is updated and the time is correct, does zfs automagically do the old zfs replication tasks that didn't happen earlier? Do I need to run the zfs replication tasks by hand? Do I need to initialize the zfs task from the very beginning?

messages:Mar 7 13:30:08 H ntpd[7291]: time reset -0.201820 s
messages:Mar 8 19:47:41 H ntpd[7291]: time correction of 40797 seconds exceeds sanity limit (1000); set clock manually to the correct UTC time.

scurrier · Mar 10, 2014

You must have a wicked Daylight Savings Time swing wherever you live.

Seriously though... did it advance the time or retard the time? I can't tell by your log messages. If it retarded the time I would GUESS that it will still trigger things per the scheduled time. Not sure what would happen if it advanced past your scheduled triggers.

Daniel Nakamura · Mar 10, 2014

scurrier said:
You must have a wicked Daylight Savings Time swing wherever you live.

Seriously though... did it advance the time or retard the time? I can't tell by your log messages. If it retarded the time I would GUESS that it will still trigger things per the scheduled time. Not sure what would happen if it advanced past your scheduled triggers.

I am sorry as I can't recall exactly the time difference. I do remember the screen said Sunday when today is definitely Monday. My question is more about zfs and how it handles this condition where ntp gets messed up and zfs replication is out of sync. I want to avoid initializing 2 TB of data. If it must be done, then I should start it sooner than later. Only syncing the missing pieces is probably the best solution but if the system isn't reliable enough to handle this error case on its own then I should enter a bug report so it can get fixed eventually.

scurrier · Mar 10, 2014

I'm confused.

Replication utilizes ZFS snapshots which are incremental. There shouldn't be any data being sent that you haven't/wouldn't sent already anyway/have to send eventually.

I doubt that ZFS relies on the time to keep snapshots from being "out of sync." By "out of sync" I mean internally confused about which snapshot is more recent, or some other problem like that.

If you want to make sure that you have a recent replication backup, then why not just trigger it again manually?

Or check the log and see if it replicated as expected?

Daniel Nakamura · Mar 10, 2014

The main server is happily making new snapshots and destroying old ones. However when I do a zfs list -t snapshot on the remote replication server, the last snapshot is dated 20140307 at 9am. There haven't been any new snapshots since then.

Backup/dataset/rsync@auto-20140306.0500-4w 8K - 32.5G -
Backup/dataset/rsync@auto-20140306.0900-4w 8K - 32.5G -
Backup/dataset/rsync@auto-20140306.1300-4w 352K - 32.5G -
Backup/dataset/rsync@auto-20140306.1700-4w 8K - 32.6G -
Backup/dataset/rsync@auto-20140306.2100-4w 8K - 32.6G -
Backup/dataset/rsync@auto-20140307.0500-4w 8K - 32.6G -
Backup/dataset/rsync@auto-20140307.0900-4w 8K - 32.6G -

Regarding the logs, the main server shows some replication going on but it doesn't look like errors.
Mar 10 15:56:01 Main autosnap.py: [tools.autosnap:57] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW Main@auto-20140310.1556-4w
Mar 10 15:56:02 Main autorepl.py: [common.pipesubr:57] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -l root -p 222 192.168.2.2 "zfs list -Hr -o name -t snapshot -d 1 Backup | tail -n 1 | cut -d@ -f2"
Mar 10 15:56:02 Main autorepl.py: [common.pipesubr:71] Executing: (/sbin/zfs send -R -I Main@auto-20140310.1156-4w Main@auto-20140310.1556-4w | /bin/dd obs=1m | /bin/dd obs=1m | /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -l root -p 222 192.168.2.2 "/sbin/zfs receive -F -d Backup && echo Succeeded.") > /tmp/repl-55715 2>&1
Mar 10 15:56:16 Main autorepl.py: [common.pipesubr:57] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -l root -p 222 192.168.2.2 "zfs list -Hr -o name -t snapshot -d 1 Backup | tail -n 1 | cut -d@ -f2"
Mar 10 15:56:16 Main autorepl.py: [common.pipesubr:71] Executing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -l root -p 222 192.168.2.2 "/sbin/zfs inherit -r freenas:state Backup"
Mar 10 15:56:30 Main autorepl.py: [common.pipesubr:71] Executing: /sbin/zfs inherit freenas:state Main@auto-20140310.1156-4w
Mar 10 15:56:30 Main autorepl.py: [common.pipesubr:71] Executing: /sbin/zfs set freenas:state=LATEST Main@auto-20140310.1556-4w
Mar 10 16:00:01 Main autosnap.py: [tools.autosnap:57] Popen()ing: /sbin/zfs snapshot -r Main/dataset/admin@auto-20140310.1600-2w

In the same time period, the replication server has absolutely nothing in the logs.

Daniel Nakamura · Mar 10, 2014

I took one of the "executing" lines and copy/pasted it into terminal. Now that showed something.

cannot receive incremental stream: most recent snapshot of Backup/dataset/Rsync does not
match incremental source

It looks like I need to replicate all the missing snapshots so the replication server matches the main server. Once that happens, then the system will take over and replicate like normal. I guess I answered my own question. The system does not replicate old snapshots. This shouldn't be the case.

scurrier · Mar 10, 2014

Sorry I think you're pushing beyond what I know, now. I still don't really understand what the problem is. Hopefully someone else will chime in with some help.

milkman_dan · Mar 11, 2014

My experience with the scripting in FreeNAS' replication setup is that it is pretty delicate, mostly from the logic it uses to determine the relative states of the two hosts and what exists on the remote side. The script, as you can see above, is creating the wrong incremental stream, so it's failing. Since all of this runs off cron, I'm guessing that's where it got confused.

Have you tried manually constructing a zfs send command with the -I switch and forcing the replication to occur? This has worked for me a couple of times, so long as I didn't delete the replication task in the webGUI. ( I don't know why that matters, but deleting the replication task entry and recreating it with the exact same parameters will basically force a full replication to occur so that local knows remote's state; I have learned this the hard way over the last couple of weeks.) The scripts really only seem to care about the latest snapshot i.e, whatever local has that remote doesn't since the last snapshot task fired off. If you watch the log output, you'll see a zfs set freenas:state line where the local machine is changing an attribute on the dataset, setting "LATEST" or "NEW" or "-". So if you can get your remote machine within one snap of the local box, the scripts should pick it up from there and resume normal ops.

On the other hand, I've also seen the script correctly create an incremental stream and catch up all the missing snaps on remote, and I'm not sure what the rhyme or reason is for that.

As one more note, when making the incremental stream manually, remember to include the snapshot that represents the latest snap on your remote, through the latest on your local box. From your statement above, that looks like it would be

Backup/dataset/rsync@auto-20140307.0900-4w and whatever the latest is on local.

Good luck. I've been wrestling with the automated replication in FreeNAS for the past couple of weeks. It seems okay for rack-to-rack in a datacenter, but over a WAN it's a little too fragile for my liking.

Daniel Nakamura · Mar 12, 2014

Regarding my situation, I was able to send the snapshots over from my main machine to the backup using the zfs send command. I wrote a little script to run zfs send for every missing snapshot since it has been several days that the last automated snapshots were sent. This morning the automated replication process took over and everything looks good for now.

I wrote a little shell script to compare the local and remote snapshots. It will email me if there is a difference so I can manually do a zfs send if necessary.

Important Announcement for the TrueNAS Community.

zfs replication and ntp

Daniel Nakamura

Dabbler

scurrier

Patron

Daniel Nakamura

Dabbler

scurrier

Patron

Daniel Nakamura

Dabbler

Daniel Nakamura

Dabbler

scurrier

Patron

milkman_dan

Cadet

Daniel Nakamura

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

zfs replication and ntp

Dabbler

Patron

Dabbler

Patron

Dabbler

Dabbler

Patron

Cadet

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "zfs replication and ntp"

Similar threads