Replication stops every Sunday

Status
Not open for further replies.

Eric WK

Cadet
Joined
Aug 8, 2013
Messages
1
Hi Forum guru:

We are newbie to FreeNAS. We have a rather strange replication problem that we quite can't get our heads around.

FreeNAS-8.3.1-RELEASE-p2-x64 (r12686+b770da6_dirty)

2 -server configuration running on FreeBSD 8.3-RELEASE-p7. vm1 is master, & vm2 is slave. Snapshot creation happens on vm1 are then replicated to vm2.

ZFS Periodic Snapshot Tasks configuration: Monday to Friday. 03:00 to 23:45. Every 15 minutes. Non recursive. Pretty generic setup. Since we are doing replication every 15 minutes, we have around 35 snapshots per day.

During weekdays, replications are working fine. No replication delays. However, on Sunday for the past 2 weeks, vm1 stopped replicating at around 2pm. Searching through the system logs, only found about 3 to 4 messages with 'zfs send' events. So the replication lag keep growing & growing. I do not think scrubbing is in process as I found no record of that in the log files.

When the replication lag happens, sometimes I can get replication to start briefly by going into ZFS Replication Setting window and adjust the time or the date replication should happen. That sometimes will kick started another zfs send. But shortly after, it will stop again.

As replication failed to run like weekdays, there is a big replication lag. Sometimes it will be Monday night or Tuesday morning before replication lag fully recovered. Then replication will run fine again until Sunday.

It has been happening for the past 2 weeks. I checked and found no cronjob is causing Sunday's replication to run only 3 or 4 times on Sunday.

What we found out was on vm2 there were over 9000 snapshots. They were old snapshots that never got cleaned up. It seems there is no automatic process that will remove the old snapshots from the system. So we manually removed them. After that, replication resumed a bit. But then stopped at a partiular snapshot. I saw in logs that zfs ran through the same snapshot repeatedly. I went ahead and deleted the stuck snapshot. All of a sudden, replication resumed again.

So I want to see if anyone out there had seen the same problem. Replication stops running on Sunday and take long time to catch up.

Thanks in advance.
Eric
 
Status
Not open for further replies.
Top