replication too slow?

Status
Not open for further replies.

George Kyriazis

Dabbler
Joined
Sep 3, 2013
Messages
42
Hi there,

I have 2 freenas systems configured as follows:

FreeNAS-8.3.1-RELEASE-p2-x64
1 mirrored pool (2x4TB) on each system
1/2 hour snapshots scheduled on PUSH (expiring after 2 days)
daily and weekly snapshots
zfs replication from PUSH to PULL
filesystem ~8% full.

Machines are geographically separated on a corporate net, with BW between then ~5-15Mb/sec.

zfs replication takes too long to complete, in fact, my replication is backlogged. The network is not the bottleneck.

I can divide the replication of each snapshot into 3 observable phases
1. network activity. PUSH does a zfs send, PULL does a zfs receive, and everything works great. This phase takes about 15 minutes.
2. No network activity, PUSH has exited zfs send, however PULL is still executing zfs receive. zfs iostat indicates that PULL still performs disk activity. This also takes about 15 minutes.
3. zfs inherit is executed after zfs send/receive. This one takes about 15 minutes, too, with disk activity on PULL.

Phases 2&3 bring the time of a shapshot replication >1/2 hour, and my replication is lagging behind. This didn't always happen, it started happening lately.

Anyway you cut it, taking ~45 minutes per snapshot, while only 1/3 of it is network traffic, seems too much of an overhead for replication.

I'd like to get other people's feedback if this timing makes sense. What is zfs doing in the "dead" receive period? Why is zfs inherit take so long?

Thanks!

George
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
That does sound weird though. How much RAM do the two systems have?
 

George Kyriazis

Dabbler
Joined
Sep 3, 2013
Messages
42
Both systems have 8GB. While PUSH is using the full amount, PULL seems to be only using 4GB, though. (as viewed by the "Reporting" tab of the UI).
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
8GB is the minimum for ZFS. I'd try adding more RAM and see if that helps. I will say that snapshots and replication do not take more than 30 minutes on my test systems. In fact, they take less than 5 minutes(typically small amounts of data transfer over Gb LAN).
 
Joined
Oct 15, 2013
Messages
2
There is a patch on github that aims to resolve this by removing the -r flag from zfs inherit during replication. I am suffering from a similar behavior and changed my autorepl.py, only to find out that it is not zfs inherit -r that is consuming the time but rather PULL needs some time to complete the receive. While completing the receive, "zfs receive" has already returned, but the dataset seems to be blocked until the snapshot has been processed properly. So while it seems that zfs inherit -r requires a lot of I/O, it is rather just waiting for the dataset to be "ready" for further processing.

Therefore, zfs inherit is not to blame for this.
 
Joined
Oct 15, 2013
Messages
2
Having upgraded PULL from 4GB to 16GB of RAM, I still see about 2 minutes of full disk activity after the snapshot has been transferred. The pattern looks like this:

capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
backupstorage 4.96T 3.21T 0 567 3.99K 4.36M
gptid/1e95b636-bba2-11e1-b617-d027886c597e 1.66T 1.07T 0 194 0 1.45M
gptid/1f047fe1-bba2-11e1-b617-d027886c597e 1.66T 1.06T 0 179 3.99K 1.45M
gptid/32ffbcdc-df8d-11e2-85aa-b8975a47bfe2 1.64T 1.08T 0 193 0 1.45M
-------------------------------------- ----- ----- ----- ----- ----- -----
So this seems not to depend on system memory, but rather on write performance of disks and/or storage adapter.
 
Status
Not open for further replies.
Top