replication too slow?

George Kyriazis · Sep 3, 2013

Hi there,

I have 2 freenas systems configured as follows:

FreeNAS-8.3.1-RELEASE-p2-x64
1 mirrored pool (2x4TB) on each system
1/2 hour snapshots scheduled on PUSH (expiring after 2 days)
daily and weekly snapshots
zfs replication from PUSH to PULL
filesystem ~8% full.

Machines are geographically separated on a corporate net, with BW between then ~5-15Mb/sec.

zfs replication takes too long to complete, in fact, my replication is backlogged. The network is not the bottleneck.

I can divide the replication of each snapshot into 3 observable phases
1. network activity. PUSH does a zfs send, PULL does a zfs receive, and everything works great. This phase takes about 15 minutes.
2. No network activity, PUSH has exited zfs send, however PULL is still executing zfs receive. zfs iostat indicates that PULL still performs disk activity. This also takes about 15 minutes.
3. zfs inherit is executed after zfs send/receive. This one takes about 15 minutes, too, with disk activity on PULL.

Phases 2&3 bring the time of a shapshot replication >1/2 hour, and my replication is lagging behind. This didn't always happen, it started happening lately.

Anyway you cut it, taking ~45 minutes per snapshot, while only 1/3 of it is network traffic, seems too much of an overhead for replication.

I'd like to get other people's feedback if this timing makes sense. What is zfs doing in the "dead" receive period? Why is zfs inherit take so long?

Thanks!

George

cyberjock · Sep 3, 2013

That does sound weird though. How much RAM do the two systems have?

George Kyriazis · Sep 3, 2013

Both systems have 8GB. While PUSH is using the full amount, PULL seems to be only using 4GB, though. (as viewed by the "Reporting" tab of the UI).

cyberjock · Sep 3, 2013

8GB is the minimum for ZFS. I'd try adding more RAM and see if that helps. I will say that snapshots and replication do not take more than 30 minutes on my test systems. In fact, they take less than 5 minutes(typically small amounts of data transfer over Gb LAN).

Florian Bruckner · Oct 15, 2013

There is a patch on github that aims to resolve this by removing the -r flag from zfs inherit during replication. I am suffering from a similar behavior and changed my autorepl.py, only to find out that it is not zfs inherit -r that is consuming the time but rather PULL needs some time to complete the receive. While completing the receive, "zfs receive" has already returned, but the dataset seems to be blocked until the snapshot has been processed properly. So while it seems that zfs inherit -r requires a lot of I/O, it is rather just waiting for the dataset to be "ready" for further processing.

Therefore, zfs inherit is not to blame for this.

Florian Bruckner · Nov 3, 2013

Having upgraded PULL from 4GB to 16GB of RAM, I still see about 2 minutes of full disk activity after the snapshot has been transferred. The pattern looks like this:

capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
backupstorage 4.96T 3.21T 0 567 3.99K 4.36M
gptid/1e95b636-bba2-11e1-b617-d027886c597e 1.66T 1.07T 0 194 0 1.45M
gptid/1f047fe1-bba2-11e1-b617-d027886c597e 1.66T 1.06T 0 179 3.99K 1.45M
gptid/32ffbcdc-df8d-11e2-85aa-b8975a47bfe2 1.64T 1.08T 0 193 0 1.45M
-------------------------------------- ----- ----- ----- ----- ----- -----
So this seems not to depend on system memory, but rather on write performance of disks and/or storage adapter.

Important Announcement for the TrueNAS Community.

replication too slow?

George Kyriazis

Dabbler

cyberjock

Inactive Account

George Kyriazis

Dabbler

cyberjock

Inactive Account

Florian Bruckner

Cadet

Florian Bruckner

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

replication too slow?

George Kyriazis

Dabbler

cyberjock

Inactive Account

George Kyriazis

Dabbler

cyberjock

Inactive Account

Florian Bruckner

Cadet

Florian Bruckner

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "replication too slow?"

Similar threads