Hi all,
Since a little while we are seeing strange issues with replication between two freenas servers, both on FreeNAS-9.3-STABLE-201602031011. Our debug.log on the sending server is full of messages like this, for all datasets:
nas# zcat -f debug.log.0.bz2 | less
Mar 3 00:00:01 nas autosnap.py: [tools.autosnap:71] Popen()ing: /sbin/zfs list -t snapshot -H -o name
Mar 3 00:00:37 nas autorepl.py: [tools.autorepl:233] Autosnap replication started
Mar 3 00:00:37 nas autorepl.py: [tools.autorepl:234] temp log file: /tmp/repl-43055
Mar 3 00:00:37 nas autorepl.py: [tools.autorepl:304] Checking dataset ourdata/civicrm
Mar 3 00:00:37 nas autorepl.py: [common.pipesubr:71] Popen()ing: /sbin/zfs list -H -t snapshot -p -o name,creation -r "ourdata/civicrm"
Mar 3 00:00:37 nas autorepl.py: [tools.autorepl:329] ds = civicrm, remotefs = new
Mar 3 00:00:39 nas autorepl.py: [tools.autorepl:337] Unable to create remote dataset new: cannot create 'new/civicrm': dataset already exists
Mar 3 00:00:39 nas autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 8346 remote.duckdns.org "zfs list -H -t snapshot -p -o name,creation -r 'new/civicrm'"
Mar 3 00:00:40 nas autorepl.py: [tools.autorepl:304] Checking dataset ourdata/dc2
Mar 3 00:00:40 nas autorepl.py: [common.pipesubr:71] Popen()ing: /sbin/zfs list -H -t snapshot -p -o name,creation -r "ourdata/dc2"
Mar 3 00:00:42 nas autorepl.py: [tools.autorepl:329] ds = dc2, remotefs = new
Mar 3 00:00:43 nas autorepl.py: [tools.autorepl:337] Unable to create remote dataset new: cannot create 'new/dc2': dataset already exists
Mar 3 00:00:43 nas autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 8346 remote.duckdns.org "zfs list -H -t snapshot -p -o name,creation -r 'new/dc2'"
Mar 3 00:00:45 nas autorepl.py: [tools.autorepl:304] Checking dataset ourdata/dc3
Mar 3 00:00:45 nas autorepl.py: [common.pipesubr:71] Popen()ing: /sbin/zfs list -H -t snapshot -p -o name,creation -r "ourdata/dc3"
Mar 3 00:00:47 nas autorepl.py: [tools.autorepl:329] ds = dc3, remotefs = new
Mar 3 00:00:48 nas autorepl.py: [tools.autorepl:337] Unable to create remote dataset new: cannot create 'new/dc3': dataset already exists
Mar 3 00:00:48 nas autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 8346 remote.duckdns.org "zfs list -H -t snapshot -p -o name,creation -r 'new/dc3'"
And this keeps happening, all night long, over and over again for all datasets.
We are also seeing 'gigantic' amounts (400 - 500 Gig) of data being transferred between the two machines. So perhaps the data is transferred, but somehow discarded. And then a new replication attempt is done.
About the replication config: Recursively replicate child dataset's snapshots / Delete stale snapshots on remote system are both checked.
Does someone have an idea to solve this? Replication between these machines has been working for more than a year, this has started happening fairly recently.
On a related note: our debug.log / messages.log only go back a few days, is there a reliable way to keep more logs? (i understand all changes should be done through the gui, and I can't find where to adjust this..?)
Thanks
Since a little while we are seeing strange issues with replication between two freenas servers, both on FreeNAS-9.3-STABLE-201602031011. Our debug.log on the sending server is full of messages like this, for all datasets:
nas# zcat -f debug.log.0.bz2 | less
Mar 3 00:00:01 nas autosnap.py: [tools.autosnap:71] Popen()ing: /sbin/zfs list -t snapshot -H -o name
Mar 3 00:00:37 nas autorepl.py: [tools.autorepl:233] Autosnap replication started
Mar 3 00:00:37 nas autorepl.py: [tools.autorepl:234] temp log file: /tmp/repl-43055
Mar 3 00:00:37 nas autorepl.py: [tools.autorepl:304] Checking dataset ourdata/civicrm
Mar 3 00:00:37 nas autorepl.py: [common.pipesubr:71] Popen()ing: /sbin/zfs list -H -t snapshot -p -o name,creation -r "ourdata/civicrm"
Mar 3 00:00:37 nas autorepl.py: [tools.autorepl:329] ds = civicrm, remotefs = new
Mar 3 00:00:39 nas autorepl.py: [tools.autorepl:337] Unable to create remote dataset new: cannot create 'new/civicrm': dataset already exists
Mar 3 00:00:39 nas autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 8346 remote.duckdns.org "zfs list -H -t snapshot -p -o name,creation -r 'new/civicrm'"
Mar 3 00:00:40 nas autorepl.py: [tools.autorepl:304] Checking dataset ourdata/dc2
Mar 3 00:00:40 nas autorepl.py: [common.pipesubr:71] Popen()ing: /sbin/zfs list -H -t snapshot -p -o name,creation -r "ourdata/dc2"
Mar 3 00:00:42 nas autorepl.py: [tools.autorepl:329] ds = dc2, remotefs = new
Mar 3 00:00:43 nas autorepl.py: [tools.autorepl:337] Unable to create remote dataset new: cannot create 'new/dc2': dataset already exists
Mar 3 00:00:43 nas autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 8346 remote.duckdns.org "zfs list -H -t snapshot -p -o name,creation -r 'new/dc2'"
Mar 3 00:00:45 nas autorepl.py: [tools.autorepl:304] Checking dataset ourdata/dc3
Mar 3 00:00:45 nas autorepl.py: [common.pipesubr:71] Popen()ing: /sbin/zfs list -H -t snapshot -p -o name,creation -r "ourdata/dc3"
Mar 3 00:00:47 nas autorepl.py: [tools.autorepl:329] ds = dc3, remotefs = new
Mar 3 00:00:48 nas autorepl.py: [tools.autorepl:337] Unable to create remote dataset new: cannot create 'new/dc3': dataset already exists
Mar 3 00:00:48 nas autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 8346 remote.duckdns.org "zfs list -H -t snapshot -p -o name,creation -r 'new/dc3'"
And this keeps happening, all night long, over and over again for all datasets.
We are also seeing 'gigantic' amounts (400 - 500 Gig) of data being transferred between the two machines. So perhaps the data is transferred, but somehow discarded. And then a new replication attempt is done.
About the replication config: Recursively replicate child dataset's snapshots / Delete stale snapshots on remote system are both checked.
Does someone have an idea to solve this? Replication between these machines has been working for more than a year, this has started happening fairly recently.
On a related note: our debug.log / messages.log only go back a few days, is there a reliable way to keep more logs? (i understand all changes should be done through the gui, and I can't find where to adjust this..?)
Thanks