Replication status shows "Running" but no zfs send processes.

cmh

Explorer
Joined
Jan 7, 2013
Messages
75
I've got several replications running from my FreeNAS (current 9.10.2) system to a CentOS system. (running current ZFS on Linux) I had a drive failure on the Linux system and so I'm redoing all the transfers. Taking some time as it is several TB of data, but it was working well until this afternoon.

Looking in the "replication tasks" tab of storage, I see the largest replication job listed as "Running", but when I go to the command line and look for any zfs send processes to see the progress, I see nothing. Looking on the target system, I see no inbound ssh connections or any related zfs receive processes. I had checked earlier in the day and saw that the zfs send process showed 87%, so initially thought it had completed, but not all the snapshots have been sent.

Source snapshots:

Code:
[root@nas /var/log]# zfs list -t snapshot -r sto/dcp
NAME							USED  AVAIL  REFER  MOUNTPOINT
sto/dcp@auto-20161213.2107-2w  2.87M	  -   873G  -
sto/dcp@auto-20161214.0907-2w  2.87M	  -   873G  -
sto/dcp@auto-20161214.2107-2w  2.87M	  -   873G  -
sto/dcp@auto-20161215.0907-2w	  0	  -   873G  -
sto/dcp@auto-20161215.2107-2w	  0	  -   873G  -
sto/dcp@auto-20161216.0907-2w   120K	  -   873G  -
sto/dcp@auto-20161216.2107-2w   184K	  -   887G  -
sto/dcp@auto-20161217.0907-2w	96K	  -   887G  -
sto/dcp@auto-20161217.2107-2w	96K	  -   887G  -
sto/dcp@auto-20161218.0907-2w	  0	  -   887G  -
sto/dcp@auto-20161218.2107-2w	  0	  -   887G  -
sto/dcp@auto-20161219.0907-2w	  0	  -   887G  -
sto/dcp@auto-20161219.2107-2w	  0	  -   887G  -
sto/dcp@auto-20161220.0907-2w  3.53M	  -   887G  -
sto/dcp@auto-20161221.0012-2w	96K	  -   887G  -
sto/dcp@auto-20161221.1212-2w	96K	  -   887G  -
sto/dcp@auto-20161222.0012-2w	96K	  -   887G  -
sto/dcp@auto-20161222.1212-2w	96K	  -   887G  -
sto/dcp@auto-20161223.0119-2w	96K	  -   887G  -
sto/dcp@auto-20161223.1319-2w	88K	  -   887G  -
sto/dcp@auto-20161224.0119-2w	  0	  -   887G  -
sto/dcp@auto-20161224.1319-2w	  0	  -   887G  -
sto/dcp@auto-20161225.0119-2w  3.96M	  -   887G  -
sto/dcp@auto-20161225.1319-2w  2.80M	  -   887G  -
sto/dcp@auto-20161226.0119-2w   255M	  -   892G  -
sto/dcp@auto-20161226.1319-2w   184K	  -   900G  -
sto/dcp@auto-20161227.0119-2w  3.15M	  -   901G  -
sto/dcp@auto-20161227.1319-2w  4.08M	  -   901G  -
sto/dcp@auto-20161228.0119-2w	96K	  -   901G  -
sto/dcp@auto-20161228.1319-2w	96K	  -   901G  -
sto/dcp@auto-20161229.0119-2w	  0	  -   901G  -
sto/dcp@auto-20161229.1319-2w	  0	  -   901G  -


Only one target snapshot seems to have made it over:

Code:
8-ewok:~> sudo zfs list -t snapshot -r bak/nas/dcp
NAME								USED  AVAIL  REFER  MOUNTPOINT
bak/nas/dcp@auto-20161213.2107-2w	  0	  -   873G  -


I opened the replication task in the FreeNAS UI and closed it, no change. Opened it again and unchecked "enabled", saved it, still no change. It shows "Running" under status, but it's not. No autorepl processes are active.

Not much regarding replication in the syslog, just this:

Code:
Dec 29 01:19:05 nas autosnap.py: [tools.autosnap:615] Autorepl running, skip destroying snapshots
Dec 29 01:19:06 nas autorepl.py: [tools.autorepl:184] Checking if process 18333 is still alive
Dec 29 01:19:06 nas autorepl.py: [tools.autorepl:188] Process 18333 still working, quitting


There are two replication tasks which show "Succeeded" and "Up to date" in the status, and three more which are blank as they still need to be run.

Recursively replicate is checked (but there are no children datasets) and delete stale snapshots is unchecked. No limits, no dedicated user, and fast ciphers selected.

Any ideas what may have happened, or what to do to get it right?

Thanks!
 

cmh

Explorer
Joined
Jan 7, 2013
Messages
75
Update - since posting, I reloaded the UI and it now shows sending in the status with percentage.

Guessing maybe the resend was triggered by an automatic snapshot?

No indication why it stopped before, but I guess interrupted transfer? Dunno - but it's working again. Was going to delete this post but thought it might be helpful to someone in the future.
 

m0nkey_

MVP
Joined
Oct 27, 2015
Messages
2,739
Interesting. It isn't until today I started to look into using replication tasks. Mine gets stuck at 'Running' and doesn't progress any further. On both the FreeNAS machine and the remote system (a basic FreeBSD install) both show their respective zfs send/recv commands in the process list. It will sit like that and won't do anything else. Maybe something is screwed up in 9.10.2? I need to do more investigating and maybe file a bug report.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Replication from 9.10.1-U4 to 9.10.2 has been working fine for me, but I've been scared of updating the main box. Don't have the time to debug it anytime soon...
 

cmh

Explorer
Joined
Jan 7, 2013
Messages
75
Interesting. It isn't until today I started to look into using replication tasks. Mine gets stuck at 'Running' and doesn't progress any further. On both the FreeNAS machine and the remote system (a basic FreeBSD install) both show their respective zfs send/recv commands in the process list. It will sit like that and won't do anything else. Maybe something is screwed up in 9.10.2? I need to do more investigating and maybe file a bug report.

To be clear, earlier in the day it had shown "Running" and had the appropriate zfs send/receive processes running. My concern was when those weren't and it said running but obviously wasn't.

I admit I wasn't paying close attention, replication is usually fire and forget for me with this setup - it had worked successfully to the same target before the single backup drive failed. I think it generally said "Running" during the transfer of the two earlier datasets. I just looked again and it's got detailed stats: "Sending sto/dcp@auto... (53%)"

Also, if you have a replication running, doing a pgrep -lf sending on the source box will show that same info:

Code:
2-NAS:~$ pgrep -lf sending
42740 zfs: sending sto/dcp@auto-20161229.1319-2w (53%: 522329213880/968336023864)


That's a really cool feature, and nice to see that info integrated into the UI.
 

George Kyriazis

Dabbler
Joined
Sep 3, 2013
Messages
42
Old thread, but adding some more info, in case somebody hits this:

11.2-U5 replicating to Ubuntu 18.04.3. I got the same situation: Replication status showed as "Running", but no zfs processes running on either source or target machine.

Looked at /var/log/debug.log file that showed the replication command that was attempted. Cut'n'past to the command line, and the culprit was that "lz4c" was not install on the target machine (Ubuntu).

Installed using "apt install liblz4-tool" and replication started right away.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
That seems rather weird, as all of ZFS' compression bits are in the ZFS kernel module.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Ah, that's much less weird. But, if you're using LZ4 on both pools (or generally the same compression on both pools) you can just send the compressed blocks. It's the -c option.
 

George Kyriazis

Dabbler
Joined
Sep 3, 2013
Messages
42
Hmm.. "-c" option to what? This was initiated from the GUI Replication Task, which has no room for extra options. In any case, I don't have the same compression on both pools. Since destination is a backup, I have gzip-9 on it.
 
Top