Replication failing

Status
Not open for further replies.

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
My jails are on a single-SSD pool, and I have a local replication task (which was far more complicated to set up than it had any reason to be, but that's a separate issue) running to back them up to my main, RAIDZ2 pool. The snapshot task runs hourly, and the replication follows the snapshot. I couldn't say for sure which FreeNAS version I was running when I set it up, but it's been running without issue for months, including since I upgraded to 11.0-U4 (which looks to have been a couple of months ago).

In the last few days, I've been getting a lot of failures without much information about why. I'll get one email saying,
Code:
Hello,
   The replication failed for the local ZFS ssdpool/jail/boinc while attempting to
   apply incremental send of snapshot auto-20171208.1600-1w -> auto-20171208.1700-1w to localhost

...and another saying,
Code:
Replication ssdpool -> localhost:tank/ssd_backup failed: Failed: ssdpool/jail/urbackup (auto-20171208.1600-1w->auto-20171208.1700-1w)

The system log (/var/log/messages) doesn't seem to have anything relevant (though it has a ton of spam from collectd). How can I figure out what's going on here?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,995
How is the capacity of the drives doing? By chance are you running out of space? I honestly would have expected you to look at that before you posted but since you didn't specify if you have enough free space, just thought I'd ask.

EDIT: Do you have a quota set for your destination dataset?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I'd hope I would have thought of checking capacity, but a perfectly fair question--there's about 15 TB free on the destination pool. No quota or reservation on the destination.
 

garm

Wizard
Joined
Aug 19, 2017
Messages
1,556
Could it be that there is a missing snapshot in the chain? That you at some point removed one async? I believe that freenas replication task assumes continuity.
 

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,477
Could it be that there is a missing snapshot in the chain? That you at some point removed one async? I believe that freenas replication task assumes continuity.

This was my thought too. I had this error type once and I never was able to pinpoint it but it happened after I had manually deleted some snapshots.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Could it be that there is a missing snapshot in the chain?
That could be it--I did delete some snapshots around that time, though not the most recent. So, how to address? Nuke the destination dataset, and let the replication run from scratch again?
 

garm

Wizard
Joined
Aug 19, 2017
Messages
1,556
honestly, I have no clue. I have never set up the native replication task, as of now I have never had a target that didnt require encrypted backups
 

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,477
That could be it--I did delete some snapshots around that time, though not the most recent. So, how to address? Nuke the destination dataset, and let the replication run from scratch again?

If I am remembering correctly, that is how I fixed it.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Yes, destroying the destination dataset, recreating it, and re-enabling the task seems to have solved the problem. Thanks!
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I think this can be solved with some flags for zfs send, but I'm not sure the end result is any different.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Yes, destroying the destination dataset, recreating it, and re-enabling the task seems to have solved the problem. Thanks!
Well, it appeared to have solved the problem, but now the warnings are back. Hmmm...
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
How long does an individual replication take?
 

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,477
Is your destination datasets set to read-only? I had some odd behavior a while ago where I got similar error messages. When I toggled read-only to "off", the errors stopped.

I know this doesn't make sense necessarily, and I believe that replicated datasets default to read-only, but when I flipped read-only off, the problem went away. Possibly a bug?

EDIT: I've now investigated a bit and remember more precisely what happened in my case. For some reason, the replication process had flipped the replicated datasets on the destination to "read-only". I just checked my backup FreeNAS box and they are not set to read-only now (as they should be). So it seems that read-only is not the default but if you check your destination datasets and they are read-only, that might be your problem.
 
Last edited:

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
How long does an individual replication take?
How would I check that? Though it certainly should be less than the interval between snapshots (one hour), as the total space used on the SSD pool is under 100 GB.
Is your destination datasets set to read-only?
I wouldn't think so, as I didn't set it to be so, but I can't find a place in the GUI to check that.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
How would I check that? Though it certainly should be less than the interval between snapshots (one hour), as the total space used on the SSD pool is under 100 GB.
The most elegant solution that comes to my mind right now would be to check login/logout times for the replication user on the destination.
 

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,477
I wouldn't think so, as I didn't set it to be so, but I can't find a place in the GUI to check that

Simple. Login to the GUI on your destination box and then click the storage tab and one of the columns indicates read-only status.

EDIT: another thought. Do you have your replication setup to hit the top level of your destination box? Or did you create a dataset on your destination that is the target of the replication?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Simple. Login to the GUI on your destination box and then click the storage tab and one of the columns indicates read-only status.
Ah, don't I feel silly--I was looking on the "edit dataset" screen. No, the dataset isn't read-only.
Or did you create a dataset on your destination that is the target of the replication?
It's a dataset. I'm replicating ssdpool to tank/ssd_backup.
 

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,477
Hmm, then I'm out of ideas. Those are the couple of situations that causes oddity errors for replications tasks on my machine. Strange how it would all of a sudden stopped working. Something had to have changed. Maybe try thinking back to that day and see if you changed anything between the last successfully sent snapshot and the one that failed.

Sorry, wish I could help more.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Well, I was about to file a bug on this, and then 11.1 released. If the problem persists after I upgrade (which will probably be a little while, to be on the safe side), I'll still probably do that.
 
Status
Not open for further replies.
Top