Local snapshot retention task fails when remote host is unavailable

CJRoss

Contributor
Joined
Aug 7, 2017
Messages
139
I have a bunch of periodic snapshot and replication tasks configured with different naming schemes and retention timings. Today I noticed that none of my daily snapshots are being cleaned up. New ones are created, but nothing is being deleted.

Looking in the zettarepl log, I see a retention task from a few days ago called zettarepl.zettarepl ran and deleted the various daily snapshots. After that, every time I see a retention zettarepl.zettarepl task it says that "Local retention failed: error listing snapshots" on a remote host. Then it shows zettarepl connecting to other remote hosts but not destroying any snapshots.

The fact that zettarepl can't list snapshots on that particular remote host is to be expected as it's down for maintenance. What's unusual is that this appears to prevent all snapshot cleanup, even ones that aren't associated with that server or a replication task at all. I've taken other hosts down for maintenance and not had this happen.

I believe this has something to do with the fact that the replication task is a pull from the down host while all of my other tasks are push replication. I'm temporarily standing up the down host to see if snapshots are properly deleted tonight.

Can anyone else replicate this issue or is it a bug unique to my setup?
 
Joined
Oct 22, 2019
Messages
3,641
It could be a safety measure to prevent the destruction of base snapshots needed for an incremental replication to your server.

If zettarepl cannot confirm with both sides, then perhaps it defaults to skipping any destructions. Not sure why this isn't the case for your other (push) replication tasks.
 

CJRoss

Contributor
Joined
Aug 7, 2017
Messages
139
It could be a safety measure to prevent the destruction of base snapshots needed for an incremental replication to your server.

If zettarepl cannot confirm with both sides, then perhaps it defaults to skipping any destructions. Not sure why this isn't the case for your other (push) replication tasks.

That's probably the case and I have no problem with that. I expect it to error out when dealing with snapshots particular to the down host.

What appears to be a bug to me is the fact that the down host error causes all snapshot deletions to fail.
 

CJRoss

Contributor
Joined
Aug 7, 2017
Messages
139
I can confirm that having the machine down is what was causing the problem. Last night all of the snapshots deleted appropriately. What's interesting is that the locals were deleted first and then all of the remotes were deleted. After that, it created the new snapshots and then pushed those.

Can anyone else replicate this bug? Not sure if it's a zettarepl or TrueNAS issue.
 
Top