Replication Task Skipping all Snapshots

urfrndsandy

Dabbler
Joined
May 30, 2023
Messages
32
Team,

I am having a weird issue recently, this setup has been working fine from 2023 may. Now all of sudden the backup Truenas server is not replicating the snapshots.

Am getting this error for all snapshots thats on the main machine

skipping snapshot ServerPool/MasterDataset/Projects@auto-2024-03-01_16-00 because it was created after the destination snapshot (auto-2024-03-01_14-00)

What could be the issue? I tried scrub on both the machines but no help. I have not changed any settings recently as well.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
it looks like your snapshots are out of sync and you havent set it to sync when out of sync (default as this can be destructive). it appears to be doing exactly as it's supposed to.

additionally, hardware info.

based on your wording, are you using the backup machine to do a pull replication from the main machine?

what are you retention times on your snapshots? if your retention time is shorter than the time it takes to replicate everything, your snapshots will expire before they can be replicated, putting you out of sync.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Team,

I am having a weird issue recently, this setup has been working fine from 2023 may. Now all of sudden the backup Truenas server is not replicating the snapshots.

Am getting this error for all snapshots thats on the main machine



What could be the issue? I tried scrub on both the machines but no help. I have not changed any settings recently as well.
This message is only for info and was introduced a few years ago in one of the ZFS update.
What this means is that the snapshot being replicated doesn't include the most recent ones.

In you case, it seems you are replicating up to

ServerPool/MasterDataset/Projects@auto-2024-03-01_14-00
but the most recent snapshot is (which was taken 2 hours later):
ServerPool/MasterDataset/Projects@auto-2024-03-01_16-00
So "ZFS send" let you know upfront, you will not have complete replication due to the ommitted recent snapshot. This would be taken care during the next replication task.
 

urfrndsandy

Dabbler
Joined
May 30, 2023
Messages
32
it looks like your snapshots are out of sync and you havent set it to sync when out of sync (default as this can be destructive). it appears to be doing exactly as it's supposed to.

additionally, hardware info.

based on your wording, are you using the backup machine to do a pull replication from the main machine?

what are you retention times on your snapshots? if your retention time is shorter than the time it takes to replicate everything, your snapshots will expire before they can be replicated, putting you out of sync.
@artlessknave thank you for your reply, I have multiple snapshot schedules and retention times are also different. A daily snapshot would be retained for 3 months. but this was all working fine from past 8 months. Now its not replicating nor Rsyncing any idea how to fix this?
 

urfrndsandy

Dabbler
Joined
May 30, 2023
Messages
32
This message is only for info and was introduced a few years ago in one of the ZFS update.
What this means is that the snapshot being replicated doesn't include the most recent ones.

In you case, it seems you are replicating up to


but the most recent snapshot is (which was taken 2 hours later):

So "ZFS send" let you know upfront, you will not have complete replication due to the ommitted recent snapshot. This would be taken care during the next replication task.

@Apollo thank you for looking into this, though i understand your logic, but the replication is not happening at the end of the day.
The last replication was on march-1 since then am getting this error and replication as well as rsync fails.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
A daily snapshot would be retained for 3 months.
hmm. that certainly should be long enough, unless you replication is going over freaking dialup or something...

somehow, however, your snapshots are not in sync. there isn't enough info here for me to really speculate why you willl have to double check the logic of your snaps and repls to see if you have a hole somewhere
 

urfrndsandy

Dabbler
Joined
May 30, 2023
Messages
32
hmm. that certainly should be long enough, unless you replication is going over freaking dialup or something...

somehow, however, your snapshots are not in sync. there isn't enough info here for me to really speculate why you willl have to double check the logic of your snaps and repls to see if you have a hole somewhere
@artlessknave is there a way to delete all files and startover again?
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
@artlessknave is there a way to delete all files and startover again?
You could share your snapshots from the source and destination system to have a look at the structure.

Otherwise edit your replication task and under destination check Replication from scratch
This will delete all snapshots on the destination system and then replicate everything again.

If the destination system has snapshots but they do not have any data in common with the source snapshots, destroy all destination snapshots and do a full replication. Warning: enabling this option can cause data loss or excessive data transfer if the replication is misconfigured.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
This will delete all snapshots on the destination system and then replicate everything again.
more accurately, it will delete all snapshots on the destination that do not match the replication source and configation, making the destination match the source based on the replication configuration. it will start at the oldest snapshot, then incrementally send the blocks for following snapshots. if your replication source is large this can take awhile, but if you do have any existing snapshot it can use it will use that.

delete all files and startover again?
the terminology is inaccurate; there are no "files" within replication contexts, exactly; replication works entirely at the snapshot block level based on time, it doesn't care about individual files, only snapshot to snapshot block changes. this is how it efficiently does differential replication.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
@Apollo thank you for looking into this, though i understand your logic, but the replication is not happening at the end of the day.
The last replication was on march-1 since then am getting this error and replication as well as rsync fails.
You have another problem in your hand and you are only seeing the tree hiding the forest behind.

What is happening is as follow:
- During execution of the "zfs send" command, zfs will perform a sanity check of the datasets and snapshots it contains. (This is where you are warned about the last snapshot being skipped).
- Once the sanity check has been completed, it will indicate (when "-vv" option is used) the size/amount of the data that it is expected to send.
- "zfs receive ..." will start if there are no apparent issues at the destination.

There are also the following behaviors:

- If a snapshot to be sent is already present at destination, the snapshot is still going to be sent to the the destination, however, upon completion of the transfer, the destination will indicate that the snapshot already exist and will show as follow and the transmitted data will be discarded:

snap pool/Backup/zdataroot/Main_dataset@manual-2024-02-16_16-27 already exists; ignoring

What you really need to focus on are on the actual failure conditions such as:

cannot receive incremental stream: most recent snapshot of pool/Backup/zdataroot/.bhyve_containers does not match incremental source
or
cannot receive incremental stream: destination 'pool/Backup/zdataroot/dataset' does not exist
or
local fs pool/Backup/zdataroot/dataset does not have fromsnap (auto-2023-12-28_00-00 in stream); must have been deleted locally; ignoring

Which appear on the sending end as:

warning: cannot send 'pool/zdataroot/Main_dataset@manual-2024-02-16_16-27': Broken pipe
Which curiously enough report it as a warning rather than an error, because from this point on, replication has been cancelled/terminated.

To get such level of reporting details, you need to use the -vv in the zfs receive command such as:
zfs receive -vv ...

In summary, don't get fixated by this very warning as it is not the cause of your failling replication:
skipping snapshot ServerPool/MasterDataset/Projects@auto-2024-03-01_16-00 because it was created after the destination snapshot (auto-2024-03-01_14-00)
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
@urfrndsandy , before you do what @chuck32 and @artlessknave have suggested you to do, I would rather you explore the root of the problem to know what exactly is happening.

If you are using the repication via "Replication Taks" GUI, you can adjust your task in the ADVANCED REPLICATION" section "Logging Level" from "DEFAULT" to "DEBUG", save the change and run the task.

1710361025302.png


This is going to add all the details of the replication transactions and the results are going to be saved in the following file:

/var/log/zettarepl.log
 
Last edited:

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
and replication as well as rsync fails.
wait. you aren't trying to use replication AND rsync to the same location are you? if so, that would explain it being out of sync.
if you change 1 file at the destination, then source and destination will no longer match, and replication will fail.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
wait. you aren't trying to use replication AND rsync to the same location are you? if so, that would explain it being out of sync.
if you change 1 file at the destination, then source and destination will no longer match, and replication will fail.
It's an ambiguous statement from @urfrndsandy which needs clarifying, indeed.
If the issue is caused by rsync being applied toward a replicated dataset, then it should be easy enough to do a rollback of the last snapshot on the destination.
Hence, adjusting the "LOG LEVEL" in the replication task as it would clearly indicate the root of the problem.
If Rsync is really the issue, then an error of the type should be reported something like:

dataset at destination has been modified...
 
Last edited:

urfrndsandy

Dabbler
Joined
May 30, 2023
Messages
32
Thanks @Apollo and @artlessknave I think the first thing to do would be get hold of the complete debug log file zettarepl.

However am not able to access this file from gui "mc" nor from putty. In putty SSH it says
zsh: permission denied: /var/log/zettarepl.log
I also tried
chown root /var/log/zettarepl.log
still same error.

Any idea how to open or export this log file?
 

urfrndsandy

Dabbler
Joined
May 30, 2023
Messages
32
Thanks @Apollo and @artlessknave I think the first thing to do would be get hold of the complete debug log file zettarepl.

However am not able to access this file from gui "mc" nor from putty. In putty SSH it says

I also tried

still same error.

Any idea how to open or export this log file?
I was able to view it using "more" option and here is the log
[2024/03/06 00:00:00] INFO [Thread-9] [zettarepl.paramiko.replication_task__task_1] Connected (version 2.0, client OpenSSH_8.8-hpn14v15)
[2024/03/06 00:00:00] INFO [Thread-9] [zettarepl.paramiko.replication_task__task_1] Authentication (publickey) successful!
[2024/03/06 00:00:01] INFO [replication_task__task_1] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2024/03/06 00:00:01] WARNING [replication_task__task_1] [zettarepl.replication.run] Discarding receive_resume_token for destination dataset 'ServerPool/MasterDataSet/Projects' as it is not supported in `replicate` mode
[2024/03/06 00:00:01] INFO [replication_task__task_1] [zettarepl.replication.run] For replication task 'task_1': doing pull from 'ServerPool/MasterDataset/Projects' to 'ServerPool/MasterDataSet/Projects' of snapshot='auto-2024-03-01_13-00' incremental_base='auto-2024-03-01_12-00' receive_resume_token=None encryption=False
[2024/03/06 00:00:53] WARNING [replication_task__task_1] [zettarepl.replication.run] For task 'task_1' at attempt 1 recoverable replication error RecoverableReplicationError("skipping snapshot ServerPool/MasterDataset/Projects@auto-2024-03-01_14-00 because it was created after the destination snapshot (auto-2024-03-01_13-00)\nskipping snapshot

I feel I should delete files and rsync again. Do I need to destroy the pool and recreate?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
What do you mean by rsync again? You must not use replication tasks and rsync tasks at the same time for the same data.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
You need a snapshot task and a replication task and no rsync. You must not write to the replication destination by other means than the replication itself. ZFS replication and rsync are fundamentally different things working at entirely different layers of the filesystem "stack".
 

urfrndsandy

Dabbler
Joined
May 30, 2023
Messages
32
@Patrick M. Hausen I have two machines, am doing hourly snapshot tasks in the primary machine and then using replication task to transfer this to a backup machine, however I dont see data when I do this. I have to Rsync to see files on the backup system.

Can you please help me get this right
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
If the snapshots are there the data is there. The datasets are simply not mounted so they cannot be accidentally written to. By "rsync'ing over them" you are destroying your snapshots.

Rule #1: a snapshot contains all the data, files, directories, ... everything ... present in the dataset at the time the snapshot was taken.
 
Top