Replication Task Skipping all Snapshots

urfrndsandy · Mar 11, 2024

Team,

I am having a weird issue recently, this setup has been working fine from 2023 may. Now all of sudden the backup Truenas server is not replicating the snapshots.

Am getting this error for all snapshots thats on the main machine

skipping snapshot ServerPool/MasterDataset/Projects@auto-2024-03-01_16-00 because it was created after the destination snapshot (auto-2024-03-01_14-00)

What could be the issue? I tried scrub on both the machines but no help. I have not changed any settings recently as well.

artlessknave · Mar 12, 2024

it looks like your snapshots are out of sync and you havent set it to sync when out of sync (default as this can be destructive). it appears to be doing exactly as it's supposed to.

additionally, hardware info.

based on your wording, are you using the backup machine to do a pull replication from the main machine?

what are you retention times on your snapshots? if your retention time is shorter than the time it takes to replicate everything, your snapshots will expire before they can be replicated, putting you out of sync.

Apollo · Mar 12, 2024

urfrndsandy said:
Team,

I am having a weird issue recently, this setup has been working fine from 2023 may. Now all of sudden the backup Truenas server is not replicating the snapshots.

Am getting this error for all snapshots thats on the main machine

What could be the issue? I tried scrub on both the machines but no help. I have not changed any settings recently as well.

This message is only for info and was introduced a few years ago in one of the ZFS update.
What this means is that the snapshot being replicated doesn't include the most recent ones.

In you case, it seems you are replicating up to

ServerPool/MasterDataset/Projects@auto-2024-03-01_14-00

but the most recent snapshot is (which was taken 2 hours later):

ServerPool/MasterDataset/Projects@auto-2024-03-01_16-00

So "ZFS send" let you know upfront, you will not have complete replication due to the ommitted recent snapshot. This would be taken care during the next replication task.

urfrndsandy · Mar 13, 2024

artlessknave said:
it looks like your snapshots are out of sync and you havent set it to sync when out of sync (default as this can be destructive). it appears to be doing exactly as it's supposed to.

additionally, hardware info.

based on your wording, are you using the backup machine to do a pull replication from the main machine?

what are you retention times on your snapshots? if your retention time is shorter than the time it takes to replicate everything, your snapshots will expire before they can be replicated, putting you out of sync.

@artlessknave thank you for your reply, I have multiple snapshot schedules and retention times are also different. A daily snapshot would be retained for 3 months. but this was all working fine from past 8 months. Now its not replicating nor Rsyncing any idea how to fix this?

urfrndsandy · Mar 13, 2024

Apollo said:
This message is only for info and was introduced a few years ago in one of the ZFS update.
What this means is that the snapshot being replicated doesn't include the most recent ones.

In you case, it seems you are replicating up to

but the most recent snapshot is (which was taken 2 hours later):

So "ZFS send" let you know upfront, you will not have complete replication due to the ommitted recent snapshot. This would be taken care during the next replication task.

@Apollo thank you for looking into this, though i understand your logic, but the replication is not happening at the end of the day.
The last replication was on march-1 since then am getting this error and replication as well as rsync fails.

artlessknave · Mar 13, 2024

urfrndsandy said:
A daily snapshot would be retained for 3 months.

hmm. that certainly should be long enough, unless you replication is going over freaking dialup or something...

somehow, however, your snapshots are not in sync. there isn't enough info here for me to really speculate why you willl have to double check the logic of your snaps and repls to see if you have a hole somewhere

urfrndsandy · Mar 13, 2024

artlessknave said:
hmm. that certainly should be long enough, unless you replication is going over freaking dialup or something...

somehow, however, your snapshots are not in sync. there isn't enough info here for me to really speculate why you willl have to double check the logic of your snaps and repls to see if you have a hole somewhere

@artlessknave is there a way to delete all files and startover again?

chuck32 · Mar 13, 2024

urfrndsandy said:
@artlessknave is there a way to delete all files and startover again?

You could share your snapshots from the source and destination system to have a look at the structure.

Otherwise edit your replication task and under destination check Replication from scratch
This will delete all snapshots on the destination system and then replicate everything again.

Replication Tasks

Decribes the fields on the Replication Tasks screen for TrueNAS CORE.

www.truenas.com

If the destination system has snapshots but they do not have any data in common with the source snapshots, destroy all destination snapshots and do a full replication. Warning: enabling this option can cause data loss or excessive data transfer if the replication is misconfigured.

artlessknave · Mar 13, 2024

chuck32 said:
This will delete all snapshots on the destination system and then replicate everything again.

more accurately, it will delete all snapshots on the destination that do not match the replication source and configation, making the destination match the source based on the replication configuration. it will start at the oldest snapshot, then incrementally send the blocks for following snapshots. if your replication source is large this can take awhile, but if you do have any existing snapshot it can use it will use that.

urfrndsandy said:
delete all files and startover again?

the terminology is inaccurate; there are no "files" within replication contexts, exactly; replication works entirely at the snapshot block level based on time, it doesn't care about individual files, only snapshot to snapshot block changes. this is how it efficiently does differential replication.

Apollo · Mar 13, 2024

urfrndsandy said:
@Apollo thank you for looking into this, though i understand your logic, but the replication is not happening at the end of the day.
The last replication was on march-1 since then am getting this error and replication as well as rsync fails.

You have another problem in your hand and you are only seeing the tree hiding the forest behind.

What is happening is as follow:
- During execution of the "zfs send" command, zfs will perform a sanity check of the datasets and snapshots it contains. (This is where you are warned about the last snapshot being skipped).
- Once the sanity check has been completed, it will indicate (when "-vv" option is used) the size/amount of the data that it is expected to send.
- "zfs receive ..." will start if there are no apparent issues at the destination.

There are also the following behaviors:

- If a snapshot to be sent is already present at destination, the snapshot is still going to be sent to the the destination, however, upon completion of the transfer, the destination will indicate that the snapshot already exist and will show as follow and the transmitted data will be discarded:

snap pool/Backup/zdataroot/Main_dataset@manual-2024-02-16_16-27 already exists; ignoring

What you really need to focus on are on the actual failure conditions such as:

cannot receive incremental stream: most recent snapshot of pool/Backup/zdataroot/.bhyve_containers does not match incremental source

or

cannot receive incremental stream: destination 'pool/Backup/zdataroot/dataset' does not exist

or

local fs pool/Backup/zdataroot/dataset does not have fromsnap (auto-2023-12-28_00-00 in stream); must have been deleted locally; ignoring

Which appear on the sending end as:

warning: cannot send 'pool/zdataroot/Main_dataset@manual-2024-02-16_16-27': Broken pipe

Which curiously enough report it as a warning rather than an error, because from this point on, replication has been cancelled/terminated.

To get such level of reporting details, you need to use the -vv in the zfs receive command such as:

zfs receive -vv ...

In summary, don't get fixated by this very warning as it is not the cause of your failling replication:

skipping snapshot ServerPool/MasterDataset/Projects@auto-2024-03-01_16-00 because it was created after the destination snapshot (auto-2024-03-01_14-00)

Apollo · Mar 13, 2024

@urfrndsandy , before you do what @chuck32 and @artlessknave have suggested you to do, I would rather you explore the root of the problem to know what exactly is happening.

If you are using the repication via "Replication Taks" GUI, you can adjust your task in the ADVANCED REPLICATION" section "Logging Level" from "DEFAULT" to "DEBUG", save the change and run the task.

This is going to add all the details of the replication transactions and the results are going to be saved in the following file:

/var/log/zettarepl.log

artlessknave · Mar 13, 2024

urfrndsandy said:
and replication as well as rsync fails.

wait. you aren't trying to use replication AND rsync to the same location are you? if so, that would explain it being out of sync.
if you change 1 file at the destination, then source and destination will no longer match, and replication will fail.

Apollo · Mar 13, 2024

artlessknave said:
wait. you aren't trying to use replication AND rsync to the same location are you? if so, that would explain it being out of sync.
if you change 1 file at the destination, then source and destination will no longer match, and replication will fail.

It's an ambiguous statement from @urfrndsandy which needs clarifying, indeed.
If the issue is caused by rsync being applied toward a replicated dataset, then it should be easy enough to do a rollback of the last snapshot on the destination.
Hence, adjusting the "LOG LEVEL" in the replication task as it would clearly indicate the root of the problem.
If Rsync is really the issue, then an error of the type should be reported something like:

dataset at destination has been modified...

urfrndsandy · Mar 14, 2024

Thanks @Apollo and @artlessknave I think the first thing to do would be get hold of the complete debug log file zettarepl.

However am not able to access this file from gui "mc" nor from putty. In putty SSH it says

zsh: permission denied: /var/log/zettarepl.log

I also tried

chown root /var/log/zettarepl.log

still same error.

Any idea how to open or export this log file?

urfrndsandy · Mar 14, 2024

urfrndsandy said:
Thanks @Apollo and @artlessknave I think the first thing to do would be get hold of the complete debug log file zettarepl.

However am not able to access this file from gui "mc" nor from putty. In putty SSH it says

I also tried

still same error.

Any idea how to open or export this log file?

I was able to view it using "more" option and here is the log

[2024/03/06 00:00:00] INFO [Thread-9] [zettarepl.paramiko.replication_task__task_1] Connected (version 2.0, client OpenSSH_8.8-hpn14v15)
[2024/03/06 00:00:00] INFO [Thread-9] [zettarepl.paramiko.replication_task__task_1] Authentication (publickey) successful!
[2024/03/06 00:00:01] INFO [replication_task__task_1] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2024/03/06 00:00:01] WARNING [replication_task__task_1] [zettarepl.replication.run] Discarding receive_resume_token for destination dataset 'ServerPool/MasterDataSet/Projects' as it is not supported in `replicate` mode
[2024/03/06 00:00:01] INFO [replication_task__task_1] [zettarepl.replication.run] For replication task 'task_1': doing pull from 'ServerPool/MasterDataset/Projects' to 'ServerPool/MasterDataSet/Projects' of snapshot='auto-2024-03-01_13-00' incremental_base='auto-2024-03-01_12-00' receive_resume_token=None encryption=False
[2024/03/06 00:00:53] WARNING [replication_task__task_1] [zettarepl.replication.run] For task 'task_1' at attempt 1 recoverable replication error RecoverableReplicationError("skipping snapshot ServerPool/MasterDataset/Projects@auto-2024-03-01_14-00 because it was created after the destination snapshot (auto-2024-03-01_13-00)\nskipping snapshot

I feel I should delete files and rsync again. Do I need to destroy the pool and recreate?

Patrick M. Hausen · Mar 14, 2024

What do you mean by rsync again? You must not use replication tasks and rsync tasks at the same time for the same data.

urfrndsandy · Mar 14, 2024

Patrick M. Hausen said:
What do you mean by rsync again? You must not use replication tasks and rsync tasks at the same time for the same data.

thanks @Patrick M. Hausen but then how do i get the snapshots, rsync only copies the data but not snapshots right?
am I doing something wrong.

Patrick M. Hausen · Mar 14, 2024

You need a snapshot task and a replication task and no rsync. You must not write to the replication destination by other means than the replication itself. ZFS replication and rsync are fundamentally different things working at entirely different layers of the filesystem "stack".

urfrndsandy · Mar 14, 2024

@Patrick M. Hausen I have two machines, am doing hourly snapshot tasks in the primary machine and then using replication task to transfer this to a backup machine, however I dont see data when I do this. I have to Rsync to see files on the backup system.

Can you please help me get this right

Patrick M. Hausen · Mar 14, 2024

If the snapshots are there the data is there. The datasets are simply not mounted so they cannot be accidentally written to. By "rsync'ing over them" you are destroying your snapshots.

Rule #1: a snapshot contains all the data, files, directories, ... everything ... present in the dataset at the time the snapshot was taken.

Important Announcement for the TrueNAS Community.

Replication Task Skipping all Snapshots

Dabbler

Wizard

Wizard

Dabbler

Dabbler

Wizard

Dabbler

Guru

Wizard

Wizard

Wizard

Wizard

Wizard

Dabbler

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Replication Task Skipping all Snapshots"

Similar threads