Replication Task Skipping all Snapshots

urfrndsandy · Mar 14, 2024

Patrick M. Hausen said:
If the snapshots are there the data is there. The datasets are simply not mounted so they cannot be accidentally written to. By "rsync'ing over them" you are destroying your snapshots.

Rule #1: a snapshot contains all the data, files, directories, ... everything ... present in the dataset at the time the snapshot was taken.

Thank you, I will startover. Now how do I delete the files in pool? do i need to create the pool again?

urfrndsandy · Mar 14, 2024

Understood, I will start over again. Now how do i delete files from backup machine? Do I need to re-do pools ?

winnielinnie · Mar 14, 2024

A ZFS backup (destination dataset + snapshots) should never be "used". It should remain a one-way destination. Nor should it ever receive data or replications from anywhere else except the one original source.

You can, however, safely "verify" the data exists, by mounting the dataset as "read-only". You can also copy individual files/folders for restoration and recovery operations.

The only time such a backup can be "used" (i.e, "written to"), is when it becomes the new source that no longer serves a backup role.

Apollo · Mar 14, 2024

urfrndsandy said:
I was able to view it using "more" option and here is the log

I feel I should delete files and rsync again. Do I need to destroy the pool and recreate?

urfrndsandy said:
Thanks @Apollo and @artlessknave I think the first thing to do would be get hold of the complete debug log file zettarepl.

However am not able to access this file from gui "mc" nor from putty. In putty SSH it says

I also tried

still same error.

Any idea how to open or export this log file?

You can use "MC" (midnight commander) to read the content of the log, but you need to be logged in as root or execute it with "sudo" privilege.
"mc" isn't great for this particular case as "zetarepl.py" will generate a huge amount of data. The one line log you provided doesn't tell the full story.

The way I do it is to use Bitvise. It has SFTP file explorer and I can directly download the file to my PC which I can then process with Notepad++.
If you want to retain some level of privacy (dataset name and such) then you could replace the string and then upload the log here.
Remember to set the logging level in the "Replication Task".

Apollo · Mar 14, 2024

urfrndsandy said:
@Patrick M. Hausen I have two machines, am doing hourly snapshot tasks in the primary machine and then using replication task to transfer this to a backup machine, however I dont see data when I do this. I have to Rsync to see files on the backup system.

Can you please help me get this right

The dataset on the backup machine are most likely set as not mounted. However, you should be able to see the datasets in the WEB interface.

Apollo · Mar 14, 2024

Patrick M. Hausen said:
By "rsync'ing over them" you are destroying your snapshots.

Rule #1: a snapshot contains all the data, files, directories, ... everything ... present in the dataset at the time the snapshot was taken.

This is incorrect. The content in the dataset when a snapshot already exist doesn't affect the content of the files up to the snapshot time. The latest files in the dataset could be affected and for sure will cause replication to fail if the dataset state has changed since the last replication.
So, instead of deleting the dataset and all the files and snapshots it contains, a rollback is really all what is needed. This is assuming the destination pool isn't creating snapshot of the dataset itself.

Patrick M. Hausen · Mar 14, 2024

Apollo said:
This is incorrect.

I don't get your point. If I roll back a local snapshot the dataset will be exactly at the state it was when the snapshot was taken. Also when I navigate to the hidden .zfs/snapshot directory I will find all the data exactly like at the time when the sapshot as taken.

Technically of course a snapshot is just a set of pointers and internal data structures but I think it is applicable from a user's point of view to say that the snapshot "contains" the data. If you send/receive the snapshot to a different pool the destination dataset will contain all the data of the source at the time the snapshot was taken.

Apollo · Mar 14, 2024

Patrick M. Hausen said:
I don't get your point. If I roll back a local snapshot the dataset will be exactly at the state it was when the snapshot was taken. Also when I navigate to the hidden .zfs/snapshot directory I will find all the data exactly like at the time when the sapshot as taken.

Technically of course a snapshot is just a set of pointers and internal data structures but I think it is applicable from a user's point of view to day that the snapshot "contains" the data. If you send/receive the snapshot to a different pool the destination dataset will contain all the data of the source at the time the snapshot was taken.

The part which I believe is incorrect is this particular sentence:

By "rsync'ing over them" you are destroying your snapshots.

The snapshot themselves are not affected.

Patrick M. Hausen · Mar 14, 2024

Yes, but if the file system is modified, the syncing of snapshots stops there. I wasn't precise, granted.

Apollo · Mar 14, 2024

Patrick M. Hausen said:
Yes, but if the file system is modified, the syncing of snapshots stops there. I wasn't precise, granted.

I agree and if the snapshot on the destination dataset was replicated from the source, then the rollback of the snapshot will revert the file system to the time of the the snapshot and zfs will no longer complain.

artlessknave · Mar 14, 2024

rsync OR replication, not both.
rsync can work well when replication isnt avalable but generally replication will be superior.
rsync must read every file on both source and destination to determine changes while replication simply uses the date of snapshots

unless your whole system is lost most of the time restores would be likely to be a few files. if needed, you can accomplish access to any of the snapshots contents by cloning it outside your replication destination and then making it avaiable however you want, like a cifs share. this way you are never touching the repl dest, which would risk breaking the sync of src to dst

urfrndsandy · Mar 14, 2024

Thank you everyone for sharing this very important information, I am deleting the dataset and will recreate. Then again set up tasks for replication only. Hope this solves the issue.

Apollo · Mar 15, 2024

urfrndsandy said:
Thank you everyone for sharing this very important information, I am deleting the dataset and will recreate. Then again set up tasks for replication only. Hope this solves the issue.

If you can live with it, then go ahead.

urfrndsandy · Mar 15, 2024

Apollo said:
If you can live with it, then go ahead.

Looks like its going to take a couple of days to complete. Its been running for few hours now.

Once the replication is complete, I will confirm if the data is copied or not.

urfrndsandy · Mar 15, 2024

It finally came to the same point where the previous replication was failing, seems like there is an error in the snapshot of the primary system.
Below is my attached log, can you please help me solve this.

[2024/03/16 01:52:13] INFO [Thread-12] [zettarepl.paramiko.replication_task__task_1] Connected (version 2.0, client OpenSSH_8.8-hpn14v15)
[2024/03/16 01:52:13] INFO [Thread-12] [zettarepl.paramiko.replication_task__task_1] Authentication (publickey) successful!
[2024/03/16 01:52:14] INFO [replication_task__task_1] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2024/03/16 01:52:14] WARNING [replication_task__task_1] [zettarepl.replication.run] Discarding receive_resume_token for destination dataset 'ServerPool/MasterDataSet/Projects' as it is not supported in `replicate` mode
[2024/03/16 01:52:14] INFO [replication_task__task_1] [zettarepl.replication.run] For replication task 'task_1': doing pull from 'ServerPool/MasterDataset/Projects' to 'ServerPool/MasterDataSet/Projects' of snapshot='auto-2024-03-02_00-00' incremental_base='auto-2024-03-01_00-00' receive_resume_token=None encryption=False
[2024/03/16 01:52:14] INFO [replication_task__task_1] [zettarepl.paramiko.replication_task__task_1.sftp] [chan 5] Opened sftp connection (server version 3)
[2024/03/16 01:52:14] INFO [replication_task__task_1] [zettarepl.transport.ssh_netcat] Automatically chose connect address '10.10.1.50'
[2024/03/16 01:52:42] ERROR [replication_task__task_1] [zettarepl.replication.run] For task 'task_1' unhandled replication error SshNetcatExecException(ExecException(1, 'checksum mismatch or incomplete stream.\nPartially received snapshot is saved.\nA resuming stream can be generated on the sending system by running:\n zfs send -t 1-1395eeb03f-128-789c636064000310a501c49c50360710a715e5e7a69766a630404176c1824af9bbafd72b00d9ec48eaf293b252934b18181a3a6dc0ea30e4d3d28a534b403253e0f26c48f2499525a9c540daa1f2831036fd25f910575c0fdcd26fbbb6d43200499e132c9f97989bcac0109c5a54965a14909f9fa3ef9b585c925ae492589208b4593fa0281fe4c26287c4d2927c5d230323135d03635d03a37803035d0303b07ddc0c887048cecf2d284a2d2ececf668003002a4033c6\n'), None)
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py", line 181, in run_replication_tasks
retry_stuck_replication(
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/stuck.py", line 18, in retry_stuck_replication
return func()
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py", line 182, in <lambda>
lambda: run_replication_task_part(replication_task, source_dataset, src_context, dst_context,
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py", line 279, in run_replication_task_part
run_replication_steps(step_templates, observer)
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py", line 637, in run_replication_steps
replicate_snapshots(step_template, incremental_base, snapshots, encryption, observer)
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py", line 720, in replicate_snapshots
run_replication_step(step, observer)
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py", line 797, in run_replication_step
ReplicationProcessRunner(process, monitor).run()
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/process_runner.py", line 33, in run
raise self.process_exception
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/process_runner.py", line 37, in _wait_process
self.replication_process.wait()
File "/usr/local/lib/python3.9/site-packages/zettarepl/transport/ssh_netcat.py", line 203, in wait
raise SshNetcatExecException(connect_exec_error, self.listen_exec_error) from None
zettarepl.transport.ssh_netcat.SshNetcatExecException: Passive side: checksum mismatch or incomplete stream.
Partially received snapshot is saved.
A resuming stream can be generated on the sending system by running:
zfs send -t 1-1395eeb03f-128-789c636064000310a501c49c50360710a715e5e7a69766a630404176c1824af9bbafd72b00d9ec48eaf293b252934b18181a3a6dc0ea30e4d3d28a534b403253e0f26c48f2499525a9c540daa1f2831036fd25f910575c0fdcd26fbbb6d43200499e132c9f97989bcac0109c5a54965a14909f9fa3ef9b585c925ae492589208b4593fa0281fe4c26287c4d2927c5d230323135d03635d03a37803035d0303b07ddc0c887048cecf2d284a2d2ececf668003002a4033c6

Apollo · Mar 15, 2024

urfrndsandy said:
It finally came to the same point where the previous replication was failing, seems like there is an error in the snapshot of the primary system.
Below is my attached log, can you please help me solve this.

So, from what I can see, you have the replication task set on the destination system as PULL.
You are also trying to perform a replication from the source which is the same as the destination, which of course you can't (unless under the condition in the paragraph below). In simple term, you are trying to replicated the dataset to itself.
This is probably the reason the "Resume Token" isn't supported here, but I am not sure. Could be a ZFS version difference.

Can you tell what is your actual hardware setup for the replication? If you have a source and destination connected via LAN, then there is no issues having "ServerPool/MasterDataSet/Projects" name and structure the same at the source as well as at the destination.
I just want to exclude incorrect Replication Task errors.

urfrndsandy · Mar 15, 2024

Apollo said:
So, from what I can see, you have the replication task set on the destination system as PULL.
You are also trying to perform a replication from the source which is the same as the destination, which of course you can't (unless under the condition in the paragraph below). In simple term, you are trying to replicated the dataset to itself.
This is probably the reason the "Resume Token" isn't supported here, but I am not sure. Could be a ZFS version difference.

Can you tell what is your actual hardware setup for the replication? If you have a source and destination connected via LAN, then there is no issues having "ServerPool/MasterDataSet/Projects" name and structure the same at the source as well as at the destination.
I just want to exclude incorrect Replication Task errors.

@Apollo yes I have two machines with same structure for Dataset and the snapshot is stopping after a specific date, now I even deleted all the snapshots of the failing date, its now failing for the next date.
Both the systems are connected via LAN.

Below are the snapshots that are completed.

I have a feeling there is some kind of corruption in the snapshot of primary machine.

Apollo · Mar 15, 2024

urfrndsandy said:
@Apollo yes I have two machines with same structure for Dataset and the snapshot is stopping after a specific date, now I even deleted all the snapshots of the failing date, its now failing for the next date.
Both the systems are connected via LAN.

Below are the snapshots that are completed.
View attachment 76602

I have a feeling there is some kind of corruption in the snapshot of primary machine.

I don't believe in snapshot or dataset corruption, otherwise ZFS would be screaming.
What I don't understand it why the resume token issue.

Are you on CORE or SCALE? What version TrueNAS for both systems?
Any quota?

urfrndsandy · Mar 15, 2024

Currently both the machines are on CORE version TrueNAS-13.0-U6.1. I just updated both the machines hoping to resolve the issue.
I have not set any quota. But one other thing is the pool on primary machine says "ONLINE(Unhealthy)" and I could not solve that as well.

Apollo · Mar 15, 2024

Then let's start looking at the pool. That might explain why the replication complains about the resume token.
In your log, there was mention of an error:

For task 'task_1' unhandled replication error SshNetcatExecException(ExecException(1, 'checksum mismatch or incomplete stream.\nPartially received snapshot is saved.

So portion of the snapshot has been transmitted and saved, but it needs to be resumed, which zettarepl.py can't seem to work it out.

Can you post details of your unhealthy pool? Something like:

zpool status -v -x ServerPool

Important Announcement for the TrueNAS Community.

Replication Task Skipping all Snapshots

Dabbler

Dabbler

MVP

Wizard

Wizard

Wizard

Hall of Famer

Wizard

Hall of Famer

Wizard

Wizard

Dabbler

Wizard

Dabbler

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Replication Task Skipping all Snapshots"

Similar threads