Replication failing b/c of new dataset?

altano · Aug 18, 2021

I have this error during replication:

cannot send tank@auto-2021-08-08_00-00 recursively: snapshot tank/root/seafile@auto-2021-08-08_00-00 does not exist

Full log:

Code:

[2021/08/18 14:00:00] INFO     [Thread-79] [zettarepl.paramiko.replication_task__task_1] Connected (version 2.0, client OpenSSH_8.4-hpn14v15)
[2021/08/18 14:00:00] INFO     [Thread-79] [zettarepl.paramiko.replication_task__task_1] Authentication (publickey) successful!
[2021/08/18 14:00:01] INFO     [replication_task__task_1] [zettarepl.replication.run] For replication task 'task_1': doing push from 'tank' to 'backup/holodeck4-tank' of snapshot='auto-2021-08-08_00-00' incremental_base='auto-2021-08-01_00-00' receive_resume_token=None encryption=False
[2021/08/18 14:00:02] WARNING  [replication_task__task_1] [zettarepl.replication.run] For task 'task_1' at attempt 1 recoverable replication error RecoverableReplicationError("cannot send tank@auto-2021-08-08_00-00 recursively: snapshot tank/root/seafile@auto-2021-08-08_00-00 does not exist\nwarning: cannot send 'tank@auto-2021-08-08_00-00': backup failed\ncannot receive: failed to read from stream")

The snapshot tank/root/seafile@auto-2021-08-08_00-00 does NOT exist because the seafile dataset was created on 8/11, after the 8/8 date that the system is expecting to find a snapshot.

It seems like the recursive replication of tank@auto-2021-08-08_00-00 is blindly expecting all CURRENTLY EXISTING datasets to have snapshots going back in time to my first top-level pool snapshot, which isn't a correct assumption (replication would never work whenever there are new datasets).

How can I work around this bad assumption, short of deleting all snapshots that predate the new dataset?

This is my replication config:

altano · Sep 25, 2021

Bump. I'm still stuck on this =\

If no one knows the answer, does anyone know a consultant who would be willing to help me out and isn't too expensive?

Heracles · Sep 25, 2021

How much control do you have on the target ? Also, is the target connected by a fast network link or is it remote ? How large a volume of data are we talking about here ?

I think that only the replication task needs to be re-created. At its first run, it would check the destination to see what is existing and what is not and would re-start from there. Not 100% sure about this though...

I would recommend you do some tests here. Basically, re-create the entire problem you have with test data :

--Create a new dataset at the top of your pool, named "testing_root"
--Under that dataset, create a second one "testing_sub1"
--Put some data in each one. Of course, no need to put large files, just small stuff.
--Create a recursive snapshot task from that root. Go fast and take snapshots every 5 minutes. Be sure to add more data between each snapshot.
--Replicate it to your target
--Once you have a few replicated snapshots, create another dataset as "testing_sub2" under your root.
--When you receive your error, delete your replication task and re-create it.
--See if the re-created task starts back from scratch or just re-sync and do it from wherever you were.

Good luck,

altano · Sep 26, 2021

Hey @Heracles thanks for responding.

I have full control over the software of the target but none of the hardware (like a VPS at a datacenter).

The link is remote. Both sides are reasonably fast: I was getting ~30-60MB/s when it was working. The volume is ~45TB and I was able to transfer ~25TB before it totally hung for a few weeks.

I think that only the replication task needs to be re-created.

Great suggestion. I just tried this:

1) Disable the original task
2) Reboot the source host
3) Create a new task with the same exact configuration, EXCEPT I disabled "Synchronize Destination Snapshots With Source" (because my snapshots already got wiped out once when I first setup the remote target server with 40TB already on the target disks and I don't want a repeat of that)
4) Save, manually "Run Now" the task

The result (as far as I can tell from the logs) is that it ran for 7.5 hours and then produced this WARNING:

Code:

[2021/09/26 01:04:51] WARNING  [replication_task__task_2] [zettarepl.replication.run] For task 'task_2' at attempt 1 recoverable replication error RecoverableReplicationError("Warning: backup/holodeck4-tank/root/truenas-VMs/pbs: property 'mountpoint' does not apply to datasets of this type\nWarning: backup/holodeck4-tank/root/truenas-VMs/pbs: property 'sharesmb' does not apply to datasets of this type\nWarning: backup/holodeck4-tank/root/truenas-VMs/pbs: property 'sharenfs' does not apply to datasets of this type\ncannot receive incremental stream: destination 'backup/holodeck4-tank/root/videos' does not exist\nwarning: cannot send 'tank/root/red@auto-2021-09-13_18-00': signal received")
[2021/09/26 01:04:51] INFO     [replication_task__task_2] [zettarepl.replication.run] After recoverable error sleeping for 1 seconds
[2021/09/26 01:05:09] INFO     [replication_task__task_2] [zettarepl.replication.run] For replication task 'task_2': doing push from 'tank' to 'backup/holodeck4-tank' of snapshot='auto-2021-09-13_19-00' incremental_base='auto-2021-09-13_18-00' receive_resume_token=None encryption=False

There are no errors yet, but there is also no progress on the replication (I can verify there is only trivial traffic on the network link from the TrueNAS dashboard).

altano · Oct 7, 2021

I was hoping U6 (which came with some stalled replication bug fixes) would fix the issue but it didn't.

My replicated datasets continue to replicate just fine. The datasets that were never created on the target continue to not be created, with errors about the dataset not existing. Still not sure how to resolve this.

Forssux · Dec 23, 2021

I whish I could help you, replicating 45T is a lot. I already had trouble with a mere 6TB as you can see in my post.
This is strange as I think that the whole point of TrueNAS is to keep data safe.

Maybe you can upvote the bug report from my post?

Forssux · Jan 19, 2022

@altano so replication task uses snapshots. This means that the filesystem needs to be unmounted for a brief moment to take the snapshot.
What is good, because this way there's also a snapshots from zvol's iocage etc. This is because snapshot is on block level not file level.
Furthermore I have started know with a smaller dataset QDocumenten as a test .
In Truenas the difficulty lies in the fact that the replication task can't recognize the taken snapshot.
I frequently get the error
most recent snapshot of QNonMedia-Backup does not
match incremental source

But I found out that one can send a snapshot with zfs send see https://docs.oracle.com/cd/E26505_01/html/E37384/gbchx.html

I start experimenting tomorrow with this.

Kind regards
Guy

Patrick M. Hausen · Jan 19, 2022

Forssux said:
This means that the filesystem needs to be unmounted for a brief moment to take the snapshot.

No. Snapshots can be taken from a live dataset or zvol any time.

altano · Jan 19, 2022

Yeah, seriously, what @Forssux? Snapshots don't require unmounting and replication just using snapshots. zrepl uses zfs send already under the covers so there is nothing for you to experiment with.

This thread is deteriorating. To anyone seeing this in the future: the tldr is that zrepl/TrueNAS replication doesn't really work. Just go use something else.

Patrick M. Hausen · Jan 19, 2022

I have replication tasks for VMs and jails applied to the parent dataset and with the "recursive" flag set. They pick up new jails and VMs as they come. I do not start at the top level "pool" dataset but one level below. I wonder why "recursive" is missing from your screenshot. Possibly a bug or something with that top level dataset. What does the corresponding snapshot task look like?

Forssux · Jan 19, 2022

Patrick M. Hausen said:
No. Snapshots can be taken from a live dataset or zvol any time.

I'm no expert of course....but the other day a test replication couldn't go through because the replication job couldn't unmount the part that needed to be replicated. This was a error message.

What would be nice is a message that one can do a scheduling every ten minutes if you would like, like below in the screenshot.
Also when one does a replication from QData/QDocumenten to another pool let's say QNonMedia-Backup that a warning appears that one is about to destroy xGB on QNonMedia-Backup.

What I had last night must be a bug..
A replication job sets a pool read only..
The second replication job assumes it went through and gives the green light...Assumes?
In one screenshot you can tell somethings wrong. In Pools
This is the actual "Job well done" message.
luckly the command

Code:

zfs set readonly=off QNonMedia-Backup

sets everything back.

Patrick M. Hausen · Jan 19, 2022

A replication job sets the destination read-only. Correct. This must be the case, because otherwise subsequent incremental replications won't work. Just leave it that way. It's a backup. Being read-only is the point.

altano · Jan 20, 2022

Patrick M. Hausen said:
I wonder why "recursive" is missing from your screenshot

It goes away when you check "(Almost) Full Filesystem Replication"

Forssux · Jan 20, 2022

Patrick M. Hausen said:
A replication job sets the destination read-only. Correct. This must be the case, because otherwise subsequent incremental replications won't work. Just leave it that way. It's a backup. Being read-only is the point.

OK thanks for that clarification...

But here I have 2 pools in the same system in the basement.
QData this is the live system that is used every day.
QNonMedia-Backup is a pool on a single disk.

The plan is to replicate every 2weeksto this drive and the put this drive upstairs in a drawer.
The dataset QData\Test-guyf was executed first and set the whole drive read only.
The dataset QData\Documenten was executed and got a green light but nothing was actually transferred.

The strange thing for me is that in the screenshots of the pools it looked that the QData\Documenten was successful.
Hellas MC told me otherwise see screenshots.. Is this normal?

Kind regards
Guy

Patrick M. Hausen · Jan 20, 2022

Don't start at the top level dataset of the pool. Always create at least one level below that like e.g. QData/shares/Documenten and start your recursive replication at QData/shares. That's what works here and if I remember correctly there's a section in the TrueNAS documentation that you should not operate on the top level pool directly.

Other than that I have no idea. You can do a zfs list -t snap on both source and destination and compare.

Forssux · Jan 20, 2022

Patrick M. Hausen said:
Don't start at the top level dataset of the pool. Always create at least one level below that like e.g. QData/shares/Documenten and start your recursive replication at QData/shares. That's what works here and if I remember correctly there's a section in the TrueNAS documentation that you should not operate on the top level pool directly.

Other than that I have no idea. You can do a zfs list -t snap on both source and destination and compare.

Thanks for this fine advice. Now I just have to find a way to quickly move all datasets in a new higher dataset level.
I didn't find this advice in the new Truenas Core Docs but I can surely see a benefit in my case.

Well it appears I don't have to move, renaming is enough..

Code:

zfs rename QData\/urbackup QData/NonMedia/urbackup
 zfs rename QData\/guyf QData/NonMedia/guyf

I noticed your system and I can only dream of such nice servers...
In your experimental system is the Truennas Scale running on top of the ESXi and using the Truenas scale as storage?

Thanks for helping
Guy

Patrick M. Hausen · Jan 20, 2022

Forssux said:
In your experimental system is the Truennas Scale running on top of the ESXi and using the Truenas scale as storage?

The TN SCALE is running in ESXi with 2 NVME drives passed through, but it does not provide storage to ESXi. I decided the 50% capacity penalty of iSCSI was not worth the trouble so I have a single non-redundant SSD for ESXi and do backups to my TN CORE via NFS and GhettoVCB.

winnielinnie · Jan 20, 2022

altano said:
It seems like the recursive replication of tank@auto-2021-08-08_00-00 is blindly expecting all CURRENTLY EXISTING datasets to have snapshots going back in time to my first top-level pool snapshot, which isn't a correct assumption (replication would never work whenever there are new datasets).

How can I work around this bad assumption, short of deleting all snapshots that predate the new dataset?

altano said:
This thread is deteriorating. To anyone seeing this in the future: the tldr is that zrepl/TrueNAS replication doesn't really work. Just go use something else.

This is one (of multiple) reasons why I use homemade scripts and manually-invoked Cron Tasks, rather than the GUI's built-in Replication Tasks (the Periodic Snapshots and Replication Tasks use zettarepl under-the-hood).

If "-I" (big i), rather than "-i" (little i) was used in their code, you could do what you wanted to in your opening post. However, because of this bug report, you'll see we're stuck with it indefinitely, especially for TrueNAS Core:

https://jira.ixsystems.com/browse/NAS-109476

I believe it wasn't really well designed from the start, but rather than re-design/re-code it from scratch, it's just going to stay this way. Plus, if there isn't a demand from business or corporate customers, it gets low (if any) priority.

Patrick M. Hausen · Jan 20, 2022

I can just confirm that a recursive replication task configured in the UI picks up new datasets or drops deleted ones as they come and go. But I never started anything - sharing, snapshots, replication - at the top level dataset of a pool. Always one level beneath that one. I don't know but I strongly suspect this might come into play, here.

Other than that - please open an issue in JIRA and attach a debug file, so developers can have a look at your problem.

altano · Jan 22, 2022

winnielinnie said:
This is one (of multiple) reasons why I use homemade scripts and manually-invoked Cron Tasks, rather than the GUI's built-in Replication Tasks (the Periodic Snapshots and Replication Tasks use zettarepl under-the-hood).

If "-I" (big i), rather than "-i" (little i) was used in their code, you could do what you wanted to in your opening post. However, because of this bug report, you'll see we're stuck with it indefinitely, especially for TrueNAS Core:

https://jira.ixsystems.com/browse/NAS-109476

I believe it wasn't really well designed from the start, but rather than re-design/re-code it from scratch, it's just going to stay this way. Plus, if there isn't a demand from business or corporate customers, it gets low (if any) priority.

Yikes. I knew the second they added "Almost" in the GUI to "Full Filesystem Replication" that trouble was afoot. I understand the need to keep TrueNAS stable but I honestly just don't understand how we're supposed to trust zetarepl when it falls over in such basic ways.

Patrick M. Hausen said:
I can just confirm that a recursive replication task configured in the UI picks up new datasets or drops deleted ones as they come and go. But I never started anything - sharing, snapshots, replication - at the top level dataset of a pool. Always one level beneath that one. I don't know but I strongly suspect this might come into play, here.

I'm not sure top-level-or-not is part of the problem. I'm assuming what you're saying you have observed as working is this:

Create a datasetGrandParent, and below it create datasetParent
Create snapshotA
Replicate datasetParent@snapshotA recursively for the first time
Create datasetChild below datasetParent
Create snapshotB
Replicate datasetParent@snapshotB recursively/incrementally
Should succeed?

I have a different set of steps that are leading to this issue:

Create a datasetGrandParent, and below it create datasetParent
Create snapshotA
Create datasetChild below datasetParent
Create snapshotB
Replicate datasetParent@snapshotB recursively for the first time
Errors out on trying to replicate datasetParent@snapshotA. Error will say that datasetChild@snapshotA doesn't exist (and it really doesn't exist, of course)

I wouldn't be surprised if everything works incrementally but TrueNAS chokes on trying to find the new datasetChild using the original snapshot name (e.g. datasetChild@snapshotA) which obviously doesn't exist because the dataset wasn't there to be snapshotted at that point. That's what blows up.

Patrick M. Hausen said:
Other than that - please open an issue in JIRA and attach a debug file, so developers can have a look at your problem.

I might have done that if I knew about TrueNAS' JIRA before I made this forum post. I'm an avid bug reporter but honestly there's zero chance I'm going to try zetarepl again. Despite success in my local testing sandbox with small datasets, I haven't successfully replicated my 45TB dataset once despite months of attempts and working around bugs like this one. Replication needs to be foolproof and we're just too far away from that for me to trust it.

Important Announcement for the TrueNAS Community.

Replication failing b/c of new dataset?

Dabbler

Dabbler

Wizard

Dabbler

Dabbler

Explorer

Explorer

Hall of Famer

Dabbler

Hall of Famer

Explorer

Hall of Famer

Dabbler

Explorer

Hall of Famer

Explorer

Hall of Famer

MVP

Hall of Famer

Dabbler

Similar threads