Replication destroys unaffiliated datasets on destination?

mistermanko · Mar 2, 2022

Something's happend after a replication task ran and somehow deleted 2 absolutely unaffiliated datasets including the snapshots(logically) on the target pool.
I just don't really know why and how the gui did not warn me about it.
So this was the initial situation, my mainpool got into unhealthy state because there were some write errors (like 3 across 3 disks). Smartctl -A on all disks looked fine, minutes before I ran the troubled replication, which is why I ignored the unhealthy state. Layout looks like that:

mainpool (unhealthy)
- media
- backups
- dump
- nextcloud
jailpool (encrypted)
- data1
- data2
- data3

I setup a replication task to replicate /mnt/jailpool -> /mnt/mainpool/backups, which would normally create the missing datasets below /mnt/jailpool/backups (data1 - data3) and the layout would look like this after the task finishes:

mainpool (unhealthy)
- media
- backups
  - jailpool (encrypted)
    - data1
    - data2
    - data3
- dump
- nextcloud
jailpool (encrypted)
- data1
- data2
- data3

Here is how I fucked up:
I setup the task to execute in /mnt/mainpool and not how I intended in /mnt/mainpool/backups.
After starting the task, I was warned about how "encrypted datasets can only be replicated to another encrypted datasets" if I recall correctly. Because I did a similar replication setup before, I was familiar with that warning, though it did not state any data loss was imminent.

So the task started and while stuck at 0% I immediately noticed two datasets missing from mainpool. The GUI showed it to look like this:

mainpool (unhealthy)
- backups
- nextcloud

zfs list confirmed what I was fearing. Datasets media and dump were missing right away, the snapshots were gone and the pool space went significantly up, which meant the data was gone.

I have trouble to understand how that could have happend. I read somewhere that replication into the root of the pool could lead to data loss, but I really don't understand why. I was under the impression, that replication always creates the datasets it replicates if not present at the destination, because it's really only snapshots that are transferred.
I somewhat hope that a scrub right after I noticed the missing datasets will eventually bring the lost ones back... let's see

winnielinnie · Mar 2, 2022

What options did you configure under the Replication Task?

Did you select only "jailpool" or each child individually under "jailpool"?

How the unintentional destruction was described is unnerving. Your last ditch attempt might be to restore from a backup, or try to rollback to a previous transaction (via the "recovery import").

mistermanko · Mar 2, 2022

winnielinnie said:
What options did you configure under the Replication Task?

winnielinnie said:
Did you select only "jailpool" or each child individually under "jailpool"?

I did select only the jailpool and ticked "recursive". And that's it, left it pretty basic.

winnielinnie said:
or try to rollback to a previous transaction (via the "recovery import").

Could you elaborate on this? What is this "recovery import"?

winnielinnie · Mar 2, 2022

mistermanko said:
I did select only the jailpool and ticked "recursive". And that's it, left it pretty basic.

But what options? Maybe a screenshot of the Replication Task (all of its options)?

I'm surprised it allowed you to continue without the warning/error that it cannot destroy a dataset with existing snapshots / children.

mistermanko said:
Could you elaborate on this? What is this "recovery import"?

It might not be possible at this point (or allowed). It's reserved for emergencies to try to restore your pool from transactions of several minutes or so ago.

zpool-import.8 — OpenZFS documentation

openzfs.github.io

mistermanko · Mar 2, 2022

winnielinnie said:
But what options? Maybe a screenshot of the Replication Task (all of its options)?

Unfortunately I deleted the task right after I had seen that is has gone wrong but basically this is how it was setup, just set the source data set "jail pool" (without child datasets) and destination "main pool", plus I ticked recursive.

winnielinnie said:
I'm surprised it allowed you to continue without the warning/error that it cannot destroy a dataset with existing snapshots / children.

This really is what annoys me, there was definitely no warning of data loss showing.

recovery import

I believe after I started scrubbing the pool, this is not longer an option.

EDIT:

I may have found a log covering the incident under /var/log/zettarepl.log

Code:

[2022/03/02 10:44:54] INFO     [MainThread] [zettarepl.scheduler.clock] Interrupted
[2022/03/02 10:44:54] INFO     [MainThread] [zettarepl.zettarepl] Scheduled tasks: [<Replication Task 'task_2'>]
[2022/03/02 10:44:55] WARNING  [replication_task__task_2] [zettarepl.replication.run] No incremental base for replication task 'task_2' on dataset 'jail_pool', destroying destination dataset

mistermanko · Mar 2, 2022

I just found out about zpool history, wonderful command. And there is no incident what-so-ever about my destroyed datasets, it's not in the log where every movement of zfs is logged...every snapshot, every dataset created, every dataset removed, everything is logged there and nothing about my missing datasets...strange

winnielinnie · Mar 2, 2022

What happens if you try to list your datasets using the command line / SSH terminal?

zfs list -o name,space -t filesystem -r mainpool

mistermanko · Mar 2, 2022

Both missing datasets don't show up there.

winnielinnie · Mar 2, 2022

I don't know what to tell you.

I'm shocked it destroyed two unrelated datasets like that, nor give you such a warning. Did you have a backup of them?

mistermanko · Mar 2, 2022

winnielinnie said:
I'm shocked it destroyed two unrelated datasets like that, nor give you such a warning. Did you have a backup of them?

I'm too pretty upset about the whole thing. I currently trying to restore some of the stuff, one of the two datasets was just a file dump (literately).
I'm really glad that it didn't hit other datasets on that pool, with really important data.

mistermanko · Mar 3, 2022

The scrub finished overnight and still the datasets remain missing...:(

kirbyhi5 · Oct 20, 2022

Hello. I know this is old, but I just wanted to comment that this exact scenario happened to me. I wanted to copy a dataset to another pool.
I made a replication task moving pool1/iocage to pool2 with the recursive option checked. pool2 had 5 datasets on it. I made a mistake and forgot to put pool2/iocage as the destination. Immediately after running it I saw that 2 child datasets from pool2 were gone. Snapshots of those datasets were also deleted. Nothing in var/log/messages. Doesn't show them as unmount. It's like they never existed. I spent hours trying to find out what happened.

zettarepl.log doesn't give me much information either. This is when the error occured:

[2022/10/20 13:38:50] WARNING [replication_task__task_1] [zettarepl.replication.run] No incremental base for replication task 'task_1' on dataset 'NVMe2/iocage', destroying destination dataset
[2022/10/20 13:39:17] ERROR [replication_task__task_1] [zettarepl.replication.run] For task 'task_1' unhandled replication error ExecException(1, "cannot unmount '/mnt/kaban/main': pool or dataset is busy\ncannot unmount '/mnt/kaban/media': pool or dataset is busy\n")
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py", line 181, in run_replication_tasks
retry_stuck_replication(
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/stuck.py", line 18, in retry_stuck_replication
return func()
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py", line 182, in <lambda>
lambda: run_replication_task_part(replication_task, source_dataset, src_context, dst_context,
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py", line 279, in run_replication_task_part
run_replication_steps(step_templates, observer)
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py", line 542, in run_replication_steps
step_template.dst_context.shell.exec(["zfs", "destroy", "-r", step_template.dst_dataset])
File "/usr/local/lib/python3.9/site-packages/zettarepl/transport/interface.py", line 92, in exec
return self.exec_async(args, encoding, stdout).wait(timeout)
File "/usr/local/lib/python3.9/site-packages/zettarepl/transport/local.py", line 80, in wait
raise ExecException(self.process.returncode, stdout)
zettarepl.transport.interface.ExecException: cannot unmount '/mnt/kaban/main': pool or dataset is busy
cannot unmount '/mnt/kaban/media': pool or dataset is busy

It looks like it was attempting to completely overwrite pool2 with pool1/iocage, but failed prematurely due to one of the datasets being busy/unable to unmount, which caused it to error thankfully. zpool history -il <poolname> >> history.log also shows what happened.

2022-10-20.13:39:02 [txg:6882420] destroy kaban/music (1609) (bptree, mintxg=1) [on truenas.arai]
2022-10-20.13:39:04 [txg:6882422] destroy kaban/image@AutoSnap.10-16-2022.12-00 (1446) [on truenas.arai]
2022-10-20.13:39:06 (2253ms) ioctl destroy_snaps

The 2 datasets that weren't busy were purged completely. Nearly 1TB of data wiped in an instant without any warning.

There should be a huge warning message that creating a replication task to a pool with existing datasets without specifying a dataset name WILL DESTROY ALL DATASETS AND SNAPSHOTS in the destination pool. I can't believe a simple mistake like that can be so costly. Some of that data I will never be able to obtain again. I'm extremely upset.

mistermanko · Oct 25, 2022

This is cruel and unfortunate. I am kinda scared now working with replication, learned my lesson the hard way. God speed to you!

salve · Dec 30, 2022

The same thing just happened to me. I wanted to copy a dataset from Pool 1 to Pool 2 and the replication deleted all datasets and snapshots on pool2! Have TrueNAS-SCALE-22.12.0

winnielinnie · Dec 30, 2022

I don't use the GUI for replications. I'm a very simple person, and I just run a script to do occasional backups/replications in the command-line.

There's too much ambiguity with the GUI, and it feels unpolished and uncertain. (What "actually" will happen, what the tooltips say, the documentation, etc).

I truly believe that many use-cases could be satisfied if iXsystems simply created a GUI wrapper around Syncoid. This would be useful for users who simply want to send backups to another location or external drive. Syncoid is very straight forward: it creates a new snapshot, then transfers the snapshot to the destination. You can specify if you want to include all intermediary snapshots (in between) or not.

salve · Dec 30, 2022

kirbyhi5 said:
There should be a huge warning message that creating a replication task to a pool with existing datasets without specifying a dataset name WILL DESTROY ALL DATASETS AND SNAPSHOTS in the destination pool. I can't believe a simple mistake like that can be so costly. Some of that data I will never be able to obtain again. I'm extremely upset.

WTF I have missed that.. I just did a test and if I write the old name of the dataset to the end of the dest-path it works.. This is very dangerous that everything is deleted without warning.. and all snapshots too :(

mistermanko · Dec 30, 2022

I'm feeling obliged to state that I - since this incident - very much successful replicated data with various tasks. Since than, I also switched to SCALE which works just the same when it comes to zfs replication. Be cautious and fully aware what you set-up in the task settings, understand what the implications of these settings are and have backups available - all of which I neglected foolishly when it'd happened to me.

prionico · Jan 31, 2023

Has anyone already opened a bug for this issue?

MisterE2002 · Feb 1, 2023

prionico said:
Has anyone already opened a bug for this issue?

Yeah, please protect other people of this bug. This should be a "blocker" for the next release.

Whattteva · Feb 1, 2023

winnielinnie said:
I don't use the GUI for replications. I'm a very simple person, and I just run a script to do occasional backups/replications in the command-line.

There's too much ambiguity with the GUI, and it feels unpolished and uncertain. (What "actually" will happen, what the tooltips say, the documentation, etc).

I truly believe that many use-cases could be satisfied if iXsystems simply created a GUI wrapper around Syncoid. This would be useful for users who simply want to send backups to another location or external drive. Syncoid is very straight forward: it creates a new snapshot, then transfers the snapshot to the destination. You can specify if you want to include all intermediary snapshots (in between) or not.

Well, the GUI is nice cause it does an extra step that would be a bit more cumbersome on the CLI, which is unencrypted ZFS send through netcat.

Important Announcement for the TrueNAS Community.

Replication destroys unaffiliated datasets on destination?

Guru

MVP

Guru

MVP

Guru

Guru

MVP

Guru

MVP

Guru

Guru

Cadet

Guru

Cadet

MVP

Cadet

Guru

Cadet

Patron

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Replication destroys unaffiliated datasets on destination?"

Similar threads