Replication destroys unaffiliated datasets on destination?

Joined
Jan 27, 2020
Messages
577
Something's happend after a replication task ran and somehow deleted 2 absolutely unaffiliated datasets including the snapshots(logically) on the target pool.
I just don't really know why and how the gui did not warn me about it.
So this was the initial situation, my mainpool got into unhealthy state because there were some write errors (like 3 across 3 disks). Smartctl -A on all disks looked fine, minutes before I ran the troubled replication, which is why I ignored the unhealthy state. Layout looks like that:

  • mainpool (unhealthy)
    • media
    • backups
    • dump
    • nextcloud
  • jailpool (encrypted)
    • data1
    • data2
    • data3
I setup a replication task to replicate /mnt/jailpool -> /mnt/mainpool/backups, which would normally create the missing datasets below /mnt/jailpool/backups (data1 - data3) and the layout would look like this after the task finishes:

  • mainpool (unhealthy)
    • media
    • backups
      • jailpool (encrypted)
        • data1
        • data2
        • data3
    • dump
    • nextcloud
  • jailpool (encrypted)
    • data1
    • data2
    • data3
Here is how I fucked up:
I setup the task to execute in /mnt/mainpool and not how I intended in /mnt/mainpool/backups.
After starting the task, I was warned about how "encrypted datasets can only be replicated to another encrypted datasets" if I recall correctly. Because I did a similar replication setup before, I was familiar with that warning, though it did not state any data loss was imminent.

So the task started and while stuck at 0% I immediately noticed two datasets missing from mainpool. The GUI showed it to look like this:
  • mainpool (unhealthy)
    • backups
    • nextcloud
zfs list confirmed what I was fearing. Datasets media and dump were missing right away, the snapshots were gone and the pool space went significantly up, which meant the data was gone.

I have trouble to understand how that could have happend. I read somewhere that replication into the root of the pool could lead to data loss, but I really don't understand why. I was under the impression, that replication always creates the datasets it replicates if not present at the destination, because it's really only snapshots that are transferred.
I somewhat hope that a scrub right after I noticed the missing datasets will eventually bring the lost ones back... let's see
 
Joined
Oct 22, 2019
Messages
3,641
What options did you configure under the Replication Task?

Did you select only "jailpool" or each child individually under "jailpool"?

How the unintentional destruction was described is unnerving. Your last ditch attempt might be to restore from a backup, or try to rollback to a previous transaction (via the "recovery import").
 
Joined
Jan 27, 2020
Messages
577
Joined
Oct 22, 2019
Messages
3,641
I did select only the jailpool and ticked "recursive". And that's it, left it pretty basic.
But what options? Maybe a screenshot of the Replication Task (all of its options)?

I'm surprised it allowed you to continue without the warning/error that it cannot destroy a dataset with existing snapshots / children.

Could you elaborate on this? What is this "recovery import"?
It might not be possible at this point (or allowed). It's reserved for emergencies to try to restore your pool from transactions of several minutes or so ago.

 
Joined
Jan 27, 2020
Messages
577
But what options? Maybe a screenshot of the Replication Task (all of its options)?
Unfortunately I deleted the task right after I had seen that is has gone wrong but basically this is how it was setup, just set the source data set "jail pool" (without child datasets) and destination "main pool", plus I ticked recursive.
I'm surprised it allowed you to continue without the warning/error that it cannot destroy a dataset with existing snapshots / children.
This really is what annoys me, there was definitely no warning of data loss showing.

recovery import
I believe after I started scrubbing the pool, this is not longer an option.

EDIT:

I may have found a log covering the incident under /var/log/zettarepl.log

Code:
[2022/03/02 10:44:54] INFO     [MainThread] [zettarepl.scheduler.clock] Interrupted
[2022/03/02 10:44:54] INFO     [MainThread] [zettarepl.zettarepl] Scheduled tasks: [<Replication Task 'task_2'>]
[2022/03/02 10:44:55] WARNING  [replication_task__task_2] [zettarepl.replication.run] No incremental base for replication task 'task_2' on dataset 'jail_pool', destroying destination dataset
 
Last edited:
Joined
Jan 27, 2020
Messages
577
I just found out about zpool history, wonderful command. And there is no incident what-so-ever about my destroyed datasets, it's not in the log where every movement of zfs is logged...every snapshot, every dataset created, every dataset removed, everything is logged there and nothing about my missing datasets...strange
 
Joined
Oct 22, 2019
Messages
3,641
What happens if you try to list your datasets using the command line / SSH terminal?

zfs list -o name,space -t filesystem -r mainpool
 
Joined
Jan 27, 2020
Messages
577
Both missing datasets don't show up there.
 
Joined
Oct 22, 2019
Messages
3,641
I don't know what to tell you. :frown:

I'm shocked it destroyed two unrelated datasets like that, nor give you such a warning. Did you have a backup of them?
 
Joined
Jan 27, 2020
Messages
577
I'm shocked it destroyed two unrelated datasets like that, nor give you such a warning. Did you have a backup of them?
I'm too pretty upset about the whole thing. I currently trying to restore some of the stuff, one of the two datasets was just a file dump (literately).
I'm really glad that it didn't hit other datasets on that pool, with really important data.
 
Joined
Jan 27, 2020
Messages
577
The scrub finished overnight and still the datasets remain missing...:(
 

kirbyhi5

Cadet
Joined
Sep 8, 2017
Messages
6
Hello. I know this is old, but I just wanted to comment that this exact scenario happened to me. I wanted to copy a dataset to another pool.
I made a replication task moving pool1/iocage to pool2 with the recursive option checked. pool2 had 5 datasets on it. I made a mistake and forgot to put pool2/iocage as the destination. Immediately after running it I saw that 2 child datasets from pool2 were gone. Snapshots of those datasets were also deleted. Nothing in var/log/messages. Doesn't show them as unmount. It's like they never existed. I spent hours trying to find out what happened.

zettarepl.log doesn't give me much information either. This is when the error occured:

[2022/10/20 13:38:50] WARNING [replication_task__task_1] [zettarepl.replication.run] No incremental base for replication task 'task_1' on dataset 'NVMe2/iocage', destroying destination dataset
[2022/10/20 13:39:17] ERROR [replication_task__task_1] [zettarepl.replication.run] For task 'task_1' unhandled replication error ExecException(1, "cannot unmount '/mnt/kaban/main': pool or dataset is busy\ncannot unmount '/mnt/kaban/media': pool or dataset is busy\n")
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py", line 181, in run_replication_tasks
retry_stuck_replication(
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/stuck.py", line 18, in retry_stuck_replication
return func()
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py", line 182, in <lambda>
lambda: run_replication_task_part(replication_task, source_dataset, src_context, dst_context,
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py", line 279, in run_replication_task_part
run_replication_steps(step_templates, observer)
File "/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py", line 542, in run_replication_steps
step_template.dst_context.shell.exec(["zfs", "destroy", "-r", step_template.dst_dataset])
File "/usr/local/lib/python3.9/site-packages/zettarepl/transport/interface.py", line 92, in exec
return self.exec_async(args, encoding, stdout).wait(timeout)
File "/usr/local/lib/python3.9/site-packages/zettarepl/transport/local.py", line 80, in wait
raise ExecException(self.process.returncode, stdout)
zettarepl.transport.interface.ExecException: cannot unmount '/mnt/kaban/main': pool or dataset is busy
cannot unmount '/mnt/kaban/media': pool or dataset is busy

It looks like it was attempting to completely overwrite pool2 with pool1/iocage, but failed prematurely due to one of the datasets being busy/unable to unmount, which caused it to error thankfully. zpool history -il <poolname> >> history.log also shows what happened.

2022-10-20.13:39:02 [txg:6882420] destroy kaban/music (1609) (bptree, mintxg=1) [on truenas.arai]
2022-10-20.13:39:04 [txg:6882422] destroy kaban/image@AutoSnap.10-16-2022.12-00 (1446) [on truenas.arai]
2022-10-20.13:39:06 (2253ms) ioctl destroy_snaps


The 2 datasets that weren't busy were purged completely. Nearly 1TB of data wiped in an instant without any warning.

There should be a huge warning message that creating a replication task to a pool with existing datasets without specifying a dataset name WILL DESTROY ALL DATASETS AND SNAPSHOTS in the destination pool. I can't believe a simple mistake like that can be so costly. Some of that data I will never be able to obtain again. I'm extremely upset.
 
Last edited:
Joined
Jan 27, 2020
Messages
577
This is cruel and unfortunate. I am kinda scared now working with replication, learned my lesson the hard way. God speed to you!
 

salve

Cadet
Joined
Dec 27, 2022
Messages
5
The same thing just happened to me. I wanted to copy a dataset from Pool 1 to Pool 2 and the replication deleted all datasets and snapshots on pool2! Have TrueNAS-SCALE-22.12.0 o_O
 
Joined
Oct 22, 2019
Messages
3,641
I don't use the GUI for replications. I'm a very simple person, and I just run a script to do occasional backups/replications in the command-line.

There's too much ambiguity with the GUI, and it feels unpolished and uncertain. (What "actually" will happen, what the tooltips say, the documentation, etc).

I truly believe that many use-cases could be satisfied if iXsystems simply created a GUI wrapper around Syncoid. This would be useful for users who simply want to send backups to another location or external drive. Syncoid is very straight forward: it creates a new snapshot, then transfers the snapshot to the destination. You can specify if you want to include all intermediary snapshots (in between) or not.
 
Last edited:

salve

Cadet
Joined
Dec 27, 2022
Messages
5
There should be a huge warning message that creating a replication task to a pool with existing datasets without specifying a dataset name WILL DESTROY ALL DATASETS AND SNAPSHOTS in the destination pool. I can't believe a simple mistake like that can be so costly. Some of that data I will never be able to obtain again. I'm extremely upset.

WTF I have missed that.. I just did a test and if I write the old name of the dataset to the end of the dest-path it works.. This is very dangerous that everything is deleted without warning.. and all snapshots too :(
 
Joined
Jan 27, 2020
Messages
577
I'm feeling obliged to state that I - since this incident - very much successful replicated data with various tasks. Since than, I also switched to SCALE which works just the same when it comes to zfs replication. Be cautious and fully aware what you set-up in the task settings, understand what the implications of these settings are and have backups available - all of which I neglected foolishly when it'd happened to me.
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
I don't use the GUI for replications. I'm a very simple person, and I just run a script to do occasional backups/replications in the command-line.

There's too much ambiguity with the GUI, and it feels unpolished and uncertain. (What "actually" will happen, what the tooltips say, the documentation, etc).

I truly believe that many use-cases could be satisfied if iXsystems simply created a GUI wrapper around Syncoid. This would be useful for users who simply want to send backups to another location or external drive. Syncoid is very straight forward: it creates a new snapshot, then transfers the snapshot to the destination. You can specify if you want to include all intermediary snapshots (in between) or not.
Well, the GUI is nice cause it does an extra step that would be a bit more cumbersome on the CLI, which is unencrypted ZFS send through netcat.
 
Top