SOLVED Rolling back multiple snapshots fails

Status
Not open for further replies.

deafen

Explorer
Joined
Jan 11, 2014
Messages
71
I find myself in need of rolling back a couple of snapshots. Here's the relevant list:

Code:
tank/Archive@auto-20140907.1710-2w         0      -  1.35T  -
tank/Archive@auto-20140907.1715-2w         0      -  1.35T  -
tank/Archive@auto-20140907.1720-2w         0      -  1.35T  -
tank/Archive@auto-20140907.1725-2w         0      -  1.35T  -
tank/Archive@auto-20140907.1730-2w         0      -  1.35T  -


I need to roll back to the 1710 snapshot. Note that none of these use any data (the column with 0 is the "Used" column).

The GUI will let me roll back to the latest one. When I try the CLI to roll back further, I get this:

Code:
[root@omega] ~# zfs rollback -r tank/Archive@auto-20140907.1725-2w
cannot destroy 'tank/Archive@auto-20140907.1730-2w': dataset is busy


I am 100% certain that nothing is using the dataset. In fact, I booted into single-user mode, imported the pool, and tried the command from there ... exactly the same result, "dataset is busy".

I have also tried first rolling back to 1730, then rolling back (with -r) to 1725, but the result is exactly the same. I have also tried the -R and -f options, to no avail.

Is there some flag that I need to tweak on the snapshot? Held, or latest, or something?
 

deafen

Explorer
Joined
Jan 11, 2014
Messages
71
Yeah, it was a hold flag (specifically, "freenas:repl"). Once I released the hold on the later snapshot, I was able to roll back to the earlier one. Rinse and repeat.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Can you provide any kind of instructions on how you did that? To be honest, I was going to ask a developer for the answer to your question because I didn't have the answer.
 

deafen

Explorer
Joined
Jan 11, 2014
Messages
71
I had to do it for several datasets, but here's a typical example:

Code:
[root@omega] ~# zfs rollback -r tank/Media@auto-20140907.1710-2w
cannot destroy 'tank/Media@auto-20140907.1715-2w': dataset is busy
cannot destroy 'tank/Media@auto-20140907.1720-2w': dataset is busy
cannot destroy 'tank/Media@auto-20140907.1725-2w': dataset is busy
cannot destroy 'tank/Media@auto-20140907.1730-2w': dataset is busy
[root@omega] ~# zfs release freenas:repl tank/Media@auto-20140907.1715-2w
[root@omega] ~# zfs release freenas:repl tank/Media@auto-20140907.1720-2w
[root@omega] ~# zfs release freenas:repl tank/Media@auto-20140907.1725-2w
[root@omega] ~# zfs release freenas:repl tank/Media@auto-20140907.1730-2w
[root@omega] ~# zfs rollback -r tank/Media@auto-20140907.1710-2w
[root@omega] ~#


So that did what I wanted, but it ultimately didn't solve my underlying problem. See, the whole sordid story is like this:

I wanted to do some potentially destructive experiments on my primary server's pool, so I made sure my replication to the backup server was good, re-IP'd it to the primary's IP (after moving the primary to a new IP, natch), and started serving the data to clients from the backup server.

Once the experiments were done (and I had indeed mucked up the primary server's pool), I rebuilt the primary server and started replicating back to it from the backup server.

Once I was satisfied that the data was all intact, I turned off CIFS and NFS on the backup server and set up a rapid (5-minute) snapshot/replication to the primary for a final state sync. Then I swapped the IPs back and started serving to clients again from the primary.

So now I need to get backups going in the right direction again. Rather than have to resync all of the data again (which would take ~40 hours), I thought that I could just rollback any snapshots that were created after the last one that was replicated back to the primary, and then set up the snaps and replication going the right direction again. That's when I ran into this problem. And it makes sense - typically, you don't want to be able to arbitrarily roll back a snapshot that has been replicated already, because now your replication target has newer data than your source - bad juju.

But now the issue is that when I create the snapshot/replication job on the primary, it (obviously) creates a new snapshot and tries to replicate it. But it complains about a mismatch between the PUSH snapshot (that it just created!) and the PULL snapshot.

Code:
Sep  8 07:55:01 delta autorepl.py: [common.pipesubr:58] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.99.35 "zfs list -Hr -o name -t snapshot -d 1 tank/VM | tail -n 1 | cut -d@ -f2"
Sep  8 07:55:01 delta autorepl.py: [tools.autorepl:410] Remote and local mismatch after replication: tank/VM: local=auto-20140908.0749-2w vs remote=auto-20140901.0900-2w
Sep  8 07:55:01 delta autorepl.py: [common.pipesubr:58] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 192.168.99.35 "zfs list -Ho name -t snapshot -d 1 tank/VM | tail -n 1 | cut -d@ -f2"
Sep  8 07:55:01 delta autorepl.py: [tools.autorepl:427] Replication of tank/VM@auto-20140908.0749-2w failed with cannot receive new filesystem stream: destination has snapshots (eg. tank/VM@auto-20140827.1525-2w) must destroy them to overwrite it Error 33 : Write error : cannot write compressed block


I haven't dug into the code yet to figure this out, and frankly I probably won't - this is all just to save having to resync 11 TB of data over 1GbE. And I know that I'm using the replication system in a way that isn't documented as such.

But this should be figured out, and documented. This scenario (catastrophic primary failure, fail over and serve from backup while restoring, fail back to primary and reestablish backups) isn't farfetched or esoteric by any means. I'll file a feature request for it.
 
Status
Not open for further replies.
Top