What a disaster Replication task done!

mathcrowd · Apr 18, 2023

I'm running a small startup team doing math education related, and store all the files on the nextcloud which mount a nfs share from truenas in last three years.

I'm setting up a new server today. and want to migrate some dataset between servers.

Unfortunitely, i used the replication task.

Setting source:

Server A/Pool 1/{dataset a,dataset b, dataset c}

Setting target:

Server B/Pool 2

Pool 2 already contains dataset a,b,c older version. and dataset d,e,f.

So I started the replication task, after 100g data has been copied, I finally realized all datasets in Pool 2 dispeared.

I quickly shutdown the server, (It seems the task can not be stopped.)

----------------------------------------------------------------

dataset d,e,f is about 30T size, I think the data is still on mydisk.

all my snapshots have gone.

---------------------------------------------------------------

I have no backup for these data.

Since I thought raid1 is safe enough( how silly i was), and lack of budget to by more disks.

-----------------------------------------------------------------

By google, i found that tpx may save my life.

i have run `zpool history -il`

and but most of them is not the related pool and have wrong date( a day before ).

Can any one help, Thanks.

mathcrowd · Apr 18, 2023

updating:

i found these related commands history.

but it's wierd that dataset d,e,f (metioned upstairs, which is not the target ) is not listed below.

so how to find my missing datasets.

```
2023-04-18.21:51:01 zpool import 11985382682913978068 DataCenter
2023-04-18.21:51:10 zfs inherit -r DataCenter
2023-04-18.22:00:43 zfs destroy -r DataCenter/static
2023-04-18.22:11:05 zfs create -o aclmode=passthrough -o casesensitivity=sensitive -o org.freenas:description=video collection -o copies=1 -o org.truenas:managedby=192.168.30.50 -o quota=none -o refquota=none -o refreservation=none -o reservation=none -o xattr=sa -o special_small_blocks0 DataCenter/video
2023-04-18.22:52:49 zfs recv -s -F -x sharenfs -x sharesmb -x mountpoint DataCenter/video
2023-04-18.22:52:50 zfs set readonly=on DataCenter/video
2023-04-18.22:52:50 zfs set readonly=on DataCenter/video
2023-04-18.22:52:50 zfs destroy DataCenter@auto-2023-04-05_00-00,auto-2023-04-06_00-00,auto-2023-04-07_00-00,auto-2023-04-08_00-00,auto-2023-04-09_00-00,auto-2023-04-10_00-00,auto-2023-04-11_00-00,auto-2023-04-12_00-00,auto-2023-04-13_00-00,auto-2023-04-14_00-00,auto-2023-04-15_00-00,auto-2023-04-17_00-00
2023-04-18.22:52:50 zfs destroy DataCenter/apt-cache@auto-2023-04-18_00-00
2023-04-18.22:52:50 zfs destroy DataCenter/calibre@auto-2023-04-18_00-00
2023-04-18.22:52:51 zfs destroy DataCenter/config@auto-2023-04-18_00-00
2023-04-18.22:52:51 zfs destroy DataCenter/gitlab@auto-2023-04-18_00-00
2023-04-18.22:52:51 zfs destroy DataCenter/tug@auto-2023-04-18_00-00
2023-04-18.22:54:21 zpool remove DataCenter /dev/gptid/531c733a-d53e-11ed-8c24-005056808ef7
2023-04-18.22:54:58 zfs destroy DataCenter/apt-cache@auto-2023-04-05_00-00,auto-2023-04-06_00-00,auto-2023-04-07_00-00,auto-2023-04-08_00-00,auto-2023-04-09_00-00,auto-2023-04-10_00-00,auto-2023-04-11_00-00,auto-2023-04-12_00-00,auto-2023-04-13_00-00,auto-2023-04-14_00-00,auto-2023-04-15_00-00
2023-04-18.22:54:58 zfs destroy DataCenter/calibre@auto-2023-04-05_00-00,auto-2023-04-06_00-00,auto-2023-04-07_00-00,auto-2023-04-08_00-00,auto-2023-04-09_00-00,auto-2023-04-10_00-00,auto-2023-04-11_00-00,auto-2023-04-12_00-00,auto-2023-04-13_00-00,auto-2023-04-14_00-00,auto-2023-04-15_00-00
2023-04-18.22:54:58 zfs destroy DataCenter/config@auto-2023-04-05_00-00,auto-2023-04-06_00-00,auto-2023-04-07_00-00,auto-2023-04-08_00-00,auto-2023-04-09_00-00,auto-2023-04-10_00-00,auto-2023-04-11_00-00,auto-2023-04-12_00-00,auto-2023-04-13_00-00,auto-2023-04-14_00-00,auto-2023-04-15_00-00
2023-04-18.22:54:58 zfs destroy DataCenter/davinci@auto-2023-04-05_00-00,auto-2023-04-06_00-00,auto-2023-04-07_00-00,auto-2023-04-08_00-00,auto-2023-04-09_00-00,auto-2023-04-10_00-00,auto-2023-04-11_00-00,auto-2023-04-12_00-00,auto-2023-04-13_00-00,auto-2023-04-14_00-00,auto-2023-04-15_00-00,auto-2023-04-17_00-00
2023-04-18.22:57:04 zfs destroy DataCenter/gitlab@auto-2023-04-05_00-00,auto-2023-04-06_00-00,auto-2023-04-07_00-00,auto-2023-04-08_00-00,auto-2023-04-09_00-00,auto-2023-04-10_00-00,auto-2023-04-11_00-00,auto-2023-04-12_00-00,auto-2023-04-13_00-00,auto-2023-04-14_00-00,auto-2023-04-15_00-00
2023-04-18.22:59:01 zfs destroy DataCenter/minio@auto-2023-04-05_00-00,auto-2023-04-06_00-00,auto-2023-04-07_00-00,auto-2023-04-08_00-00,auto-2023-04-09_00-00,auto-2023-04-10_00-00,auto-2023-04-11_00-00,auto-2023-04-12_00-00,auto-2023-04-13_00-00,auto-2023-04-14_00-00,auto-2023-04-15_00-00,auto-2023-04-17_00-00
2023-04-18.22:59:49 zfs destroy DataCenter/registry@auto-2023-04-05_00-00,auto-2023-04-06_00-00,auto-2023-04-07_00-00,auto-2023-04-08_00-00,auto-2023-04-09_00-00,auto-2023-04-10_00-00,auto-2023-04-11_00-00,auto-2023-04-12_00-00,auto-2023-04-13_00-00,auto-2023-04-14_00-00,auto-2023-04-15_00-00,auto-2023-04-17_00-00
2023-04-18.23:00:07 zfs destroy DataCenter/softwares@auto-2023-04-05_00-00,auto-2023-04-06_00-00,auto-2023-04-07_00-00,auto-2023-04-08_00-00,auto-2023-04-09_00-00,auto-2023-04-10_00-00,auto-2023-04-11_00-00,auto-2023-04-12_00-00,auto-2023-04-13_00-00,auto-2023-04-14_00-00,auto-2023-04-15_00-00,auto-2023-04-17_00-00
2023-04-18.23:00:07 zfs destroy DataCenter/tug@auto-2023-04-05_00-00,auto-2023-04-06_00-00,auto-2023-04-07_00-00,auto-2023-04-08_00-00,auto-2023-04-09_00-00,auto-2023-04-10_00-00,auto-2023-04-11_00-00,auto-2023-04-12_00-00,auto-2023-04-13_00-00,auto-2023-04-14_00-00,auto-2023-04-15_00-00
2023-04-18.23:13:04 zpool remove DataCenter /dev/gptid/53052171-d53e-11ed-8c24-005056808ef7
2023-04-18.23:24:51 zpool add DataCenter special /dev/gptid/f4186fa9-ddfb-11ed-a4c0-00505680d330
2023-04-18.23:32:37 zpool remove DataCenter /dev/gptid/454be33a-d0a8-11ed-aa8b-00505680d00f
2023-04-18.23:43:35 zpool attach DataCenter /dev/gptid/f4186fa9-ddfb-11ed-a4c0-00505680d330 /dev/gptid/a8a8513c-ddff-11ed-a4c0-00505680d330
2023-04-18.23:46:03 zpool remove DataCenterNone
2023-04-19.00:00:39 zfs recv -s -F -x sharenfs -x sharesmb -x mountpoint DataCenter/apt-cache
2023-04-19.00:00:49 zfs set readonly=on DataCenter/apt-cache
2023-04-19.00:01:04 zfs recv -s -F -x sharenfs -x sharesmb -x mountpoint DataCenter/apt-cache
2023-04-19.00:01:19 zfs recv -s -F -x sharenfs -x sharesmb -x mountpoint DataCenter/apt-cache
2023-04-19.00:01:38 zfs recv -s -F -x sharenfs -x sharesmb -x mountpoint DataCenter/apt-cache
2023-04-19.00:02:48 zfs recv -s -F -x sharenfs -x sharesmb -x mountpoint DataCenter/apt-cache
2023-04-19.00:03:07 zfs recv -s -F -x sharenfs -x sharesmb -x mountpoint DataCenter/apt-cache
2023-04-19.00:04:17 zfs recv -s -F -x sharenfs -x sharesmb -x mountpoint DataCenter/apt-cache
2023-04-19.00:10:40 zpool import 11985382682913978068 DataCenter
2023-04-19.00:10:40 zpool set cachefile=/data/zfs/zpool.cache DataCenter

```

mathcrowd · Apr 18, 2023

updating:

finally i found destroy commands using `history -i`

Desperately，i'm doing 'vdev rm/add' operation while the replication task is running.

So, when i try

Code:

zpool import -N -o readonly=on -F -T 17032157 DataCenter

I got "one or more devices is currently unavailable".

Original disks info:

1 vdisk in esxi for special vdev
2 physical disk in esxi for l2rc cache
2 vdisk in esxi for log vdev

What I've done:

1. remove l2rc cache
2. add one physical disk removed as special vdev
3. delete original vdisk special vdev
4. add anothor physical disk as mirror
5. remove 2 log vdev

Is it possible find my datasets back? It's really important to our team.
We've upload all of our work onto nextcloud.
crying....

winnielinnie · Apr 18, 2023

Wait, you removed a special vdev? That basically takes down the entire pool.

It's very difficult to follow your posts at this point in time.

NugentS · Apr 18, 2023

My read from that is that he has destroyed his target pool in several ways by flailing around doing the equivalent of pressing random buttons.
And no backup.
AND its virtual

I don't see any way back from that

"delete original vdisk special vdev" - thats fatal for whichever pool that used to be.

mathcrowd · Apr 18, 2023

winnielinnie said:
Wait, you removed a special vdev? That basically takes down the entire pool.

It's very difficult to follow your posts at this point in time.

Actually I replaced my special vdev with two physical disks which were l2rc cache.

The whole proccess is with no warning and error.

I can still import the Pool but with all the datasets loosing.

mathcrowd · Apr 18, 2023

NugentS said:
My read from that is that he has destroyed his target pool in several ways by flailing around doing the equivalent of pressing random buttons.
And no backup.
AND its virtual

I don't see any way back from that

"delete original vdisk special vdev" - thats fatal for whichever pool that used to be.

I still have the vdisks files. only loosing the l2rc caching data.

I thought metadata will be migrated to the new disks.

However, it made me impossible to rollback.

mathcrowd · Apr 18, 2023

mathcrowd said:
Actually I replaced my special vdev with two physical disks which were l2rc cache.

The whole proccess is with no warning and error.

I can still import the Pool but with all the datasets loosing.

Furthermore, I run the esxi on 2 node vsan but only running one node.

So I think the original vdisk can be found from another node if i do not sysnc two node data.

jgreco · Apr 18, 2023

mathcrowd said:
Furthermore, I run the esxi on 2 node vsan but only running one node.

My head hurts.

mathcrowd · Apr 18, 2023

jgreco said:
My head hurts.

So I can get the my vmdk object back from that unsynchronized cluster, right?

Ericloewe · Apr 18, 2023

I think you need to a) illustrate what is going on, b) boil this down to a list of specific actions taken, and c) stop touching the server because this is impossible to follow.

Frankly, your setup is a complete disaster and it's only gotten worse by carrying out seemingly random actions with no clear plan and certainly not a bit of input from someone knowledgeable.

Patrick M. Hausen · Apr 18, 2023

Back to the start - replication means replication. If you replicate dataset X on machine A to machine B, then dataset X on machine B will look exactly like the one on machine A. This happens on the block level. Everything that might have been on machine B in a dataset named X will be gone. If you don't have another backup, that's bad but fact. That's all that can be said from the TrueNAS point of view.

Now with the rest of your posts describing various actions on your part ... you seem to be running TrueNAS virtualized on virtual disk images. This is a configuration that is in itself strongly discouraged and known to be dangerous. That being said there might be a chance you can recover that data if e.g. VMFS snapshots are available.

But with even our most experienced regular @jgreco getting nausea from your rapid unordered retelling of events, to help you we will need a complete and precise description of your setup on the target machine to even begin thinking about useful suggestions.

Anyone providing technical support via some remote medium, be it a forum or the phone, needs to construct a complete mental image of the state of your machine. This is currently not possible from your posts alone. Please provide more structured information.

Kind regards, keeping my fingers crossed for your data,
Patrick

mathcrowd · Apr 18, 2023

Ericloewe said:
I think you need to a) illustrate what is going on, b) boil this down to a list of specific actions taken, and c) stop touching the server because this is impossible to follow.

Frankly, your setup is a complete disaster and it's only gotten worse by carrying out seemingly random actions with no clear plan and certainly not a bit of input from someone knowledgeable.

More Simplized illustration:

I accidentally triggered the destroy all datasets commands without noticing it.

And when the destroy command is runned, I was still replacing the vdevs including:

Original State:

* 4 data vdev, with 2 disk in a mirrored group
* 1 special vdev, which is esxi vmdk object
* 2 log dev, which are esxi vmdk object

What I do:

1. remove 2 l2rc cache vdevs (which is physical disks).
2. add one disk mentioned above to pool as special vdev.
3. delete the original special vdev.
4. add one more disk as mirror in special vdev.
5. delete the log devs.

When I replacing the vdevs, the server is stilling running a replication task which trigger the destroy command.
About 900Gb data is written to the pool. And the whole pool size is about 30Tb.

6. I reboot my server to stop the task.
7. I found the tpx of the destroy command.

Code:

2023-04-18.23:13:41 [txg:17032225] destroy DataCenter/nextcloud (143) (bptree, mintxg=1)

8. But can not do the rollback after exported the pool

Code:

zpool import -N -o readonly=on -F -T 17032157 DataCenter`

err: one or more devices is currently unavailable

9. I tried this configs and try importing again

Code:

set vfs.zfs.recover=1
set vfs.zfs.debug=1
sysctl vfs.zfs.spa.load_verify_metadata=0
sysctl vfs.zfs.spa.load_verify_data=0
zpool import -N -o readonly=on -f -R /mnt -F -T 17032157 DataCenter

monitoring the dbgmsg:

Code:

spa.c:6249:spa_tryimport(): spa_tryimport: importing DataCenter, max_txg=17032157
spa_misc.c:419:spa_load_note(): spa_load($import, config trusted): LOADING
spa.c:8369:spa_async_request(): spa=$import async request task=1
spa.c:8369:spa_async_request(): spa=$import async request task=1
spa.c:8369:spa_async_request(): spa=$import async request task=1
spa.c:8369:spa_async_request(): spa=$import async request task=1
spa.c:8369:spa_async_request(): spa=$import async request task=1
spa.c:8369:spa_async_request(): spa=$import async request task=1
spa.c:8369:spa_async_request(): spa=$import async request task=1
spa.c:8369:spa_async_request(): spa=$import async request task=1
spa.c:8369:spa_async_request(): spa=$import async request task=1
spa.c:8369:spa_async_request(): spa=$import async request task=1
spa_misc.c:404:spa_load_failed(): spa_load($import, config untrusted): FAILED: no valid uberblock found
spa_misc.c:419:spa_load_note(): spa_load($import, config untrusted): UNLOADING
spa.c:6110:spa_import(): spa_import: importing DataCenter, max_txg=17032157 (RECOVERY MODE)
spa_misc.c:419:spa_load_note(): spa_load(DataCenter, config trusted): LOADING
spa_misc.c:404:spa_load_failed(): spa_load(DataCenter, config untrusted): FAILED: no valid uberblock found
spa_misc.c:419:spa_load_note(): spa_load(DataCenter, config untrusted): UNLOADING

----------------------------------------------------------------------------------------------------

I still keeping all my esxi vmdk files now.

Hope somebody to help.

I'm so silly to have no backups.

It's a disaster to our team.

Samuel Tai · Apr 18, 2023

Honestly, your situation is unrecoverable, and a complete comedy of errors. Why didn't you stop digging when you realized you were in a hole?

The only hope you have to recovering your data is to use a ZFS rescue utility like Klennet ZFS Recovery. Honestly, you need to have a stable bare metal Windows rescue environment to even attempt a rescue. Make multiple copies of your VMDKs, and keep a set untouched as your gold masters, used only as sources for copies.

mathcrowd · Apr 18, 2023

Samuel Tai said:
Honestly, your situation is unrecoverable, and a complete comedy of errors. Why didn't you stop digging when you realized you were in a hole?

The only hope you have to recovering your data is to use a ZFS rescue utility like Klennet ZFS Recovery. Honestly, you need to have a stable bare metal Windows rescue environment to even attempt a rescue. Make multiple copies of your VMDKs, and keep a set untouched as your gold masters, used only as sources for copies.

you mean run zfs recovery on data vdev or special vdev?

acctually, i can import the pool. but barely anyting remain on it.

Ericloewe · Apr 19, 2023

mathcrowd said:
I accidentally triggered the destroy all datasets commands without noticing it.

How on Earth did that happen?

mathcrowd said:
And when the destroy command is runned, I was still replacing the vdevs including:

What do those things have to do with each other?

mathcrowd said:
1. remove 2 l2rc cache vdevs (which is physical disks).

That's fine.

mathcrowd said:
add one disk mentioned above to pool as special vdev.

That's a lot less fine.

mathcrowd said:
delete the original special vdev.

That is impossible to do on ZFS. If you mean you destroyed the virtual disks or something, that would destroy the pool.

mathcrowd said:
add one more disk as mirror in special vdev.

Uncontroversial

mathcrowd said:
delete the log devs.

SLOG ? Fine.

mathcrowd said:
When I replacing the vdevs, the server is stilling running a replication task which trigger the destroy command.
About 900Gb data is written to the pool. And the whole pool size is about 30Tb.

And you didn't get any serious warnings? I'm assuming you configured this from the GUI?

mathcrowd · Apr 19, 2023

mathcrowd said:
I accidentally triggered the destroy all datasets commands without noticing it.

How on Earth did that happen?

I just set the pool root as target when doing replication task, so boom!!!!

mathcrowd said:
delete the original special vdev.

That is impossible to do on ZFS. If you mean you destroyed the virtual disks or something, that would destroy the pool.

since i add one physical disk as special vdev, so truenas allow me to remove the former one, remaining one disk as special vdev.
actually it take a while to replace the disk.

the esxi vdisk file was kept and have backups before all disater happened.

mathcrowd · Apr 19, 2023

Samuel Tai said:
Honestly, your situation is unrecoverable, and a complete comedy of errors. Why didn't you stop digging when you realized you were in a hole?

The only hope you have to recovering your data is to use a ZFS rescue utility like Klennet ZFS Recovery. Honestly, you need to have a stable bare metal Windows rescue environment to even attempt a rescue. Make multiple copies of your VMDKs, and keep a set untouched as your gold masters, used only as sources for copies.

I'm trying the software now, should I choose new special vdev(physical disk, can imported normally) or the original vdev(vmdk esxi)

Samuel Tai · Apr 19, 2023

mathcrowd said:
I'm trying the software now, should I choose new special vdev(physical disk, can imported normally) or the original vdev(vmdk esxi)

Sorry, I don't know. You'll have to ask Klennet support.

Important Announcement for the TrueNAS Community.

What a disaster Replication task done!

Dabbler

Dabbler

Dabbler

MVP

MVP

Dabbler

Dabbler

Dabbler

Resident Grinch

Dabbler

Server Wrangler

Hall of Famer

Dabbler

Never underestimate your own stupidity

Dabbler

Server Wrangler

Dabbler

Dabbler

Never underestimate your own stupidity

Similar threads