Any recovery after partial replication?

NonWonderDog

Cadet
Joined
Sep 8, 2020
Messages
2
So I was upgrading the array in my home server from 6 drives in zraid2 to 8 drives. I replicated everything to a 16 TB external drive, recreated my pool, and started another replication task to move from the external back to my pool. (I have partial backups containing everything important, but I only had one full copy during all this. The only files potentially lost were downloaded/recorded media.)

Overnight I was getting slow I/O warnings over email and the external was sitting at 60 °C. This seemed not-so-good, so I stood the enclosure up on one side (it has feet on the short edge, so this is how the cooling was designed) and it immediately cooled by 5 degrees.

And then of course I accidentally tipped it on its side 20 minutes later. And it head crashed. No more spinup. No more data. (This is pretty poor shock resistance for an external drive, but then again I knew it was just an Exos X16 stuffed in an enclosure.) I was able to get it to spin up again with the freezer trick, but all it did was make some godawful scratching noises before stopping. It's probably toast.

But before that happened it had transferred all but about 1.2 TB of data to my storage pool. Several datasets fully transferred and are fine, but a big chunk of data is in one 5.2 TB dataset (which should be 6.4 TB, I think).

Even though it shows as 5.21 TB, there are no files. Is there anything I can do to recover this and put some filenames to the partially-transferred dataset?

I don't really understand how zfs is structured physically or where metadata is stored in relation to data. If replication is block-by-block, did it simply fail before it copied any metadata and all my files are undifferentiated compressed gibberish, or has it probably copied enough metadata to reconstruct something but is failing somewhere in the process? In short, is there any hope?

"zpool status" reports "errors: No known data errors."
 

NonWonderDog

Cadet
Joined
Sep 8, 2020
Messages
2
I think I've got more of an understanding on what's going on here now. The dataset is busy, and there's a resume token, since zfs receive -s (started by the replication task) never completed.

But it never *will* complete, because the sending pool exploded itself. So to clean this up I can apparently do a zfs recieve -A and it will delete my 5.2 TB of in-progress transfers and let me move on with my new non-datahoarder life.

I don't expect much, but I still have to ask: is there any way to do anything with the partially-received snapshot? Clone it as-is, finish it with empty data, etc? I'll feel silly if I delete everything and then find out I could have done a zfs receive < /dev/zero.
 

dnut

Cadet
Joined
Jul 19, 2022
Messages
3
I've had a similar experience. In my opinion, using the replication task for transferring large datasets over the internet isn't the best approach. I configured a Disaster Recovery setup for a client (TrueNAS to TrueNAS geo-replication) that involved over 40TB of data. However, I constantly faced internet disruptions before the replication could complete.

Consequently, I resorted to using rsync on a file-level basis. While it wasn't as efficient, it proved to be more resilient. If anyone has any better suggestions or strategies on this, I'd be keen to hear them.
 
Top