Recovering Data from Corrupted (?) ZFS Mirror Pool

Rvby1

Cadet
Joined
Jan 17, 2024
Messages
8
This issue started in an Ubuntu Server VM, but I'm hoping (praying) that I can fix it using some tools in TrueNAS.

I was running a Mirror pool on an Ubuntu Server VM, which itself was running inside of Proxmox. The two 2TB drives were passed through to the VM, and it had them set up in a singular Mirror pool with one dataset inside of it. The drives were both Seagate Barracuda Compute drives.

In the last month or so, I started getting close to filling the pool, so I decided to invest in creating a separate NAS. Got that configured with 2x18TB drives in a Mirror pool, all configured and running through TrueNAS Scale. These were Seagate EXOS drives.

After some configuration and testing, I felt that I was ready to move my data from the Ubuntu VM ZFS Pool over to the TrueNAS ZFS Pool. I had the TrueNAS ZFS pool mounted using cifs on my Ubuntu VM, and I was able to write and read from it without any issues.

I decided to use rsync, as I thought it would copy over the hardlinks my Ubuntu ZFS pool was using. The command was something like... rsync -avH --update --progress. This command worked fine in my testing, and it copied over data without any issues.

When I eventually tried to run it on my Ubuntu's media-vault to copy it over to my TrueNAS vault, everything was working fine for a few thousand files. But then I noticed that my SSH terminal connection had hung, and when I went to check on the VM in Proxmox, it had an IO error status (little yellow triangle). I didn't understand why until digging deeper.

Through some stupidity, I misconfigured the LVM-thin and ended up over-allocating. Something during the rsync process must have written out data to the local machine's LV, which then caused it to fill up the entire LVM-thin. This then starved out all of the VMs on my machine and caused them to start failing.

I was able to resolve the starvation issues by backing up one of my VMs over to my NAS, then deleting it. This let me reboot into my Ubuntu VM, but I immediately noticed that there was some corruption. Some of my folder permissions were changed, and almost all of the data on my Ubuntu's ZFS mirror pool was gone, save for a few random files.

I went from almost 2TB, down to a couple of gigs.

I disconnected the ZFS drives without doing any sort of export and tried to connect them to my TrueNAS system in the hopes that, maybe, the data was there, but just blocked by corruption. After connecting the drives to my mother board's SATA connectors and booting, TrueNAS spit out the following error to the notifications section:

Code:
Device: /dev/sde [SAT], 5 Offline uncorrectable sectors.
2024-01-17 01:02:57 (America/Los_Angeles)


And when I tried to just import it, this is the error that I got: https://pastebin.com/CuxyfgXX

Now, some of the data is on my NAS, but the majority of it is not. I'd like to recover this data if I can, as reobtaining a lot of it would be very time consuming.

I'm currently running a scan with UFSExplorer in the hopes it can find something/fix whatever corruption happened, if that is indeed the case.

Anyone know if there is a way to get this pool to import correctly into TrueNAS Scale in the hopes that the data is there, and the Ubuntu VM is just messed up?

Anyone have any other advice? Programs I should try? Steps I should take? Anything would be appreciated!

Thanks!
 

Rvby1

Cadet
Joined
Jan 17, 2024
Messages
8
Alright, so my solution to this, after freaking out about losing the data for hours, ended up being incredibly simple.

While I was scanning one of the disks with UFS explorer, I decided to plug the disk that was giving me the uncorrectable sectors back into my Ubuntu VM. I was expecting a zpool status to tell me that the pool was degraded, and that the drive had issues with its metadata... but it said that there were no pools available.

A bit of googling later, and I found that I could run zpool import -a. After this, it suddenly saw the pool again. I also saw that the permissions of my vault dataset were suddenly correct again, and that it was seeing all of the data in the dataset!

What I'm thinking is that the Ubuntu VM lost track of the zpool while it was having IO errors.

Strangely, the dataset still had a mounting point on my machine--it just had the wrong permissions and almost none of my data inside. I'm not totally clear on why the mounting point didn't disappear, or if it was still there, just didn't have any data due to the missing zpool.

Anyway, this has been resolved. Hopefully my solution above can help someone else out who thought they'd lost data!

And yes, I will be improving my backup system. That's part of why I invested in building the NAS, so that I could have more space for backups, haha. I was playing it a bit fast and loose with my media data, and I very nearly paid the price!
 
Top