Cannot reliably restore replicated backups from second node

0x3d638c8e · Feb 12, 2023

Cannot reliably restore TrueNas backups

I have two TrueNAS SCALE instances, on separate machines. The main instance, nas, had to have the zfs pool re-created and backups restored. This is causing me many issues.

Setup

Let me preface this by saying I'm new to zfs (one of the reasons of choosing TrueNAS), coming from traditional ext4 + LUKS. I've set up the original pool via CLI and imported it (trying out OMV before).

The main instance, nas, has a structure like

Code:

tank/enc/ # <- encrypted
tank/enc/backups
tank/enc/backups/timemachine
tank/enc/backups/windows
tank/enc/data
tank/enc/data/whatever
tank/downloads # <- unencrypted and not backed up

The backup node

Code:

tank/enc/ # <- encrypted
tank/enc/backups
tank/enc/backups/nas # <- for backing up zfs data from the first node
tank/enc/backups/nas/backups/timemachine # <- mirrored the original nodes structure, just 2 levels deeper
# <- note how there is no backups/windows, since there is nothing of value worth backing up redudantly
tank/enc/backups/nas/data
tank/enc/backups/nas/data/whatever

Automatic snapshots are set up on the lowest datasets in the tree for some datasets, not recursive on their parents, since i e.g. do not want to back up some 2T of ultimately replaceable Windows data.

Others, in this case e.g. tank/enc/data have recursive snapshots configured, since I plan on backing up the whole dataset and its children. I am not sure if that's a mistake or best practice.

Code:

#tank/enc/ 
#tank/enc/backups
tank/enc/backups/timemachine # <- snapshots here
tank/enc/backups/windows # <- snapshots here
tank/enc/data # <- recursive snapshots here
#tank/enc/data/whatever # <- snapshots inherited from parent

In the same vein, automatic replication is set up similarly:

Code:

#tank/enc/ 
#tank/enc/backups
tank/enc/backups/timemachine # -> tank/enc/backups/nas/backups/timemachine 
#tank/enc/backups/windows # goes nowhere
tank/enc/data  # -> tank/enc/backups/nas/data
#tank/enc/data/whatever # not explicitly configured, part of parent

This all runs on a schedule.

Replication Restore Issues

For restoring, I first disabled SMB and auxiliary services and stopped automatic snapshotting and PUSH replication.

Then, I (naively) re-created the nas pool, followed by the dataset structure from history via SSH and clicked "Restore" on each automatic replication, but all of those failed by either

It showing up as "Failed" (or "Error") for a split second, with this helpful message for logs:

Unable to send encrypted dataset to existing unencrypted or unrelated dataset <- This was the standard error, but that could be fixed by selecting "Do not copy file system properties".
Sometimes, selecting a pool on the backup node in the UI gives me a "[EFAULT] Failed to get group quota for tankX/dataset: [cannot get used/quota for tankX/dataset: unsupported version or feature ]" error (which appears to have no impact on anything?)
Those 2 errors are in my search history, but I don't recall them occurring again:
- Cannot receive new filesystem stream: encryption property 'encryption' cannot be set or excluded for raw streams
- Encryption requested for destination dataset 'tank', but it already exists and is not encrypted

The only way I've found to actually get some data is by manually re-creating a PULL task from the source for each dataset, from scratch, not recursive, unselect "Copy file system properties", and start them one-by-one.

However, I am unsure how successful that is, since am getting the following issues:

Every time I trigger a replication the UI becomes unavailable (and so do new ssh connections). I'm getting 100% CPU usage as soon as a task starts. Because of this, restoring backups (if they worked, see below) would take days, since I'd have to babysit these jobs. I can exclude a CPU from the kernel's scheduler and manually assign it to Python or whatever is running the UI, but that seems ill advised (same with messing with nice for sshd - I've been warned to basically not touch the system outside of the UI...)
I get no logs of previous attempts. I started 5 last night and got zero evidence this morning (clearly, most failed, since there's no data). This makes sense with the system in a constant state of 100% CPU and the scheduler only giving time to the zfs replication, I presume?
Because of that, everything is ALWAYS "Pending" (or "Running" for a brief moment before the UI locks). "PENDING" does not seem like a valid state for a one-off run that's not on a schedule?
Frequently, I get false-positives for locked datasets. Child datasets appear locked, despite their parent clearly being unlocked. Unlocking via the CLI confuses the poor thing even more. When trying to use the UI to unlock it, I get a "mountpoint or dataset is busy" message, which is a red herring in this context, since it's clearly unlocked and mounted and hence, cannot be unlocked and mounted again
ConnectionResetError: [Errno 104] Connection reset by peer in /var/log/zettarepl.log is an occasional occurrence during replication, but it appears not to kill the process, same with ConnectionRefusedError: [Errno 111] Connection refused
And last but not least, after letting some replication tasks run overnight, I came back this morning only to find that only a small subset of data was actually copied (150/700G or so for one dataset) and others failed (but I have no evidence of that). I have no confidence that the data is actually backed up

Hardware

FWIW:

nas
- 4 CPUs. QEMU/Proxmox Hypervisor. Ryzen 2200G under the hood
- 24GB DDR4-2666 RAM
- 64GB SCSI virtual disk, underlying is an NVMe SSD with zfs (standard proxmox)
- 6 drives, 3 different capacity, striped mirror (I am aware of the redundancy implications). LSI HBA passthrough
backup
- 4 CPUs. QEMU/Proxmox Hypervisor. i7 (Coffee Lake?) under the hood
- 8GB DDR4-2666 RAM
- 64GB SCSI virtual disk, underlying is an NVMe SSD with zfs (standard proxmox)
- 1 12TB USB drive, SCSI passthrough. Native AIO

Appreciate any help. There's a good chance I've messed up the setup.

0x3d638c8e · Feb 12, 2023

It appears I cannot edit the OP, so apologies for the double post.

I had a replication fail on me just now (allegedly - again, no UI logs - but definitely not enough data) with `dmesg` (on the VM/TrueNAS side!) telling me `asyncio_loop` OOM'd - which I find peculiar. The hypervisor has 32GB, 24GiB of that for the VM, leaving around ~7GiB for the host (for any host-related I/O buffering - keep in mind, the LSI HBA is done via passthrough, so all zfs/disk I/O should be handled by the VM's kernel (?), leaving only the boot disk putting stress on the hypervisor).

Memory usage in the VM never exceeded 4GiB. I'm going to go into the QEMU config, suspecting that ballooning memory might be responsible for part of the erratic behavior. I don't know enough about that to see if that makes any sense - TrueNAS reported it had the whole 24GiB available in the UI after the kernel message, but I don't know what happens if the Hypervisor assigns more memory while a process is running. I'd assume the Kernel would be aware of the extra memory and let the replication process use it - but that's just a guess.

Important Announcement for the TrueNAS Community.

Cannot reliably restore replicated backups from second node

0x3d638c8e

Cadet

Cannot reliably restore TrueNas backups

Setup

Replication Restore Issues

Hardware

0x3d638c8e

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Cannot reliably restore replicated backups from second node

0x3d638c8e

Cadet

Cannot reliably restore TrueNas backups​

Setup​

Replication Restore Issues​

Hardware​

0x3d638c8e

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Cannot reliably restore replicated backups from second node"

Similar threads

Cannot reliably restore TrueNas backups

Setup

Replication Restore Issues

Hardware