Is My Data Lost? A ZFS Encryption / Replication Remote Keys Unable to Be Loaded Question

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
Here's what I am trying now, it will take a little bit, so I will have to get back.

From FrickNASty: zfs send -Rw FrickNASty/Encrypted/PhotoVideo@manual-oldest-snapshots| zfs recv -Fuv Yolen/RemoteBackups/PhotoVideoBackup

When I started this replication I get the dataset PhotoVideoBackup which seems to have the appropriate encryption root. Once this finishes, I will try to transfer from Yolen/RemoteBackups/PhotoVideo -> Yolen/RemoteBackups/PhotoVideoBackup and see what happens to get up to date with a snapshot that can be used for the replication task. Will update after.

Screenshot 2023-06-01 at 6.12.47 PM.png
 
Last edited:

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
@winnielinnie

So I am still having issues. Just to back up to do things locally I still am getting system crashes when trying to do a zfs replicaiton task to a local dataset.

For example,

I set up a local task in the GUI under replication tasks, it happily starts moving snapshots (around 19TB worth of ~700 snapshots):

FrickNASty/Encrypted/PhotoVideo -> Roshar/formove/PhotoVideo

right up until the exact same point every time, but as soon as I get past @auto-2023-04-07_00-00, it crashes.
 
Last edited:

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
@winnielinnie

So I am still having issues. Just to back up to do things locally I still am getting system crashes when trying to do a zfs replicaiton task to a local dataset.

For example,

I set up a local task in the GUI under replication tasks, it happily starts moving snapshots:

FrickNASty/Encrypted/PhotoVideo -> Roshar/formove/PhotoVideo

right up until the exact same point every time, but as soon as I get past @auto-2023-04-07_00-00, it crashes.
This is happening on TrueNAS Scale on my local machine. I tried doing a zfs recv -A Roshar/formove/PhotoVideo to get rid of the in progress record, success. Then I tried rolling back Roshar/formove/PhotoVideo to an older snapshot (2023-04-01) (success) and then I deleted the snapshots in between 2023-04-02 and 2023-06-01 on FrickNASty/Encrypted/PhotoVideo thinking maybe one of the snapshots was problematic somehow. But as soon as I go to enable to start the replication task in the GUI, I get a system crash on TrueNAS scale as it goes to send 2023-06-01. There is no output data in the console or on my IPMI to help me diagnose.
 

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
This is happening on TrueNAS Scale on my local machine. I tried doing a zfs recv -A Roshar/formove/PhotoVideo to get rid of the in progress record, success. Then I tried rolling back Roshar/formove/PhotoVideo to an older snapshot (2023-04-01) (success) and then I deleted the snapshots in between 2023-04-02 and 2023-06-01 on FrickNASty/Encrypted/PhotoVideo thinking maybe one of the snapshots was problematic somehow. But as soon as I go to enable to start the replication task in the GUI, I get a system crash on TrueNAS scale. There is no output data in the console or on my IPMI to help me diagnose.
This was the same snapshot difference at which my remote system would fail and restart back at the beginning of all of this, so I think this is actually the issue, but for the life of me, I cannot see why
 

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
I even tried making a new zfs replication task, based on hourly snapshots. I renamed the snapshot auto-2023-04-01_00-00 to hourly-auto-2023-04-01_00 to match the formatting of the new hourly snapshots and set up the replication task to replicate that. Nope, instant crash.
 

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
So I will send the newly named hourly-auto-2023-04-01_00-00 to Roshar/formove/PhotoVideo and see if I can layer the incrementals on top of that

Nope, this caused a system crash. Let me try to send the next snapshot first and then just assume that something weird changed about the pool beforehand.
 
Last edited:
Joined
Oct 22, 2019
Messages
3,587
This is starting to sound like a (new or resurfaced) ZFS bug...

It's not normal nor expected behavior.
 

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
This is starting to sound like a (new or resurfaced) ZFS bug...

It's not normal nor expected behavior.
Yeah, it makes little sense, but I have replicated this behavior multiple times. Is there a way I can monitor the logs of the crash to get a little more info? Unfortunately looking at the console gives no info and the video output also gives no information.
 

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
This is starting to sound like a (new or resurfaced) ZFS bug...

It's not normal nor expected behavior.

OK, so the transfer from the later snapshots worked (in the GUI) and 18 TB of data were transferred, but I cannot access any files within the dataset, even when I am root on the host system. Looking at the dataset properties, this is the only dataset that I use small metadata special device to store files less than 32K on. This was a parameter that I changed in like fall of 2022, so I am not sure why that would cause an error come April, but I wonder if this is really the issue here. That every dataset I am not using the special metadata device to store small files on works fine, makes me wonder if this is really the issue.

So I added a zfs special metadata device to Roshar and am transferring FrickNASty/Encrypted/PhotoVideo -> Roshar/formove/PhotoVideo and will see if this also causes the same crashes.

Interestingly, the replication task had the option "replicate dataset properties" checked. And Yolen/RemoteBackups/PhotoVideo has special metadata set to 32K despite there not being a special metadata device on that pool. So I want to try now without copying the dataset properties. @winnielinnie , am I missing something stupid about zfs replication from a pool with a special metadata device and small file storage to a pool without one?

I have sufficient mirrors for my special devices.
 
Joined
Oct 22, 2019
Messages
3,587
@winnielinnie , am I missing something stupid about zfs replication from a pool with a special metadata device and small file storage to a pool without one?
Looking at the dataset properties, this is the only dataset that I use small metadata special device to store files less than 32K on. This was a parameter that I changed in like fall of 2022, so I am not sure why that would cause an error come April, but I wonder if this is really the issue here. That every dataset I am not using the special metadata device to store small files on works fine, makes me wonder if this is really the issue.
It sounds like this is the issue. I've never used a special vdev, let alone any threshold for a "small block special allocation". But perhaps the destination pool must also require a special vdev to house these small blocks. (Otherwise, the dataset parameter, on the destination, should be set to the default value of "0".)

This might be what you're witnessing.

It's perhaps possible to add an "exclude" for the dataset properties in the Replication config window under "Properties Exclude".

In this case you would add special_small_blocks to be excluded.

Though, I'm not sure how safe this is, or what other implications might be involved. Nor am I sure this exclusion is even supported in a send/recv.

EDIT: I'm curious. What do you get with this, for the dataset(s) in question; source and destination pools:
Code:
zfs get special_small_blocks sourcepool/dataset
zfs get special_small_blocks destinationpool/dataset
 
Last edited:
Top