SOLVED Full replication of encrypted pool crashes consistently

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
Last several days I've been trying (in vain) to successully replicate a 6 TB encypted pool to a target pool.

The source and target pools are residing in separate machines which are:
- both running the same version of TrueNAS Core (12.0-U6.1)
- both running the same motherboard (Supermicro X10SLM+-F),
- both using an identical LSI SAS 2008 controller card (and identical BIOSes in IT mode)
- almost-same version of BIOS ("source" system is on `latest-1` version, as opposed to "target", which is on `latest`),
- slightly different versions of CPUs (Xeon E3-1231v3 vs E3-1220v3)
- both use ECC RAM (32 GB vs 24 GB)

The source pool resides on a 4x4 RAID-Z1 fusion pool and is being replicated to 4x10 TB fusion pool (striped mirror configuration).
Both metadata vdevs on both ends are running a mirrored SSD configuration.

I am using GUI exclusively to drive this replication as I am not yet proficient in the arts of ZFS CLI. My understanding (implied by U6.1 version) is that this version is production-ready.

The result so far is a (complicated) smorgasbord of crashes (kernel panics) on both ends.
My first thought here was a faulty RAM (somewhere) but these machines have been working flawlessly at least several years that I am (for now) discounting this - that a faulty RAM stick might happen in one machine.....but in both? At the same time? I am not sure I am that (un)lucky. I will be running the memtest86 on both of these machines really soon but these consistent crashes are really frustrating as I keep getting to about 80-90% through, only to experience a crash and ruin a pool. I am currently doing a final replication run from a pool to a pool and if that one crashes, I am definitely doing the manual copy to at least have one decent backup somewhere.

I was able to manually copy the pool's contents from one end to another. I was also able to complete scrubs in both pools - no errors found. Spot checks playing tes of media files (though inconclusive) yielded no problems with playing even the largest of files (some media is > 50 GB large).

I've logged two tickets on the TrueNAS JIRA (https://jira.ixsystems.com/browse/NAS-113477 and https://jira.ixsystems.com/browse/NAS-113491). These tickets go into a much deeper characterisation of what is being tried, what is crashing and when, so I'd rather not rehash all of that here. Instead, I thought I'd collect some thoughts from experienced people here - really appreciate anyone spending the time looking into these.....I tried to be as exhaustive/precise as possible but the amount of details might be a bit daunting.

If anyone is interested further, please don't hesitate to suggest probable causes or avenues to investigate.
 
Last edited:

quietday

Cadet
Joined
Apr 5, 2020
Messages
3
Similar issue. Local replication tasks consistently fail near completion with Error:
__init__() missing 1 required positional argument: 'warnings'.
Tail of Log with error:
[2021/11/27 16:08:33] INFO [replication_task__task_1] [zettarepl.replication.run] For replication task 'task_1': doing push from 'pool1/jail' to 'pool2/dir/jail' of snapshot='auto-2021-11-27_09-42' incremental_base=None receive_resume_token=None encryption=False
[2021/11/27 16:08:39] ERROR [replication_task__task_1] [zettarepl.replication.run] For task 'task_1' unhandled replication error TypeError("__init__() missing 1 required positional argument: 'warnings'")
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/run.py", line 189, in run_replication_tasks
notify(observer, ReplicationTaskSuccess(replication_task.id))
TypeError: __init__() missing 1 required positional argument: 'warnings'
It's not a scale we could manually verify. I need to better understand how 'replication' works.

zfs send/receive, rsync, etc complete as expected.
 

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
After a full week of troubleshooting, and at least 10 failed replications (of which at least 70% left the target pool in unusable state and 1 even managed to almost render the source pool unusable as well) I was able to replicate the pool to another machine - but not in a single pass.
I was able to determine that the replication tasks seem to be crashing at particular points in replication, which led me to suspect a few datasets that seemed to have some (or all) of the above:
  • nested datasets of different record sizes (parent dataset record size 1 MB, child dataset record size 16 kB)
  • long paths (potentially exceeding 255 characters)
  • "special" characters in filenames (a few examples would include Japanese chars in file names, or a character "squared" (superscripted symbol "2")
I cannot claim decidedly that any or all of these might have caused the crashes - but I do observe I was able to transfer these datasets (in isolation) through replication tasks (both local and remote) after trimming the path lengths down or removing the "special" chars from filenames (potentially from directory names as well, but I haven't been that observant as I was trimming these).

After the replication was successful, I ran scrubs on all the pools that participated in this (there is one source pool [RAID-Z2] and two target pools [one striped mirror, and one single-drive]) - no errors found on any. Manual spot checks find no problems on either of these three.

One additional note - 4 passes of memtest86 PRO (v9.2) are good on each machine. I am doing a full 32 pass on each machine to fully exclude any memory issues as I write.

I am at a loss how to explain this. I cannot attribute this to any piece of hardware in these machines so far (you can see the source machine in my signature - target is very close to it). In particular, LSI controller is a tried-and-true SAS2008 in IT mode with 20.00.07 firmware. Most of the hardware is "server-grade" vintage 2014, which in my mind, tells me it should be rock-solid. The disks used are either "NAS-grade" (in case of source drives) or "enterprise-grade" (in case of target drives, i.e. Exos 10 TB).

My distinct impression of encrypted replication so far is that it should not have left beta - if it's targeting expert users only (i.e. requires CLI proficiency), it should have been stated so. The state of GUI (as of T12-U6.1) is not helping out at all as it seems incomplete - for example, creating a PULL replication task and then switching to Advanced as a last step seems to lose configuration choices from previous steps leaving the user scratching their heads as to how to get the source pool data displayed. Aside from crashes, I have seen dataset properties not correctly propagated to target machine (probably because of these crashes), cryptic error messages and/or messages whose application leads to data loss.

Probably a good portion of these issues can be attributed to the awkward interaction with the idiosyncrasies of native ZFS replication, which by itself has led even experienced/expert users thrown off (i.e. see discussions on workaround workflows involving creating additional root datasets to gain better control and/or more "intuitive" behaviour).
 

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
Marking this thread as SOLVED as there are indications this will be fixed in the upcoming TrueNAS (U7).
See comment(s) on the first JIRA ticket.
 
Top