Sending from client PC (Syncoid) catastrophically crashes TrueNAS server!

Joined
Oct 22, 2019
Messages
3,641
Using a test dataset (encrypted), I was able to initiate the first Syncoid replication from a client PC (Linux) to TrueNAS 12 server. (12.0-U5.1)

It completes successfully, and naturally maintains matching base snapshots on the client side and server side.

E.g, @syncoid_linuxpc_2021-08-18:13:44:46-GMT-04:00

According to the syncoid documentation, any subsequent replications will do this:
  1. Create new @syncoid-blahblah-timestamp snapshot
  2. Do an incremental send based on the changes from the last base snapshot and newly created snapshot
  3. Safely prune any obsolete snapshots

Sounds good.

Problem is, these replications are causing the entire TrueNAS server to catastrophically crash and reboot.

I'm getting flashbacks from the kernel panic bug of a similar nature: https://jira.ixsystems.com/browse/NAS-107636

Which log files can I review to figure out what triggers this crash? Every log file I read suddenly "skips" a few minutes with no indication of an abrupt crash. The very next lines show a normal bootup procedure.

It happens every time I try to incrementally replicate from the client PC.
 
Last edited:
Joined
Oct 22, 2019
Messages
3,641
Now I'm seeing the pool marked as UNHEALTHY, yet this is the output I get from checking it with zpool status -v:

errors: Permanent errors have been detected in the following files: <0x67e>:<0x0>

What kind of "file" is <0x67e>:<0x0>
 
Joined
Oct 22, 2019
Messages
3,641
Code:
  pool: primary-pool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Wed Aug 18 14:24:12 2021
        839G scanned at 4.74G/s, 19.3G issued at 112M/s, 3.67T total
        0B repaired, 0.51% done, 09:30:28 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        primary-pool                                    ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/c1867b4a-37e3-11eb-a3ec-b42e99aad10f  ONLINE       0     0     0
            gptid/c18a4369-37e3-11eb-a3ec-b42e99aad10f  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            gptid/3d77f32d-3813-11eb-a3ec-b42e99aad10f  ONLINE       0     0     0
            gptid/3d8d3508-3813-11eb-a3ec-b42e99aad10f  ONLINE       0     0     0


Currently running a scrub.

This is disturbing, considering that ZFS is meant to be a stalwart of resiliency. How could a failed replication task bring an entire pool to its knees?

No hardware errors, no checksum errors. Simply tried to replicate a test dataset from a client PC to the server.
 
Joined
Oct 22, 2019
Messages
3,641
UPDATE: After a scrub, the pool is considered "HEALTHY" and reports no errors.

This tells me that something about the send/recv of the encrypted dataset perhaps created an artifact before/during the system crash that the zpool reads as an "error".

A similar issue happened before with TrueNAS 12, using its built-in replication task in regards to natively encrypted datasets. (Multiple users, multiple reports, multiple confirmations.)




I'm nervous to try this again, as it not only abruptly crashes the entire system, but also requires a full scrub to get back up and running. Even for the sake of submitting a bug reports and crash logs, it's not a risk I'm willing to take with my data. (Yes, I have backups, but I don't want to have to resort to restoring from backups just to troubleshoot a bug or try to send/recv from a client PC.)

Perhaps there's something about native encryption that is triggering this (once again.)
 
Last edited:
Top