Kernel Panic when trying to destroy dataset

sammael

Explorer
Joined
May 15, 2017
Messages
76
Hi,
I have 2 truenas scale's with the "truenas1" replicating 2 datasets to the "truenas2". This has worked without any issue for months and survived all the upgrades. Currently on 22.12.3.2.

Yesterday I woke up to "truenas2" being in a reboot loop. Long story short I narrowed it down to of all things 1 replication task. As it was trying to reconnect it was crashing "truenas2". Since there is inexplicably no convenient way to STOP running replication task, I did what most of the threads I was able to find suggested - killed the zfs send / zfs receive processes on both machines. This stopped the crashing. Whenever I tried to manually start the replication it would crash "truenas2". Since the data is also backed up elsewhere I decided to delete the snapshot task, replication task on "truenas1" and dataset on "truenas2" and re-replicate it.

Herein a host of issues begin. Trying to delete the dataset produces kernel panic, the zfs destroy process becomes stuck and unkillable, and the machine is unable to reboot or shut down (I left it sitting for an hour after issuing the shutdown command.) Everything else works, but the panic keeps repeating about every 5 minutes-ish

Code:
[10392.560613] INFO: task txg_sync:3396 blocked for more than 1208 seconds.
[10392.560629]       Tainted: P           OE     5.15.107+truenas #1
[10392.560637] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10392.560642] task:txg_sync        state:D stack:    0 pid: 3396 ppid:     2 flags:0x00004000
[10392.560652] Call Trace:
[10392.560656]  <TASK>
[10392.560662]  __schedule+0x2f0/0x950
[10392.560674]  schedule+0x5b/0xd0
[10392.560681]  vcmn_err.cold+0x66/0x68 [spl]
[10392.560702]  ? spl_kmem_cache_alloc+0x36/0x100 [spl]
[10392.560718]  ? bt_grow_leaf+0xdc/0xe0 [zfs]
[10392.560841]  ? pn_free+0x30/0x30 [zfs]
[10392.561000]  ? zfs_btree_find_in_buf+0x59/0xb0 [zfs]
[10392.561118]  zfs_panic_recover+0x6d/0x90 [zfs]
[10392.561296]  range_tree_add_impl+0x168/0x570 [zfs]
[10392.561454]  ? mutex_lock+0xe/0x30
[10392.561461]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[10392.561628]  ? list_head+0x9/0x30 [zfs]
[10392.561748]  metaslab_free_concrete+0x115/0x250 [zfs]
[10392.561902]  metaslab_free_impl+0xad/0xe0 [zfs]
[10392.562055]  metaslab_free+0x168/0x190 [zfs]
[10392.562212]  zio_free_sync+0xde/0xf0 [zfs]
[10392.562398]  dsl_scan_free_block_cb+0x66/0x1b0 [zfs]
[10392.562546]  bpobj_iterate_blkptrs+0x102/0x320 [zfs]
[10392.562664]  ? dsl_scan_free_block_cb+0x1b0/0x1b0 [zfs]
[10392.562810]  bpobj_iterate_impl+0x243/0x3a0 [zfs]
[10392.562928]  ? dsl_scan_free_block_cb+0x1b0/0x1b0 [zfs]
[10392.563074]  dsl_process_async_destroys+0x2cf/0x570 [zfs]
[10392.563221]  dsl_scan_sync+0x1dd/0x8e0 [zfs]
[10392.563369]  ? kfree+0x1fc/0x250
[10392.563375]  spa_sync_iterate_to_convergence+0x11f/0x1e0 [zfs]
[10392.563537]  spa_sync+0x2e9/0x5d0 [zfs]
[10392.563696]  txg_sync_thread+0x229/0x2a0 [zfs]
[10392.563866]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[10392.564034]  thread_generic_wrapper+0x59/0x70 [spl]
[10392.564052]  ? __thread_exit+0x20/0x20 [spl]
[10392.564067]  kthread+0x127/0x150
[10392.564074]  ? set_kthread_struct+0x50/0x50
[10392.564078]  ret_from_fork+0x22/0x30
[10392.564087]  </TASK>


I tried: multiple reboots (no change), disabling all apps and sharing services (none of which access the pool in question), scrubbing (no errors), crying (no effect), deleting the files from inside the dataset to at least recover the space (no can do - read only filesystem)

I have also discovered that despite deleting both the periodic snapshot task and replication task and all snapshots of the dataset in question from "truenas1", and the fact there has NEVER been an periodic snapshot task on "truenas2", between the reboots there keep appearing snapshots of the dataset on "truenas2". Trying to delete these produces kernel panic as well, but after reboot they seem to be gone, but a new one appeared again.

As the other replication dataset is 26TB and I only have 1G network, I'd rather find a solution other than destroying the pool altogether (which I think should "fix" it?).

Any suggestion welcome!
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
I agree trashing the pool will likely fix the problem, which is probably some sort of pool corruption, but as last resort.

A panic usually is the last gasp message just before the kernel crashes. You are seeing a kernel stack back trace from a blocked kernel task, the transaction group sync which normally happens every 5 seconds.

I expect you have checked around these messages are there aren't any other kernel messages around where this is happening maybe indicating some other hardware or kernel software issue?

I also expect you have double checked tn1 for replication tasks pushing to tn2. Hopefully you have also check in case there is a replication task on tn2 pulling data from tn1, just for completeness, and also check there are no snapshot tasks on tn2 which might generate snapshots?

Seems like you need to stop the new snapshots appearing first? Don't need extra snapshots complicating things. You say there is a disabled/deleted replication task on tn1 than would send snaps to tn2? But you are still getting snaps appearing? Are the snaps normally replicated over ssh? Can you disable sshd on tn2? Can you disable authenticion key on tn2 ssh key that is used by tn1 to send snaps? Does this stop the snaps appearing? If so then it seems you have more work to track down the snaps on tn1.
 

sammael

Explorer
Joined
May 15, 2017
Messages
76
There weren't any other kernel messages apart from the hung task. I tried disabling the ssh service and removing the key to no avail. There are no, nor there ever were tasks pushing from tn2 to tn1, it's all one way from tn1 to tn2. tn2 is just a raidz2 target for backup and it runs truenas because I wanted to leverage the zfs replication + I'm used to it and run ~15-ish apps on each tn on their own mirrored m2 ssd pools.

I've seen some threads about replication task causing crashes/panics on reddit and here as well, and I have to confess as a hobbyist homelab user setting up replication task was the most hostile user unfriendly thing in truenas I've experienced yet, not to mention how the auth keys seem to go missing every other upgrade and replication failing and needing enabling replication from scratch despite no data changed on source dataset etc. For comparison my rsync tasks are literally "set and forget" and been syncing data for years, even back from when "truenas1" was core not scale.

In the end I just destroyed the pool and recreated it with 1 dataset for the sole use as target of the replication task from "truenas1" (all others were converted to rsync tasks previously - I just didn't want to deal with moving 26TB of data around, but that's moot now). Should that fail in the future I'll just convert that one to rsync as well and forget zfs replication exists.
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
Problem solved it seems.

I agree replication setup can be convoluted, have to do ssh key setup, then ssh connection setup, then replication task setup, and in that order.

I think ZFS replication is a better solution in general, but rsync works easier for you, and it is your system.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Trying to delete the dataset produces kernel panic, the zfs destroy process becomes stuck and unkillable, and the machine is unable to reboot or shut down (I left it sitting for an hour after issuing the shutdown command.) Everything else works, but the panic keeps repeating about every 5 minutes-ish

Please do fill out a bug report (see "Report a Bug" at the top of page) and post the issue number here if possible, It's important that ZFS should avoid bring the system down, pool corruption may be an exception but even there it is desirable for the system to be as resilient as possible.
 

sammael

Explorer
Joined
May 15, 2017
Messages
76
@samarium, indeed and as it is just backup of my movies not some critical production data I just went rsync since I've already rsync'd 5TB into tn2, before I tried to re-replicate, but I forgot to encrypt the pool, but the source is encrypted and I just don't have it in me to redo it again. I mean it's supposed to be a backup yet I had to wipe it so that's no good. Rsync for me all the way now:)

@jgreco https://ixsystems.atlassian.net/browse/NAS-122883
 

ikarlo

Dabbler
Joined
Apr 21, 2021
Messages
18
HI,
I have same problem with the exact same setup (both systems, source and destination, are on Truenas scale 22.12.3.2).
ZFS panic on target system while trying to destroy a old snapshot, manually or automatically through retention on replication jobs.

After another reboot truenas try to mount pool without success with same kernel messages of Sammael.

This is a big problem, please please find a fix as soon as possible.

Thanks
Carlo
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
HI,
I have same problem with the exact same setup (both systems, source and destination, are on Truenas scale 22.12.3.2).
ZFS panic on target system while trying to destroy a old snapshot, manually or automatically through retention on replication jobs.

After another reboot truenas try to mount pool without success with same kernel messages of Sammael.

This is a big problem, please please find a fix as soon as possible.

Thanks
Carlo
you should comment on the jira ticket if you have same issue
 

TempleHasFallen

Dabbler
Joined
Jan 27, 2022
Messages
34
I'm also facing the same issue when pushing a replication from TN1 (22.12.3.2) to TN2 (Core 13.0-U5.2).
However a the time of my initial push, my TN1 box also resets.

Is the jira ticket private? I cannot see the contents
 

ikarlo

Dabbler
Joined
Apr 21, 2021
Messages
18
I'm also facing the same issue when pushing a replication from TN1 (22.12.3.2) to TN2 (Core 13.0-U5.2).
However a the time of my initial push, my TN1 box also resets.

Is the jira ticket private? I cannot see the contents

I don't know, since yesterday, I can't read the ticket anymore
 

sammael

Explorer
Joined
May 15, 2017
Messages
76
I can still read the ticket (be weird if I couldn't as I was the one who opened it) and there's nothing new, apart from someone being assigned to it.

Also surprised it doesn't get more coverage / faster resolution, to me it seems like quite a serious issue, indeed in all my years of using TN since Core 9 this is the only issue where I "lost" data (had to destroy the whole pool on target truenas, quotes because I lost a backup of a data, so didn't lose anything in reality, but still troublesome.)

I pretty much lost all faith in zfs replication (absolutely unbelievable massive pain to set up compared to rsync), and even if they "fix" it I'm sticking with rsync that hasn't ever failed me like this.

edit: the Jira ticket lists the priority as Low, so /shrug
 

TempleHasFallen

Dabbler
Joined
Jan 27, 2022
Messages
34

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Also surprised it doesn't get more coverage / faster resolution, to me it seems like quite a serious issue, indeed in all my years of using TN since Core 9 this is the only issue where I "lost" data (had to destroy the whole pool on target truenas, quotes because I lost a backup of a data, so didn't lose anything in reality, but still troublesome.)

These things usually aren't simple problems, and rushing to find a solution without actually fully understanding what is going on is rarely advisable.

I pretty much lost all faith in zfs replication (absolutely unbelievable massive pain to set up compared to rsync), and even if they "fix" it I'm sticking with rsync that hasn't ever failed me like this.

Replication is awesome for some specific use cases as it is more easily able to take advantage of stuff like only copying modified blocks; this means it can handle block storage or snapshots or other similar tricky ZFS stuff. rsync isn't able to do a lot of that, but on the other hand, rsync is a whole lot less fragile. If you're looking for file-based copy/backups, and can cope with stuff like snapshots on your own, then definitely rsync has a lot of upsides.

I find it easier to design file storage around rsync's relatively minor quirks. The place it can hurt you is if you have large files; rsync is unable to easily recognize stuff such as a file being moved and may want to recopy the entire file.
 

ikarlo

Dabbler
Joined
Apr 21, 2021
Messages
18
Until the fix is released, what are the best ways to restore replication without disabling encryption of the source dataset?

I think about disabling "Include dataset properties" in the replication settings.
This way I believe an unencrypted stream arrives at the destination to avoid PANIC issue.
What do you think?
 

TempleHasFallen

Dabbler
Joined
Jan 27, 2022
Messages
34
Until the fix is released, what are the best ways to restore replication without disabling encryption of the source dataset?

I think about disabling "Include dataset properties" in the replication settings.
This way I believe an unencrypted stream arrives at the destination to avoid PANIC issue.
What do you think?
This is what I am currently doing.

- Disabled "Include dataset properties"
- Enable Encryption, set key or passphrase
- Keep in mind dataset has to be unlocked on remote system after boot unless saved.

I had to destroy my target pool completely to get rid of the broken datasets.
 

TempleHasFallen

Dabbler
Joined
Jan 27, 2022
Messages
34
Do you mean that the pool on the remote system can remain encrypted even if it contains unencrypted replicated datasets?
No, I'm referring to setting encryption in your replication job as such:
1689754985280.png


Basically it will encrypt the data with the set passphrase/key before it is written on the target disk.
On the first time replication, the dataset on the target will show unlocked, afterwards you will have to unlock the dataset with the key/passphrase after boot in order for future replications to work.
 

ikarlo

Dabbler
Joined
Apr 21, 2021
Messages
18
Ok, so I can recreate the target pool encrypted, with its own key, as before?
 
Last edited:
Top