sammael
Explorer
- Joined
- May 15, 2017
- Messages
- 76
Hi,
I have 2 truenas scale's with the "truenas1" replicating 2 datasets to the "truenas2". This has worked without any issue for months and survived all the upgrades. Currently on 22.12.3.2.
Yesterday I woke up to "truenas2" being in a reboot loop. Long story short I narrowed it down to of all things 1 replication task. As it was trying to reconnect it was crashing "truenas2". Since there is inexplicably no convenient way to STOP running replication task, I did what most of the threads I was able to find suggested - killed the zfs send / zfs receive processes on both machines. This stopped the crashing. Whenever I tried to manually start the replication it would crash "truenas2". Since the data is also backed up elsewhere I decided to delete the snapshot task, replication task on "truenas1" and dataset on "truenas2" and re-replicate it.
Herein a host of issues begin. Trying to delete the dataset produces kernel panic, the zfs destroy process becomes stuck and unkillable, and the machine is unable to reboot or shut down (I left it sitting for an hour after issuing the shutdown command.) Everything else works, but the panic keeps repeating about every 5 minutes-ish
I tried: multiple reboots (no change), disabling all apps and sharing services (none of which access the pool in question), scrubbing (no errors), crying (no effect), deleting the files from inside the dataset to at least recover the space (no can do - read only filesystem)
I have also discovered that despite deleting both the periodic snapshot task and replication task and all snapshots of the dataset in question from "truenas1", and the fact there has NEVER been an periodic snapshot task on "truenas2", between the reboots there keep appearing snapshots of the dataset on "truenas2". Trying to delete these produces kernel panic as well, but after reboot they seem to be gone, but a new one appeared again.
As the other replication dataset is 26TB and I only have 1G network, I'd rather find a solution other than destroying the pool altogether (which I think should "fix" it?).
Any suggestion welcome!
I have 2 truenas scale's with the "truenas1" replicating 2 datasets to the "truenas2". This has worked without any issue for months and survived all the upgrades. Currently on 22.12.3.2.
Yesterday I woke up to "truenas2" being in a reboot loop. Long story short I narrowed it down to of all things 1 replication task. As it was trying to reconnect it was crashing "truenas2". Since there is inexplicably no convenient way to STOP running replication task, I did what most of the threads I was able to find suggested - killed the zfs send / zfs receive processes on both machines. This stopped the crashing. Whenever I tried to manually start the replication it would crash "truenas2". Since the data is also backed up elsewhere I decided to delete the snapshot task, replication task on "truenas1" and dataset on "truenas2" and re-replicate it.
Herein a host of issues begin. Trying to delete the dataset produces kernel panic, the zfs destroy process becomes stuck and unkillable, and the machine is unable to reboot or shut down (I left it sitting for an hour after issuing the shutdown command.) Everything else works, but the panic keeps repeating about every 5 minutes-ish
Code:
[10392.560613] INFO: task txg_sync:3396 blocked for more than 1208 seconds. [10392.560629] Tainted: P OE 5.15.107+truenas #1 [10392.560637] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10392.560642] task:txg_sync state:D stack: 0 pid: 3396 ppid: 2 flags:0x00004000 [10392.560652] Call Trace: [10392.560656] <TASK> [10392.560662] __schedule+0x2f0/0x950 [10392.560674] schedule+0x5b/0xd0 [10392.560681] vcmn_err.cold+0x66/0x68 [spl] [10392.560702] ? spl_kmem_cache_alloc+0x36/0x100 [spl] [10392.560718] ? bt_grow_leaf+0xdc/0xe0 [zfs] [10392.560841] ? pn_free+0x30/0x30 [zfs] [10392.561000] ? zfs_btree_find_in_buf+0x59/0xb0 [zfs] [10392.561118] zfs_panic_recover+0x6d/0x90 [zfs] [10392.561296] range_tree_add_impl+0x168/0x570 [zfs] [10392.561454] ? mutex_lock+0xe/0x30 [10392.561461] ? __raw_spin_unlock+0x5/0x10 [zfs] [10392.561628] ? list_head+0x9/0x30 [zfs] [10392.561748] metaslab_free_concrete+0x115/0x250 [zfs] [10392.561902] metaslab_free_impl+0xad/0xe0 [zfs] [10392.562055] metaslab_free+0x168/0x190 [zfs] [10392.562212] zio_free_sync+0xde/0xf0 [zfs] [10392.562398] dsl_scan_free_block_cb+0x66/0x1b0 [zfs] [10392.562546] bpobj_iterate_blkptrs+0x102/0x320 [zfs] [10392.562664] ? dsl_scan_free_block_cb+0x1b0/0x1b0 [zfs] [10392.562810] bpobj_iterate_impl+0x243/0x3a0 [zfs] [10392.562928] ? dsl_scan_free_block_cb+0x1b0/0x1b0 [zfs] [10392.563074] dsl_process_async_destroys+0x2cf/0x570 [zfs] [10392.563221] dsl_scan_sync+0x1dd/0x8e0 [zfs] [10392.563369] ? kfree+0x1fc/0x250 [10392.563375] spa_sync_iterate_to_convergence+0x11f/0x1e0 [zfs] [10392.563537] spa_sync+0x2e9/0x5d0 [zfs] [10392.563696] txg_sync_thread+0x229/0x2a0 [zfs] [10392.563866] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [10392.564034] thread_generic_wrapper+0x59/0x70 [spl] [10392.564052] ? __thread_exit+0x20/0x20 [spl] [10392.564067] kthread+0x127/0x150 [10392.564074] ? set_kthread_struct+0x50/0x50 [10392.564078] ret_from_fork+0x22/0x30 [10392.564087] </TASK>
I tried: multiple reboots (no change), disabling all apps and sharing services (none of which access the pool in question), scrubbing (no errors), crying (no effect), deleting the files from inside the dataset to at least recover the space (no can do - read only filesystem)
I have also discovered that despite deleting both the periodic snapshot task and replication task and all snapshots of the dataset in question from "truenas1", and the fact there has NEVER been an periodic snapshot task on "truenas2", between the reboots there keep appearing snapshots of the dataset on "truenas2". Trying to delete these produces kernel panic as well, but after reboot they seem to be gone, but a new one appeared again.
As the other replication dataset is 26TB and I only have 1G network, I'd rather find a solution other than destroying the pool altogether (which I think should "fix" it?).
Any suggestion welcome!