Kernel Panic on NFS write

Thalhammer · Jan 7, 2021

Hi,
Soooo I tried it, loved it, broke it.
Setup:

KVM Virtual Machine (q35),
Boot drive: SSD virtio disk
Data drives: 8x8TB connected to a LSI SAS2308 HBA (HBA passed through to VM)
Memory: 16G ECC
CPU: 14 Cores of a Xeon E5-2620 v4 @ 2.10GHz

[ 1436.910692] VERIFY(tree->avl_numnodes > 0) failed
[ 1436.912523] PANIC at avl.c:750:avl_remove()
[ 1436.914103] Showing stack for process 42568
[ 1436.916038] CPU: 6 PID: 42568 Comm: ganesha.nfsd Tainted: P OE 5.9.0-1-amd64 #1 Debian 5.9.1-1
[ 1436.918773] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
[ 1436.922650] Call Trace:
[ 1436.924074] dump_stack+0x6b/0x88
[ 1436.925631] spl_panic+0xd4/0xfc [spl]
[ 1436.927342] ? _cond_resched+0x16/0x40
[ 1436.929103] ? _cond_resched+0x16/0x40
[ 1436.931032] ? queued_spin_unlock+0x5/0x10 [zfs]
[ 1436.932859] ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 1436.934819] ? txg_list_add+0x99/0xd0 [zfs]
[ 1436.936385] ? zilog_dirty+0x50/0xc0 [zfs]
[ 1436.938373] ? queued_spin_unlock+0x5/0x10 [zfs]
[ 1436.940247] ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 1436.942419] ? zil_itx_assign+0x1cc/0x320 [zfs]
[ 1436.943849] ? spl_kmem_cache_alloc+0x83/0x260 [spl]
[ 1436.945449] avl_remove+0x297/0x2b0 [zavl]
[ 1436.947393] zfs_rangelock_exit+0x142/0x230 [zfs]
[ 1436.949372] zfs_write+0x593/0xf10 [zfs]
[ 1436.951056] zpl_write_common_iovec+0xac/0x120 [zfs]
[ 1436.953044] zpl_iter_write_common+0x86/0xb0 [zfs]
[ 1436.954982] zpl_iter_write+0x4e/0x80 [zfs]
[ 1436.956251] do_iter_readv_writev+0x160/0x1d0
[ 1436.957576] do_iter_write+0x7c/0x1b0
[ 1436.959184] vfs_writev+0xa0/0xf0
[ 1436.960278] ? __kmalloc+0x10b/0x260
[ 1436.961430] __x64_sys_pwritev+0xad/0xf0
[ 1436.962808] do_syscall_64+0x33/0x80
[ 1436.963885] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1436.965257] RIP: 0033:0x7febee873ea0
[ 1436.966641] Code: 3c 24 48 89 4c 24 18 e8 5e fe f8 ff 4c 8b 54 24 18 8b 3c 24 45 31 c0 41 89 c1 8b 54 24 14 48 8b 74 24 08 b8 28 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 30 44 89 cf 48 89 04 24 e8 8c fe f8 ff 48 8b
[ 1436.971466] RSP: 002b:00007feb2e601ef0 EFLAGS: 00000246 ORIG_RAX: 0000000000000128
[ 1436.973712] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007febee873ea0
[ 1436.975772] RDX: 0000000000000001 RSI: 00007feaf01011c0 RDI: 000000000000001f
[ 1436.977764] RBP: 00007febbc004c70 R08: 0000000000000000 R09: 0000000000000000
[ 1436.979811] R10: 0000000050a00000 R11: 0000000000000246 R12: 00007feaf0101190
[ 1436.982169] R13: 00007feaf01011e0 R14: 00007febeeb27fb8 R15: 0000000000000000

Trying to write a NFS share on the pool at higher speeds (slow io seems to be fine, maybe race condition?).
The file being written to at the time is unreadable (even after reboot) and attempts to access it cause various funny things to happen.
After removing said file seems to remove the weird behaviour and restore the pool to a functioning state.
Scrub does not find any errors (neither with the bad file nor after removing it). However when it happens it causes ganesha.nfs to completely hang up to the point where I have to hard cycle the vm. The same Pool and setup works fine with truenas core (I only switched cause I only use basic features and virtio networking is crap on freebsd). I am unsure if its a bug of openzfs, or a combination of ganesha + zfs or a result of the specific build in truenas scale, however I have used zfs on proxmox for a while (which is also linux based) and never had anything comparable.
For now I am back to truenas core (cause a slow virtual nas is way better than a broken virtual nas), but I'd really like to switch to scale (network performance is like 10x that of freebsd).
I did some googling but could not find anything related to the above trace.
Hope the above info helps some of the devs. If there are any ideas or questions feel free to ask.

Sincerely,
Thalhammer

anodos · Jan 7, 2021

Thalhammer said:
Hi,
Soooo I tried it, loved it, broke it.
Setup:

KVM Virtual Machine (q35),
Boot drive: SSD virtio disk
Data drives: 8x8TB connected to a LSI SAS2308 HBA (HBA passed through to VM)
Memory: 16G ECC
CPU: 14 Cores of a Xeon E5-2620 v4 @ 2.10GHz

[ 1436.910692] VERIFY(tree->avl_numnodes > 0) failed
[ 1436.912523] PANIC at avl.c:750:avl_remove()
[ 1436.914103] Showing stack for process 42568
[ 1436.916038] CPU: 6 PID: 42568 Comm: ganesha.nfsd Tainted: P OE 5.9.0-1-amd64 #1 Debian 5.9.1-1
[ 1436.918773] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
[ 1436.922650] Call Trace:
[ 1436.924074] dump_stack+0x6b/0x88
[ 1436.925631] spl_panic+0xd4/0xfc [spl]
[ 1436.927342] ? _cond_resched+0x16/0x40
[ 1436.929103] ? _cond_resched+0x16/0x40
[ 1436.931032] ? queued_spin_unlock+0x5/0x10 [zfs]
[ 1436.932859] ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 1436.934819] ? txg_list_add+0x99/0xd0 [zfs]
[ 1436.936385] ? zilog_dirty+0x50/0xc0 [zfs]
[ 1436.938373] ? queued_spin_unlock+0x5/0x10 [zfs]
[ 1436.940247] ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 1436.942419] ? zil_itx_assign+0x1cc/0x320 [zfs]
[ 1436.943849] ? spl_kmem_cache_alloc+0x83/0x260 [spl]
[ 1436.945449] avl_remove+0x297/0x2b0 [zavl]
[ 1436.947393] zfs_rangelock_exit+0x142/0x230 [zfs]
[ 1436.949372] zfs_write+0x593/0xf10 [zfs]
[ 1436.951056] zpl_write_common_iovec+0xac/0x120 [zfs]
[ 1436.953044] zpl_iter_write_common+0x86/0xb0 [zfs]
[ 1436.954982] zpl_iter_write+0x4e/0x80 [zfs]
[ 1436.956251] do_iter_readv_writev+0x160/0x1d0
[ 1436.957576] do_iter_write+0x7c/0x1b0
[ 1436.959184] vfs_writev+0xa0/0xf0
[ 1436.960278] ? __kmalloc+0x10b/0x260
[ 1436.961430] __x64_sys_pwritev+0xad/0xf0
[ 1436.962808] do_syscall_64+0x33/0x80
[ 1436.963885] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1436.965257] RIP: 0033:0x7febee873ea0
[ 1436.966641] Code: 3c 24 48 89 4c 24 18 e8 5e fe f8 ff 4c 8b 54 24 18 8b 3c 24 45 31 c0 41 89 c1 8b 54 24 14 48 8b 74 24 08 b8 28 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 30 44 89 cf 48 89 04 24 e8 8c fe f8 ff 48 8b
[ 1436.971466] RSP: 002b:00007feb2e601ef0 EFLAGS: 00000246 ORIG_RAX: 0000000000000128
[ 1436.973712] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007febee873ea0
[ 1436.975772] RDX: 0000000000000001 RSI: 00007feaf01011c0 RDI: 000000000000001f
[ 1436.977764] RBP: 00007febbc004c70 R08: 0000000000000000 R09: 0000000000000000
[ 1436.979811] R10: 0000000050a00000 R11: 0000000000000246 R12: 00007feaf0101190
[ 1436.982169] R13: 00007feaf01011e0 R14: 00007febeeb27fb8 R15: 0000000000000000

Trying to write a NFS share on the pool at higher speeds (slow io seems to be fine, maybe race condition?).
The file being written to at the time is unreadable (even after reboot) and attempts to access it cause various funny things to happen.
After removing said file seems to remove the weird behaviour and restore the pool to a functioning state.
Scrub does not find any errors (neither with the bad file nor after removing it). However when it happens it causes ganesha.nfs to completely hang up to the point where I have to hard cycle the vm. The same Pool and setup works fine with truenas core (I only switched cause I only use basic features and virtio networking is crap on freebsd). I am unsure if its a bug of openzfs, or a combination of ganesha + zfs or a result of the specific build in truenas scale, however I have used zfs on proxmox for a while (which is also linux based) and never had anything comparable.
For now I am back to truenas core (cause a slow virtual nas is way better than a broken virtual nas), but I'd really like to switch to scale (network performance is like 10x that of freebsd).
I did some googling but could not find anything related to the above trace.
Hope the above info helps some of the devs. If there are any ideas or questions feel free to ask.

Sincerely,
Thalhammer

We have a bug tracker at jira.ixsystems.com. If you can reproduce the issue on SCALE, I highly recommend filing a proper bug ticket.

Important Announcement for the TrueNAS Community.

Kernel Panic on NFS write

Thalhammer

Cadet

anodos

Sambassador

Similar threads

Important Announcement for the TrueNAS Community.

Kernel Panic on NFS write

Thalhammer

Cadet

anodos

Sambassador

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Kernel Panic on NFS write"

Similar threads