Investigating system hangs on nfsd

osprey

Cadet
Joined
Dec 19, 2023
Messages
1
I'm experiencing a strange behaviour on one of my Truenas system.
The hardware is an HP DL380 Gen9 server, E5-2640v3 CPUs, 32GB ecc memory, 12x 14TB SAS disks plus two 240GB sata boot disks. The sas controller is an HP P840 configured in hba mode (suboptimal, but the driver is quite solid on linux environment)
The system is running TrueNAS-SCALE-22.12.4.2, Pool usage is at 69% and memory is not exausted.

Data disks are configured in a single pool with a single raidz3 vdev. The system exports the datasets as NFS shares.

The problem is that after some days of activity, the system hangs in a strange way. Looking at the logs I see some kernel rants about tasks blocked, usually related to nfsd or txg_sync, until the system stops to respond on web gui, ssh and of course nfs. The system is still pingable on the network, and if I plug a keyboard on usb port, the kernel reacts allocating a device (I see the kernel messages on console), but no input is accepted even on tty.

In the system logs nothing strange is reported except the hung tasks, and I have no kernel dumps in /data/crash

Here is the extract from kernel messages about the kernel hung tasks

Code:
Dec 11 06:24:30 XXXX.local kernel: INFO: task nfsd:8592 blocked for more than 120 seconds.
Dec 11 06:24:30 XXXX.local kernel:       Tainted: P           OE     5.15.131+truenas #1
Dec 11 06:24:30 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 11 06:24:30 XXXX.local kernel: task:nfsd            state:D stack:    0 pid: 8592 ppid:     2 flags:0x00004000
Dec 11 06:24:30 XXXX.local kernel: Call Trace:
Dec 11 06:24:30 XXXX.local kernel:  <TASK>
Dec 11 06:24:30 XXXX.local kernel:  __schedule+0x2f0/0x950
Dec 11 06:24:30 XXXX.local kernel:  schedule+0x5b/0xd0
Dec 11 06:24:30 XXXX.local kernel:  rwsem_down_write_slowpath+0x197/0x4e0
Dec 11 06:24:30 XXXX.local kernel:  reconnect_path+0xa9/0x2f0
Dec 11 06:24:30 XXXX.local kernel:  ? zpl_fh_to_dentry+0xcf/0xf0 [zfs]
Dec 11 06:24:30 XXXX.local kernel:  ? nfsd_proc_read+0x240/0x240 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  exportfs_decode_fh_raw+0xf6/0x300
Dec 11 06:24:30 XXXX.local kernel:  ? exp_find_key+0xc0/0x1e0 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  ? kmem_cache_free+0x267/0x290
Dec 11 06:24:30 XXXX.local kernel:  ? tcp_recvmsg_locked+0x203/0x970
Dec 11 06:24:30 XXXX.local kernel:  ? __kmalloc+0x159/0x400
Dec 11 06:24:30 XXXX.local kernel:  ? call_rcu+0xd8/0x6d0
Dec 11 06:24:30 XXXX.local kernel:  ? tomoyo_check_task_acl+0x40/0x40
Dec 11 06:24:30 XXXX.local kernel:  ? security_prepare_creds+0x47/0xa0
Dec 11 06:24:30 XXXX.local kernel:  fh_verify+0x390/0x6c0 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  nfsd_create+0x55/0x1a0 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  nfsd3_proc_mkdir+0xc9/0x160 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  nfsd_dispatch+0x153/0x240 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  svc_process_common+0x3cf/0x6c0 [sunrpc]
Dec 11 06:24:30 XXXX.local kernel:  ? nfsd_svc+0x350/0x350 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  svc_process+0xb7/0xf0 [sunrpc]
Dec 11 06:24:30 XXXX.local kernel:  nfsd+0xe8/0x140 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  ? nfsd_shutdown_threads+0x80/0x80 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  kthread+0x127/0x150
Dec 11 06:24:30 XXXX.local kernel:  ? set_kthread_struct+0x50/0x50
Dec 11 06:24:30 XXXX.local kernel:  ret_from_fork+0x22/0x30
Dec 11 06:24:30 XXXX.local kernel:  </TASK>
Dec 11 06:24:30 XXXX.local kernel: INFO: task nfsd:8593 blocked for more than 120 seconds.
Dec 11 06:24:30 XXXX.local kernel:       Tainted: P           OE     5.15.131+truenas #1
Dec 11 06:24:30 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 11 06:24:30 XXXX.local kernel: task:nfsd            state:D stack:    0 pid: 8593 ppid:     2 flags:0x00004000
Dec 11 06:24:30 XXXX.local kernel: Call Trace:
Dec 11 06:24:30 XXXX.local kernel:  <TASK>
Dec 11 06:24:30 XXXX.local kernel:  __schedule+0x2f0/0x950
Dec 11 06:24:30 XXXX.local kernel:  ? dbuf_read_impl.constprop.0+0x284/0x380 [zfs]
Dec 11 06:24:30 XXXX.local kernel:  schedule+0x5b/0xd0
Dec 11 06:24:30 XXXX.local kernel:  schedule_timeout+0x88/0x140
Dec 11 06:24:30 XXXX.local kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Dec 11 06:24:30 XXXX.local kernel:  io_schedule_timeout+0x4c/0x80
Dec 11 06:24:30 XXXX.local kernel:  __cv_timedwait_common+0x128/0x160 [spl]
Dec 11 06:24:30 XXXX.local kernel:  ? finish_wait+0x90/0x90
Dec 11 06:24:30 XXXX.local kernel:  __cv_timedwait_io+0x15/0x20 [spl]
Dec 11 06:24:30 XXXX.local kernel:  zio_wait+0x109/0x220 [zfs]
Dec 11 06:24:30 XXXX.local kernel:  dmu_buf_hold_by_dnode+0x5f/0x90 [zfs]
Dec 11 06:24:30 XXXX.local kernel:  zap_get_leaf_byblk+0x6a/0x130 [zfs]
Dec 11 06:24:30 XXXX.local kernel:  zap_deref_leaf+0x92/0xe0 [zfs]
Dec 11 06:24:30 XXXX.local kernel:  fzap_cursor_retrieve+0x10f/0x270 [zfs]
Dec 11 06:24:30 XXXX.local kernel:  zap_cursor_retrieve+0x6a/0x2b0 [zfs]
Dec 11 06:24:30 XXXX.local kernel:  ? dmu_prefetch+0x18a/0x210 [zfs]
Dec 11 06:24:30 XXXX.local kernel:  zfs_readdir+0x12a/0x430 [zfs]
Dec 11 06:24:30 XXXX.local kernel:  ? aa_file_perm+0x114/0x4d0
Dec 11 06:24:30 XXXX.local kernel:  zpl_iterate+0x56/0x90 [zfs]
Dec 11 06:24:30 XXXX.local kernel:  iterate_dir+0x17c/0x1c0
Dec 11 06:24:30 XXXX.local kernel:  get_name+0x13b/0x1b0
Dec 11 06:24:30 XXXX.local kernel:  ? get_name+0x1b0/0x1b0
Dec 11 06:24:30 XXXX.local kernel:  reconnect_path+0x28c/0x2f0
Dec 11 06:24:30 XXXX.local kernel:  ? nfsd_proc_read+0x240/0x240 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  exportfs_decode_fh_raw+0xf6/0x300
Dec 11 06:24:30 XXXX.local kernel:  ? exp_find_key+0xc0/0x1e0 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  ? select_task_rq_fair+0x157/0x1220
Dec 11 06:24:30 XXXX.local kernel:  ? __kmalloc+0x159/0x400
Dec 11 06:24:30 XXXX.local kernel:  ? call_rcu+0xd8/0x6d0
Dec 11 06:24:30 XXXX.local kernel:  ? tomoyo_check_task_acl+0x40/0x40
Dec 11 06:24:30 XXXX.local kernel:  ? security_prepare_creds+0x47/0xa0
Dec 11 06:24:30 XXXX.local kernel:  fh_verify+0x390/0x6c0 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  ? svc_xprt_do_enqueue+0x11b/0x1b0 [sunrpc]
Dec 11 06:24:30 XXXX.local kernel:  nfsd_access+0x2b/0x130 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  nfsd3_proc_access+0x64/0xb0 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  nfsd_dispatch+0x153/0x240 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  svc_process_common+0x3cf/0x6c0 [sunrpc]
Dec 11 06:24:30 XXXX.local kernel:  ? nfsd_svc+0x350/0x350 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  svc_process+0xb7/0xf0 [sunrpc]
Dec 11 06:24:30 XXXX.local kernel:  nfsd+0xe8/0x140 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  ? nfsd_shutdown_threads+0x80/0x80 [nfsd]
Dec 11 06:24:30 XXXX.local kernel:  kthread+0x127/0x150
Dec 11 06:24:30 XXXX.local kernel:  ? set_kthread_struct+0x50/0x50
Dec 11 06:24:30 XXXX.local kernel:  ret_from_fork+0x22/0x30
Dec 11 06:24:30 XXXX.local kernel:  </TASK>
Dec 13 04:13:03 XXXX.local kernel: perf: interrupt took too long (2523 > 2500), lowering kernel.perf_event_max_sample_rate to 79250
Dec 13 10:01:07 XXXX.local kernel: perf: interrupt took too long (3180 > 3153), lowering kernel.perf_event_max_sample_rate to 62750
Dec 13 22:07:06 XXXX.local kernel: perf: interrupt took too long (3994 > 3975), lowering kernel.perf_event_max_sample_rate to 50000
Dec 14 12:08:37 XXXX.local kernel: INFO: task txg_sync:4559 blocked for more than 120 seconds.
Dec 14 12:08:37 XXXX.local kernel: INFO: task txg_sync:4559 blocked for more than 120 seconds.
Dec 14 12:08:37 XXXX.local kernel:       Tainted: P           OE     5.15.131+truenas #1
Dec 14 12:08:37 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 14 12:08:37 XXXX.local kernel: task:txg_sync        state:D stack:    0 pid: 4559 ppid:     2 flags:0x00004000
Dec 14 12:08:37 XXXX.local kernel: Call Trace:
Dec 14 12:08:37 XXXX.local kernel:  <TASK>
Dec 14 12:08:37 XXXX.local kernel:  __schedule+0x2f0/0x950
Dec 14 12:08:37 XXXX.local kernel:  ? dbuf_read_impl.constprop.0+0x284/0x380 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  schedule+0x5b/0xd0
Dec 14 12:08:37 XXXX.local kernel:  schedule_timeout+0x88/0x140
Dec 14 12:08:37 XXXX.local kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Dec 14 12:08:37 XXXX.local kernel:  io_schedule_timeout+0x4c/0x80
Dec 14 12:08:37 XXXX.local kernel:  __cv_timedwait_common+0x128/0x160 [spl]
Dec 14 12:08:37 XXXX.local kernel:  ? finish_wait+0x90/0x90
Dec 14 12:08:37 XXXX.local kernel:  __cv_timedwait_io+0x15/0x20 [spl]
Dec 14 12:08:37 XXXX.local kernel:  zio_wait+0x109/0x220 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  dbuf_hold_impl+0x459/0x640 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  dbuf_hold_level+0x2b/0x60 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  dmu_buf_hold_array_by_dnode+0x107/0x530 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  dmu_buf_hold_array.constprop.0+0x63/0xb0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  dmu_write+0x6a/0xd0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  ? dmu_buf_will_dirty_impl+0xed/0x100 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  space_map_write_intro_debug+0xab/0xe0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  space_map_write_impl+0x49/0x240 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  ? dbuf_find_dirty_eq+0x9/0x30 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  space_map_write+0x9a/0x190 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  metaslab_flush+0xed/0x330 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  ? spa_cleanup_old_sm_logs+0xb4/0x170 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  spa_flush_metaslabs+0x150/0x200 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  ? __cond_resched+0x16/0x50
Dec 14 12:08:37 XXXX.local kernel:  spa_sync_iterate_to_convergence+0x140/0x1e0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  spa_sync+0x2eb/0x5d0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  txg_sync_thread+0x229/0x2a0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  thread_generic_wrapper+0x59/0x70 [spl]
Dec 14 12:08:37 XXXX.local kernel:  ? __thread_exit+0x20/0x20 [spl]
Dec 14 12:08:37 XXXX.local kernel:  kthread+0x127/0x150
Dec 14 12:08:37 XXXX.local kernel:  ? set_kthread_struct+0x50/0x50
Dec 14 12:08:37 XXXX.local kernel:  ret_from_fork+0x22/0x30
Dec 14 12:08:37 XXXX.local kernel:  </TASK>
Dec 14 12:08:37 XXXX.local kernel: INFO: task nfsd:8592 blocked for more than 121 seconds.
Dec 14 12:08:37 XXXX.local kernel:       Tainted: P           OE     5.15.131+truenas #1
Dec 14 12:08:37 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 14 12:08:37 XXXX.local kernel: task:nfsd            state:D stack:    0 pid: 8592 ppid:     2 flags:0x00004000
Dec 14 12:08:37 XXXX.local kernel: Call Trace:
Dec 14 12:08:37 XXXX.local kernel:  <TASK>
Dec 14 12:08:37 XXXX.local kernel:  __schedule+0x2f0/0x950
Dec 14 12:08:37 XXXX.local kernel:  ? dbuf_read_impl.constprop.0+0x284/0x380 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  schedule+0x5b/0xd0
Dec 14 12:08:37 XXXX.local kernel:  schedule_timeout+0x88/0x140
Dec 14 12:08:37 XXXX.local kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Dec 14 12:08:37 XXXX.local kernel:  io_schedule_timeout+0x4c/0x80
Dec 14 12:08:37 XXXX.local kernel:  __cv_timedwait_common+0x128/0x160 [spl]
Dec 14 12:08:37 XXXX.local kernel:  ? finish_wait+0x90/0x90
Dec 14 12:08:37 XXXX.local kernel:  __cv_timedwait_io+0x15/0x20 [spl]
Dec 14 12:08:37 XXXX.local kernel:  zio_wait+0x109/0x220 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  dmu_buf_hold+0x5f/0x90 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zap_lockdir+0x4e/0xc0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zap_lookup_norm+0x59/0xd0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zap_lookup+0x12/0x20 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zfs_match_find.constprop.0+0xf0/0x100 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zfs_dirent_lock+0x427/0x5a0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zfs_dirlook+0x88/0x2b0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zfs_lookup+0x24d/0x400 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zpl_lookup+0xc6/0x260 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  ? fill_pre_wcc+0xa7/0x1c0 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  __lookup_slow+0x88/0x150
Dec 14 12:08:37 XXXX.local kernel:  lookup_one_len+0x84/0x90
Dec 14 12:08:37 XXXX.local kernel:  nfsd_lookup_dentry+0xdb/0x450 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  ? svc_xprt_do_enqueue+0x11b/0x1b0 [sunrpc]
Dec 14 12:08:37 XXXX.local kernel:  nfsd_lookup+0x8b/0x150 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  ? nfsd_cache_lookup+0x585/0x680 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  nfsd3_proc_lookup+0xaf/0x110 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  nfsd_dispatch+0x153/0x240 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  svc_process_common+0x3cf/0x6c0 [sunrpc]
Dec 14 12:08:37 XXXX.local kernel:  ? nfsd_svc+0x350/0x350 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  svc_process+0xb7/0xf0 [sunrpc]
Dec 14 12:08:37 XXXX.local kernel:  nfsd+0xe8/0x140 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  ? nfsd_shutdown_threads+0x80/0x80 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  kthread+0x127/0x150
Dec 14 12:08:37 XXXX.local kernel:  ? set_kthread_struct+0x50/0x50
Dec 14 12:08:37 XXXX.local kernel:  ret_from_fork+0x22/0x30
Dec 14 12:08:37 XXXX.local kernel:  </TASK>
Dec 14 12:08:37 XXXX.local kernel: INFO: task nfsd:8593 blocked for more than 121 seconds.
Dec 14 12:08:37 XXXX.local kernel:       Tainted: P           OE     5.15.131+truenas #1
Dec 14 12:08:37 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 14 12:08:37 XXXX.local kernel: task:nfsd            state:D stack:    0 pid: 8593 ppid:     2 flags:0x00004000
Dec 14 12:08:37 XXXX.local kernel: Call Trace:
Dec 14 12:08:37 XXXX.local kernel:  <TASK>
Dec 14 12:08:37 XXXX.local kernel:  __schedule+0x2f0/0x950
Dec 14 12:08:37 XXXX.local kernel:  ? dbuf_read_impl.constprop.0+0x284/0x380 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  schedule+0x5b/0xd0
Dec 14 12:08:37 XXXX.local kernel:  schedule_timeout+0x88/0x140
Dec 14 12:08:37 XXXX.local kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Dec 14 12:08:37 XXXX.local kernel:  io_schedule_timeout+0x4c/0x80
Dec 14 12:08:37 XXXX.local kernel:  __cv_timedwait_common+0x128/0x160 [spl]
Dec 14 12:08:37 XXXX.local kernel:  ? finish_wait+0x90/0x90
Dec 14 12:08:37 XXXX.local kernel:  __cv_timedwait_io+0x15/0x20 [spl]
Dec 14 12:08:37 XXXX.local kernel:  zio_wait+0x109/0x220 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  dmu_buf_hold_by_dnode+0x5f/0x90 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zap_get_leaf_byblk+0x6a/0x130 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zap_deref_leaf+0x92/0xe0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  fzap_lookup+0x8a/0x130 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  ? zap_hashbits+0xa/0x20 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  ? zap_hash+0x2a/0xd0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  ? zap_name_init_str+0x9f/0x170 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zap_lookup_impl+0x78/0x2c0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  ? zap_lockdir+0x90/0xc0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  ? dbuf_read+0x34a/0x580 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zap_lookup_norm+0x83/0xd0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zap_lookup+0x12/0x20 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zfs_match_find.constprop.0+0xf0/0x100 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zfs_dirent_lock+0x427/0x5a0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zfs_dirlook+0x88/0x2b0 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zfs_lookup+0x24d/0x400 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  zpl_lookup+0xc6/0x260 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  ? fill_pre_wcc+0xa7/0x1c0 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  __lookup_slow+0x88/0x150
Dec 14 12:08:37 XXXX.local kernel:  lookup_one_len+0x84/0x90
Dec 14 12:08:37 XXXX.local kernel:  nfsd_lookup_dentry+0xdb/0x450 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  ? svc_xprt_do_enqueue+0x11b/0x1b0 [sunrpc]
Dec 14 12:08:37 XXXX.local kernel:  nfsd_lookup+0x8b/0x150 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  ? nfsd_cache_lookup+0x585/0x680 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  nfsd3_proc_lookup+0xaf/0x110 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  nfsd_dispatch+0x153/0x240 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  svc_process_common+0x3cf/0x6c0 [sunrpc]
Dec 14 12:08:37 XXXX.local kernel:  ? nfsd_svc+0x350/0x350 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  svc_process+0xb7/0xf0 [sunrpc]
Dec 14 12:08:37 XXXX.local kernel:  nfsd+0xe8/0x140 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  ? nfsd_shutdown_threads+0x80/0x80 [nfsd]
Dec 14 12:08:37 XXXX.local kernel:  kthread+0x127/0x150
Dec 14 12:08:37 XXXX.local kernel:  ? set_kthread_struct+0x50/0x50
Dec 14 12:08:37 XXXX.local kernel:  ret_from_fork+0x22/0x30
Dec 14 12:08:37 XXXX.local kernel:  </TASK>
Dec 14 12:08:37 XXXX.local kernel: INFO: task rrdcached:9686 blocked for more than 121 seconds.
Dec 14 12:08:37 XXXX.local kernel:       Tainted: P           OE     5.15.131+truenas #1
Dec 14 12:08:37 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 14 12:08:37 XXXX.local kernel: task:rrdcached       state:D stack:    0 pid: 9686 ppid:     1 flags:0x00000000
Dec 14 12:08:37 XXXX.local kernel: Call Trace:
Dec 14 12:08:37 XXXX.local kernel:  <TASK>
Dec 14 12:08:37 XXXX.local kernel:  __schedule+0x2f0/0x950
Dec 14 12:08:37 XXXX.local kernel:  ? dbuf_read_impl.constprop.0+0x284/0x380 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  schedule+0x5b/0xd0
Dec 14 12:08:37 XXXX.local kernel:  schedule_timeout+0x88/0x140
Dec 14 12:08:37 XXXX.local kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Dec 14 12:08:37 XXXX.local kernel:  io_schedule_timeout+0x4c/0x80
Dec 14 12:08:37 XXXX.local kernel:  __cv_timedwait_common+0x128/0x160 [spl]
Dec 14 12:08:37 XXXX.local kernel:  ? finish_wait+0x90/0x90
Dec 14 12:08:37 XXXX.local kernel:  __cv_timedwait_io+0x15/0x20 [spl]
Dec 14 12:08:37 XXXX.local kernel:  zio_wait+0x109/0x220 [zfs]
Dec 14 12:08:37 XXXX.local kernel:  dbuf_hold_impl+0x459/0x640 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dbuf_hold_level+0x2b/0x60 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dmu_buf_hold_array_by_dnode+0x107/0x530 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dmu_read_impl+0xa8/0x180 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dmu_read+0x56/0xa0 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  zfs_fillpage+0x7a/0x180 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  zfs_getpage+0x50/0xf0 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  zpl_readpage_common+0x29/0x50 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  filemap_read_page+0x36/0xf0
Dec 14 12:08:38 XXXX.local kernel:  filemap_fault+0x98d/0xa80
Dec 14 12:08:38 XXXX.local kernel:  __do_fault+0x36/0x120
Dec 14 12:08:38 XXXX.local kernel:  __handle_mm_fault+0xce7/0x1580
Dec 14 12:08:38 XXXX.local kernel:  handle_mm_fault+0xcf/0x2b0
Dec 14 12:08:38 XXXX.local kernel:  do_user_addr_fault+0x1be/0x670
Dec 14 12:08:38 XXXX.local kernel:  exc_page_fault+0x72/0x150
Dec 14 12:08:38 XXXX.local kernel:  asm_exc_page_fault+0x22/0x30
Dec 14 12:08:38 XXXX.local kernel: RIP: 0033:0x7fc1d5aa253d
Dec 14 12:08:38 XXXX.local kernel: RSP: 002b:00007fc1d1d9c8e0 EFLAGS: 00010212
Dec 14 12:08:38 XXXX.local kernel: RAX: 0000000000000000 RBX: 0000000000000082 RCX: 00007fc1d583e6e7
Dec 14 12:08:38 XXXX.local kernel: RDX: 0000000000000001 RSI: 00000000000244a8 RDI: 00007fc1ce54d000
Dec 14 12:08:38 XXXX.local kernel: RBP: 00007fc1d1d9cb50 R08: 0000000000000008 R09: 0000000000000000
Dec 14 12:08:38 XXXX.local kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 00007fc1c0001f90
Dec 14 12:08:38 XXXX.local kernel: R13: 00007fc1c0001d20 R14: 00007fc1ce54d000 R15: 0000000000000000
Dec 14 12:08:38 XXXX.local kernel:  </TASK>
Dec 14 12:08:38 XXXX.local kernel: INFO: task rrdcached:9687 blocked for more than 121 seconds.
Dec 14 12:08:38 XXXX.local kernel:       Tainted: P           OE     5.15.131+truenas #1
Dec 14 12:08:38 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 14 12:08:38 XXXX.local kernel: task:rrdcached       state:D stack:    0 pid: 9687 ppid:     1 flags:0x00000000
Dec 14 12:08:38 XXXX.local kernel: Call Trace:
Dec 14 12:08:38 XXXX.local kernel:  <TASK>
Dec 14 12:08:38 XXXX.local kernel:  __schedule+0x2f0/0x950
Dec 14 12:08:38 XXXX.local kernel:  ? dbuf_read_impl.constprop.0+0x284/0x380 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  schedule+0x5b/0xd0
Dec 14 12:08:38 XXXX.local kernel:  schedule_timeout+0x88/0x140
Dec 14 12:08:38 XXXX.local kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Dec 14 12:08:38 XXXX.local kernel:  io_schedule_timeout+0x4c/0x80
Dec 14 12:08:38 XXXX.local kernel:  __cv_timedwait_common+0x128/0x160 [spl]
Dec 14 12:08:38 XXXX.local kernel:  ? finish_wait+0x90/0x90
Dec 14 12:08:38 XXXX.local kernel:  __cv_timedwait_io+0x15/0x20 [spl]
Dec 14 12:08:38 XXXX.local kernel:  zio_wait+0x109/0x220 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dbuf_hold_impl+0x459/0x640 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dbuf_hold_level+0x2b/0x60 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dmu_buf_hold_array_by_dnode+0x107/0x530 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dmu_read_impl+0xa8/0x180 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dmu_read+0x56/0xa0 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  zfs_fillpage+0x7a/0x180 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  zfs_getpage+0x50/0xf0 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  zpl_readpage_common+0x29/0x50 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  filemap_read_page+0x36/0xf0
Dec 14 12:08:38 XXXX.local kernel:  filemap_fault+0x98d/0xa80
Dec 14 12:08:38 XXXX.local kernel:  __do_fault+0x36/0x120
Dec 14 12:08:38 XXXX.local kernel:  __handle_mm_fault+0xce7/0x1580
Dec 14 12:08:38 XXXX.local kernel:  handle_mm_fault+0xcf/0x2b0
Dec 14 12:08:38 XXXX.local kernel:  do_user_addr_fault+0x1be/0x670
Dec 14 12:08:38 XXXX.local kernel:  exc_page_fault+0x72/0x150
Dec 14 12:08:38 XXXX.local kernel:  asm_exc_page_fault+0x22/0x30
Dec 14 12:08:38 XXXX.local kernel: RIP: 0033:0x7fc1d5aa253d
Dec 14 12:08:38 XXXX.local kernel: RSP: 002b:00007fc1d159b8e0 EFLAGS: 00010212
Dec 14 12:08:38 XXXX.local kernel: RAX: 0000000000000000 RBX: 0000000000000082 RCX: 00007fc1d583e6e7
Dec 14 12:08:38 XXXX.local kernel: RDX: 0000000000000001 RSI: 00000000000244a8 RDI: 00007fc1ce528000
Dec 14 12:08:38 XXXX.local kernel: RBP: 00007fc1d159bb50 R08: 0000000000000009 R09: 0000000000000000
Dec 14 12:08:38 XXXX.local kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 00007fc1c4001ff0
Dec 14 12:08:38 XXXX.local kernel: R13: 00007fc1c4001d40 R14: 00007fc1ce528000 R15: 0000000000000000
Dec 14 12:08:38 XXXX.local kernel:  </TASK>
Dec 14 12:08:38 XXXX.local kernel: INFO: task rrdcached:9688 blocked for more than 122 seconds.
Dec 14 12:08:38 XXXX.local kernel:       Tainted: P           OE     5.15.131+truenas #1
Dec 14 12:08:38 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 14 12:08:38 XXXX.local kernel: task:rrdcached       state:D stack:    0 pid: 9688 ppid:     1 flags:0x00000000
Dec 14 12:08:38 XXXX.local kernel: Call Trace:
Dec 14 12:08:38 XXXX.local kernel:  <TASK>
Dec 14 12:08:38 XXXX.local kernel:  __schedule+0x2f0/0x950
Dec 14 12:08:38 XXXX.local kernel:  ? dbuf_read_impl.constprop.0+0x284/0x380 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  schedule+0x5b/0xd0
Dec 14 12:08:38 XXXX.local kernel:  schedule_timeout+0x88/0x140
Dec 14 12:08:38 XXXX.local kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Dec 14 12:08:38 XXXX.local kernel:  io_schedule_timeout+0x4c/0x80
Dec 14 12:08:38 XXXX.local kernel:  __cv_timedwait_common+0x128/0x160 [spl]
Dec 14 12:08:38 XXXX.local kernel:  ? finish_wait+0x90/0x90
Dec 14 12:08:38 XXXX.local kernel:  __cv_timedwait_io+0x15/0x20 [spl]
Dec 14 12:08:38 XXXX.local kernel:  zio_wait+0x109/0x220 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dbuf_hold_impl+0x459/0x640 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dbuf_hold_level+0x2b/0x60 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dmu_buf_hold_array_by_dnode+0x107/0x530 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dmu_read_impl+0xa8/0x180 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dmu_read+0x56/0xa0 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  zfs_fillpage+0x7a/0x180 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dmu_read+0x56/0xa0 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  zfs_fillpage+0x7a/0x180 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  zfs_getpage+0x50/0xf0 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  zpl_readpage_common+0x29/0x50 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  filemap_read_page+0x36/0xf0
Dec 14 12:08:38 XXXX.local kernel:  filemap_fault+0x98d/0xa80
Dec 14 12:08:38 XXXX.local kernel:  __do_fault+0x36/0x120
Dec 14 12:08:38 XXXX.local kernel:  __handle_mm_fault+0xce7/0x1580
Dec 14 12:08:38 XXXX.local kernel:  handle_mm_fault+0xcf/0x2b0
Dec 14 12:08:38 XXXX.local kernel:  do_user_addr_fault+0x1be/0x670
Dec 14 12:08:38 XXXX.local kernel:  exc_page_fault+0x72/0x150
Dec 14 12:08:38 XXXX.local kernel:  asm_exc_page_fault+0x22/0x30
Dec 14 12:08:38 XXXX.local kernel: RIP: 0033:0x7fc1d5aa253d
Dec 14 12:08:38 XXXX.local kernel: RSP: 002b:00007fc1d0d9a8e0 EFLAGS: 00010212
Dec 14 12:08:38 XXXX.local kernel: RAX: 0000000000000000 RBX: 0000000000000082 RCX: 00007fc1d583e6e7
Dec 14 12:08:38 XXXX.local kernel: RDX: 0000000000000001 RSI: 00000000000244a8 RDI: 00007fc1ce572000
Dec 14 12:08:38 XXXX.local kernel: RBP: 00007fc1d0d9ab50 R08: 0000000000000007 R09: 0000000000000000
Dec 14 12:08:38 XXXX.local kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 00007fc1b8002140
Dec 14 12:08:38 XXXX.local kernel: R13: 00007fc1b8001f40 R14: 00007fc1ce572000 R15: 0000000000000000
Dec 14 12:08:38 XXXX.local kernel:  </TASK>
Dec 14 12:08:38 XXXX.local kernel: INFO: task rrdcached:9689 blocked for more than 122 seconds.
Dec 14 12:08:38 XXXX.local kernel:       Tainted: P           OE     5.15.131+truenas #1
Dec 14 12:08:38 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 14 12:08:38 XXXX.local kernel: task:rrdcached       state:D stack:    0 pid: 9689 ppid:     1 flags:0x00000000
Dec 14 12:08:38 XXXX.local kernel: Call Trace:
Dec 14 12:08:38 XXXX.local kernel:  <TASK>
Dec 14 12:08:38 XXXX.local kernel:  __schedule+0x2f0/0x950
Dec 14 12:08:38 XXXX.local kernel:  ? dbuf_read_impl.constprop.0+0x284/0x380 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  schedule+0x5b/0xd0
Dec 14 12:08:38 XXXX.local kernel:  schedule_timeout+0x88/0x140
Dec 14 12:08:38 XXXX.local kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Dec 14 12:08:38 XXXX.local kernel:  io_schedule_timeout+0x4c/0x80
Dec 14 12:08:38 XXXX.local kernel:  __cv_timedwait_common+0x128/0x160 [spl]
Dec 14 12:08:38 XXXX.local kernel:  ? finish_wait+0x90/0x90
Dec 14 12:08:38 XXXX.local kernel:  __cv_timedwait_io+0x15/0x20 [spl]
Dec 14 12:08:38 XXXX.local kernel:  zio_wait+0x109/0x220 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dbuf_hold_impl+0x459/0x640 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dbuf_hold_level+0x2b/0x60 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dmu_buf_hold_array_by_dnode+0x107/0x530 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dmu_read_impl+0xa8/0x180 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  dmu_read+0x56/0xa0 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  zfs_fillpage+0x7a/0x180 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  zfs_getpage+0x50/0xf0 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  zpl_readpage_common+0x29/0x50 [zfs]
Dec 14 12:08:38 XXXX.local kernel:  filemap_read_page+0x36/0xf0
Dec 14 12:08:38 XXXX.local kernel:  filemap_fault+0x98d/0xa80
Dec 14 12:08:38 XXXX.local kernel:  __do_fault+0x36/0x120
Dec 14 12:08:38 XXXX.local kernel:  __handle_mm_fault+0xce7/0x1580
Dec 14 12:08:38 XXXX.local kernel:  handle_mm_fault+0xcf/0x2b0
Dec 14 12:08:38 XXXX.local kernel:  do_user_addr_fault+0x1be/0x670
Dec 14 12:08:38 XXXX.local kernel:  exc_page_fault+0x72/0x150
Dec 14 12:08:38 XXXX.local kernel:  asm_exc_page_fault+0x22/0x30
Dec 14 12:08:38 XXXX.local kernel: RIP: 0033:0x7fc1d5aa253d
Dec 14 12:08:38 XXXX.local kernel: RSP: 002b:00007fc1d05998e0 EFLAGS: 00010212
Dec 14 12:08:38 XXXX.local kernel: RAX: 0000000000000000 RBX: 0000000000000082 RCX: 00007fc1d583e6e7
Dec 14 12:08:38 XXXX.local kernel: RDX: 0000000000000001 RSI: 00000000000244a8 RDI: 00007fc1ce503000
Dec 14 12:08:38 XXXX.local kernel: RBP: 00007fc1d0599b50 R08: 0000000000000006 R09: 0000000000000000
Dec 14 12:08:38 XXXX.local kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 00007fc1bc0020e0
Dec 14 12:08:38 XXXX.local kernel: R13: 00007fc1bc001de0 R14: 00007fc1ce503000 R15: 0000000000000000
Dec 14 12:08:38 XXXX.local kernel:  </TASK>
Dec 14 14:53:31 XXXX.local kernel: perf: interrupt took too long (4999 > 4992), lowering kernel.perf_event_max_sample_rate to 40000
Dec 16 03:25:30 XXXX.local kernel: perf: interrupt took too long (6250 > 6248), lowering kernel.perf_event_max_sample_rate to 32000
Dec 17 04:13:07 XXXX.local kernel: INFO: task txg_sync:4559 blocked for more than 120 seconds.
Dec 17 04:13:07 XXXX.local kernel:       Tainted: P           OE     5.15.131+truenas #1
Dec 17 04:13:07 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 17 04:13:07 XXXX.local kernel: task:txg_sync        state:D stack:    0 pid: 4559 ppid:     2 flags:0x00004000
Dec 17 04:13:07 XXXX.local kernel: Call Trace:
Dec 17 04:13:07 XXXX.local kernel:  <TASK>
Dec 17 04:13:07 XXXX.local kernel:  __schedule+0x2f0/0x950
Dec 17 04:13:07 XXXX.local kernel:  ? get_swap_device+0xc0/0xe0
Dec 17 04:13:07 XXXX.local kernel:  schedule+0x5b/0xd0
Dec 17 04:13:07 XXXX.local kernel:  schedule_timeout+0x88/0x140
Dec 17 04:13:07 XXXX.local kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Dec 17 04:13:07 XXXX.local kernel:  io_schedule_timeout+0x4c/0x80
Dec 17 04:13:07 XXXX.local kernel:  __cv_timedwait_common+0x128/0x160 [spl]
Dec 17 04:13:07 XXXX.local kernel:  ? finish_wait+0x90/0x90
Dec 17 04:13:07 XXXX.local kernel:  __cv_timedwait_io+0x15/0x20 [spl]
Dec 17 04:13:07 XXXX.local kernel:  zio_wait+0x109/0x220 [zfs]
Dec 17 04:13:07 XXXX.local kernel:  dsl_scan_sync+0x4d8/0x980 [zfs]
Dec 17 04:13:07 XXXX.local kernel:  ? kfree+0x1fc/0x250
Dec 17 04:13:07 XXXX.local kernel:  spa_sync_iterate_to_convergence+0x11f/0x1e0 [zfs]
Dec 17 04:13:07 XXXX.local kernel:  spa_sync+0x2eb/0x5d0 [zfs]
Dec 17 04:13:07 XXXX.local kernel:  txg_sync_thread+0x229/0x2a0 [zfs]
Dec 17 04:13:07 XXXX.local kernel:  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
Dec 17 04:13:07 XXXX.local kernel:  thread_generic_wrapper+0x59/0x70 [spl]
Dec 17 04:13:07 XXXX.local kernel:  ? __thread_exit+0x20/0x20 [spl]
Dec 17 04:13:07 XXXX.local kernel:  kthread+0x127/0x150
Dec 17 04:13:07 XXXX.local kernel:  ? set_kthread_struct+0x50/0x50
Dec 17 04:13:07 XXXX.local kernel:  ret_from_fork+0x22/0x30
Dec 17 04:13:07 XXXX.local kernel:  </TASK>
Dec 17 14:10:53 XXXX.local kernel: perf: interrupt took too long (7815 > 7812), lowering kernel.perf_event_max_sample_rate to 25500


The only other entry I've found that may point the finger against the disks subsystem is in /var/log/daemon.log

Code:
Dec 17 17:34:59 XXXX.local zed[4191761]: eid=18 class=deadman pool='Pool0' vdev=75be8983-065a-41a9-afff-0a5696e2832f size=4096 offset=343724634112 priority=4 err=0 flags=0x1808a0 delay=606728354ms bookmark=21009:0:0:138835
Dec 17 22:49:04 XXXX.local zed[2991386]: eid=19 class=deadman pool='Pool0' vdev=efbd3b04-7466-475f-83e3-19d4f299c29d size=4096 offset=12653207072768 priority=4 err=0 flags=0x1808a0 delay=625574278ms bookmark=21009:0:0:119190


I have sporadic errors on collectd, but I don't know if they may be relevant...

Code:
Dec 17 05:13:56 XXXX.local collectd[11253]: Traceback (most recent call last):
  File "/usr/local/lib/collectd_pyplugins/cputemp.py", line 21, in read
    with Client() as c:
  File "/usr/lib/python3/dist-packages/middlewared/client/client.py", line 326, in __init__
    self._ws.connect()
  File "/usr/lib/python3/dist-packages/middlewared/client/client.py", line 129, in connect
    rv = super(WSClient, self).connect()
  File "/usr/lib/python3/dist-packages/ws4py/client/__init__.py", line 222, in connect
    bytes = self.sock.recv(128)
socket.timeout: timed out
Dec 18 10:46:34 XXXX.local collectd[11253]: processes plugin: read_file_contents: Reading file "/proc/2519358/stat" failed.
Dec 18 20:42:54 XXXX.local collectd[11253]: processes plugin: read_file_contents: Reading file "/proc/981836/stat" failed.


The system ceased to be responsive around 22:00 of Dec 18 without spitting out other errors...

One strange fact is that I would expect to find some i/o errors on the tty console, but that does not happens. Also, the kernel seems not completely dead, since if I attach a usb device to the system, it gets detected and the driver attached...

Any hints or suggestion on how to debug this situation would be really appreciated.
 
Top