I'm experiencing a strange behaviour on one of my Truenas system.
The hardware is an HP DL380 Gen9 server, E5-2640v3 CPUs, 32GB ecc memory, 12x 14TB SAS disks plus two 240GB sata boot disks. The sas controller is an HP P840 configured in hba mode (suboptimal, but the driver is quite solid on linux environment)
The system is running TrueNAS-SCALE-22.12.4.2, Pool usage is at 69% and memory is not exausted.
Data disks are configured in a single pool with a single raidz3 vdev. The system exports the datasets as NFS shares.
The problem is that after some days of activity, the system hangs in a strange way. Looking at the logs I see some kernel rants about tasks blocked, usually related to nfsd or txg_sync, until the system stops to respond on web gui, ssh and of course nfs. The system is still pingable on the network, and if I plug a keyboard on usb port, the kernel reacts allocating a device (I see the kernel messages on console), but no input is accepted even on tty.
In the system logs nothing strange is reported except the hung tasks, and I have no kernel dumps in /data/crash
Here is the extract from kernel messages about the kernel hung tasks
The only other entry I've found that may point the finger against the disks subsystem is in /var/log/daemon.log
I have sporadic errors on collectd, but I don't know if they may be relevant...
The system ceased to be responsive around 22:00 of Dec 18 without spitting out other errors...
One strange fact is that I would expect to find some i/o errors on the tty console, but that does not happens. Also, the kernel seems not completely dead, since if I attach a usb device to the system, it gets detected and the driver attached...
Any hints or suggestion on how to debug this situation would be really appreciated.
The hardware is an HP DL380 Gen9 server, E5-2640v3 CPUs, 32GB ecc memory, 12x 14TB SAS disks plus two 240GB sata boot disks. The sas controller is an HP P840 configured in hba mode (suboptimal, but the driver is quite solid on linux environment)
The system is running TrueNAS-SCALE-22.12.4.2, Pool usage is at 69% and memory is not exausted.
Data disks are configured in a single pool with a single raidz3 vdev. The system exports the datasets as NFS shares.
The problem is that after some days of activity, the system hangs in a strange way. Looking at the logs I see some kernel rants about tasks blocked, usually related to nfsd or txg_sync, until the system stops to respond on web gui, ssh and of course nfs. The system is still pingable on the network, and if I plug a keyboard on usb port, the kernel reacts allocating a device (I see the kernel messages on console), but no input is accepted even on tty.
In the system logs nothing strange is reported except the hung tasks, and I have no kernel dumps in /data/crash
Here is the extract from kernel messages about the kernel hung tasks
Code:
Dec 11 06:24:30 XXXX.local kernel: INFO: task nfsd:8592 blocked for more than 120 seconds. Dec 11 06:24:30 XXXX.local kernel: Tainted: P OE 5.15.131+truenas #1 Dec 11 06:24:30 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 11 06:24:30 XXXX.local kernel: task:nfsd state:D stack: 0 pid: 8592 ppid: 2 flags:0x00004000 Dec 11 06:24:30 XXXX.local kernel: Call Trace: Dec 11 06:24:30 XXXX.local kernel: <TASK> Dec 11 06:24:30 XXXX.local kernel: __schedule+0x2f0/0x950 Dec 11 06:24:30 XXXX.local kernel: schedule+0x5b/0xd0 Dec 11 06:24:30 XXXX.local kernel: rwsem_down_write_slowpath+0x197/0x4e0 Dec 11 06:24:30 XXXX.local kernel: reconnect_path+0xa9/0x2f0 Dec 11 06:24:30 XXXX.local kernel: ? zpl_fh_to_dentry+0xcf/0xf0 [zfs] Dec 11 06:24:30 XXXX.local kernel: ? nfsd_proc_read+0x240/0x240 [nfsd] Dec 11 06:24:30 XXXX.local kernel: exportfs_decode_fh_raw+0xf6/0x300 Dec 11 06:24:30 XXXX.local kernel: ? exp_find_key+0xc0/0x1e0 [nfsd] Dec 11 06:24:30 XXXX.local kernel: ? kmem_cache_free+0x267/0x290 Dec 11 06:24:30 XXXX.local kernel: ? tcp_recvmsg_locked+0x203/0x970 Dec 11 06:24:30 XXXX.local kernel: ? __kmalloc+0x159/0x400 Dec 11 06:24:30 XXXX.local kernel: ? call_rcu+0xd8/0x6d0 Dec 11 06:24:30 XXXX.local kernel: ? tomoyo_check_task_acl+0x40/0x40 Dec 11 06:24:30 XXXX.local kernel: ? security_prepare_creds+0x47/0xa0 Dec 11 06:24:30 XXXX.local kernel: fh_verify+0x390/0x6c0 [nfsd] Dec 11 06:24:30 XXXX.local kernel: nfsd_create+0x55/0x1a0 [nfsd] Dec 11 06:24:30 XXXX.local kernel: nfsd3_proc_mkdir+0xc9/0x160 [nfsd] Dec 11 06:24:30 XXXX.local kernel: nfsd_dispatch+0x153/0x240 [nfsd] Dec 11 06:24:30 XXXX.local kernel: svc_process_common+0x3cf/0x6c0 [sunrpc] Dec 11 06:24:30 XXXX.local kernel: ? nfsd_svc+0x350/0x350 [nfsd] Dec 11 06:24:30 XXXX.local kernel: svc_process+0xb7/0xf0 [sunrpc] Dec 11 06:24:30 XXXX.local kernel: nfsd+0xe8/0x140 [nfsd] Dec 11 06:24:30 XXXX.local kernel: ? nfsd_shutdown_threads+0x80/0x80 [nfsd] Dec 11 06:24:30 XXXX.local kernel: kthread+0x127/0x150 Dec 11 06:24:30 XXXX.local kernel: ? set_kthread_struct+0x50/0x50 Dec 11 06:24:30 XXXX.local kernel: ret_from_fork+0x22/0x30 Dec 11 06:24:30 XXXX.local kernel: </TASK> Dec 11 06:24:30 XXXX.local kernel: INFO: task nfsd:8593 blocked for more than 120 seconds. Dec 11 06:24:30 XXXX.local kernel: Tainted: P OE 5.15.131+truenas #1 Dec 11 06:24:30 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 11 06:24:30 XXXX.local kernel: task:nfsd state:D stack: 0 pid: 8593 ppid: 2 flags:0x00004000 Dec 11 06:24:30 XXXX.local kernel: Call Trace: Dec 11 06:24:30 XXXX.local kernel: <TASK> Dec 11 06:24:30 XXXX.local kernel: __schedule+0x2f0/0x950 Dec 11 06:24:30 XXXX.local kernel: ? dbuf_read_impl.constprop.0+0x284/0x380 [zfs] Dec 11 06:24:30 XXXX.local kernel: schedule+0x5b/0xd0 Dec 11 06:24:30 XXXX.local kernel: schedule_timeout+0x88/0x140 Dec 11 06:24:30 XXXX.local kernel: ? __bpf_trace_tick_stop+0x10/0x10 Dec 11 06:24:30 XXXX.local kernel: io_schedule_timeout+0x4c/0x80 Dec 11 06:24:30 XXXX.local kernel: __cv_timedwait_common+0x128/0x160 [spl] Dec 11 06:24:30 XXXX.local kernel: ? finish_wait+0x90/0x90 Dec 11 06:24:30 XXXX.local kernel: __cv_timedwait_io+0x15/0x20 [spl] Dec 11 06:24:30 XXXX.local kernel: zio_wait+0x109/0x220 [zfs] Dec 11 06:24:30 XXXX.local kernel: dmu_buf_hold_by_dnode+0x5f/0x90 [zfs] Dec 11 06:24:30 XXXX.local kernel: zap_get_leaf_byblk+0x6a/0x130 [zfs] Dec 11 06:24:30 XXXX.local kernel: zap_deref_leaf+0x92/0xe0 [zfs] Dec 11 06:24:30 XXXX.local kernel: fzap_cursor_retrieve+0x10f/0x270 [zfs] Dec 11 06:24:30 XXXX.local kernel: zap_cursor_retrieve+0x6a/0x2b0 [zfs] Dec 11 06:24:30 XXXX.local kernel: ? dmu_prefetch+0x18a/0x210 [zfs] Dec 11 06:24:30 XXXX.local kernel: zfs_readdir+0x12a/0x430 [zfs] Dec 11 06:24:30 XXXX.local kernel: ? aa_file_perm+0x114/0x4d0 Dec 11 06:24:30 XXXX.local kernel: zpl_iterate+0x56/0x90 [zfs] Dec 11 06:24:30 XXXX.local kernel: iterate_dir+0x17c/0x1c0 Dec 11 06:24:30 XXXX.local kernel: get_name+0x13b/0x1b0 Dec 11 06:24:30 XXXX.local kernel: ? get_name+0x1b0/0x1b0 Dec 11 06:24:30 XXXX.local kernel: reconnect_path+0x28c/0x2f0 Dec 11 06:24:30 XXXX.local kernel: ? nfsd_proc_read+0x240/0x240 [nfsd] Dec 11 06:24:30 XXXX.local kernel: exportfs_decode_fh_raw+0xf6/0x300 Dec 11 06:24:30 XXXX.local kernel: ? exp_find_key+0xc0/0x1e0 [nfsd] Dec 11 06:24:30 XXXX.local kernel: ? select_task_rq_fair+0x157/0x1220 Dec 11 06:24:30 XXXX.local kernel: ? __kmalloc+0x159/0x400 Dec 11 06:24:30 XXXX.local kernel: ? call_rcu+0xd8/0x6d0 Dec 11 06:24:30 XXXX.local kernel: ? tomoyo_check_task_acl+0x40/0x40 Dec 11 06:24:30 XXXX.local kernel: ? security_prepare_creds+0x47/0xa0 Dec 11 06:24:30 XXXX.local kernel: fh_verify+0x390/0x6c0 [nfsd] Dec 11 06:24:30 XXXX.local kernel: ? svc_xprt_do_enqueue+0x11b/0x1b0 [sunrpc] Dec 11 06:24:30 XXXX.local kernel: nfsd_access+0x2b/0x130 [nfsd] Dec 11 06:24:30 XXXX.local kernel: nfsd3_proc_access+0x64/0xb0 [nfsd] Dec 11 06:24:30 XXXX.local kernel: nfsd_dispatch+0x153/0x240 [nfsd] Dec 11 06:24:30 XXXX.local kernel: svc_process_common+0x3cf/0x6c0 [sunrpc] Dec 11 06:24:30 XXXX.local kernel: ? nfsd_svc+0x350/0x350 [nfsd] Dec 11 06:24:30 XXXX.local kernel: svc_process+0xb7/0xf0 [sunrpc] Dec 11 06:24:30 XXXX.local kernel: nfsd+0xe8/0x140 [nfsd] Dec 11 06:24:30 XXXX.local kernel: ? nfsd_shutdown_threads+0x80/0x80 [nfsd] Dec 11 06:24:30 XXXX.local kernel: kthread+0x127/0x150 Dec 11 06:24:30 XXXX.local kernel: ? set_kthread_struct+0x50/0x50 Dec 11 06:24:30 XXXX.local kernel: ret_from_fork+0x22/0x30 Dec 11 06:24:30 XXXX.local kernel: </TASK> Dec 13 04:13:03 XXXX.local kernel: perf: interrupt took too long (2523 > 2500), lowering kernel.perf_event_max_sample_rate to 79250 Dec 13 10:01:07 XXXX.local kernel: perf: interrupt took too long (3180 > 3153), lowering kernel.perf_event_max_sample_rate to 62750 Dec 13 22:07:06 XXXX.local kernel: perf: interrupt took too long (3994 > 3975), lowering kernel.perf_event_max_sample_rate to 50000 Dec 14 12:08:37 XXXX.local kernel: INFO: task txg_sync:4559 blocked for more than 120 seconds. Dec 14 12:08:37 XXXX.local kernel: INFO: task txg_sync:4559 blocked for more than 120 seconds. Dec 14 12:08:37 XXXX.local kernel: Tainted: P OE 5.15.131+truenas #1 Dec 14 12:08:37 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 14 12:08:37 XXXX.local kernel: task:txg_sync state:D stack: 0 pid: 4559 ppid: 2 flags:0x00004000 Dec 14 12:08:37 XXXX.local kernel: Call Trace: Dec 14 12:08:37 XXXX.local kernel: <TASK> Dec 14 12:08:37 XXXX.local kernel: __schedule+0x2f0/0x950 Dec 14 12:08:37 XXXX.local kernel: ? dbuf_read_impl.constprop.0+0x284/0x380 [zfs] Dec 14 12:08:37 XXXX.local kernel: schedule+0x5b/0xd0 Dec 14 12:08:37 XXXX.local kernel: schedule_timeout+0x88/0x140 Dec 14 12:08:37 XXXX.local kernel: ? __bpf_trace_tick_stop+0x10/0x10 Dec 14 12:08:37 XXXX.local kernel: io_schedule_timeout+0x4c/0x80 Dec 14 12:08:37 XXXX.local kernel: __cv_timedwait_common+0x128/0x160 [spl] Dec 14 12:08:37 XXXX.local kernel: ? finish_wait+0x90/0x90 Dec 14 12:08:37 XXXX.local kernel: __cv_timedwait_io+0x15/0x20 [spl] Dec 14 12:08:37 XXXX.local kernel: zio_wait+0x109/0x220 [zfs] Dec 14 12:08:37 XXXX.local kernel: dbuf_hold_impl+0x459/0x640 [zfs] Dec 14 12:08:37 XXXX.local kernel: dbuf_hold_level+0x2b/0x60 [zfs] Dec 14 12:08:37 XXXX.local kernel: dmu_buf_hold_array_by_dnode+0x107/0x530 [zfs] Dec 14 12:08:37 XXXX.local kernel: dmu_buf_hold_array.constprop.0+0x63/0xb0 [zfs] Dec 14 12:08:37 XXXX.local kernel: dmu_write+0x6a/0xd0 [zfs] Dec 14 12:08:37 XXXX.local kernel: ? dmu_buf_will_dirty_impl+0xed/0x100 [zfs] Dec 14 12:08:37 XXXX.local kernel: space_map_write_intro_debug+0xab/0xe0 [zfs] Dec 14 12:08:37 XXXX.local kernel: space_map_write_impl+0x49/0x240 [zfs] Dec 14 12:08:37 XXXX.local kernel: ? dbuf_find_dirty_eq+0x9/0x30 [zfs] Dec 14 12:08:37 XXXX.local kernel: space_map_write+0x9a/0x190 [zfs] Dec 14 12:08:37 XXXX.local kernel: metaslab_flush+0xed/0x330 [zfs] Dec 14 12:08:37 XXXX.local kernel: ? spa_cleanup_old_sm_logs+0xb4/0x170 [zfs] Dec 14 12:08:37 XXXX.local kernel: spa_flush_metaslabs+0x150/0x200 [zfs] Dec 14 12:08:37 XXXX.local kernel: ? __cond_resched+0x16/0x50 Dec 14 12:08:37 XXXX.local kernel: spa_sync_iterate_to_convergence+0x140/0x1e0 [zfs] Dec 14 12:08:37 XXXX.local kernel: spa_sync+0x2eb/0x5d0 [zfs] Dec 14 12:08:37 XXXX.local kernel: txg_sync_thread+0x229/0x2a0 [zfs] Dec 14 12:08:37 XXXX.local kernel: ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] Dec 14 12:08:37 XXXX.local kernel: thread_generic_wrapper+0x59/0x70 [spl] Dec 14 12:08:37 XXXX.local kernel: ? __thread_exit+0x20/0x20 [spl] Dec 14 12:08:37 XXXX.local kernel: kthread+0x127/0x150 Dec 14 12:08:37 XXXX.local kernel: ? set_kthread_struct+0x50/0x50 Dec 14 12:08:37 XXXX.local kernel: ret_from_fork+0x22/0x30 Dec 14 12:08:37 XXXX.local kernel: </TASK> Dec 14 12:08:37 XXXX.local kernel: INFO: task nfsd:8592 blocked for more than 121 seconds. Dec 14 12:08:37 XXXX.local kernel: Tainted: P OE 5.15.131+truenas #1 Dec 14 12:08:37 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 14 12:08:37 XXXX.local kernel: task:nfsd state:D stack: 0 pid: 8592 ppid: 2 flags:0x00004000 Dec 14 12:08:37 XXXX.local kernel: Call Trace: Dec 14 12:08:37 XXXX.local kernel: <TASK> Dec 14 12:08:37 XXXX.local kernel: __schedule+0x2f0/0x950 Dec 14 12:08:37 XXXX.local kernel: ? dbuf_read_impl.constprop.0+0x284/0x380 [zfs] Dec 14 12:08:37 XXXX.local kernel: schedule+0x5b/0xd0 Dec 14 12:08:37 XXXX.local kernel: schedule_timeout+0x88/0x140 Dec 14 12:08:37 XXXX.local kernel: ? __bpf_trace_tick_stop+0x10/0x10 Dec 14 12:08:37 XXXX.local kernel: io_schedule_timeout+0x4c/0x80 Dec 14 12:08:37 XXXX.local kernel: __cv_timedwait_common+0x128/0x160 [spl] Dec 14 12:08:37 XXXX.local kernel: ? finish_wait+0x90/0x90 Dec 14 12:08:37 XXXX.local kernel: __cv_timedwait_io+0x15/0x20 [spl] Dec 14 12:08:37 XXXX.local kernel: zio_wait+0x109/0x220 [zfs] Dec 14 12:08:37 XXXX.local kernel: dmu_buf_hold+0x5f/0x90 [zfs] Dec 14 12:08:37 XXXX.local kernel: zap_lockdir+0x4e/0xc0 [zfs] Dec 14 12:08:37 XXXX.local kernel: zap_lookup_norm+0x59/0xd0 [zfs] Dec 14 12:08:37 XXXX.local kernel: zap_lookup+0x12/0x20 [zfs] Dec 14 12:08:37 XXXX.local kernel: zfs_match_find.constprop.0+0xf0/0x100 [zfs] Dec 14 12:08:37 XXXX.local kernel: zfs_dirent_lock+0x427/0x5a0 [zfs] Dec 14 12:08:37 XXXX.local kernel: zfs_dirlook+0x88/0x2b0 [zfs] Dec 14 12:08:37 XXXX.local kernel: zfs_lookup+0x24d/0x400 [zfs] Dec 14 12:08:37 XXXX.local kernel: zpl_lookup+0xc6/0x260 [zfs] Dec 14 12:08:37 XXXX.local kernel: ? fill_pre_wcc+0xa7/0x1c0 [nfsd] Dec 14 12:08:37 XXXX.local kernel: __lookup_slow+0x88/0x150 Dec 14 12:08:37 XXXX.local kernel: lookup_one_len+0x84/0x90 Dec 14 12:08:37 XXXX.local kernel: nfsd_lookup_dentry+0xdb/0x450 [nfsd] Dec 14 12:08:37 XXXX.local kernel: ? svc_xprt_do_enqueue+0x11b/0x1b0 [sunrpc] Dec 14 12:08:37 XXXX.local kernel: nfsd_lookup+0x8b/0x150 [nfsd] Dec 14 12:08:37 XXXX.local kernel: ? nfsd_cache_lookup+0x585/0x680 [nfsd] Dec 14 12:08:37 XXXX.local kernel: nfsd3_proc_lookup+0xaf/0x110 [nfsd] Dec 14 12:08:37 XXXX.local kernel: nfsd_dispatch+0x153/0x240 [nfsd] Dec 14 12:08:37 XXXX.local kernel: svc_process_common+0x3cf/0x6c0 [sunrpc] Dec 14 12:08:37 XXXX.local kernel: ? nfsd_svc+0x350/0x350 [nfsd] Dec 14 12:08:37 XXXX.local kernel: svc_process+0xb7/0xf0 [sunrpc] Dec 14 12:08:37 XXXX.local kernel: nfsd+0xe8/0x140 [nfsd] Dec 14 12:08:37 XXXX.local kernel: ? nfsd_shutdown_threads+0x80/0x80 [nfsd] Dec 14 12:08:37 XXXX.local kernel: kthread+0x127/0x150 Dec 14 12:08:37 XXXX.local kernel: ? set_kthread_struct+0x50/0x50 Dec 14 12:08:37 XXXX.local kernel: ret_from_fork+0x22/0x30 Dec 14 12:08:37 XXXX.local kernel: </TASK> Dec 14 12:08:37 XXXX.local kernel: INFO: task nfsd:8593 blocked for more than 121 seconds. Dec 14 12:08:37 XXXX.local kernel: Tainted: P OE 5.15.131+truenas #1 Dec 14 12:08:37 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 14 12:08:37 XXXX.local kernel: task:nfsd state:D stack: 0 pid: 8593 ppid: 2 flags:0x00004000 Dec 14 12:08:37 XXXX.local kernel: Call Trace: Dec 14 12:08:37 XXXX.local kernel: <TASK> Dec 14 12:08:37 XXXX.local kernel: __schedule+0x2f0/0x950 Dec 14 12:08:37 XXXX.local kernel: ? dbuf_read_impl.constprop.0+0x284/0x380 [zfs] Dec 14 12:08:37 XXXX.local kernel: schedule+0x5b/0xd0 Dec 14 12:08:37 XXXX.local kernel: schedule_timeout+0x88/0x140 Dec 14 12:08:37 XXXX.local kernel: ? __bpf_trace_tick_stop+0x10/0x10 Dec 14 12:08:37 XXXX.local kernel: io_schedule_timeout+0x4c/0x80 Dec 14 12:08:37 XXXX.local kernel: __cv_timedwait_common+0x128/0x160 [spl] Dec 14 12:08:37 XXXX.local kernel: ? finish_wait+0x90/0x90 Dec 14 12:08:37 XXXX.local kernel: __cv_timedwait_io+0x15/0x20 [spl] Dec 14 12:08:37 XXXX.local kernel: zio_wait+0x109/0x220 [zfs] Dec 14 12:08:37 XXXX.local kernel: dmu_buf_hold_by_dnode+0x5f/0x90 [zfs] Dec 14 12:08:37 XXXX.local kernel: zap_get_leaf_byblk+0x6a/0x130 [zfs] Dec 14 12:08:37 XXXX.local kernel: zap_deref_leaf+0x92/0xe0 [zfs] Dec 14 12:08:37 XXXX.local kernel: fzap_lookup+0x8a/0x130 [zfs] Dec 14 12:08:37 XXXX.local kernel: ? zap_hashbits+0xa/0x20 [zfs] Dec 14 12:08:37 XXXX.local kernel: ? zap_hash+0x2a/0xd0 [zfs] Dec 14 12:08:37 XXXX.local kernel: ? zap_name_init_str+0x9f/0x170 [zfs] Dec 14 12:08:37 XXXX.local kernel: zap_lookup_impl+0x78/0x2c0 [zfs] Dec 14 12:08:37 XXXX.local kernel: ? zap_lockdir+0x90/0xc0 [zfs] Dec 14 12:08:37 XXXX.local kernel: ? dbuf_read+0x34a/0x580 [zfs] Dec 14 12:08:37 XXXX.local kernel: zap_lookup_norm+0x83/0xd0 [zfs] Dec 14 12:08:37 XXXX.local kernel: zap_lookup+0x12/0x20 [zfs] Dec 14 12:08:37 XXXX.local kernel: zfs_match_find.constprop.0+0xf0/0x100 [zfs] Dec 14 12:08:37 XXXX.local kernel: zfs_dirent_lock+0x427/0x5a0 [zfs] Dec 14 12:08:37 XXXX.local kernel: zfs_dirlook+0x88/0x2b0 [zfs] Dec 14 12:08:37 XXXX.local kernel: zfs_lookup+0x24d/0x400 [zfs] Dec 14 12:08:37 XXXX.local kernel: zpl_lookup+0xc6/0x260 [zfs] Dec 14 12:08:37 XXXX.local kernel: ? fill_pre_wcc+0xa7/0x1c0 [nfsd] Dec 14 12:08:37 XXXX.local kernel: __lookup_slow+0x88/0x150 Dec 14 12:08:37 XXXX.local kernel: lookup_one_len+0x84/0x90 Dec 14 12:08:37 XXXX.local kernel: nfsd_lookup_dentry+0xdb/0x450 [nfsd] Dec 14 12:08:37 XXXX.local kernel: ? svc_xprt_do_enqueue+0x11b/0x1b0 [sunrpc] Dec 14 12:08:37 XXXX.local kernel: nfsd_lookup+0x8b/0x150 [nfsd] Dec 14 12:08:37 XXXX.local kernel: ? nfsd_cache_lookup+0x585/0x680 [nfsd] Dec 14 12:08:37 XXXX.local kernel: nfsd3_proc_lookup+0xaf/0x110 [nfsd] Dec 14 12:08:37 XXXX.local kernel: nfsd_dispatch+0x153/0x240 [nfsd] Dec 14 12:08:37 XXXX.local kernel: svc_process_common+0x3cf/0x6c0 [sunrpc] Dec 14 12:08:37 XXXX.local kernel: ? nfsd_svc+0x350/0x350 [nfsd] Dec 14 12:08:37 XXXX.local kernel: svc_process+0xb7/0xf0 [sunrpc] Dec 14 12:08:37 XXXX.local kernel: nfsd+0xe8/0x140 [nfsd] Dec 14 12:08:37 XXXX.local kernel: ? nfsd_shutdown_threads+0x80/0x80 [nfsd] Dec 14 12:08:37 XXXX.local kernel: kthread+0x127/0x150 Dec 14 12:08:37 XXXX.local kernel: ? set_kthread_struct+0x50/0x50 Dec 14 12:08:37 XXXX.local kernel: ret_from_fork+0x22/0x30 Dec 14 12:08:37 XXXX.local kernel: </TASK> Dec 14 12:08:37 XXXX.local kernel: INFO: task rrdcached:9686 blocked for more than 121 seconds. Dec 14 12:08:37 XXXX.local kernel: Tainted: P OE 5.15.131+truenas #1 Dec 14 12:08:37 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 14 12:08:37 XXXX.local kernel: task:rrdcached state:D stack: 0 pid: 9686 ppid: 1 flags:0x00000000 Dec 14 12:08:37 XXXX.local kernel: Call Trace: Dec 14 12:08:37 XXXX.local kernel: <TASK> Dec 14 12:08:37 XXXX.local kernel: __schedule+0x2f0/0x950 Dec 14 12:08:37 XXXX.local kernel: ? dbuf_read_impl.constprop.0+0x284/0x380 [zfs] Dec 14 12:08:37 XXXX.local kernel: schedule+0x5b/0xd0 Dec 14 12:08:37 XXXX.local kernel: schedule_timeout+0x88/0x140 Dec 14 12:08:37 XXXX.local kernel: ? __bpf_trace_tick_stop+0x10/0x10 Dec 14 12:08:37 XXXX.local kernel: io_schedule_timeout+0x4c/0x80 Dec 14 12:08:37 XXXX.local kernel: __cv_timedwait_common+0x128/0x160 [spl] Dec 14 12:08:37 XXXX.local kernel: ? finish_wait+0x90/0x90 Dec 14 12:08:37 XXXX.local kernel: __cv_timedwait_io+0x15/0x20 [spl] Dec 14 12:08:37 XXXX.local kernel: zio_wait+0x109/0x220 [zfs] Dec 14 12:08:37 XXXX.local kernel: dbuf_hold_impl+0x459/0x640 [zfs] Dec 14 12:08:38 XXXX.local kernel: dbuf_hold_level+0x2b/0x60 [zfs] Dec 14 12:08:38 XXXX.local kernel: dmu_buf_hold_array_by_dnode+0x107/0x530 [zfs] Dec 14 12:08:38 XXXX.local kernel: dmu_read_impl+0xa8/0x180 [zfs] Dec 14 12:08:38 XXXX.local kernel: dmu_read+0x56/0xa0 [zfs] Dec 14 12:08:38 XXXX.local kernel: zfs_fillpage+0x7a/0x180 [zfs] Dec 14 12:08:38 XXXX.local kernel: zfs_getpage+0x50/0xf0 [zfs] Dec 14 12:08:38 XXXX.local kernel: zpl_readpage_common+0x29/0x50 [zfs] Dec 14 12:08:38 XXXX.local kernel: filemap_read_page+0x36/0xf0 Dec 14 12:08:38 XXXX.local kernel: filemap_fault+0x98d/0xa80 Dec 14 12:08:38 XXXX.local kernel: __do_fault+0x36/0x120 Dec 14 12:08:38 XXXX.local kernel: __handle_mm_fault+0xce7/0x1580 Dec 14 12:08:38 XXXX.local kernel: handle_mm_fault+0xcf/0x2b0 Dec 14 12:08:38 XXXX.local kernel: do_user_addr_fault+0x1be/0x670 Dec 14 12:08:38 XXXX.local kernel: exc_page_fault+0x72/0x150 Dec 14 12:08:38 XXXX.local kernel: asm_exc_page_fault+0x22/0x30 Dec 14 12:08:38 XXXX.local kernel: RIP: 0033:0x7fc1d5aa253d Dec 14 12:08:38 XXXX.local kernel: RSP: 002b:00007fc1d1d9c8e0 EFLAGS: 00010212 Dec 14 12:08:38 XXXX.local kernel: RAX: 0000000000000000 RBX: 0000000000000082 RCX: 00007fc1d583e6e7 Dec 14 12:08:38 XXXX.local kernel: RDX: 0000000000000001 RSI: 00000000000244a8 RDI: 00007fc1ce54d000 Dec 14 12:08:38 XXXX.local kernel: RBP: 00007fc1d1d9cb50 R08: 0000000000000008 R09: 0000000000000000 Dec 14 12:08:38 XXXX.local kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 00007fc1c0001f90 Dec 14 12:08:38 XXXX.local kernel: R13: 00007fc1c0001d20 R14: 00007fc1ce54d000 R15: 0000000000000000 Dec 14 12:08:38 XXXX.local kernel: </TASK> Dec 14 12:08:38 XXXX.local kernel: INFO: task rrdcached:9687 blocked for more than 121 seconds. Dec 14 12:08:38 XXXX.local kernel: Tainted: P OE 5.15.131+truenas #1 Dec 14 12:08:38 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 14 12:08:38 XXXX.local kernel: task:rrdcached state:D stack: 0 pid: 9687 ppid: 1 flags:0x00000000 Dec 14 12:08:38 XXXX.local kernel: Call Trace: Dec 14 12:08:38 XXXX.local kernel: <TASK> Dec 14 12:08:38 XXXX.local kernel: __schedule+0x2f0/0x950 Dec 14 12:08:38 XXXX.local kernel: ? dbuf_read_impl.constprop.0+0x284/0x380 [zfs] Dec 14 12:08:38 XXXX.local kernel: schedule+0x5b/0xd0 Dec 14 12:08:38 XXXX.local kernel: schedule_timeout+0x88/0x140 Dec 14 12:08:38 XXXX.local kernel: ? __bpf_trace_tick_stop+0x10/0x10 Dec 14 12:08:38 XXXX.local kernel: io_schedule_timeout+0x4c/0x80 Dec 14 12:08:38 XXXX.local kernel: __cv_timedwait_common+0x128/0x160 [spl] Dec 14 12:08:38 XXXX.local kernel: ? finish_wait+0x90/0x90 Dec 14 12:08:38 XXXX.local kernel: __cv_timedwait_io+0x15/0x20 [spl] Dec 14 12:08:38 XXXX.local kernel: zio_wait+0x109/0x220 [zfs] Dec 14 12:08:38 XXXX.local kernel: dbuf_hold_impl+0x459/0x640 [zfs] Dec 14 12:08:38 XXXX.local kernel: dbuf_hold_level+0x2b/0x60 [zfs] Dec 14 12:08:38 XXXX.local kernel: dmu_buf_hold_array_by_dnode+0x107/0x530 [zfs] Dec 14 12:08:38 XXXX.local kernel: dmu_read_impl+0xa8/0x180 [zfs] Dec 14 12:08:38 XXXX.local kernel: dmu_read+0x56/0xa0 [zfs] Dec 14 12:08:38 XXXX.local kernel: zfs_fillpage+0x7a/0x180 [zfs] Dec 14 12:08:38 XXXX.local kernel: zfs_getpage+0x50/0xf0 [zfs] Dec 14 12:08:38 XXXX.local kernel: zpl_readpage_common+0x29/0x50 [zfs] Dec 14 12:08:38 XXXX.local kernel: filemap_read_page+0x36/0xf0 Dec 14 12:08:38 XXXX.local kernel: filemap_fault+0x98d/0xa80 Dec 14 12:08:38 XXXX.local kernel: __do_fault+0x36/0x120 Dec 14 12:08:38 XXXX.local kernel: __handle_mm_fault+0xce7/0x1580 Dec 14 12:08:38 XXXX.local kernel: handle_mm_fault+0xcf/0x2b0 Dec 14 12:08:38 XXXX.local kernel: do_user_addr_fault+0x1be/0x670 Dec 14 12:08:38 XXXX.local kernel: exc_page_fault+0x72/0x150 Dec 14 12:08:38 XXXX.local kernel: asm_exc_page_fault+0x22/0x30 Dec 14 12:08:38 XXXX.local kernel: RIP: 0033:0x7fc1d5aa253d Dec 14 12:08:38 XXXX.local kernel: RSP: 002b:00007fc1d159b8e0 EFLAGS: 00010212 Dec 14 12:08:38 XXXX.local kernel: RAX: 0000000000000000 RBX: 0000000000000082 RCX: 00007fc1d583e6e7 Dec 14 12:08:38 XXXX.local kernel: RDX: 0000000000000001 RSI: 00000000000244a8 RDI: 00007fc1ce528000 Dec 14 12:08:38 XXXX.local kernel: RBP: 00007fc1d159bb50 R08: 0000000000000009 R09: 0000000000000000 Dec 14 12:08:38 XXXX.local kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 00007fc1c4001ff0 Dec 14 12:08:38 XXXX.local kernel: R13: 00007fc1c4001d40 R14: 00007fc1ce528000 R15: 0000000000000000 Dec 14 12:08:38 XXXX.local kernel: </TASK> Dec 14 12:08:38 XXXX.local kernel: INFO: task rrdcached:9688 blocked for more than 122 seconds. Dec 14 12:08:38 XXXX.local kernel: Tainted: P OE 5.15.131+truenas #1 Dec 14 12:08:38 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 14 12:08:38 XXXX.local kernel: task:rrdcached state:D stack: 0 pid: 9688 ppid: 1 flags:0x00000000 Dec 14 12:08:38 XXXX.local kernel: Call Trace: Dec 14 12:08:38 XXXX.local kernel: <TASK> Dec 14 12:08:38 XXXX.local kernel: __schedule+0x2f0/0x950 Dec 14 12:08:38 XXXX.local kernel: ? dbuf_read_impl.constprop.0+0x284/0x380 [zfs] Dec 14 12:08:38 XXXX.local kernel: schedule+0x5b/0xd0 Dec 14 12:08:38 XXXX.local kernel: schedule_timeout+0x88/0x140 Dec 14 12:08:38 XXXX.local kernel: ? __bpf_trace_tick_stop+0x10/0x10 Dec 14 12:08:38 XXXX.local kernel: io_schedule_timeout+0x4c/0x80 Dec 14 12:08:38 XXXX.local kernel: __cv_timedwait_common+0x128/0x160 [spl] Dec 14 12:08:38 XXXX.local kernel: ? finish_wait+0x90/0x90 Dec 14 12:08:38 XXXX.local kernel: __cv_timedwait_io+0x15/0x20 [spl] Dec 14 12:08:38 XXXX.local kernel: zio_wait+0x109/0x220 [zfs] Dec 14 12:08:38 XXXX.local kernel: dbuf_hold_impl+0x459/0x640 [zfs] Dec 14 12:08:38 XXXX.local kernel: dbuf_hold_level+0x2b/0x60 [zfs] Dec 14 12:08:38 XXXX.local kernel: dmu_buf_hold_array_by_dnode+0x107/0x530 [zfs] Dec 14 12:08:38 XXXX.local kernel: dmu_read_impl+0xa8/0x180 [zfs] Dec 14 12:08:38 XXXX.local kernel: dmu_read+0x56/0xa0 [zfs] Dec 14 12:08:38 XXXX.local kernel: zfs_fillpage+0x7a/0x180 [zfs] Dec 14 12:08:38 XXXX.local kernel: dmu_read+0x56/0xa0 [zfs] Dec 14 12:08:38 XXXX.local kernel: zfs_fillpage+0x7a/0x180 [zfs] Dec 14 12:08:38 XXXX.local kernel: zfs_getpage+0x50/0xf0 [zfs] Dec 14 12:08:38 XXXX.local kernel: zpl_readpage_common+0x29/0x50 [zfs] Dec 14 12:08:38 XXXX.local kernel: filemap_read_page+0x36/0xf0 Dec 14 12:08:38 XXXX.local kernel: filemap_fault+0x98d/0xa80 Dec 14 12:08:38 XXXX.local kernel: __do_fault+0x36/0x120 Dec 14 12:08:38 XXXX.local kernel: __handle_mm_fault+0xce7/0x1580 Dec 14 12:08:38 XXXX.local kernel: handle_mm_fault+0xcf/0x2b0 Dec 14 12:08:38 XXXX.local kernel: do_user_addr_fault+0x1be/0x670 Dec 14 12:08:38 XXXX.local kernel: exc_page_fault+0x72/0x150 Dec 14 12:08:38 XXXX.local kernel: asm_exc_page_fault+0x22/0x30 Dec 14 12:08:38 XXXX.local kernel: RIP: 0033:0x7fc1d5aa253d Dec 14 12:08:38 XXXX.local kernel: RSP: 002b:00007fc1d0d9a8e0 EFLAGS: 00010212 Dec 14 12:08:38 XXXX.local kernel: RAX: 0000000000000000 RBX: 0000000000000082 RCX: 00007fc1d583e6e7 Dec 14 12:08:38 XXXX.local kernel: RDX: 0000000000000001 RSI: 00000000000244a8 RDI: 00007fc1ce572000 Dec 14 12:08:38 XXXX.local kernel: RBP: 00007fc1d0d9ab50 R08: 0000000000000007 R09: 0000000000000000 Dec 14 12:08:38 XXXX.local kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 00007fc1b8002140 Dec 14 12:08:38 XXXX.local kernel: R13: 00007fc1b8001f40 R14: 00007fc1ce572000 R15: 0000000000000000 Dec 14 12:08:38 XXXX.local kernel: </TASK> Dec 14 12:08:38 XXXX.local kernel: INFO: task rrdcached:9689 blocked for more than 122 seconds. Dec 14 12:08:38 XXXX.local kernel: Tainted: P OE 5.15.131+truenas #1 Dec 14 12:08:38 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 14 12:08:38 XXXX.local kernel: task:rrdcached state:D stack: 0 pid: 9689 ppid: 1 flags:0x00000000 Dec 14 12:08:38 XXXX.local kernel: Call Trace: Dec 14 12:08:38 XXXX.local kernel: <TASK> Dec 14 12:08:38 XXXX.local kernel: __schedule+0x2f0/0x950 Dec 14 12:08:38 XXXX.local kernel: ? dbuf_read_impl.constprop.0+0x284/0x380 [zfs] Dec 14 12:08:38 XXXX.local kernel: schedule+0x5b/0xd0 Dec 14 12:08:38 XXXX.local kernel: schedule_timeout+0x88/0x140 Dec 14 12:08:38 XXXX.local kernel: ? __bpf_trace_tick_stop+0x10/0x10 Dec 14 12:08:38 XXXX.local kernel: io_schedule_timeout+0x4c/0x80 Dec 14 12:08:38 XXXX.local kernel: __cv_timedwait_common+0x128/0x160 [spl] Dec 14 12:08:38 XXXX.local kernel: ? finish_wait+0x90/0x90 Dec 14 12:08:38 XXXX.local kernel: __cv_timedwait_io+0x15/0x20 [spl] Dec 14 12:08:38 XXXX.local kernel: zio_wait+0x109/0x220 [zfs] Dec 14 12:08:38 XXXX.local kernel: dbuf_hold_impl+0x459/0x640 [zfs] Dec 14 12:08:38 XXXX.local kernel: dbuf_hold_level+0x2b/0x60 [zfs] Dec 14 12:08:38 XXXX.local kernel: dmu_buf_hold_array_by_dnode+0x107/0x530 [zfs] Dec 14 12:08:38 XXXX.local kernel: dmu_read_impl+0xa8/0x180 [zfs] Dec 14 12:08:38 XXXX.local kernel: dmu_read+0x56/0xa0 [zfs] Dec 14 12:08:38 XXXX.local kernel: zfs_fillpage+0x7a/0x180 [zfs] Dec 14 12:08:38 XXXX.local kernel: zfs_getpage+0x50/0xf0 [zfs] Dec 14 12:08:38 XXXX.local kernel: zpl_readpage_common+0x29/0x50 [zfs] Dec 14 12:08:38 XXXX.local kernel: filemap_read_page+0x36/0xf0 Dec 14 12:08:38 XXXX.local kernel: filemap_fault+0x98d/0xa80 Dec 14 12:08:38 XXXX.local kernel: __do_fault+0x36/0x120 Dec 14 12:08:38 XXXX.local kernel: __handle_mm_fault+0xce7/0x1580 Dec 14 12:08:38 XXXX.local kernel: handle_mm_fault+0xcf/0x2b0 Dec 14 12:08:38 XXXX.local kernel: do_user_addr_fault+0x1be/0x670 Dec 14 12:08:38 XXXX.local kernel: exc_page_fault+0x72/0x150 Dec 14 12:08:38 XXXX.local kernel: asm_exc_page_fault+0x22/0x30 Dec 14 12:08:38 XXXX.local kernel: RIP: 0033:0x7fc1d5aa253d Dec 14 12:08:38 XXXX.local kernel: RSP: 002b:00007fc1d05998e0 EFLAGS: 00010212 Dec 14 12:08:38 XXXX.local kernel: RAX: 0000000000000000 RBX: 0000000000000082 RCX: 00007fc1d583e6e7 Dec 14 12:08:38 XXXX.local kernel: RDX: 0000000000000001 RSI: 00000000000244a8 RDI: 00007fc1ce503000 Dec 14 12:08:38 XXXX.local kernel: RBP: 00007fc1d0599b50 R08: 0000000000000006 R09: 0000000000000000 Dec 14 12:08:38 XXXX.local kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 00007fc1bc0020e0 Dec 14 12:08:38 XXXX.local kernel: R13: 00007fc1bc001de0 R14: 00007fc1ce503000 R15: 0000000000000000 Dec 14 12:08:38 XXXX.local kernel: </TASK> Dec 14 14:53:31 XXXX.local kernel: perf: interrupt took too long (4999 > 4992), lowering kernel.perf_event_max_sample_rate to 40000 Dec 16 03:25:30 XXXX.local kernel: perf: interrupt took too long (6250 > 6248), lowering kernel.perf_event_max_sample_rate to 32000 Dec 17 04:13:07 XXXX.local kernel: INFO: task txg_sync:4559 blocked for more than 120 seconds. Dec 17 04:13:07 XXXX.local kernel: Tainted: P OE 5.15.131+truenas #1 Dec 17 04:13:07 XXXX.local kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 17 04:13:07 XXXX.local kernel: task:txg_sync state:D stack: 0 pid: 4559 ppid: 2 flags:0x00004000 Dec 17 04:13:07 XXXX.local kernel: Call Trace: Dec 17 04:13:07 XXXX.local kernel: <TASK> Dec 17 04:13:07 XXXX.local kernel: __schedule+0x2f0/0x950 Dec 17 04:13:07 XXXX.local kernel: ? get_swap_device+0xc0/0xe0 Dec 17 04:13:07 XXXX.local kernel: schedule+0x5b/0xd0 Dec 17 04:13:07 XXXX.local kernel: schedule_timeout+0x88/0x140 Dec 17 04:13:07 XXXX.local kernel: ? __bpf_trace_tick_stop+0x10/0x10 Dec 17 04:13:07 XXXX.local kernel: io_schedule_timeout+0x4c/0x80 Dec 17 04:13:07 XXXX.local kernel: __cv_timedwait_common+0x128/0x160 [spl] Dec 17 04:13:07 XXXX.local kernel: ? finish_wait+0x90/0x90 Dec 17 04:13:07 XXXX.local kernel: __cv_timedwait_io+0x15/0x20 [spl] Dec 17 04:13:07 XXXX.local kernel: zio_wait+0x109/0x220 [zfs] Dec 17 04:13:07 XXXX.local kernel: dsl_scan_sync+0x4d8/0x980 [zfs] Dec 17 04:13:07 XXXX.local kernel: ? kfree+0x1fc/0x250 Dec 17 04:13:07 XXXX.local kernel: spa_sync_iterate_to_convergence+0x11f/0x1e0 [zfs] Dec 17 04:13:07 XXXX.local kernel: spa_sync+0x2eb/0x5d0 [zfs] Dec 17 04:13:07 XXXX.local kernel: txg_sync_thread+0x229/0x2a0 [zfs] Dec 17 04:13:07 XXXX.local kernel: ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] Dec 17 04:13:07 XXXX.local kernel: thread_generic_wrapper+0x59/0x70 [spl] Dec 17 04:13:07 XXXX.local kernel: ? __thread_exit+0x20/0x20 [spl] Dec 17 04:13:07 XXXX.local kernel: kthread+0x127/0x150 Dec 17 04:13:07 XXXX.local kernel: ? set_kthread_struct+0x50/0x50 Dec 17 04:13:07 XXXX.local kernel: ret_from_fork+0x22/0x30 Dec 17 04:13:07 XXXX.local kernel: </TASK> Dec 17 14:10:53 XXXX.local kernel: perf: interrupt took too long (7815 > 7812), lowering kernel.perf_event_max_sample_rate to 25500
The only other entry I've found that may point the finger against the disks subsystem is in /var/log/daemon.log
Code:
Dec 17 17:34:59 XXXX.local zed[4191761]: eid=18 class=deadman pool='Pool0' vdev=75be8983-065a-41a9-afff-0a5696e2832f size=4096 offset=343724634112 priority=4 err=0 flags=0x1808a0 delay=606728354ms bookmark=21009:0:0:138835 Dec 17 22:49:04 XXXX.local zed[2991386]: eid=19 class=deadman pool='Pool0' vdev=efbd3b04-7466-475f-83e3-19d4f299c29d size=4096 offset=12653207072768 priority=4 err=0 flags=0x1808a0 delay=625574278ms bookmark=21009:0:0:119190
I have sporadic errors on collectd, but I don't know if they may be relevant...
Code:
Dec 17 05:13:56 XXXX.local collectd[11253]: Traceback (most recent call last): File "/usr/local/lib/collectd_pyplugins/cputemp.py", line 21, in read with Client() as c: File "/usr/lib/python3/dist-packages/middlewared/client/client.py", line 326, in __init__ self._ws.connect() File "/usr/lib/python3/dist-packages/middlewared/client/client.py", line 129, in connect rv = super(WSClient, self).connect() File "/usr/lib/python3/dist-packages/ws4py/client/__init__.py", line 222, in connect bytes = self.sock.recv(128) socket.timeout: timed out Dec 18 10:46:34 XXXX.local collectd[11253]: processes plugin: read_file_contents: Reading file "/proc/2519358/stat" failed. Dec 18 20:42:54 XXXX.local collectd[11253]: processes plugin: read_file_contents: Reading file "/proc/981836/stat" failed.
The system ceased to be responsive around 22:00 of Dec 18 without spitting out other errors...
One strange fact is that I would expect to find some i/o errors on the tty console, but that does not happens. Also, the kernel seems not completely dead, since if I attach a usb device to the system, it gets detected and the driver attached...
Any hints or suggestion on how to debug this situation would be really appreciated.