Networking issues after Cobia upgrade

Joined
Nov 20, 2023
Messages
4
Hello everyone,

I'm new here so pardon me if I missed providing any information, just let me know and I will share it with you.
Furthermore, I tried searching the forums but did not find any similar issue.

So to start with a hopefully-brief history, I've been running TrueNAS Scale for almost 2 years now on an old PC, incrementally improvind where needed. I've also been running ~3 VMs, all ubuntu-based with little to no load whatsoever (Plex, Ghost & a monitoring one having Prometheus and Cloudflare Tunnels). Therefore, it's also been configured with a br0 bridge NIC for VM access.

Not to the current day, since the Cobia upgrade I've been having some strange issues which I can't exactly pinpoint to the upgrade itself, yet do not have a solid explanation either. It just happens that after a few hours of perfectly fine running, Scale loses all connectivity; I can't access it from the UI, can't ping (No route to host), can't mount NFS Shares. Furthermore, SCALE misses all metrics for the timeframe from the "crash" and until the restart.

The solution is a hard poweroff of the whole system.

In trying to fix it I stopped all VM's & removed the bridge nic altogether as I wanted to migrate them either way, hoping the issue was around the bridge nic behaving erratically. However, this did not fix the issue.

I also tried searching system logs, journal, dmesg, but did not manage to find anything related.

Journalctl logs of 2 hours before athe last crash, at Nov 20 01:12
Code:
Nov 19 23:00:01 warehouse CRON[16412]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Nov 19 23:00:01 warehouse CRON[16413]: (root) CMD (PATH="/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/root/bin" midclt call cloudsync.sync 1 > /dev/null 2> /dev/null)
Nov 19 23:00:01 warehouse CRON[16412]: pam_unix(cron:session): session closed for user root
Nov 19 23:00:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Nov 19 23:00:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 19 23:00:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Nov 19 23:10:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Nov 19 23:10:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 19 23:10:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Nov 19 23:15:31 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Currently unreadable (pending) sectors
Nov 19 23:15:31 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Offline uncorrectable sectors
Nov 19 23:15:32 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Currently unreadable (pending) sectors
Nov 19 23:15:32 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Offline uncorrectable sectors
Nov 19 23:15:32 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63
Nov 19 23:15:32 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 37
Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63
Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 37
Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 74 to 75
Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63
Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 37
Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 74 to 75
Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 74 to 75
Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63
Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 37
Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 74 to 75
Nov 19 23:15:49 warehouse nscd[16801]: 16801 monitoring file `/etc/hosts` (1)
Nov 19 23:15:49 warehouse nscd[16801]: 16801 monitoring directory `/etc` (2)
Nov 19 23:15:49 warehouse nscd[16801]: 16801 monitoring file `/etc/resolv.conf` (3)
Nov 19 23:15:49 warehouse nscd[16801]: 16801 monitoring directory `/etc` (2)
Nov 19 23:15:49 warehouse nscd[16801]: 16801 monitoring file `/etc/nsswitch.conf` (4)
Nov 19 23:15:49 warehouse nscd[16801]: 16801 monitoring directory `/etc` (2)
Nov 19 23:17:01 warehouse CRON[16860]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Nov 19 23:17:01 warehouse CRON[16861]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Nov 19 23:17:01 warehouse CRON[16860]: pam_unix(cron:session): session closed for user root
Nov 19 23:20:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Nov 19 23:20:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 19 23:20:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Nov 19 23:30:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Nov 19 23:30:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 19 23:30:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Nov 19 23:40:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Nov 19 23:40:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 19 23:40:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Nov 19 23:45:31 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Currently unreadable (pending) sectors
Nov 19 23:45:31 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Offline uncorrectable sectors
Nov 19 23:45:32 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Currently unreadable (pending) sectors
Nov 19 23:45:32 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Offline uncorrectable sectors
Nov 19 23:45:32 warehouse smartd[4155]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 64
Nov 19 23:45:32 warehouse smartd[4155]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 36
Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 64
Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 36
Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 64
Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 36
Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 64
Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 36
Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdf [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 65
Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdf [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 35
Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdf [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 65
Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdf [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 35
Nov 19 23:45:34 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 65
Nov 19 23:45:34 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 35
Nov 19 23:45:34 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 65
Nov 19 23:45:34 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 35
Nov 19 23:50:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Nov 19 23:50:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 19 23:50:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Nov 20 00:00:01 warehouse CRON[17882]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Nov 20 00:00:01 warehouse CRON[17883]: (root) CMD (PATH="/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/root/bin" midclt call cloudsync.sync 1 > /dev/null 2> /dev/null)
Nov 20 00:00:01 warehouse CRON[17882]: pam_unix(cron:session): session closed for user root
Nov 20 00:00:20 warehouse systemd[1]: Starting dpkg-db-backup.service - Daily dpkg database backup service...
Nov 20 00:00:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Nov 20 00:00:20 warehouse systemd[1]: Starting logrotate.service - Rotate log files...
Nov 20 00:00:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 20 00:00:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Nov 20 00:00:20 warehouse systemd[1]: logrotate.service: Deactivated successfully.
Nov 20 00:00:20 warehouse systemd[1]: Finished logrotate.service - Rotate log files.
Nov 20 00:00:20 warehouse systemd[1]: dpkg-db-backup.service: Deactivated successfully.
Nov 20 00:00:20 warehouse systemd[1]: Finished dpkg-db-backup.service - Daily dpkg database backup service.
Nov 20 00:07:20 warehouse systemd[1]: Starting sysstat-summary.service - Generate a daily summary of process accounting...
Nov 20 00:07:21 warehouse systemd[1]: sysstat-summary.service: Deactivated successfully.
Nov 20 00:07:21 warehouse systemd[1]: Finished sysstat-summary.service - Generate a daily summary of process accounting.
Nov 20 00:10:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Nov 20 00:10:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 20 00:10:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Nov 20 00:15:31 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Currently unreadable (pending) sectors
Nov 20 00:15:31 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Offline uncorrectable sectors
Nov 20 00:15:32 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Currently unreadable (pending) sectors
Nov 20 00:15:32 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Offline uncorrectable sectors
Nov 20 00:15:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 65
Nov 20 00:15:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 35
Nov 20 00:15:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 65
Nov 20 00:15:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 35
Nov 20 00:15:51 warehouse nscd[18427]: 18427 monitoring file `/etc/hosts` (1)
Nov 20 00:15:51 warehouse nscd[18427]: 18427 monitoring directory `/etc` (2)
Nov 20 00:15:51 warehouse nscd[18427]: 18427 monitoring file `/etc/resolv.conf` (3)
Nov 20 00:15:51 warehouse nscd[18427]: 18427 monitoring directory `/etc` (2)
Nov 20 00:15:51 warehouse nscd[18427]: 18427 monitoring file `/etc/nsswitch.conf` (4)
Nov 20 00:15:51 warehouse nscd[18427]: 18427 monitoring directory `/etc` (2)
Nov 20 00:17:01 warehouse CRON[18450]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Nov 20 00:17:01 warehouse CRON[18451]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Nov 20 00:17:01 warehouse CRON[18450]: pam_unix(cron:session): session closed for user root
Nov 20 00:20:00 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Nov 20 00:20:00 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 20 00:20:00 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Nov 20 00:30:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Nov 20 00:30:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 20 00:30:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Nov 20 00:40:00 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Nov 20 00:40:00 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 20 00:40:00 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Nov 20 00:45:31 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Currently unreadable (pending) sectors
Nov 20 00:45:31 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Offline uncorrectable sectors
Nov 20 00:45:32 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Currently unreadable (pending) sectors
Nov 20 00:45:32 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Offline uncorrectable sectors
Nov 20 00:47:42 warehouse systemd[1]: Starting fstrim.service - Discard unused blocks on filesystems from /etc/fstab...
Nov 20 00:47:42 warehouse systemd[1]: fstrim.service: Deactivated successfully.
Nov 20 00:47:42 warehouse systemd[1]: Finished fstrim.service - Discard unused blocks on filesystems from /etc/fstab.
Nov 20 00:50:00 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Nov 20 00:50:00 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 20 00:50:00 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Nov 20 01:00:01 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Nov 20 01:00:01 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 20 01:00:01 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Nov 20 01:00:01 warehouse CRON[19574]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Nov 20 01:00:01 warehouse CRON[19575]: (root) CMD (PATH="/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/root/bin" midclt call cloudsync.sync 1 > /dev/null 2> /dev/null)
Nov 20 01:00:02 warehouse CRON[19574]: pam_unix(cron:session): session closed for user root
Nov 20 01:10:01 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Nov 20 01:10:01 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 20 01:10:01 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool.


1700484761279.png


Also, I believe you can ignore the /dev/sdg disk errors as that's been like that for like 2 months now, give or take. I plant to swap that either the single disk or replace the whole vdev with a larger capacity one, but haven't decided yet which approach to take.

I'd appreciate any ideas you have that I can try.
Thanks!
 
Joined
Nov 20, 2023
Messages
4
Posting here instead of editing the post as it might help other people find it easier in case of need.

I think I managed to solved it by disabling Service Announcement completely:
  • NetBIOS-NS
  • mDNS
  • WS-Discovery

All the above being set to false gave me an uptime of almost 5 days now, which is an improvement from the couple of hours i had for the past week. We'll see how that behaves.

Therefore, for anyone hitting similar freezes, there's no harm in trying this.
Furthermore, if any mantainers get to read this, let me know if you want me to share any more information and which should that be, I'll be glad to provide if it helps development of TrueNAS.

Cheers!
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
FWIW, I run with `mDNS, WS-DISCOVERY` on and SCALE 23.10.0.1 is stable.
 
Joined
Nov 20, 2023
Messages
4
Hey @Yorick,

It seems that was not the issue after all, yet I had around an entire week with no issue whatsoever, wierdly enough.

On the latest crash with the debug kernel I'm seeing the following logs though:
Code:
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: rcu:     0-...!: (13 GPs behind) idle=6468/0/0x0 softirq=4096457/4096458 fqs=0 (false positive?)
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: rcu:     3-...!: (4 ticks this GP) idle=5900/0/0x0 softirq=4704918/4704920 fqs=0 (false positive?)
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: rcu:     6-...!: (0 ticks this GP) idle=10b0/0/0x0 softirq=3498093/3498093 fqs=0 (false positive?)
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: rcu:     8-...!: (0 ticks this GP) idle=14d0/0/0x0 softirq=4116752/4116752 fqs=0 (false positive?)
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: rcu:     11-...!: (13 GPs behind) idle=c6b0/0/0x0 softirq=5149632/5149632 fqs=0 (false positive?)
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: rcu:     14-...!: (0 ticks this GP) idle=94f8/0/0x0 softirq=3335767/3335767 fqs=0 (false positive?)
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel:     (detected by 5, t=5268 jiffies, g=34057997, q=217 ncpus=16)
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: Sending NMI from CPU 5 to CPUs 0:
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: Sending NMI from CPU 5 to CPUs 3:
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: Sending NMI from CPU 5 to CPUs 6:
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: Sending NMI from CPU 5 to CPUs 8:
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: Sending NMI from CPU 5 to CPUs 11:
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: Sending NMI from CPU 5 to CPUs 14:
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: rcu: rcu_preempt kthread timer wakeup didn't happen for 20159 jiffies! g34057997 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: rcu:     Possible timer handling issue on cpu=14 timer-softirq=1085698
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: rcu: rcu_preempt kthread starved for 20165 jiffies! g34057997 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=14
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: rcu: RCU grace-period kthread stack dump:
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: task:rcu_preempt     state:I stack:0     pid:16    ppid:2      flags:0x00004000
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: Call Trace:
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel:  <TASK>
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel:  __schedule+0x351/0xa20
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel:  ? rcu_gp_cleanup+0x480/0x480
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel:  schedule+0x5d/0xe0
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel:  schedule_timeout+0x94/0x150
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel:  rcu_gp_fqs_loop+0x141/0x4c0
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel:  rcu_gp_kthread+0xd0/0x190
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel:  kthread+0xe9/0x110
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel:  ? kthread_complete_and_exit+0x20/0x20
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel:  ret_from_fork+0x22/0x30
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel:  </TASK>
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: rcu: Stack dump where RCU GP kthread last ran:
Nov 30 18:13:05 warehouse.stefanmuraru.local kernel: Sending NMI from CPU 5 to CPUs 14:
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel: rcu: INFO: rcu_preempt self-detected stall on CPU
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel: rcu:     2-....: (5247 ticks this GP) idle=c4a4/1/0x4000000000000000 softirq=4419267/4419267 fqs=1050
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:     (t=5250 jiffies g=34058001 q=18544 ncpus=16)
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel: CPU: 2 PID: 124 Comm: khugepaged Tainted: P           OE      6.1.55-debug+truenas #2
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel: Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 PC MATE(MS-7A34), BIOS A.N5 07/18/2022
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel: RIP: 0010:smp_call_function_many_cond+0xee/0x2f0
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel: Code: 63 d0 e8 75 df 3c 00 3b 05 2f f1 ac 01 73 25 48 63 d0 49 8b 36 48 03 34 d5 c0 da 41 82 8b 56 08 83 e2 01 74 0a f3 90 8b 4e 08 <83> e1 01 75 f6 83 c0 01 eb c1 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel: RSP: 0018:ffffc900005e3c08 EFLAGS: 00000202
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000011
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel: RDX: 0000000000000001 RSI: ffff8893ee8385e0 RDI: ffff88811038dd08
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: ffff88811038dbd8
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel: R10: ffff88894c8e2000 R11: 0000000000000000 R12: 0000000000000002
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel: R13: 0000000000000001 R14: ffff8893ee8b2080 R15: ffff8893eebc0000
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel: FS:  0000000000000000(0000) GS:ffff8893ee880000(0000) knlGS:0000000000000000
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel: CR2: 00007f5ee80016d8 CR3: 0000000163594000 CR4: 00000000003506e0
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel: Call Trace:
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  <IRQ>
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? rcu_dump_cpu_stacks+0xc8/0x100
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? rcu_sched_clock_irq.cold+0x69/0x2fb
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? sched_slice+0x87/0x140
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? kvm_emulate_wbinvd_noskip.part.0+0xa0/0xa0 [kvm]
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? raw_notifier_call_chain+0x44/0x60
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? timekeeping_update+0xdd/0x130
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? account_process_tick+0xd2/0x170
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? update_process_times+0x77/0xb0
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? tick_sched_handle+0x22/0x60
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? tick_sched_timer+0x6f/0x80
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? tick_sched_do_timer+0xa0/0xa0
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? __hrtimer_run_queues+0x112/0x2b0
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? hrtimer_interrupt+0xfe/0x220
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? __sysvec_apic_timer_interrupt+0x7f/0x170
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? sysvec_apic_timer_interrupt+0x99/0xc0
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  </IRQ>
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  <TASK>
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? smp_call_function_many_cond+0xee/0x2f0
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? smp_call_function_many_cond+0xcb/0x2f0
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? mm_take_all_locks+0x210/0x210
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? mm_take_all_locks+0x210/0x210
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  smp_call_function+0x39/0x70
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  collapse_huge_page+0x5ba/0x1470
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? psi_group_change+0x145/0x360
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  hpage_collapse_scan_pmd+0x5b7/0x7f0
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  khugepaged+0x4fc/0x970
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? collapse_pte_mapped_thp+0x5d0/0x5d0
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  kthread+0xe9/0x110
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ? kthread_complete_and_exit+0x20/0x20
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  ret_from_fork+0x22/0x30
Nov 30 18:13:27 warehouse.stefanmuraru.local kernel:  </TASK>
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel: rcu: INFO: rcu_preempt self-detected stall on CPU
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel: rcu:     2-....: (21001 ticks this GP) idle=c4a4/1/0x4000000000000000 softirq=4419267/4419267 fqs=4203
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:     (t=21070 jiffies g=34058001 q=19282 ncpus=16)
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel: CPU: 2 PID: 124 Comm: khugepaged Tainted: P           OE      6.1.55-debug+truenas #2
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel: Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 PC MATE(MS-7A34), BIOS A.N5 07/18/2022
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel: RIP: 0010:smp_call_function_many_cond+0xeb/0x2f0
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel: Code: 7e 08 48 63 d0 e8 75 df 3c 00 3b 05 2f f1 ac 01 73 25 48 63 d0 49 8b 36 48 03 34 d5 c0 da 41 82 8b 56 08 83 e2 01 74 0a f3 90 <8b> 4e 08 83 e1 01 75 f6 83 c0 01 eb c1 48 83 c4 30 5b 5d 41 5c 41
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel: RSP: 0018:ffffc900005e3c08 EFLAGS: 00000202
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel: RDX: 0000000000000001 RSI: ffff8893ee8385e0 RDI: ffff88811038dd08
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: ffff88811038dbd8
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel: R10: ffff88894c8e2000 R11: 0000000000000000 R12: 0000000000000002
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel: R13: 0000000000000001 R14: ffff8893ee8b2080 R15: ffff8893eebc0000
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel: FS:  0000000000000000(0000) GS:ffff8893ee880000(0000) knlGS:0000000000000000
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel: CR2: 00007f5ee80016d8 CR3: 0000000163594000 CR4: 00000000003506e0
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel: Call Trace:
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  <IRQ>
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? rcu_dump_cpu_stacks+0xc8/0x100
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? rcu_sched_clock_irq.cold+0x69/0x2fb
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? sched_slice+0x87/0x140
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? perf_event_task_tick+0x64/0x370
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? nohz_balance_exit_idle+0x16/0xc0
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? account_process_tick+0xd2/0x170
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? update_process_times+0x77/0xb0
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? tick_sched_handle+0x22/0x60
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? tick_sched_timer+0x6f/0x80
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? tick_sched_do_timer+0xa0/0xa0
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? __hrtimer_run_queues+0x112/0x2b0
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? hrtimer_interrupt+0xfe/0x220
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? __sysvec_apic_timer_interrupt+0x7f/0x170
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? sysvec_apic_timer_interrupt+0x99/0xc0
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  </IRQ>
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  <TASK>
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? smp_call_function_many_cond+0xeb/0x2f0
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? smp_call_function_many_cond+0xcb/0x2f0
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? mm_take_all_locks+0x210/0x210
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? mm_take_all_locks+0x210/0x210
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  smp_call_function+0x39/0x70
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  collapse_huge_page+0x5ba/0x1470
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? psi_group_change+0x145/0x360
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  hpage_collapse_scan_pmd+0x5b7/0x7f0
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  khugepaged+0x4fc/0x970
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? collapse_pte_mapped_thp+0x5d0/0x5d0
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  kthread+0xe9/0x110
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ? kthread_complete_and_exit+0x20/0x20
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  ret_from_fork+0x22/0x30
Nov 30 18:14:30 warehouse.stefanmuraru.local kernel:  </TASK>


This for me starts to look like a Kernel issue, and tbh I would personally try downgrading to a lower Kernel version in the hope a recent change in the Kernel does not behave well with my hardware.
Does anyone have any idea whether this is possible?

I am not excluding CPU / other hardware starting to fail, yet I seriously doubt it at this point and don't want to blindly start throwing money in it before I figure they are actually the problem.

Thanks!
 

gegtor

Explorer
Joined
Sep 16, 2017
Messages
99
Hello my system on TrueNAS-SCALE-23.10.0.1 recently also started having similar issues where it crashes after few days of uptime
This system in the past never had issues and those crashes appeared after update from Scale 22

I will run a memtest on this system and follow up soon

Crash screen:
Screenshot 2023-12-15 at 17.18.30.png
 
Joined
Nov 20, 2023
Messages
4
Hey @gegtor,

I, for one, managed to solve it by reinstalling the whole TrueNAS Scale host and restoring from a backup config file. Can't say it will work for you though, yet it's worth a shot.

For now I'm 6 days and going, hope I won't jinx it. It's pretty wierd tho, you can't blame this sort of issue on anyone.
 

gegtor

Explorer
Joined
Sep 16, 2017
Messages
99
Memtest result was good no faults
Today this system hung again without the RCU error just stuck on menu screen

I will try to reinstall the OS but it's weird
 

erlend_oyen

Dabbler
Joined
Sep 5, 2023
Messages
17
similar issue with 23.x
rcu_preempt detected stalls on cpus/tasks

running with AMD Ryzen 5 PRO 4650GE and ECC mem, no errors on memtest.

i have now disabled every auto OC feature, power saving, and c-states on the MB to see if it helps, also adjusted the memory speed from 3200 to 2666 to

i am starting to wonder if it's crashing when one or more docker application is starting.
 

gegtor

Explorer
Joined
Sep 16, 2017
Messages
99
I upgraded today to SCALE-23.10.1 maybe it would help

My system never had such issues in the past and started doing this since upgrade to Cobia so something definitely went wrong there
 

gegtor

Explorer
Joined
Sep 16, 2017
Messages
99
Maybe something related to AMD broke when Scale switched over to Linux Kernel 6.x
I never had any issue while I was on Scale 22.x.x
 

erlend_oyen

Dabbler
Joined
Sep 5, 2023
Messages
17
after testing x number of bios settings related to cpu i have now been able to boot the system without crashing, or any exceptions messages during boot.
core config is now set to 1+1 IE 2 cores, "great" performance.
on asrock it's named core control TWO (1+1)
other option is FOUR (2+2) and auto, does not work


experimented with smt disabled/enabled
iommu disabled/enabled
all forms of auto oc
setting all clock/timing settings manually
but this did not help.

i have also disabled suspend to ram (s3), deep sleep, pcie device power on
rtc alarm power on, by os
global c-state control


the only setting that seems to be the effective "fix" is the core config.
core control TWO (1+1)

i have ordered a newer ryzen cpu NON PRO to see if that changes anything
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Interesting. Is that specific to your APU / mobile Ryzen, @erlend_oyen ?
 
Top