Debian VM freezes after several hours

wuxia

Dabbler
Joined
Jan 7, 2016
Messages
49
I have a strange case of Debian installation working for several hours and then getting non-responsive and freezing. I can't find anything of interest in the debian log. The installation is running some docker apps with 12GB of RAM available using only 5-6GB at best. How can I debug the reason for the freeze. I'm running a FreeBSD VM along with it that works like a charm. The same docker apps worked on the same debian version non-virtualised with uptime of months.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,907
Please have a look at the forum rules (linked in red at top of the web pages) what information is required to answer questions like this.
 

wuxia

Dabbler
Joined
Jan 7, 2016
Messages
49
Sorry, I'm running TrueNAS-13.0-U3.1 on X10SDV-4C-TLN2F (Xeon D-1521) with 32GB RAM.

My VM settings:
Virtual CPUs: 1
Cores: 4
Threads: 1
Memory Size: 12.00 GiB
Virtual NIC: Intel e82585 (e1000)
Virtual Disk: AHCI - 50GiB ZVol

My VM is not yet freezed and I found this in /var/log/messages which seems kind of relevant:
[ 813.077332] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 813.077731] rcu: 3-...!: (0 ticks this GP) idle=be4/0/0x0 softirq=31861/31861 fqs=0 (false positive?)
[ 813.077768] (detected by 2, t=5252 jiffies, g=59585, q=1518)
[ 813.077771] Sending NMI from CPU 2 to CPUs 3:
[ 813.077803] NMI backtrace for cpu 3 skipped: idling at native_safe_halt+0xe/0x20
[ 813.077943] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
[ 813.077945] clocksource: 'hpet' wd_now: 2c87528c wd_last: 178404e2 mask: ffffffff
[ 813.077946] clocksource: 'tsc' cs_now: 529d486ad81c cs_last: 529284fe2a94 mask: ffffffffffffffff
[ 813.077948] tsc: Marking TSC unstable due to clocksource watchdog
[ 813.078774] rcu: rcu_sched kthread starved for 5252 jiffies! g59585 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=3
[ 813.078791] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[ 813.078814] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 813.078816] rcu: RCU grace-period kthread stack dump:
[ 813.078854] sched_clock: Marking unstable (812569742337, 497610293)<-(813080613760, -1823243)
[ 813.067441] task:rcu_sched state:I stack: 0 pid: 12 ppid: 2 flags:0x00004000
[ 813.067444] Call Trace:
[ 813.067454] __schedule+0x282/0x880
[ 813.067458] schedule+0x46/0xb0
[ 813.067460] schedule_timeout+0x8b/0x150
[ 813.067464] ? __next_timer_interrupt+0x110/0x110
[ 813.067469] rcu_gp_kthread+0x51b/0xbb0
[ 813.067472] ? rcu_cpu_kthread+0x190/0x190
[ 813.067476] kthread+0x11b/0x140
[ 813.067479] ? __kthread_bind_mask+0x60/0x60
[ 813.067483] ret_from_fork+0x22/0x30
[ 813.092517] clocksource: Checking clocksource tsc synchronization from CPU 3.
[ 813.092575] clocksource: Switched to clocksource hpet
 
Last edited:

wuxia

Dabbler
Joined
Jan 7, 2016
Messages
49
The freezes continue while the FreeBSD VM stays stable. This is the last message I found in /var/log/messages:

Jan 2 01:40:10 *** kernel: [31928.266310] (t=8151 jiffies g=2780713 q=3534)
Jan 2 01:40:10 *** kernel: [31928.266519] task:rcu_sched state:I stack: 0 pid: 12 ppid: 2 flags:0x00004000
Jan 2 01:40:10 *** kernel: [31928.266523] Call Trace:
Jan 2 01:40:10 *** kernel: [31928.266559] __schedule+0x282/0x880
Jan 2 01:40:10 *** kernel: [31928.266567] ? __switch_to_asm+0x3a/0x60
Jan 2 01:40:10 *** kernel: [31928.266592] schedule+0x46/0xb0
Jan 2 01:40:10 *** kernel: [31928.266594] schedule_timeout+0x8b/0x150
Jan 2 01:40:10 *** kernel: [31928.266598] ? __next_timer_interrupt+0x110/0x110
Jan 2 01:40:10 *** kernel: [31928.266602] rcu_gp_kthread+0x51b/0xbb0
Jan 2 01:40:10 *** kernel: [31928.266605] ? rcu_cpu_kthread+0x190/0x190
Jan 2 01:40:10 *** kernel: [31928.266609] kthread+0x11b/0x140
Jan 2 01:40:10 *** kernel: [31928.266633] ? __kthread_bind_mask+0x60/0x60
Jan 2 01:40:10 *** kernel: [31928.266635] ret_from_fork+0x22/0x30
Jan 2 01:40:10 *** kernel: [31928.266649] NMI backtrace for cpu 3
Jan 2 01:40:10 *** kernel: [31928.266676] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.10.0-19-amd64 #1 Debian 5.10.149-2
Jan 2 01:40:10 *** kernel: [31928.266677] Hardware name: FreeBSD BHYVE/BHYVE, BIOS 13.0 11/10/2020
Jan 2 01:40:10 *** kernel: [31928.266678] Call Trace:
Jan 2 01:40:10 *** kernel: [31928.266681] <IRQ>
Jan 2 01:40:10 *** kernel: [31928.266687] dump_stack+0x6b/0x83
Jan 2 01:40:10 *** kernel: [31928.266691] nmi_cpu_backtrace.cold+0x32/0x69
Jan 2 01:40:10 *** kernel: [31928.266718] ? lapic_can_unplug_cpu+0x80/0x80
Jan 2 01:40:10 *** kernel: [31928.266729] nmi_trigger_cpumask_backtrace+0xdf/0xf0
Jan 2 01:40:10 *** kernel: [31928.266758] rcu_dump_cpu_stacks+0xa2/0xd4
Jan 2 01:40:10 *** kernel: [31928.266761] rcu_sched_clock_irq.cold+0x1ff/0x3d6
Jan 2 01:40:10 *** kernel: [31928.266764] update_process_times+0x8c/0xc0
Jan 2 01:40:10 *** kernel: [31928.266769] tick_sched_handle+0x22/0x60
Jan 2 01:40:10 *** kernel: [31928.266773] tick_sched_timer+0x80/0xc0
Jan 2 01:40:10 *** kernel: [31928.266775] ? tick_do_update_jiffies64.part.0+0xc0/0xc0
Jan 2 01:40:10 *** kernel: [31928.266799] __hrtimer_run_queues+0x12a/0x280
Jan 2 01:40:10 *** kernel: [31928.266802] hrtimer_interrupt+0x110/0x2c0
Jan 2 01:40:10 *** kernel: [31928.266805] __sysvec_apic_timer_interrupt+0x5f/0xe0
Jan 2 01:40:10 *** kernel: [31928.266810] asm_call_irq_on_stack+0x12/0x20
Jan 2 01:40:10 *** kernel: [31928.266812] </IRQ>
Jan 2 01:40:10 *** kernel: [31928.266816] sysvec_apic_timer_interrupt+0x72/0x80
Jan 2 01:40:10 *** kernel: [31928.266840] asm_sysvec_apic_timer_interrupt+0x12/0x20
Jan 2 01:40:10 *** kernel: [31928.266844] RIP: 0010:native_safe_halt+0xe/0x20
Jan 2 01:40:10 *** kernel: [31928.266846] Code: 00 f0 80 48 02 20 48 8b 00 a8 08 75 c0 e9 77 ff ff ff cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 0f 00 2d a6 39 51 00 fb f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00
Jan 2 01:40:10 *** kernel: [31928.266848] RSP: 0018:ffffbc074009fef0 EFLAGS: 00000246
Jan 2 01:40:10 *** kernel: [31928.266851] RAX: ffffffffb4cf6410 RBX: 0000000000000003 RCX: ffff940879db0a40
Jan 2 01:40:10 *** kernel: [31928.266853] RDX: 00000000015bbe9a RSI: ffffbc074009fe88 RDI: 00001d0244b73f00
Jan 2 01:40:10 *** kernel: [31928.266854] RBP: ffff9407c02997c0 R08: 0000000000000001 R09: 0000000000040800
Jan 2 01:40:10 *** kernel: [31928.266856] R10: 0000000000040800 R11: 0000000000000000 R12: 0000000000000000
Jan 2 01:40:10 *** kernel: [31928.266857] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Jan 2 01:40:10 *** kernel: [31928.266882] ? __sched_text_end+0x6/0x6
Jan 2 01:40:10 *** kernel: [31928.266889] default_idle+0xa/0x20
Jan 2 01:40:10 *** kernel: [31928.266892] default_idle_call+0x3c/0xd0
Jan 2 01:40:10 *** kernel: [31928.266896] do_idle+0x20c/0x2b0
Jan 2 01:40:10 *** kernel: [31928.266922] cpu_startup_entry+0x19/0x20
Jan 2 01:40:10 *** kernel: [31928.266925] secondary_startup_64_no_verify+0xb0/0xbb


Any idea how to debug this?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,740
Difficult :wink: People have been experiencing unexplainable freezes when overprovisioning CPU cores. I don't know if 4 cores for the VM on a4 core machine qualifies, but could you try setting the number of cores to 2 just for a test?
 

Seganku

Cadet
Joined
Feb 6, 2023
Messages
1
I had this same issue running Home Assistant OS as a VM on Proxmox. I had configured 4 cores on my 4 core machine. Reducing it to 2 cores appears to have solved the issue. Thanks! (sorry for replying as a non-TrueNAS user, but I hoped the additional information might prove useful. This behavior may not be unique to TrueNAS, but some underlying shared virtualization code, or just overprovisioned CPU cores)

Thank you!
 

Micko

Cadet
Joined
Mar 10, 2020
Messages
7
I'm experiencing the same issue on a Truenas Core
Version: TrueNAS-13.0-U3.1
64GB ECC - Xeon 1521D

I'm running just an Alpine Linux VM on bhyve but after 3/4 hours the stuck begins.

[52416.873298] watchdog: BUG: soft lockup - CPU#4 stuck for 9384s! [kworker/4:3:3969] [52416.873309] Modules linked in: tcp_diag udp_diag inet_diag veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_tables x_tables br_netfilter bridge stp llc overlay nfs lockd grace sunrpc fscache netfs nls_utf8 nls_cp437 vfat fat mousedev ipv6 af_packet psmouse serio_raw pcspkr efi_pstore intel_rapl_msr input_leds intel_rapl_common crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd rapl evdev button efivarfs dm_mod virtio_net net_failover failover virtio_blk crc32_pclmul virtio_pci virtio_pci_modern_dev virtio_ring virtio simpledrm drm_kms_helper cfbfillrect syscopyarea cfbimgb lt sysfillrect sysimgblt fb_sys_fops cfbcopyarea cec drm i2c_core drm_panel_orientation_quirks agpgart loop ext4 crc32c_generic crc32c_intel crc16 mbcache jbd2 usb_storage usbcore usb_comm on sd_mod t10_pi scsi_mod [52416.873384] CPU: 4 PID: 3969 Comm: kworker/4:3 Tainted: G L 5.15.86-0-lts #1-Alpine [52416.873388] Hardware name: FreeBSD BHYVE/BHYVE, BIOS 13.0 11/10/2020 [52416.873391] Workqueue: events netstamp_clear [52416.873399] RIP: 0010:smp_call_function_many_cond+0x117/0x2f0 [52416.873406] Code: ff 48 89 de e8 aa 91 37 00 3b 05 f8 96 1f 01 89 c7 73 21 48 63 c7 49 8b 0e 48 03 0c c5 c0 ba 0d 98 8b 41 08 a8 01 74 0a f3 90 <8b> 51 08 83 e2 01 75 f6 eb cd 48 83 c4 48 5b 5d 41 5c 41 5d 41 5e [52416.873408] RSP: 0018:ffffbe8b807cfd90 EFLAGS: 00000202 [52416.873411] RAX: 0000000000000011 RBX: ffffa453fbd2d208 RCX: ffffa453fbcf1f60 [52416.873413] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000003 [52416.873414] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [52416.873415] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 [52416.873417] R13: 0000000000031f60 R14: ffffa453fbd2d200 R15: 0000000000000100 [52416.873419] FS: 0000000000000000(0000) GS:ffffa453fbd00000(0000) knlGS:0000000000000000 [52416.873421] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [52416.873422] CR2: 000000c00014b010 CR3: 000000011920a006 CR4: 00000000003706e0 [52416.873424] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [52416.873426] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [52416.873427] Call Trace: [52416.873429] <TASK> [52416.873431] ? add_nops+0xd0/0xd0 [52416.873437] on_each_cpu_cond_mask+0x19/0x30 [52416.873441] text_poke_bp_batch+0xa9/0x290 [52416.873445] text_poke_finish+0x1b/0x30 [52416.873449] arch_jump_label_transform_apply+0x16/0x30 [52416.873453] static_key_enable_cpuslocked+0x6f/0xa0 [52416.873458] static_key_enable+0x16/0x20 [52416.873461] process_one_work+0x1d0/0x360 [52416.873465] worker_thread+0x4d/0x3b0 [52416.873468] ? queue_work_node+0x100/0x100 [52416.873471] kthread+0x118/0x140 [52416.873475] ? set_kthread_struct+0x60/0x60 [52416.873478] ret_from_fork+0x22/0x30 [52416.873484] </TASK>

Vm is running on kernel 5.x
alpinedocker01:~# uname -r 5.15.86-0-lts

I have already tried to change disk (4x4tb raidZ2) & nic to VirtIO but nothing changes.
Even adding/removing cores from VM didn't help (now it's on 4 CPU/w 2 cores)


Any suggestion to avoid this behaviour? I still have access to my VM via SSH, if u need other logs, I'll be glad do provide them
thanks
Ric
 
Last edited:

RandomLegend

Dabbler
Joined
Nov 1, 2021
Messages
34
Sorry to bump up that thread, but i am here with the same issue.

I have TrueNAS Core running on a Xeon E3 1246 V3 (4 Cores, 8 Threads), 16GB ECC Ram

I then have an ubuntu VM and gave it 2 Cores, 4 Threads and 4GB RAM.

The VM can run for months without issues, but as soon as i strain its performance and do some bulk work thats cpu intensive, it gets stuck on CPU aswell...

The CPU Usage in TrueNAS is always super super low (usually below 10%) even when the VM is crashing
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
If you specified in the settings to have 1 CPU, 2 cores and 4 threads, that's actually 8 threads in total... too many for a VM on a host system with only 8 threads itself.
 

seanm

Guru
Joined
Jun 11, 2018
Messages
570
I have successfully applied @gusev.vitaliy's patch against a local copy of the TrueNAS source code, re-compiled, and replaced /boot/kernel/vmm.ko with my build, and it works in the sense that bhyve still works. Now let's see if my Ubuntu VMs can stay up for more than a few days...
 
Top