Debian VM freezes after several hours

wuxia · Dec 28, 2022

I have a strange case of Debian installation working for several hours and then getting non-responsive and freezing. I can't find anything of interest in the debian log. The installation is running some docker apps with 12GB of RAM available using only 5-6GB at best. How can I debug the reason for the freeze. I'm running a FreeBSD VM along with it that works like a charm. The same docker apps worked on the same debian version non-virtualised with uptime of months.

ChrisRJ · Dec 28, 2022

Please have a look at the forum rules (linked in red at top of the web pages) what information is required to answer questions like this.

wuxia · Dec 28, 2022

Sorry, I'm running TrueNAS-13.0-U3.1 on X10SDV-4C-TLN2F (Xeon D-1521) with 32GB RAM.

My VM settings:
Virtual CPUs: 1
Cores: 4
Threads: 1
Memory Size: 12.00 GiB
Virtual NIC: Intel e82585 (e1000)
Virtual Disk: AHCI - 50GiB ZVol

My VM is not yet freezed and I found this in /var/log/messages which seems kind of relevant:

[ 813.077332] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 813.077731] rcu: 3-...!: (0 ticks this GP) idle=be4/0/0x0 softirq=31861/31861 fqs=0 (false positive?)
[ 813.077768] (detected by 2, t=5252 jiffies, g=59585, q=1518)
[ 813.077771] Sending NMI from CPU 2 to CPUs 3:
[ 813.077803] NMI backtrace for cpu 3 skipped: idling at native_safe_halt+0xe/0x20
[ 813.077943] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
[ 813.077945] clocksource: 'hpet' wd_now: 2c87528c wd_last: 178404e2 mask: ffffffff
[ 813.077946] clocksource: 'tsc' cs_now: 529d486ad81c cs_last: 529284fe2a94 mask: ffffffffffffffff
[ 813.077948] tsc: Marking TSC unstable due to clocksource watchdog
[ 813.078774] rcu: rcu_sched kthread starved for 5252 jiffies! g59585 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=3
[ 813.078791] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[ 813.078814] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 813.078816] rcu: RCU grace-period kthread stack dump:
[ 813.078854] sched_clock: Marking unstable (812569742337, 497610293)<-(813080613760, -1823243)
[ 813.067441] task:rcu_sched state:I stack: 0 pid: 12 ppid: 2 flags:0x00004000
[ 813.067444] Call Trace:
[ 813.067454] __schedule+0x282/0x880
[ 813.067458] schedule+0x46/0xb0
[ 813.067460] schedule_timeout+0x8b/0x150
[ 813.067464] ? __next_timer_interrupt+0x110/0x110
[ 813.067469] rcu_gp_kthread+0x51b/0xbb0
[ 813.067472] ? rcu_cpu_kthread+0x190/0x190
[ 813.067476] kthread+0x11b/0x140
[ 813.067479] ? __kthread_bind_mask+0x60/0x60
[ 813.067483] ret_from_fork+0x22/0x30
[ 813.092517] clocksource: Checking clocksource tsc synchronization from CPU 3.
[ 813.092575] clocksource: Switched to clocksource hpet

wuxia · Jan 2, 2023

The freezes continue while the FreeBSD VM stays stable. This is the last message I found in /var/log/messages:

Jan 2 01:40:10 *** kernel: [31928.266310] (t=8151 jiffies g=2780713 q=3534)
Jan 2 01:40:10 *** kernel: [31928.266519] task:rcu_sched state:I stack: 0 pid: 12 ppid: 2 flags:0x00004000
Jan 2 01:40:10 *** kernel: [31928.266523] Call Trace:
Jan 2 01:40:10 *** kernel: [31928.266559] __schedule+0x282/0x880
Jan 2 01:40:10 *** kernel: [31928.266567] ? __switch_to_asm+0x3a/0x60
Jan 2 01:40:10 *** kernel: [31928.266592] schedule+0x46/0xb0
Jan 2 01:40:10 *** kernel: [31928.266594] schedule_timeout+0x8b/0x150
Jan 2 01:40:10 *** kernel: [31928.266598] ? __next_timer_interrupt+0x110/0x110
Jan 2 01:40:10 *** kernel: [31928.266602] rcu_gp_kthread+0x51b/0xbb0
Jan 2 01:40:10 *** kernel: [31928.266605] ? rcu_cpu_kthread+0x190/0x190
Jan 2 01:40:10 *** kernel: [31928.266609] kthread+0x11b/0x140
Jan 2 01:40:10 *** kernel: [31928.266633] ? __kthread_bind_mask+0x60/0x60
Jan 2 01:40:10 *** kernel: [31928.266635] ret_from_fork+0x22/0x30
Jan 2 01:40:10 *** kernel: [31928.266649] NMI backtrace for cpu 3
Jan 2 01:40:10 *** kernel: [31928.266676] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.10.0-19-amd64 #1 Debian 5.10.149-2
Jan 2 01:40:10 *** kernel: [31928.266677] Hardware name: FreeBSD BHYVE/BHYVE, BIOS 13.0 11/10/2020
Jan 2 01:40:10 *** kernel: [31928.266678] Call Trace:
Jan 2 01:40:10 *** kernel: [31928.266681] <IRQ>
Jan 2 01:40:10 *** kernel: [31928.266687] dump_stack+0x6b/0x83
Jan 2 01:40:10 *** kernel: [31928.266691] nmi_cpu_backtrace.cold+0x32/0x69
Jan 2 01:40:10 *** kernel: [31928.266718] ? lapic_can_unplug_cpu+0x80/0x80
Jan 2 01:40:10 *** kernel: [31928.266729] nmi_trigger_cpumask_backtrace+0xdf/0xf0
Jan 2 01:40:10 *** kernel: [31928.266758] rcu_dump_cpu_stacks+0xa2/0xd4
Jan 2 01:40:10 *** kernel: [31928.266761] rcu_sched_clock_irq.cold+0x1ff/0x3d6
Jan 2 01:40:10 *** kernel: [31928.266764] update_process_times+0x8c/0xc0
Jan 2 01:40:10 *** kernel: [31928.266769] tick_sched_handle+0x22/0x60
Jan 2 01:40:10 *** kernel: [31928.266773] tick_sched_timer+0x80/0xc0
Jan 2 01:40:10 *** kernel: [31928.266775] ? tick_do_update_jiffies64.part.0+0xc0/0xc0
Jan 2 01:40:10 *** kernel: [31928.266799] __hrtimer_run_queues+0x12a/0x280
Jan 2 01:40:10 *** kernel: [31928.266802] hrtimer_interrupt+0x110/0x2c0
Jan 2 01:40:10 *** kernel: [31928.266805] __sysvec_apic_timer_interrupt+0x5f/0xe0
Jan 2 01:40:10 *** kernel: [31928.266810] asm_call_irq_on_stack+0x12/0x20
Jan 2 01:40:10 *** kernel: [31928.266812] </IRQ>
Jan 2 01:40:10 *** kernel: [31928.266816] sysvec_apic_timer_interrupt+0x72/0x80
Jan 2 01:40:10 *** kernel: [31928.266840] asm_sysvec_apic_timer_interrupt+0x12/0x20
Jan 2 01:40:10 *** kernel: [31928.266844] RIP: 0010:native_safe_halt+0xe/0x20
Jan 2 01:40:10 *** kernel: [31928.266846] Code: 00 f0 80 48 02 20 48 8b 00 a8 08 75 c0 e9 77 ff ff ff cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 0f 00 2d a6 39 51 00 fb f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00
Jan 2 01:40:10 *** kernel: [31928.266848] RSP: 0018:ffffbc074009fef0 EFLAGS: 00000246
Jan 2 01:40:10 *** kernel: [31928.266851] RAX: ffffffffb4cf6410 RBX: 0000000000000003 RCX: ffff940879db0a40
Jan 2 01:40:10 *** kernel: [31928.266853] RDX: 00000000015bbe9a RSI: ffffbc074009fe88 RDI: 00001d0244b73f00
Jan 2 01:40:10 *** kernel: [31928.266854] RBP: ffff9407c02997c0 R08: 0000000000000001 R09: 0000000000040800
Jan 2 01:40:10 *** kernel: [31928.266856] R10: 0000000000040800 R11: 0000000000000000 R12: 0000000000000000
Jan 2 01:40:10 *** kernel: [31928.266857] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Jan 2 01:40:10 *** kernel: [31928.266882] ? __sched_text_end+0x6/0x6
Jan 2 01:40:10 *** kernel: [31928.266889] default_idle+0xa/0x20
Jan 2 01:40:10 *** kernel: [31928.266892] default_idle_call+0x3c/0xd0
Jan 2 01:40:10 *** kernel: [31928.266896] do_idle+0x20c/0x2b0
Jan 2 01:40:10 *** kernel: [31928.266922] cpu_startup_entry+0x19/0x20
Jan 2 01:40:10 *** kernel: [31928.266925] secondary_startup_64_no_verify+0xb0/0xbb

Any idea how to debug this?

Patrick M. Hausen · Jan 2, 2023

Difficult

People have been experiencing unexplainable freezes when overprovisioning CPU cores. I don't know if 4 cores for the VM on a4 core machine qualifies, but could you try setting the number of cores to 2 just for a test?

Seganku · Feb 6, 2023

I had this same issue running Home Assistant OS as a VM on Proxmox. I had configured 4 cores on my 4 core machine. Reducing it to 2 cores appears to have solved the issue. Thanks! (sorry for replying as a non-TrueNAS user, but I hoped the additional information might prove useful. This behavior may not be unique to TrueNAS, but some underlying shared virtualization code, or just overprovisioned CPU cores)

Thank you!

Micko · Feb 6, 2023

I'm experiencing the same issue on a Truenas Core
Version: TrueNAS-13.0-U3.1
64GB ECC - Xeon 1521D

I'm running just an Alpine Linux VM on bhyve but after 3/4 hours the stuck begins.

[52416.873298] watchdog: BUG: soft lockup - CPU#4 stuck for 9384s! [kworker/4:3:3969]
[52416.873309] Modules linked in: tcp_diag udp_diag inet_diag veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter
 iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_tables x_tables br_netfilter bridge stp llc overlay nfs lockd grace sunrpc fscache netfs nls_utf8 nls_cp437 vfat
 fat mousedev ipv6 af_packet psmouse serio_raw pcspkr efi_pstore intel_rapl_msr input_leds intel_rapl_common crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd rapl evdev
button efivarfs dm_mod virtio_net net_failover failover virtio_blk crc32_pclmul virtio_pci virtio_pci_modern_dev virtio_ring virtio simpledrm drm_kms_helper cfbfillrect syscopyarea cfbimgb
lt sysfillrect sysimgblt fb_sys_fops cfbcopyarea cec drm i2c_core drm_panel_orientation_quirks agpgart loop ext4 crc32c_generic crc32c_intel crc16 mbcache jbd2 usb_storage usbcore usb_comm
on sd_mod t10_pi scsi_mod
[52416.873384] CPU: 4 PID: 3969 Comm: kworker/4:3 Tainted: G             L    5.15.86-0-lts #1-Alpine
[52416.873388] Hardware name: FreeBSD BHYVE/BHYVE, BIOS 13.0 11/10/2020
[52416.873391] Workqueue: events netstamp_clear
[52416.873399] RIP: 0010:smp_call_function_many_cond+0x117/0x2f0
[52416.873406] Code: ff 48 89 de e8 aa 91 37 00 3b 05 f8 96 1f 01 89 c7 73 21 48 63 c7 49 8b 0e 48 03 0c c5 c0 ba 0d 98 8b 41 08 a8 01 74 0a f3 90 <8b> 51 08 83 e2 01 75 f6 eb cd 48 83 c4
48 5b 5d 41 5c 41 5d 41 5e
[52416.873408] RSP: 0018:ffffbe8b807cfd90 EFLAGS: 00000202
[52416.873411] RAX: 0000000000000011 RBX: ffffa453fbd2d208 RCX: ffffa453fbcf1f60
[52416.873413] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000003
[52416.873414] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[52416.873415] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[52416.873417] R13: 0000000000031f60 R14: ffffa453fbd2d200 R15: 0000000000000100
[52416.873419] FS:  0000000000000000(0000) GS:ffffa453fbd00000(0000) knlGS:0000000000000000
[52416.873421] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[52416.873422] CR2: 000000c00014b010 CR3: 000000011920a006 CR4: 00000000003706e0
[52416.873424] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[52416.873426] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[52416.873427] Call Trace:
[52416.873429]  <TASK>
[52416.873431]  ? add_nops+0xd0/0xd0
[52416.873437]  on_each_cpu_cond_mask+0x19/0x30
[52416.873441]  text_poke_bp_batch+0xa9/0x290
[52416.873445]  text_poke_finish+0x1b/0x30
[52416.873449]  arch_jump_label_transform_apply+0x16/0x30
[52416.873453]  static_key_enable_cpuslocked+0x6f/0xa0
[52416.873458]  static_key_enable+0x16/0x20
[52416.873461]  process_one_work+0x1d0/0x360
[52416.873465]  worker_thread+0x4d/0x3b0
[52416.873468]  ? queue_work_node+0x100/0x100
[52416.873471]  kthread+0x118/0x140
[52416.873475]  ? set_kthread_struct+0x60/0x60
[52416.873478]  ret_from_fork+0x22/0x30
[52416.873484]  </TASK>

Vm is running on kernel 5.x

alpinedocker01:~# uname -r
5.15.86-0-lts

I have already tried to change disk (4x4tb raidZ2) & nic to VirtIO but nothing changes.
Even adding/removing cores from VM didn't help (now it's on 4 CPU/w 2 cores)

Any suggestion to avoid this behaviour? I still have access to my VM via SSH, if u need other logs, I'll be glad do provide them
thanks
Ric

wuxia · Feb 7, 2023

I've reduced my cores to two and that solved my issue.

RandomLegend · Mar 31, 2023

Sorry to bump up that thread, but i am here with the same issue.

I have TrueNAS Core running on a Xeon E3 1246 V3 (4 Cores, 8 Threads), 16GB ECC Ram

I then have an ubuntu VM and gave it 2 Cores, 4 Threads and 4GB RAM.

The VM can run for months without issues, but as soon as i strain its performance and do some bulk work thats cpu intensive, it gets stuck on CPU aswell...

The CPU Usage in TrueNAS is always super super low (usually below 10%) even when the VM is crashing

sretalla · Mar 31, 2023

If you specified in the settings to have 1 CPU, 2 cores and 4 threads, that's actually 8 threads in total... too many for a VM on a host system with only 8 threads itself.

RandomLegend · Mar 31, 2023

@sretalla you are 100% correct and my tunnelvision didn't allow me to see that rookie-mistake.

Thanks, sorry, have a great day

seanm · Dec 5, 2023

I suspect this thread is quite similar to this other one, and I'll link to a particular post which refers to a fix by @gusev.vitaliy now in FreeBSD/bhyve:

So now TrueNAS just needs to either cherry-pick the fix or update its FreeBSD.

@jixam created a ticket in JIRA: https://ixsystems.atlassian.net/browse/NAS-122108

seanm · Dec 27, 2023

I have successfully applied @gusev.vitaliy's patch against a local copy of the TrueNAS source code, re-compiled, and replaced /boot/kernel/vmm.ko with my build, and it works in the sense that bhyve still works. Now let's see if my Ubuntu VMs can stay up for more than a few days...

Important Announcement for the TrueNAS Community.

Debian VM freezes after several hours

wuxia

Dabbler

ChrisRJ

Wizard

wuxia

Dabbler

wuxia

Dabbler

Patrick M. Hausen

Hall of Famer

Seganku

Cadet

Micko

Cadet

wuxia

Dabbler

RandomLegend

Dabbler

sretalla

Powered by Neutrality

RandomLegend

Dabbler

seanm

Guru

seanm

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Debian VM freezes after several hours

Dabbler

Wizard

Dabbler

Dabbler

Hall of Famer

Cadet

Cadet

Dabbler

Dabbler

Powered by Neutrality

Dabbler

Guru

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Debian VM freezes after several hours"

Similar threads