Ubuntu 22.04 LTS VM constantly crashes while having plenty of resources.

Seani · Nov 24, 2022

This has been annoying me for several months now. I use a ubuntu VM to run several docker containers on my Truans 13.0U2 server.
I've given it 12GB of RAM, 4C8T of my 8C18T cpu and 150GB of hard drive space. I do not see a reason why it should crash. Yet it does.
Sometimes it runs perfectly smooth for weeks and sometimes it crashes after a single day.

This is what it usually displays when the VM isn't responding any more.

I've considered switching to Scale to try and see if it's a bhyve problem. Has anyone else encountered this?

Seani · Dec 27, 2022

Anyone found a solution for this? The ubuntu VM crashes pretty much every other day with these errors. I've tried giving it even more RAM, more cores, fewer cores etc. but the behaviour persists. I have no idea why it constantly claims to be starved for CPU. The TrueNAS machine itself runs without any errors for months between reboots.

sretalla · Dec 27, 2022

Maybe have a read of this thread... seems related to your issue and has some suggestions of how to handle it.

Bhyve with Ubuntu 19.04 - keeps locking up?

I have FreeNAS 11.2.-U6 running on an Atom C3000 machine. I have a Bhyve-VM, previously running Ubuntu 19.04 (and now 19.10 beta). Both the disk and NIC are set to VirtIO. Within this Ubuntu VM, I have qBittorrent running under Docker, and it's also mounting my main FreeNAS ZFS pools via SMB...

www.truenas.com

ChrisRJ · Dec 27, 2022

It would also help to know, as per the forum rules, what hardware this is running on.

Seani · Mar 22, 2023

Well, today the problem transformed from being annoying to deal breaking for me.

My hardware:
Supermicro X10SRH-CLN4F
onboard LSI SAS 3008 controller crossflashed to IT mode, board has the most recent bios and ipmi.
Intel Xeon E5-2620v4
128GB DDR4 regECC 2133MHz Hynix RAM
Boot drive is a 128GB Kingston Sata SSD
4 pools for data:
8x 8TB Iron Wolf Z2
8x 18TB Toshiba Z2
8x 12TB shucked WD drives Z2
6x 12TB shucked WD drives Z2
1 pool for apps, zvols etc.:
1TB m2 NVME ssd via PCI/e to m2 card
ASUS XG-C100C 10GBit/s Ethernet card
2x Dell Perc H310 flashed to IT mode
Corsair HX850i PSU
Plugged into a UPS.
TrueNAS-13.0-U4 is what is currently running on it.

Both TrueNAS CORE itself as well as a bunch of jails and plugins (Plex, MineOS, Syncthing, Handbrake) run rock solid without any errors for months at a time.

The Ubuntu Server VM on the other hand does not. I have tried changing the vcore count (2-8), the RAM Size (4GB-16GB) and I have manually increased the swap size within Ubuntu as suggested as one of the solutions. All to no avail.

Sometimes the VM runs for a week before throwing the "soft lockup" errors, sometimes it happens within hours of starting it. Neither RAM, nor SWAP, nor disk space of the zvol are anywhere near full utilization.
htop within the VM with all my containers running.

So far I simply restarted the VM whenever it crashed, which was very annoying but manageable.
Today when I tried to power off the vm (in the frozen state it does no longer react to soft shutdowns), the UI got stuck at "Please wait" for 10 minutes before I manually reloaded the page (it usually takes a few seconds to power off the VM).

The UI then showed the VM as being off, which is good.
However when I try to turn it back on again, I get this error:


Error: Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 139, in call_method
    result = await self.middleware._call(message['method'], serviceobj, methodobj, params, app=self)
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1247, in _call
    return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args)
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1152, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
  File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 979, in nf
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/vm.py", line 1598, in start
    self.vms[vm['name']].start(vm_data=vm)
KeyError: 'udocker'

After googling these errors I found a thread with a similar problem on SCALE where a workaround was to restart middlewared.
Which I did via:
service middlewared restart

However the error persists and I can only start the VM again after rebooting the TruenasCORE machine itself. Which takes forever and is not good for the health of my 30 spinning disks.

So has anyone a fix/workaround/whatever to restart the VM without restarting the host system? Or even better yet a way to prevent the bloody VM from freezing in the first place?
At this point I'm considering getting a second server merely for the VM stuff which would be a huge waste of money and electricity since it runs alright on my CORE machine (until it freezes).

ChrisRJ · Mar 22, 2023

Your NIC does not work well with FreeBSD/TrueNAS Core. Can you try what switching to an on-board NIC does?

Seani · Mar 22, 2023

ChrisRJ said:
Your NIC does not work well with FreeBSD/TrueNAS Core. Can you try what switching to an on-board NIC does?

I forgot to mention that I configured that NIC as a direct connection to my workstation on another subnet (works great, no issues there). Neither the VM nor the host system itself use it. The only thing happening on it is me accessing SMB shares from my PC.
Plus I already had the VM freezes before I installed that NIC so I think we can safely rule it out.

ChrisRJ · Mar 22, 2023

Seani said:
so I think we can safely rule it out.

I don't dispute that your argument sounds logical. However, experience shows two things over and over again: 1) Sometimes a seemingly illogical change solves a problem. 2) Changing one variable, even it seems unrelated, often helps to gain additional insight as to the root-cause.

In that light it could really help. Plus you have installed an update of TrueNAS somewhere in the process.

adrianwi · Mar 22, 2023

It doesn't help you, but I have 3 VMs running Ubuntu 22.04 LTS (and one running 20.04!), and they are all rock solid and have been running for 25 days since my last reboot. I don't think I've ever seen one crash in the years I've run them.

Seani · Mar 22, 2023

ChrisRJ said:
In that light it could really help. Plus you have installed an update of TrueNAS somewhere in the process.

Multiple minor and major updates of both the host OS as well as the VM OS.

The VM freezes started when I was still using Truenas CORE 12 and ubuntu server 20.04 LTS in a different VM (which also froze, which is why I switched to an actual raspberry pi to deploy pihole instead of using that freezing VM).

At that time my motherboard ran a different BIOS version, a different IPMI version, the 3008 LSI controller and the 2 Dell Perc H310 controllers were on different firmware versions, I had no 10GBit/s card in the system, the zvol was on another pool etc. and yet the problem was the same.

I can remove the 10GBit/s card but I think it's insanely unlikely to be responsible for the problem since it was not present yet when the problem started occurring and the problem did not change at all after I installed it.

sretalla · Mar 23, 2023

Seani said:
ASUS XG-C100C 10GBit/s Ethernet card

I would put a fairly strong bet on the aquantia chip here being the cause of at least some of the instability, but if we regard that as understood, I'll move on to other options.

I would suggest changing the core count to 2 and upping the threads to 2 (still giving you 4 threads at the end).

Are you running virtio for the NIC and disk in the VM?

rvassar · Mar 23, 2023

What are you using to keep time in sync on the VM? The one thing I notice when I used to run VM's on TN is the time tends to drift and wander around. I don't think that would result in a soft lockup, but food for thought...

Seani · Mar 25, 2023

sretalla said:
I would put a fairly strong bet on the aquantia chip here being the cause of at least some of the instability, but if we regard that as understood, I'll move on to other options.

I would suggest changing the core count to 2 and upping the threads to 2 (still giving you 4 threads at the end).

Are you running virtio for the NIC and disk in the VM?

I removed the network card and as I expected, the VM still froze after half a day of uptime.

I've also tried a core count of 2 with 2 threads each, which also didn't prevent another freeze a few hours after starting the VM up again.

Both the disk and the NIC are set to virtIO.

vidx · Jun 17, 2023

I'm also facing the same issue. I have a 1 core and 1 thread Ubuntu 22.04 LTS VM with around 30 docker containers running fine for a very long time. After adding Nextcloud and Plex to the containers, I increased the VM to 2 cores and 2 threads today. That's when the problem with the soft lockup CPU bug starts appearing, at times disabling the VM altogether. Read somewhere that changing the NIC from e1000 to VirtIO helps and it did somewhat. The VM ran for a few hours without issue but the soft lockup message came up again.

Is there a solution for this? Or a change to Scale preferred?

Seani · Jun 17, 2023

vidx said:
I'm also facing the same issue. I have a 1 core and 1 thread Ubuntu 22.04 LTS VM with around 30 docker containers running fine for a very long time. After adding Nextcloud and Plex to the containers, I increased the VM to 2 cores and 2 threads today. That's when the problem with the soft lockup CPU bug starts appearing, at times disabling the VM altogether. Read somewhere that changing the NIC from e1000 to VirtIO helps and it did somewhat. The VM ran for a few hours without issue but the soft lockup message came up again.

Is there a solution for this? Or a change to Scale preferred?

I never found a solution and instead set up an old PC of mine with Proxmox and now that runs the exact same ubuntu server 22.04 LTS VM with the exact same stuff as on TrueNAS Core (I copied everything over) with the small difference that it does not crash. Waste of electricity but I was tired of no solution and random but frequent crashes. Good look to you, do tell if you try Scale and that ends up working for you.

vidx · Jun 17, 2023

Seani said:
I never found a solution and instead set up an old PC of mine with Proxmox and now that runs the exact same ubuntu server 22.04 LTS VM with the exact same stuff as on TrueNAS Core (I copied everything over) with the small difference that it does not crash. Waste of electricity but I was tired of no solution and random but frequent crashes. Good look to you, do tell if you try Scale and that ends up working for you.

I took the plunge and upgraded Core to Scale. It was one hell of a ride because the VM had no network after the upgrade despite assigning a bridge NIC. Took me many hours to resolve the problem but that's another story.

Conclusion - it worked and looking stable at the moment.

lilomaster · Jun 25, 2023

A couple of weeks ago, I upgraded my hardware from an AMD FX 8350 to an Intel Xeon E5-2670 v3. I didn't experience this issue with the AMD hardware, but as soon as I upgraded to the Intel Xeon processor, I started experiencing the very same issue described in this post, although not up to the extent that I cannot restart the VM anymore. The motherboard I'm using still features a Realtek NIC (just like the old motherboard with the AMD Processor), and I tried both the VirtIO and the e1000 virtual NICs for the VM. Might it be anything related to Intel or Xeon processors or chipsets? I don't want to upgrade to Scale because I feel Core is much more mature, and many things work more straightforwardly in Core compared to Scale (A friend of mine installed Scale and has gone through hours of troubleshooting to put up things that are very straightforward with Core). Hopefully, this information helps to troubleshoot this issue.

Wafflez19 · Jul 22, 2023

I've also been seeing the same issues since around 13.0U2. I don't recall the issue being present in the release of 13 or U1. Ended up moving most of my services off Ubuntu VM's and onto windows. More resource use but I just got tired of trying different settings (cores, threads, different image versions & types, storage and network settings, etc). I still need to kick one Ubuntu VM every couple of days, if I'm lucky. I've noticed the harder the VM works the faster it locks up. It's a bit of a blur but I'm pretty sure I had v3 processors in when the issue started.

TrueNAS-13.0-U5.2
R730XD
RAM: 768.00 GB
CPU E5-2697A v4
BRCM 10G/GbE 2+2P 57800 rNDC
24 x 600 GB 12G SAS

Important Announcement for the TrueNAS Community.

Ubuntu 22.04 LTS VM constantly crashes while having plenty of resources.

Seani

Dabbler

Seani

Dabbler

sretalla

Powered by Neutrality

Bhyve with Ubuntu 19.04 - keeps locking up?

ChrisRJ

Wizard

Seani

Dabbler

ChrisRJ

Wizard

Seani

Dabbler

ChrisRJ

Wizard

adrianwi

Guru

Seani

Dabbler

sretalla

Powered by Neutrality

rvassar

Guru

Seani

Dabbler

vidx

Dabbler

Seani

Dabbler

vidx

Dabbler

lilomaster

Cadet

Wafflez19

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Ubuntu 22.04 LTS VM constantly crashes while having plenty of resources.

Dabbler

Dabbler

Powered by Neutrality

Wizard

Dabbler

Wizard

Dabbler

Wizard

Guru

Dabbler

Powered by Neutrality

Guru

Dabbler

Dabbler

Dabbler

Dabbler

Cadet

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Ubuntu 22.04 LTS VM constantly crashes while having plenty of resources."

Similar threads