Ubuntu 22.04 LTS VM constantly crashes while having plenty of resources.

Seani

Dabbler
Joined
Oct 18, 2016
Messages
41
This has been annoying me for several months now. I use a ubuntu VM to run several docker containers on my Truans 13.0U2 server.
I've given it 12GB of RAM, 4C8T of my 8C18T cpu and 150GB of hard drive space. I do not see a reason why it should crash. Yet it does.
Sometimes it runs perfectly smooth for weeks and sometimes it crashes after a single day.

VM.PNG

This is what it usually displays when the VM isn't responding any more.

I've considered switching to Scale to try and see if it's a bhyve problem. Has anyone else encountered this?
 

Seani

Dabbler
Joined
Oct 18, 2016
Messages
41
Anyone found a solution for this? The ubuntu VM crashes pretty much every other day with these errors. I've tried giving it even more RAM, more cores, fewer cores etc. but the behaviour persists. I have no idea why it constantly claims to be starved for CPU. The TrueNAS machine itself runs without any errors for months between reboots.
VM crash2.PNG
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
Maybe have a read of this thread... seems related to your issue and has some suggestions of how to handle it.

 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
It would also help to know, as per the forum rules, what hardware this is running on.
 

Seani

Dabbler
Joined
Oct 18, 2016
Messages
41
Well, today the problem transformed from being annoying to deal breaking for me.

My hardware:
Supermicro X10SRH-CLN4F
onboard LSI SAS 3008 controller crossflashed to IT mode, board has the most recent bios and ipmi.
Intel Xeon E5-2620v4
128GB DDR4 regECC 2133MHz Hynix RAM
Boot drive is a 128GB Kingston Sata SSD
4 pools for data:
8x 8TB Iron Wolf Z2
8x 18TB Toshiba Z2
8x 12TB shucked WD drives Z2
6x 12TB shucked WD drives Z2
1 pool for apps, zvols etc.:
1TB m2 NVME ssd via PCI/e to m2 card
ASUS XG-C100C 10GBit/s Ethernet card
2x Dell Perc H310 flashed to IT mode
Corsair HX850i PSU
Plugged into a UPS.
TrueNAS-13.0-U4 is what is currently running on it.

Both TrueNAS CORE itself as well as a bunch of jails and plugins (Plex, MineOS, Syncthing, Handbrake) run rock solid without any errors for months at a time.

1679506977687.png

The Ubuntu Server VM on the other hand does not. I have tried changing the vcore count (2-8), the RAM Size (4GB-16GB) and I have manually increased the swap size within Ubuntu as suggested as one of the solutions. All to no avail.

Sometimes the VM runs for a week before throwing the "soft lockup" errors, sometimes it happens within hours of starting it. Neither RAM, nor SWAP, nor disk space of the zvol are anywhere near full utilization.
htop within the VM with all my containers running.
1679509399573.png


So far I simply restarted the VM whenever it crashed, which was very annoying but manageable.
Today when I tried to power off the vm (in the frozen state it does no longer react to soft shutdowns), the UI got stuck at "Please wait" for 10 minutes before I manually reloaded the page (it usually takes a few seconds to power off the VM).

The UI then showed the VM as being off, which is good.
However when I try to turn it back on again, I get this error:
1679507490015.png

Error: Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 139, in call_method result = await self.middleware._call(message['method'], serviceobj, methodobj, params, app=self) File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1247, in _call return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args) File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1152, in run_in_executor return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs)) File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 979, in nf return f(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/vm.py", line 1598, in start self.vms[vm['name']].start(vm_data=vm) KeyError: 'udocker'

After googling these errors I found a thread with a similar problem on SCALE where a workaround was to restart middlewared.
Which I did via:
service middlewared restart

However the error persists and I can only start the VM again after rebooting the TruenasCORE machine itself. Which takes forever and is not good for the health of my 30 spinning disks.

So has anyone a fix/workaround/whatever to restart the VM without restarting the host system? Or even better yet a way to prevent the bloody VM from freezing in the first place?
At this point I'm considering getting a second server merely for the VM stuff which would be a huge waste of money and electricity since it runs alright on my CORE machine (until it freezes).
 
Last edited:

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
Your NIC does not work well with FreeBSD/TrueNAS Core. Can you try what switching to an on-board NIC does?
 

Seani

Dabbler
Joined
Oct 18, 2016
Messages
41
Your NIC does not work well with FreeBSD/TrueNAS Core. Can you try what switching to an on-board NIC does?
I forgot to mention that I configured that NIC as a direct connection to my workstation on another subnet (works great, no issues there). Neither the VM nor the host system itself use it. The only thing happening on it is me accessing SMB shares from my PC.
Plus I already had the VM freezes before I installed that NIC so I think we can safely rule it out.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
so I think we can safely rule it out.
I don't dispute that your argument sounds logical. However, experience shows two things over and over again: 1) Sometimes a seemingly illogical change solves a problem. 2) Changing one variable, even it seems unrelated, often helps to gain additional insight as to the root-cause.

In that light it could really help. Plus you have installed an update of TrueNAS somewhere in the process.
 

adrianwi

Guru
Joined
Oct 15, 2013
Messages
1,231
It doesn't help you, but I have 3 VMs running Ubuntu 22.04 LTS (and one running 20.04!), and they are all rock solid and have been running for 25 days since my last reboot. I don't think I've ever seen one crash in the years I've run them.
 

Seani

Dabbler
Joined
Oct 18, 2016
Messages
41
In that light it could really help. Plus you have installed an update of TrueNAS somewhere in the process.
Multiple minor and major updates of both the host OS as well as the VM OS.

The VM freezes started when I was still using Truenas CORE 12 and ubuntu server 20.04 LTS in a different VM (which also froze, which is why I switched to an actual raspberry pi to deploy pihole instead of using that freezing VM).

At that time my motherboard ran a different BIOS version, a different IPMI version, the 3008 LSI controller and the 2 Dell Perc H310 controllers were on different firmware versions, I had no 10GBit/s card in the system, the zvol was on another pool etc. and yet the problem was the same.

I can remove the 10GBit/s card but I think it's insanely unlikely to be responsible for the problem since it was not present yet when the problem started occurring and the problem did not change at all after I installed it.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
ASUS XG-C100C 10GBit/s Ethernet card
I would put a fairly strong bet on the aquantia chip here being the cause of at least some of the instability, but if we regard that as understood, I'll move on to other options.

I would suggest changing the core count to 2 and upping the threads to 2 (still giving you 4 threads at the end).

Are you running virtio for the NIC and disk in the VM?
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
What are you using to keep time in sync on the VM? The one thing I notice when I used to run VM's on TN is the time tends to drift and wander around. I don't think that would result in a soft lockup, but food for thought...
 

Seani

Dabbler
Joined
Oct 18, 2016
Messages
41
I would put a fairly strong bet on the aquantia chip here being the cause of at least some of the instability, but if we regard that as understood, I'll move on to other options.

I would suggest changing the core count to 2 and upping the threads to 2 (still giving you 4 threads at the end).

Are you running virtio for the NIC and disk in the VM?
I removed the network card and as I expected, the VM still froze after half a day of uptime.

I've also tried a core count of 2 with 2 threads each, which also didn't prevent another freeze a few hours after starting the VM up again.

Both the disk and the NIC are set to virtIO.
 

vidx

Dabbler
Joined
Oct 16, 2021
Messages
40
I'm also facing the same issue. I have a 1 core and 1 thread Ubuntu 22.04 LTS VM with around 30 docker containers running fine for a very long time. After adding Nextcloud and Plex to the containers, I increased the VM to 2 cores and 2 threads today. That's when the problem with the soft lockup CPU bug starts appearing, at times disabling the VM altogether. Read somewhere that changing the NIC from e1000 to VirtIO helps and it did somewhat. The VM ran for a few hours without issue but the soft lockup message came up again.

Is there a solution for this? Or a change to Scale preferred?
 

Seani

Dabbler
Joined
Oct 18, 2016
Messages
41
I'm also facing the same issue. I have a 1 core and 1 thread Ubuntu 22.04 LTS VM with around 30 docker containers running fine for a very long time. After adding Nextcloud and Plex to the containers, I increased the VM to 2 cores and 2 threads today. That's when the problem with the soft lockup CPU bug starts appearing, at times disabling the VM altogether. Read somewhere that changing the NIC from e1000 to VirtIO helps and it did somewhat. The VM ran for a few hours without issue but the soft lockup message came up again.

Is there a solution for this? Or a change to Scale preferred?
I never found a solution and instead set up an old PC of mine with Proxmox and now that runs the exact same ubuntu server 22.04 LTS VM with the exact same stuff as on TrueNAS Core (I copied everything over) with the small difference that it does not crash. Waste of electricity but I was tired of no solution and random but frequent crashes. Good look to you, do tell if you try Scale and that ends up working for you.
 

vidx

Dabbler
Joined
Oct 16, 2021
Messages
40
I never found a solution and instead set up an old PC of mine with Proxmox and now that runs the exact same ubuntu server 22.04 LTS VM with the exact same stuff as on TrueNAS Core (I copied everything over) with the small difference that it does not crash. Waste of electricity but I was tired of no solution and random but frequent crashes. Good look to you, do tell if you try Scale and that ends up working for you.
I took the plunge and upgraded Core to Scale. It was one hell of a ride because the VM had no network after the upgrade despite assigning a bridge NIC. Took me many hours to resolve the problem but that's another story.

Conclusion - it worked and looking stable at the moment.
 

lilomaster

Cadet
Joined
May 30, 2022
Messages
9
A couple of weeks ago, I upgraded my hardware from an AMD FX 8350 to an Intel Xeon E5-2670 v3. I didn't experience this issue with the AMD hardware, but as soon as I upgraded to the Intel Xeon processor, I started experiencing the very same issue described in this post, although not up to the extent that I cannot restart the VM anymore. The motherboard I'm using still features a Realtek NIC (just like the old motherboard with the AMD Processor), and I tried both the VirtIO and the e1000 virtual NICs for the VM. Might it be anything related to Intel or Xeon processors or chipsets? I don't want to upgrade to Scale because I feel Core is much more mature, and many things work more straightforwardly in Core compared to Scale (A friend of mine installed Scale and has gone through hours of troubleshooting to put up things that are very straightforward with Core). Hopefully, this information helps to troubleshoot this issue.
 

Wafflez19

Cadet
Joined
May 25, 2020
Messages
3
I've also been seeing the same issues since around 13.0U2. I don't recall the issue being present in the release of 13 or U1. Ended up moving most of my services off Ubuntu VM's and onto windows. More resource use but I just got tired of trying different settings (cores, threads, different image versions & types, storage and network settings, etc). I still need to kick one Ubuntu VM every couple of days, if I'm lucky. I've noticed the harder the VM works the faster it locks up. It's a bit of a blur but I'm pretty sure I had v3 processors in when the issue started.

TrueNAS-13.0-U5.2
R730XD
RAM: 768.00 GB
CPU E5-2697A v4
BRCM 10G/GbE 2+2P 57800 rNDC
24 x 600 GB 12G SAS
 
Top