Virtualization actions slow since hardware upgrade (haproxy backend has no server available)

NoxiousPluK · Jun 18, 2023

Today I did the following steps (in order):
- Upgrade TrueNAS Scale from 22.02.4 to 22.12.3 (everything seemed OK after the upgrade, my VMs started fine)
- Upgrade my motherboard (AsRock B450 Pro4) BIOS from 1.50 to 1.80
- Upgrade my motherboard BIOS from 1.80 to 8.02
- Upgrade my CPU from Ryzen 7 1700X to Ryzen 7 5800X
- Upgrade my RAM from 4x16GB DDR4-2166 to 2x32GB DDR4-3200

I double-checked that hardware virtualization was enabled and enabled SR-IOV (since my NIC supports it) and IOMMU (just in case I wanted to experiment with hardware mapping in the future).

TrueNAS seemed to boot fine, but the Virtualization page in the GUI got stuck on seemingly infinite loading (I gave it ~5-10 minutes, but no result).
To debug, I disabled SR-IOV and IOMMU again and rebooted, but no result.

I went through a bunch of logfiles, but the only seemingly relevant thing I could find was this in /var/log/messages:

Code:

Jun 18 09:59:27 truenas middlewared[2202]: libvirt: QEMU Driver error : Domain not found: no domain with matching name '1_gateway'
Jun 18 09:59:28 truenas middlewared[2202]: libvirt: QEMU Driver error : Domain not found: no domain with matching name '2_dc01'
Jun 18 09:59:29 truenas middlewared[2202]: libvirt: QEMU Driver error : Domain not found: no domain with matching name '3_akpublicrust'
Jun 18 09:59:29 truenas middlewared[2202]: libvirt: QEMU Driver error : Domain not found: no domain with matching name '6_hass'
Jun 18 09:59:30 truenas middlewared[2202]: libvirt: QEMU Driver error : Domain not found: no domain with matching name '7_pufferpanel'
Jun 18 09:59:30 truenas middlewared[2202]: libvirt: QEMU Driver error : Domain not found: no domain with matching name '8_webserver'
Jun 18 09:59:31 truenas middlewared[2202]: libvirt: QEMU Driver error : Domain not found: no domain with matching name '10_pihole2'
Jun 18 09:59:31 truenas middlewared[2202]: libvirt: QEMU Driver error : Domain not found: no domain with matching name '11_bothost'

These being my VM names.

After ~10-15 minutes, the page suddenly worked, but all VMs (including those with autostart enabled) were stopped, and hovering over the State-switch shows a tooltip that only states 'ERROR'.

I tried to enable one of the VMs and it took about 10 minutes (with a spinning 'Please wait' prompt) for it to start, but it did end up starting.
At the exact moment that the VM did start, I got this message in my open shell:

Code:

Broadcast message from systemd-journald@truenas (Sun 2023-06-18 12:53:09 CEST):

haproxy[149277]: backend be_13 has no server available!


Broadcast message from systemd-journald@truenas (Sun 2023-06-18 12:53:10 CEST):

haproxy[149277]: backend be_29 has no server available!

2023 Jun 18 12:53:09 truenas backend be_13 has no server available!
2023 Jun 18 12:53:10 truenas backend be_29 has no server available!

Broadcast message from systemd-journald@truenas (Sun 2023-06-18 12:53:10 CEST):

haproxy[149277]: backend be_33 has no server available!

2023 Jun 18 12:53:10 truenas backend be_33 has no server available!

Broadcast message from systemd-journald@truenas (Sun 2023-06-18 12:53:10 CEST):

haproxy[149277]: backend be_38 has no server available!

2023 Jun 18 12:53:10 truenas backend be_38 has no server available!

Broadcast message from systemd-journald@truenas (Sun 2023-06-18 12:53:10 CEST):

haproxy[149277]: backend be_47 has no server available!


Broadcast message from systemd-journald@truenas (Sun 2023-06-18 12:53:11 CEST):

haproxy[149277]: backend be_52 has no server available!

2023 Jun 18 12:53:10 truenas backend be_47 has no server available!
2023 Jun 18 12:53:11 truenas backend be_52 has no server available!

I managed to start two more VMs this way (giving similar messages), and on a 4rd try to start another one, I got (after ~5 minutes) this error in the GUI:

Code:

CallError

[EFAULT] Failed to connect to libvirt

More info...

 Error: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/plugins/vm/supervisor/supervisor.py", line 172, in start
    if self.domain.create() < 0:
  File "/usr/lib/python3/dist-packages/libvirt.py", line 1353, in create
    raise libvirtError('virDomainCreate() failed')
libvirt.libvirtError: Cannot recv data: Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/plugins/vm/vm_lifecycle.py", line 46, in start
    await self.middleware.run_in_thread(self._start, vm['name'])
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1261, in run_in_thread
    return await self.run_in_executor(self.thread_pool_executor, method, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1258, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
  File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/vm/vm_supervisor.py", line 68, in _start
    self.vms[vm_name].start(vm_data=self._vm_from_name(vm_name))
  File "/usr/lib/python3/dist-packages/middlewared/plugins/vm/supervisor/supervisor.py", line 181, in start
    raise CallError('\n'.join(errors))
middlewared.service_exception.CallError: [EFAULT] Cannot recv data: Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 204, in call_method
    result = await self.middleware._call(message['method'], serviceobj, methodobj, params, app=self)
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1344, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1378, in nf
    return await func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1246, in nf
    res = await f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/vm/vm_lifecycle.py", line 48, in start
    if (await self.middleware.call('vm.get_instance', id))['status']['state'] != 'RUNNING':
  File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/vm/vms.py", line 100, in extend_context
    self._check_setup_connection()
  File "/usr/lib/python3/dist-packages/middlewared/plugins/vm/connection.py", line 71, in _check_setup_connection
    self._check_connection_alive()
  File "/usr/lib/python3/dist-packages/middlewared/plugins/vm/connection.py", line 66, in _check_connection_alive
    raise CallError('Failed to connect to libvirt')
middlewared.service_exception.CallError: [EFAULT] Failed to connect to libvirt

I have not been able to start any more VMs since; every suggestion is very welcome!

NoxiousPluK · Jun 18, 2023

I've noticed a few apparmor lines in my dmesg:

Code:

[ 5203.472301] audit: type=1400 audit(1687087938.840:46): apparmor="DENIED" operation="capable" profile="libvirtd" pid=6717 comm="rpc-worker" capability=39  capname="bpf"
[ 5203.489300] audit: type=1400 audit(1687087938.856:47): apparmor="DENIED" operation="capable" profile="libvirtd" pid=6717 comm="rpc-worker" capability=38  capname="perfmon"
[ 6570.949274] audit: type=1400 audit(1687089306.316:48): apparmor="STATUS" operation="profile_load" profile="unconfined" name="libvirt-73fa8e15-8fbf-48a1-bca7-2a27f47d63cb" pid=462370 comm="apparmor_parser"
[ 6571.016978] audit: type=1400 audit(1687089306.384:49): apparmor="STATUS" operation="profile_load" profile="unconfined" name="libvirt-4248b4ac-9420-45d9-8774-53c1b510881c" pid=462373 comm="apparmor_parser"
[ 6571.085621] audit: type=1400 audit(1687089306.452:50): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-73fa8e15-8fbf-48a1-bca7-2a27f47d63cb" pid=462376 comm="apparmor_parser"
[ 6571.154236] audit: type=1400 audit(1687089306.520:51): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-4248b4ac-9420-45d9-8774-53c1b510881c" pid=462384 comm="apparmor_parser"
[ 6571.224383] audit: type=1400 audit(1687089306.592:52): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-73fa8e15-8fbf-48a1-bca7-2a27f47d63cb" pid=462388 comm="apparmor_parser"
[ 6571.294310] audit: type=1400 audit(1687089306.660:53): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-4248b4ac-9420-45d9-8774-53c1b510881c" pid=462392 comm="apparmor_parser"
[ 6571.365844] audit: type=1400 audit(1687089306.732:54): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libvirt-73fa8e15-8fbf-48a1-bca7-2a27f47d63cb" pid=462396 comm="apparmor_parser"
[ 6571.436875] audit: type=1400 audit(1687089306.804:55): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libvirt-4248b4ac-9420-45d9-8774-53c1b510881c" pid=462468 comm="apparmor_parser"

Not sure if these could be relevant to the problem at hand. After trying and retrying I've managed to start one more VM, and once its started it's running entirely fine (and the hardware upgrade is quite noticable!).

NoxiousPluK · Jun 18, 2023

It turns out that the VNC service is also not working.

Visiting http://192.168.0.20/vm/display/52/vnc.html?path=vm/display/52/&autoconnect=1 gives me

503 Service Unavailable

No server is available to handle this request.

joeschmuck · Jun 18, 2023

NoxiousPluK said:
Today I did the following steps (in order):
- Upgrade TrueNAS Scale from 22.02.4 to 22.12.3 (everything seemed OK after the upgrade, my VMs started fine)
- Upgrade my motherboard (AsRock B450m Pro4) BIOS from 1.50 to 1.80
- Upgrade my motherboard BIOS from 1.80 to 8.02
- Upgrade my CPU from Ryzen 7 1700X to Ryzen 7 5800X
- Upgrade my RAM from 4x16GB DDR4-2166 to 2x32GB DDR4-3200

In the future, upgrade one item at a time and fully test the system out. As of right now you can only say that TrueNAS Scale worked after it's upgrade. If you are unsure how the system operated after the TrueNAS upgrade, can you roll back your system to the earlier version and then test it out? Unfortunately you likely cannot roll the BIOS back, I don't know for your motherboard if that is possible. I suspect you needed the BIOS updates to support the CPU change.

Did you test the system out after each upgrade?

I would inspect the BIOS. This can really make a big change as to how the motherboard operates. Try a factory reset then see if all the BIOS setting look appropriate and give that a try. Maybe you have already done that.

NoxiousPluK · Jun 18, 2023

joeschmuck said:
In the future, upgrade one item at a time and fully test the system out. As of right now you can only say that TrueNAS Scale worked after it's upgrade. If you are unsure how the system operated after the TrueNAS upgrade, can you roll back your system to the earlier version and then test it out? Unfortunately you likely cannot roll the BIOS back, I don't know for your motherboard if that is possible. I suspect you needed the BIOS updates to support the CPU change.

Did you test the system out after each upgrade?

I would inspect the BIOS. This can really make a big change as to how the motherboard operates. Try a factory reset then see if all the BIOS setting look appropriate and give that a try. Maybe you have already done that.

The trouble is that I required the BIOS upgrade to be able to use the new CPU, and I didn't want to run the old CPU with the BIOS upgrade as this is against the manufacturers recommendations, so I did not have too much of a choice.

Performance does not seem to be an issue and everything is blazingly fast when it works. It just seems stuck on some odd time-out when starting VMs.

I did reset the BIOS to full defaults but sadly that did not make a difference (except reset the RAM speed, for which I re-enabled the 3200MHz profile again).

I can try running the old CPU + RAM, but it is not recommended for this BIOS version and may cause issues according to AsRock. The CPU is known to be good (came out of my desktop after upgrading to AM5 last week) but the RAM is brand new (I did run a memory test to validate its stability and waited for two passes, so it should be alright).

Thanks for the response!

NoxiousPluK · Jun 21, 2023

If anyone has any idea, please; if there's anything I can test, debug, check.. Preferably without reboots since it takes hours to bring up services, but everything is welcome.

joeschmuck · Jun 22, 2023

Sorry you haven't had any replies in this thread.

My first thing to say is, please verify the BIOS version number you flashed into your motherboard. You stated above that it was version 8.02. But the ASRock website only has version 5.70 for this model, both R1 and R2 versions. Did you grab the wrong BIOS and flash it into your motherboard? Did you download the BIOS from here? https://www.asrock.com/MB/AMD/B450M Pro4 R2.0/index.asp#BIOS

Assuming you are good with the BIOS, continue reading.

joeschmuck said:
can you roll back your system to the earlier version and then test it out?

1. Have you tried this, roll back to 22.02.4 using the GUI?

2. You could install your old RAM and see if that fixes it. This is actually a good option. It will rule out the new RAM with certainty.

3. Are you running on bare metal? If you are running TrueNAS in a VM, stop and run it on bare metal to see how it's working.

I really hope you are using a valid BIOS, if not, I'd flash to the official BIOS and try again.

NoxiousPluK · Jun 22, 2023

joeschmuck said:
Sorry you haven't had any replies in this thread.

My first thing to say is, please verify the BIOS version number you flashed into your motherboard. You stated above that it was version 8.02. But the ASRock website only has version 5.70 for this model, both R1 and R2 versions. Did you grab the wrong BIOS and flash it into your motherboard? Did you download the BIOS from here? https://www.asrock.com/MB/AMD/B450M Pro4 R2.0/index.asp#BIOS

View attachment 67633

Assuming you are good with the BIOS, continue reading.

1. Have you tried this, roll back to 22.02.4 using the GUI?

2. You could install your old RAM and see if that fixes it. This is actually a good option. It will rule out the new RAM with certainty.

3. Are you running on bare metal? If you are running TrueNAS in a VM, stop and run it on bare metal to see how it's working.

I really hope you are using a valid BIOS, if not, I'd flash to the official BIOS and try again.

Apologies, I double-checked and I have the B450 Pro4 (without the M); https://www.asrock.com/mb/amd/b450 pro4/index.asp#BIOS
I updated my post accordingly.

1. I wasn't aware this was possible, but I see the option now under System > Boot, I will try this soon (tm) when I can spare a few hours of downtime (since it takes that long to get the VMs back up) - Thanks!
2. Will also check that; the old RAM does have the wrong speed for this CPU, but it should still work. I don't experience performance issues once things get going (it feels more like something being stuck on a time-out, rather than a performance issue, because everything else is fast and VM performance is great), but it won't hurt to try
3. Indeed running bare metal

Thanks! Some things for me to try out :)

joeschmuck · Jun 22, 2023

If rolling back to the previous TrueNAS boot environment works, then I'd backup your configuration files and perform a clean installation of TrueNAS current version, then restore the configuration files. You can remove the 120GB SSD and use a USB Flash Drive if you like so the SSD stays in tact. And remember, once you fix the problem, your VM's should start up normally.

thirdgen89gta · Sep 17, 2023

I know this is a slightly older post, but I recently started getting this error message. However, I didn't upgrade TrueNAS Scale, but rather had some ZVOLs that got deleted from the CLI with ZFS Destroy, and it suddenly popped up afterwards.

The symptoms are a little similar, IO inside Guests seems to be slow, or laggy. Even though with a disk bench, I can achieve some fairly high numbers from the SSD Pools in the 4-5GB/s range, but nowhere near the usual 15-17GB/s range. However its things like trying to open the Start Menu in a guest and typing into the Start Search that has laggy results.

I've tried to search out where this is coming from. But getting a ton of false hits on google for HAProxy that don't really relate to TrueNAS on the host side.

Code:

Broadcast message from systemd-journald@asgard (Sun 2023-09-17 21:15:02 CDT):

haproxy[5507]: backend be_37 has no server available!


Broadcast message from systemd-journald@asgard (Sun 2023-09-17 21:15:08 CDT):

haproxy[1997947]: backend be_231 has no server available!


Broadcast message from systemd-journald@asgard (Sun 2023-09-17 21:15:08 CDT):

haproxy[1997947]: backend be_335 has no server available!

# uname -a
Linux asgard 5.15.107+truenas #1 SMP Tue Jul 25 00:05:02 UTC 2023 x86_64 GNU/Linux

Server is a AMD Epyc 24core Zen3 with 256GB RAM, and the VM's are hosted on a RaidZ1 pool of 3x 4TB PCIe Gen4 SSDs.

Code:

OS Version:TrueNAS-SCALE-22.12.3.3
Product:Super Server
Model:AMD EPYC 7443P 24-Core Processor
Memory:252 GiB

chuck32 · Oct 29, 2023

I have a similiar issue. I only saw the messages when I left the shell open and they only seem to appear when I'm fiddling with VMs (I think starting or exiting them).

They do not appear in /var/logs/messages.

Edit: I also have the audit / apparmor messages in my log.

thirdgen89gta said:
However its things like trying to open the Start Menu in a guest and typing into the Start Search that has laggy results.

I didn't measure IO.

VMs are a bit sluggy for me but I only started trying VMs with a GUI yesterday and have I no benchmark on how fluent they should be. If I recall it correctly they didn't feel any more sluggish then when I first setup the server and live booted lubuntu directly. If presseds I'd say the performance is as expected.

Also starting / stopping VMs works as usual. If I hadn't left the shell open I probably wouldn't have noticed / this time I'm running bad blocks so I can see it when I tmux attach the current run.

If I'm content with VM performance (for the lack of comparison though), are these messages of any concern?

chuck32 · Oct 30, 2023

Update

I'm rather confident, that I identified the context:

I have the following setup:

	Started	ID shown when clicking DISPLAY
VM1	Yes	99
VM2	No	112
VM3	Yes	106
VM4	No	91
VM5	Yes	28
VM6	No	didn't check
VM7	No	didn't check

Starting VM2 results in:

Code:

haproxy[1181898]: backend be_40 has no server available!
haproxy[1181898]: backend be_91 has no server available!
haproxy[1181898]: backend be_126 has no server available!

Then starting VM4 results in:

Code:

haproxy[1181898]: backend be_112 has no server available!
haproxy[1200117]: backend be_40 has no server available!
haproxy[1200117]: backend be_126 has no server available!

Then starting VM6 results in:

Code:

matth
haproxy[1213661]: backend be_112 has no server available!
haproxy[1213661]: backend be_126 has no server available!

Not sure why the message persisted for 112 (I'm sure I killed the VMs after completing this survey and I got the correct numbers).

However I would deduce that the message appears during VM start up and references all stopped VMs. Not sure if this was apparent for everyone but me.

thirdgen89gta · Oct 31, 2023

I’m considering blowing away my TrueNas config completely and starting fresh.

I did reimage the OS and restore my config, but the issue followed.

So I may save my current config, blow it away, then setup a vm and see if the shell messages disappear and performance is improved. If it is, I’ll redo my acls and shares, and stand up the VMs from scratch by reusing the Zvols. As long as I record their current macs they’ll even have the same IPs.

Id the issue follows then I’ll restore the config. I did upgrade to bluefin, but had to roll back due to a pci-e pass thru issue preventing me from attaching my GPU to the Plex vm.

chuck32 · Oct 31, 2023

Please keep us posted on the progress, as I'm planning on rearranging one of my pools the struggle of completely starting from scratch wouldn't be that hard. Assuming I don't have to redo my VM pool ;)

Important Announcement for the TrueNAS Community.

Virtualization actions slow since hardware upgrade (haproxy backend has no server available)

NoxiousPluK

Cadet

NoxiousPluK

Cadet

NoxiousPluK

Cadet

503 Service Unavailable

joeschmuck

Old Man

NoxiousPluK

Cadet

NoxiousPluK

Cadet

joeschmuck

Old Man

NoxiousPluK

Cadet

joeschmuck

Old Man

thirdgen89gta

Dabbler

chuck32

Guru

chuck32

Guru

thirdgen89gta

Dabbler

chuck32

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Virtualization actions slow since hardware upgrade (haproxy backend has no server available)

Cadet

Cadet

Cadet

503 Service Unavailable​

Old Man

Cadet

Cadet

Old Man

Cadet

Old Man

Dabbler

Guru

Guru

Dabbler

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Virtualization actions slow since hardware upgrade (haproxy backend has no server available)"

Similar threads

503 Service Unavailable