SOLVED Formatting Virtio drive in Win 11 causes assertion error in VM

kaulex

Dabbler
Joined
Jan 25, 2022
Messages
16
EDITED after resolution was found on Jan-31st:

To cut a long story short - please read last post first:

Windows 11 runs fine on bhyve as long as you install via AHCI / 512 block size volume first. Also make sure you only use virtio drivers 0.1.185 (0.1.208 and 0.215-1 and 0.215-2 did not work realiably).

After installation, however, in my case a
Code:
system middlewared restart

was required to get the system to the desired behavior.

The root cause, why the system became unresponsive is still unclear, but might have been a not fully released resource (network or disk) during reboot/restart of the windows 11 VM.
_____


Currently I am testing Freenas 12U7 capabilities to run windows VMs, as compared in running VMs in proxmox on ZFS. For various use cases potential users of the installation want to run Windows 7 (legacy software), 10 (current version) or 11 (future version).

Status is as follow:
(1) Installing Windows 10 works on a virtio VMs using virtio-win-0.1.185.iso version when attached to a CDROM during installation of windows 10. In this case, even selecting a virtio VM during VM set-up works.

(2) The situation is different for installation attemps for Windows 11, however. Here AHCI VM set-up is required with block size of 512 (running on a 4k pool). The virtio version required to install drivers native for windows 11 is 0.1.215-2, 0.1.215-1 or 0.1.208. As now now, I only tested 0.1.215-2 more thoroughly.

The good news, after many failed attemps, the installation can be completed, following a strikt approach.

What works?

Following the suggestion provided here in this thread:

I followed the instruction by Patrick given over multiple posts in the thread, which I summarize here as 2 stages - pre-installation and post-installation:

Setup before sucessful installation during/after VM set-up:
(0) Try AHCI, but make sure to set the virtual disk blocksize to 512. Patrick: "This alone has fixed all crashes with Windows 10 for me."
(0.1) In case you already have a disk attached, change the blocksize of the emulated disk device from "default" to "512". VM > Devices > Disk > Edit.

After successful installation using AHCI:
(1) Patrick "You can switch to virtio after installation.

  • install virtio drivers - easy once you got a VM up and running
  • create an additional disk of type virtio, couple of Megabytes is enough
  • partition and format the new disk and make sure it's working in Windows
  • then shutdown and switch your main disk from AHCI to virtio
  • boot, VM should come up fine
  • shutdown, remove small new disk, boot
  • enjoy"
I am stuck at the stage highlighted in bold partition and format the new disk.
After integrating the small virtio volume and having installed the most recent virtio drivers 0.1.215-2, the formatting of the small disk in Windows leads to
- disconnect of RDP,
- disconnect of VNC,
- the VM being stuck in limbo (Windows events are still begin written),
- only a power-off through the UI can make the VM restartable (Stop, restart does not work)

It seems to be, that the formatting of the small disk with NTFS causes an error, which leads to the followin error in the VM log:
Assertion failed: (n >= 2 && n <= BLOCKIF_IOV_MAX + 2), function pci_vtblk_proc, file /truenas-releng/freenas/_BE/os/usr.sbin/bhyve/pci_virtio_block.c, line 278.

As this is an error related to disk management, here some details regarding my set-up:
I run a zfs pool with 2 disks using ashift 12. The VMs are installed in a dataset.

Any suggestions for improvement to resolve the error are appreciated. Questions to clarify the root cause as well.
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
having installed the most recent virtio drivers 0.1.215-2
I've seen other threads (search on virtio drivers) saying that 0.1.185 is the latest version that's working... versions after that one seem to be problematic.
 

kaulex

Dabbler
Joined
Jan 25, 2022
Messages
16
Thank you.

As far as I could tell during installation, there is no Windows 11 driver bundled in 0.1.185.

From a tutorial, which I also followed in proxmox, Windows 10 drivers as shown in 0.1.208 might also work:

I will run a test with installing/downgrading to 0.1.185 after Win 11 installation and report / update here.
 

kaulex

Dabbler
Joined
Jan 25, 2022
Messages
16
Just a short update.


Some progress was made:

I installed a fresh Win 11 VM via AHCI + Block size 512. Block size 512 is mandatory, otherwise the behavior is erratical leading to freeze and crashes, not only of the guest, but also the TrueNAS host.

What works then using 0.1.185 is attaching and formatting the small virtio disk without crash. It was also possible to change the main disk from AHCI/512 to virtio (using 512 block size still, not autosize, see below).

So to sum up: don't use any newer version than 0.1.185. The developer confirmed that some settings on the virtio driver is causing the errors in bhyve.


Now a new phenomenon (with different root cause?) arises, however:

The behaviour of the disks seems to be ok now - I can format the small disk without crash. Then I had shut down the VM from the guest end, by using the "shut down" option.

When shutting down the VM from the guest system (win 11), it seems to shut down ok. After changing the disk parameters and restarting the VM, however, the system starts, but in some cases I had the following erratic behavior:

The host system, however, is running into issues afterwards leading to UI timeouts. When checking the console (at the attached screen), I see error messages with socket timeouts for reading out the disk temperature.

Any ideas, what could be causing this irritating behaviour, which I can only by a restart via console?
Could this be related to power management S-states and windows?
 

kaulex

Dabbler
Joined
Jan 25, 2022
Messages
16
Hera another update.

When shutting down or restarting Win 11 from within, in some cases the guest VM starts up again as expected. In other cases - root cause not yet fully understood - not only the VM goes down, but also the network on the host.

The TrueNAS host is the not accessible via UI or ssh (no route to host). Existing ssh connections get no more data.

When looking at the screen attached to the server, I see a python 3.9 error about disktemp.py's socket unreachable as direct output in the log. When using access to shell and top, I see that
- bhyve is gone,
- libvirt is at ~100% and
- another python 3.9 process is in state "uwait" at about 100% load

When exiting the shell and rebooting, the PID for the python process sees to belong to the ZFS daemon...

So it looks as if the VM's io handling is blocking zfsd from functioning. Why this is causing network to break down is a puzzle to me - my guess is no disk, no system...

Any ideas, what is causing this?
Any clue how to fix it?

P.S: Update
After waiting ~15 mins, both the host and the guest system came back again.

Could it be the Windows installation ran some kind of "checkdsk" internally before booting up again?

As there is no event log information from within Windows between 4:02pm and 4:17pm, I assume, the guest machine was keeping itself and the host busy for 15 mins...

Still no clue or indication, however, why the host was not accessible via network during that time.

EDIT: for sake of completeness - I found the corresponding error message, which was displayed at the console in the file /var/log/mesages:

Code:
Jan 30 16:00:12 truenas kernel: igb0: link state changed to DOWN
Jan 30 16:00:12 truenas kernel: vnet0: promiscuous mode disabled
Jan 30 16:00:12 truenas kernel: tap0: Ethernet address: AA:BB:CC:EE:FF:00
Jan 30 16:00:12 truenas kernel: tap0: changing name to 'vnet0'
Jan 30 16:00:13 truenas kernel: vnet0: promiscuous mode enabled
Jan 30 16:00:13 truenas kernel: bridge0: link state changed to DOWN
Jan 30 16:00:13 truenas kernel: vnet0: link state changed to UP
Jan 30 16:00:13 truenas kernel: bridge0: link state changed to UP
Jan 30 16:00:17 truenas 1 2022-01-30T16:00:17.338686+01:00 truenas.local dhclient 2261 - - New IP Address (igb0): 192.168.1.2
Jan 30 16:00:17 truenas 1 2022-01-30T16:00:17.339094+01:00 truenas.local dhclient 2262 - - New Subnet Mask (igb0): 255.255.255.0
Jan 30 16:00:17 truenas 1 2022-01-30T16:00:17.339444+01:00 truenas.local dhclient 2264 - - New Broadcast Address (igb0): 192.168.1.255
Jan 30 16:00:17 truenas 1 2022-01-30T16:00:17.339781+01:00 truenas.local dhclient 2265 - - New Routers (igb0): 192.168.1.1
Jan 30 16:00:17 truenas kernel: igb0: link state changed to UP
Jan 30 16:03:00 truenas kernel: vnet0: link state changed to DOWN
Jan 30 16:03:00 truenas kernel: igb0: link state changed to DOWN
Jan 30 16:06:22 truenas 1 2022-01-30T16:06:22.682275+01:00 truenas.local collectd 1458 - - Traceback (most recent call last):
  File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 62, in read
    with Client() as c:
  File "/usr/local/lib/python3.9/site-packages/middlewared/client/client.py", line 283, in __init__
    self._ws.connect()
  File "/usr/local/lib/python3.9/site-packages/middlewared/client/client.py", line 124, in connect
    rv = super(WSClient, self).connect()
  File "/usr/local/lib/python3.9/site-packages/ws4py/client/__init__.py", line 223, in connect
    bytes = self.sock.recv(128)
socket.timeout: timed out
Jan 30 16:08:04 truenas 1 2022-01-30T16:08:04.793839+01:00 truenas.local ntpd 1354 - - no peer for too long, server running free now
Jan 30 16:11:22 truenas 1 2022-01-30T16:11:22.686015+01:00 truenas.local collectd 1458 - - Traceback (most recent call last):
  File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 62, in read
    with Client() as c:
  File "/usr/local/lib/python3.9/site-packages/middlewared/client/client.py", line 283, in __init__
    self._ws.connect()
  File "/usr/local/lib/python3.9/site-packages/middlewared/client/client.py", line 124, in connect
    rv = super(WSClient, self).connect()
  File "/usr/local/lib/python3.9/site-packages/ws4py/client/__init__.py", line 223, in connect
    bytes = self.sock.recv(128)
socket.timeout: timed out
 
Last edited:

kaulex

Dabbler
Joined
Jan 25, 2022
Messages
16
Just to sum up the oddities - I wanted to start a different VM via UI on a system, which came back from the "ureachable" state as described above. When pushing the "start" button or the "On/Off" toggle switch, I get the following error message:

Code:
Error: Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 138, in call_method
    result = await self.middleware._call(message['method'], serviceobj, methodobj, params, app=self,
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1224, in _call
    return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args)
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1128, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
  File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 979, in nf
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/vm.py", line 1599, in start
    self.vms[vm['name']].start(vm_data=vm)
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/vm.py", line 166, in start
    if self.domain.isActive():
  File "/usr/local/lib/python3.9/site-packages/libvirt.py", line 1709, in isActive
    raise libvirtError('virDomainIsActive() failed')
libvirt.libvirtError: internal error: client socket is closed


So the unclean exit also affects the TrueNAS daemon middlewared. Maybe this info helps to identify root cause or solution. Imho I am unsure, if the issue is linked to Win11 in particular or bhyve VM behaviour under TrueNAS in general.
 

kaulex

Dabbler
Joined
Jan 25, 2022
Messages
16
Here a few more lines of log-file for people, who might be interested in what can be done to resolve the issue:

This is the middlewared.log last 40 lines, wihch basically describe the issue of the non-working button:
Code:
[2022/01/31 19:54:02] (DEBUG) middlewared.status():1699 - Failed to retrieve VM status for Win11
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/vm.py", line 1697, in status
    return self.vms[vm['name']].status()
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/vm.py", line 120, in status
    'state': 'STOPPED' if not domain.isActive() else 'RUNNING',
  File "/usr/local/lib/python3.9/site-packages/libvirt.py", line 1709, in isActive
    raise libvirtError('virDomainIsActive() failed')
libvirt.libvirtError: internal error: client socket is closed
[2022/01/31 19:54:02] (DEBUG) middlewared.status():1699 - Failed to retrieve VM status for WIn11_virtio
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/vm.py", line 1697, in status
    return self.vms[vm['name']].status()
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/vm.py", line 120, in status
    'state': 'STOPPED' if not domain.isActive() else 'RUNNING',
  File "/usr/local/lib/python3.9/site-packages/libvirt.py", line 1709, in isActive
    raise libvirtError('virDomainIsActive() failed')
libvirt.libvirtError: internal error: client socket is closed
[2022/01/31 19:54:02] (WARNING) application.call_method():174 - Exception while calling vm.start(*[5])
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 138, in call_method
    result = await self.middleware._call(message['method'], serviceobj, methodobj, params, app=self,
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1224, in _call
    return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args)
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1128, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
  File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 979, in nf
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/vm.py", line 1599, in start
    self.vms[vm['name']].start(vm_data=vm)
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/vm.py", line 166, in start
    if self.domain.isActive():
  File "/usr/local/lib/python3.9/site-packages/libvirt.py", line 1709, in isActive
    raise libvirtError('virDomainIsActive() failed')
libvirt.libvirtError: internal error: client socket is closed
[2022/01/31 19:54:03] (DEBUG) middlewared.logger.CrashReporting.report():112 - Sending a crash report...
[2022/01/31 19:54:03] (DEBUG) urllib3.connectionpool._get_conn():266 - Resetting dropped connection: sentry.ixsystems.com
[2022/01/31 19:54:03] (DEBUG) urllib3.connectionpool._make_request():428 - https://sentry.ixsystems.com:443 "POST /api/2/store/ HTTP/1.1" 200 41
 

kaulex

Dabbler
Joined
Jan 25, 2022
Messages
16
Ok, some update on my behalf.

After looking on the forum, I found different root causes for unresponsive UI.

In most cases using
Code:
service middlewared stop
service middlewared start


was resolving the issue.

After trying this, I now have
- Windows 11 being able to start/stop via UI (stop triggers "Shutdown")
- Restart from within Windows 11 works
- Shutdown from within Windows 11 works

So my verdict:
Win 11 in Truenas 12U7 via bhyve works.

I recommend, however, after the set-up of the Windows 11 virtual machine and switching over from AHCI to virtio as described above to restart middlewared once as described in this post to avoid irritable behavior.
 
Last edited:

kaulex

Dabbler
Joined
Jan 25, 2022
Messages
16
Just for sake of completeness, in case someone is coming along similar issues, here the messages.log extract from a clean start & shutdown of the VM compared to the faulty one added to post #5 a couple of minutes ago:
Code:
Jan 31 20:26:31 truenas kernel: igb0: link state changed to UP
Jan 31 20:27:17 truenas 1 2022-01-31T20:27:17.480268+01:00 truenas.local rtsold 979 - - <rtsock_input_ifannounce> interface tap0 removed
Jan 31 20:27:17 truenas kernel: tap0: Ethernet address: AA:BB:CC:DD:EE:FF:00
Jan 31 20:27:17 truenas kernel: tap0: changing name to 'vnet4'
Jan 31 20:27:17 truenas kernel: igb0: link state changed to DOWN
Jan 31 20:27:17 truenas kernel: bridge0: link state changed to DOWN
Jan 31 20:27:17 truenas kernel: vnet4: promiscuous mode enabled
Jan 31 20:27:17 truenas kernel: vnet4: link state changed to UP
Jan 31 20:27:17 truenas kernel: bridge0: link state changed to UP
Jan 31 20:27:22 truenas 1 2022-01-31T20:27:22.095373+01:00 truenas.local dhclient 74516 - - New IP Address (igb0): 192.168.1.2
Jan 31 20:27:22 truenas 1 2022-01-31T20:27:22.095758+01:00 truenas.local dhclient 74517 - - New Subnet Mask (igb0): 255.255.255.0
Jan 31 20:27:22 truenas 1 2022-01-31T20:27:22.096157+01:00 truenas.local dhclient 74518 - - New Broadcast Address (igb0): 192.168.1.255
Jan 31 20:27:22 truenas 1 2022-01-31T20:27:22.096541+01:00 truenas.local dhclient 74519 - - New Routers (igb0): 192.168.1.1
Jan 31 20:27:22 truenas kernel: igb0: link state changed to UP
Jan 31 20:36:24 truenas 1 2022-01-31T20:36:24.044216+01:00 truenas.local rtsold 979 - - <rtsock_input_ifannounce> interface vnet4 removed
Jan 31 20:36:24 truenas kernel: vnet4: link state changed to DOWN
Jan 31 20:36:24 truenas kernel: igb0: link state changed to DOWN
Jan 31 20:36:24 truenas kernel: vnet4: promiscuous mode disabled
Jan 31 20:36:28 truenas 1 2022-01-31T20:36:28.562746+01:00 truenas.local dhclient 74923 - - New IP Address (igb0): 192.168.1.2
Jan 31 20:36:28 truenas 1 2022-01-31T20:36:28.563143+01:00 truenas.local dhclient 74924 - - New Subnet Mask (igb0): 255.255.255.0
Jan 31 20:36:28 truenas 1 2022-01-31T20:36:28.563516+01:00 truenas.local dhclient 74925 - - New Broadcast Address (igb0): 192.168.1.255
Jan 31 20:36:28 truenas 1 2022-01-31T20:36:28.563856+01:00 truenas.local dhclient 74926 - - New Routers (igb0): 192.168.1.1
Jan 31 20:36:28 truenas kernel: igb0: link state changed to UP


The main difference is: in this clean run, the network interface igb0 comes up again. In the broken case (post #5 in the code section added after editing) it did not.

Root cause: still unknown.
Most likely cause: user error/undefined state due to virtio drivers issues on guest vm

SUMMARY:
Win 11 installation works with
(1) Windows 10 drivers taken from version 0.1.185,
(2) if AHCI installation with block size 512 is done first and
(3) switch over to virtio disk handling is done after successful installation and
(4) successful test formatting using a small 2nd virtio disk added to the VM and
(5) a successful reboot in full virtio configuation (both drives) before
(6) detaching the 2nd volume.

Then a reboot, restart and shutdown of the Win 11 installation with a single virtio device works.

Note:
The virtio disk sector size is still 512.
 

kaulex

Dabbler
Joined
Jan 25, 2022
Messages
16
Hera another update.

When shutting down or restarting Win 11 from within, in some cases the guest VM starts up again as expected. In other cases - root cause not yet fully understood - not only the VM goes down, but also the network on the host.

The TrueNAS host is the not accessible via UI or ssh (no route to host). Existing ssh connections get no more data.

When looking at the screen attached to the server, I see a python 3.9 error about disktemp.py's socket unreachable as direct output in the log. When using access to shell and top, I see that
- bhyve is gone,
- libvirt is at ~100% and
- another python 3.9 process is in state "uwait" at about 100% load

When exiting the shell and rebooting, the PID for the python process sees to belong to the ZFS daemon...

So it looks as if the VM's io handling is blocking zfsd from functioning. Why this is causing network to break down is a puzzle to me - my guess is no disk, no system...

Any ideas, what is causing this?
Any clue how to fix it?

P.S: Update
After waiting ~15 mins, both the host and the guest system came back again.

Could it be the Windows installation ran some kind of "checkdsk" internally before booting up again?

As there is no event log information from within Windows between 4:02pm and 4:17pm, I assume, the guest machine was keeping itself and the host busy for 15 mins...

Still no clue or indication, however, why the host was not accessible via network during that time.

EDIT: for sake of completeness - I found the corresponding error message, which was displayed at the console in the file /var/log/mesages:

Code:
Jan 30 16:00:12 truenas kernel: igb0: link state changed to DOWN
Jan 30 16:00:12 truenas kernel: vnet0: promiscuous mode disabled
Jan 30 16:00:12 truenas kernel: tap0: Ethernet address: AA:BB:CC:EE:FF:00
Jan 30 16:00:12 truenas kernel: tap0: changing name to 'vnet0'
Jan 30 16:00:13 truenas kernel: vnet0: promiscuous mode enabled
Jan 30 16:00:13 truenas kernel: bridge0: link state changed to DOWN
Jan 30 16:00:13 truenas kernel: vnet0: link state changed to UP
Jan 30 16:00:13 truenas kernel: bridge0: link state changed to UP
Jan 30 16:00:17 truenas 1 2022-01-30T16:00:17.338686+01:00 truenas.local dhclient 2261 - - New IP Address (igb0): 192.168.1.2
Jan 30 16:00:17 truenas 1 2022-01-30T16:00:17.339094+01:00 truenas.local dhclient 2262 - - New Subnet Mask (igb0): 255.255.255.0
Jan 30 16:00:17 truenas 1 2022-01-30T16:00:17.339444+01:00 truenas.local dhclient 2264 - - New Broadcast Address (igb0): 192.168.1.255
Jan 30 16:00:17 truenas 1 2022-01-30T16:00:17.339781+01:00 truenas.local dhclient 2265 - - New Routers (igb0): 192.168.1.1
Jan 30 16:00:17 truenas kernel: igb0: link state changed to UP
Jan 30 16:03:00 truenas kernel: vnet0: link state changed to DOWN
Jan 30 16:03:00 truenas kernel: igb0: link state changed to DOWN
Jan 30 16:06:22 truenas 1 2022-01-30T16:06:22.682275+01:00 truenas.local collectd 1458 - - Traceback (most recent call last):
  File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 62, in read
    with Client() as c:
  File "/usr/local/lib/python3.9/site-packages/middlewared/client/client.py", line 283, in __init__
    self._ws.connect()
  File "/usr/local/lib/python3.9/site-packages/middlewared/client/client.py", line 124, in connect
    rv = super(WSClient, self).connect()
  File "/usr/local/lib/python3.9/site-packages/ws4py/client/__init__.py", line 223, in connect
    bytes = self.sock.recv(128)
socket.timeout: timed out
Jan 30 16:08:04 truenas 1 2022-01-30T16:08:04.793839+01:00 truenas.local ntpd 1354 - - no peer for too long, server running free now
Jan 30 16:11:22 truenas 1 2022-01-30T16:11:22.686015+01:00 truenas.local collectd 1458 - - Traceback (most recent call last):
  File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 62, in read
    with Client() as c:
  File "/usr/local/lib/python3.9/site-packages/middlewared/client/client.py", line 283, in __init__
    self._ws.connect()
  File "/usr/local/lib/python3.9/site-packages/middlewared/client/client.py", line 124, in connect
    rv = super(WSClient, self).connect()
  File "/usr/local/lib/python3.9/site-packages/ws4py/client/__init__.py", line 223, in connect
    bytes = self.sock.recv(128)
socket.timeout: timed out

FOR sake of completeness:

Loss of network connection can be encountered, if for the network interface used for VM traffic the option "Disable Hardware Offloading" is not checked. Here a section from the manual, which describes this for the initial set-up of PlugIns:

1643919845819.png



Same seems to be true for VMs (check e.g. this reply).

SUMMARY:
Make sure, when first setting up a VM on top of a physical network adapter to check the option. Otherwise the start-up and shut-down of the VM can cause to loss of network connection as described above.
 

Attachments

  • 1643920099008.png
    1643920099008.png
    31.8 KB · Views: 124
Top