Nested ESXi Hypervisor in TrueNAS SCALE

TempleHasFallen

Dabbler
Joined
Jan 27, 2022
Messages
34
This is a post to document a lot of troubleshooting that was done as well as to share information with the goal of getting an ESXi hypervisor to run nested as a VM inside of TrueNAS SCALE.

Short backstory:
In the process of upgrading and migrating from TrueNAS Core (storage) and 2 hypervisors (compute, connected via 10GbE to TN Core) to a single machine to lower power consumption. The Truenas box is a X10DRi with dual E5-2680v3's and 256GB ddr4-ecc, while the hypervisors are Z8NA-D6's with dual X5675's and 64GB ddr3-ecc each. For connectivity between the boxes, Intel X540-T2's were used.
The goal is to keep the VM format ESXi compatible and basically directly mount the NFS datastore on the ESXi VM, so that at any time, additional hypervisors can just spin up if required and VM's can be migrated over if more compute is required by simply mounting the NFS share.

For testing purposes one of the hypervisors was used with SCALE 22.02.01 as well as Bluefin (but that brought up a *lot* of bugs)

Main issues faced:
  • ESXi 6.5 installer PSOD's regardless of hardware configuration (tried various CPU modes, NIC and Disk configs)
  • ESXi 6.7 installer PSOD's regardless of hardware configuration (tried various CPU modes, NIC and Disk configs)
  • ESXi 7.0.3(a-f) do not support the e1000 used by KVM as a NIC. It seems to virtualize as an Intel 82540EM which is deprecated in 7.0
    • Preloading the ISO with community drivers did not resolve the issue.
    • e1000 support was supposedly added in 7.0.3f but it does not work with SCALE KVM's e1000.
    • booting with option "preferVMklinux=true" with custom or standard iso did not affect any change
    • VirtIO NICs are not supported at all by ESXi and I could not find any way to make them work
  • Passthrough issues with NIC's... Dedicated card did not want to pass through to VM at all (IOMMU enabled, SRV-IO enabled)
    • VM fails to start when trying to attach one of the two onboard NICs (separate entries in lspci)
      • Code:
        group 24 is not viable Please ensure all devices within the iommu_group are bound to their vfio bus driver
        • A ticket seems to be already open but without much activity which should allow isolating/blacklisting PCI devices other than GPU's in the future as that seemed to be the cause of the issue.
    • VM fails to start when trying to attach the dedicated X540-T2 (either both ports or either of them separately)
      • Code:
        failed to setup container for group : Failed to set iommu for container: Operation not permitted
    • For passing through either dedicated or onboard NICs I also tried setting custom kernel options and rebooting (both downstream and multifunction and separately) as it suggested it may be an IOMMU grouping issue:
      • Code:
        midclt call system.advanced.update '{"kernel_extra_options": "pcie_acs_override=downstream,multifunction"}
The main issue seems to be that SCALE e1000 seems to be explicitly only supported on ESXi 6.7 which I can't even get to boot.
For ESXi 7.0.3 I can't get it to see a network card or pass through a card.

After changing hardware to an old consumer system (i7-3970X, DDR3), I was able to pass through the X540-T2 without a problem. This suggests that it may be due to hardware and may work on the target system, however, it also means that effectively, I would have to physically route all the datastore NFS traffic from one X540-T2 attached to the VM to another attached to TrueNAS.

Alternatively, is there any way to add or modify the KVM options provided for adapters to allow for vmxnet3 adapters? Or alternatively edit the VM configuration directly? There seems to be a ticket already open about this too, but it has basically zero activity as well. I understand that vmxnet3 adapter libraries and not at all present in SCALE 22.02 (so its not only an issue of have the option available via GUI/CLI). Modifying the adapter model type to vmxnet3 via virsh did not affect a change either (not to mention it effectively breaks the GUI communication to libvrt).

If anyone has any idea how to make the e1000 driver for ESXi 7.0.3 work, or knows how to get VirtIO to work or has any suggestions regarding it, please assist.
 

diogen

Explorer
Joined
Jul 21, 2022
Messages
72
In the process of upgrading and migrating from TrueNAS Core (storage) and 2 hypervisors (compute, connected via 10GbE to TN Core) to a single machine to lower power consumption. The Truenas box is a X10DRi with dual E5-2680v3's and 256GB ddr4-ecc, while the hypervisors are Z8NA-D6's with dual X5675's and 64GB ddr3-ecc each. For connectivity between the boxes, Intel X540-T2's were used.
Just to clarify: you are trying to use a TrueNAS box as a hypervisor to run a VMware hypervisor (ESXi) as a VM on it...? Isn't this backwards?
 

TempleHasFallen

Dabbler
Joined
Jan 27, 2022
Messages
34
Just to clarify: you are trying to use a TrueNAS box as a hypervisor to run a VMware hypervisor (ESXi) as a VM on it...? Isn't this backwards?
That's correct.
From a practical standpoint, since NVMe and SSD datastores are on the TrueNAS box (not to mention tens of shares via SMB, iSCSI etc. as well as tens of services), it would be a big disadvantage to not have TN running on baremetal (hotplug abilities). Additionally, as I mentioned, TN will keep its current role to be able to support multiple external hypervisors on demand as it has until now, therefore I don't want to virtualize it.
 

christopherl

Cadet
Joined
Jun 3, 2017
Messages
3
You can retain hotplug abilities by passsing through the entire controller rather than the disks.
 
Top