Passthrough NIC high host CPU

Big-T · May 7, 2023

Hi, I have Truenas Scale 22.12.2 running as a host on a Ryzen 5800x (8 cores/16 threads). Running as a Virtual Machine, I have OPNsense 23.1.7_3 (based on FreeBSD). I gave it 4 cores/8 threads, which is surely overkill and I expect it to be mostly idle. I've successfully passed through two Intel I210 1GB NICs as PCI devices to the VM, and the configuration is working properly (I use a separate NIC for communication with the Truenas host).

The issue that I'm seeing is, when a high volume of traffic communicates with the VM, I get a much higher CPU usage on the host than I would expect. Running iperf3 (and again, these are only 1gb nics) results in a small virtual cpu usage bump, but on the host I see 200% usage- two full cores- being used by the qemu-system-x86 process. I might expect this if the networking were being shared with the host, but these NICs are being passed through directly as PCI devices.

Any suggestions?

jgreco · May 7, 2023

Big-T said:
The issue that I'm seeing is, when a high volume of traffic communicates with the VM, I get a much higher CPU usage on the host than I would expect.

So... you've assigned a large number of cores to the VM and then you're shocked that the host is like very busy when there's traffic?

What's happening here is that KVM is doing what you asked. It's assigning four cores. If four cores are not available, NO cores are scheduled for the VM. If four cores ARE available, ALL four cores are assigned. This means that even if you are just doing a simple operation that requires one tenth of one core, all four cores need to be scheduled, and are unavailable to the host platform. This is a common error; do not allocate resources based on what you THINK you need, but rather what is OBSERVED to be needed. Those unnecessary cores just end up hanging around useless in the VM.

sretalla · May 8, 2023

Big-T said:
I gave it 4 cores/8 threads

Did you actually put those numbers in the config?

1 CPU, 4 cores, 8 threads is actually 32 threads.

As recommended above already, you should start with a lower number first.

Perhaps, 1, 2, 2.

Also consider in the guest OS if you have enabled hardware offload for the network. Those NICs you have can do that and should take CPU load off.

jgreco · May 8, 2023

sretalla said:
Those NICs you have can do that and should take CPU load off.

That's not correct. What happens is every time there is an interrupt, all four virtual CPU's allocated to the VM have to be free so that they may assigned to the VM. This may not be easily managed on a machine with only eight CPU's, as it means that if there are more than four active tasks on the host (remembering that ZFS is notoriously piggy, middleware is notoriously piggy, userland daemons such as SMB are notoriously piggy, ...) then your VM scheduler may defer and not assign a timeslice due to resource unavailability.

The problem is that even the simplest trite packet must be handed off to the VM guest. So any network I/O ultimately results in an interrupt and the allocation of four cores of CPU *just* to handle the trite packet. There is no useful CPU offload; you would actually do better without CPU offload because at least the CPU would be doing some meaningful work doing that packet processing work.

If instead you are allocating four cores and you get a timeslice of CPU and there is only the briefest of brief amount of work to do to process the trite packet, you then either exit the VM timeslice or you wait around for the timeslice to end. Historically this has been handled inefficiently, involving some amount of useless CPU spin. On four CPU's. For the remainder of the timeslice.

Big-T said:
on the host I see 200% usage- two full cores- being used

And worse, you need to airquote the word "used" because all that CPU time is being wasted.

Big-T · May 8, 2023

So to clarify, when I say I assigned 4 cores/8 threads I meant 4 cores with 2 threads each, for a total of 8- half that of the host. The host is mostly idle.

Interrupt handling I understand, but time spent processing interrupts should also be included in the VM's cpu utilization report. I expect there is some overhead for the host, but with PCI passthrough it should simply the overhead of mapping the doorbell BAR into the client memory space; the host needn't process the MSI itself.

I have tried reducing the number of cores to 1 CPU, 2 cores, 2 threads each as suggested and the problem remains.

If an iperf3 test at 1gb line speed takes 2 full Zen 3 threads on the host, does a 10gb NIC pci passthrough require 20? This isn't just "the way interrupts work", something is not working properly here.

jgreco · May 8, 2023

Big-T said:
Interrupt handling I understand, but time spent processing interrupts should also be included in the VM's cpu utilization report.

It's not time spent processing interrupts. It's time spent doing absolutely nothing, because the need to do the work on one CPU dragged three more useless cores along with it. They are not processing interrupts. They are doing NOTHING. But they still need to be scheduled, because the VM has 4 cores assigned.

Big-T said:
time spent processing interrupts should also be included in the VM's cpu utilization report.

Again, it is not "time spent processing interrupts". I have no idea how KVM actually gathers accounting statistics but I can easily imagine that this isn't what they expect and that this isn't optimized.

Big-T said:
If an iperf3 test at 1gb line speed takes 2 full Zen 3 threads on the host, does a 10gb NIC pci passthrough require 20?

No, there's nothing there that would cause that, obviously.

Big-T said:
This isn't just "the way interrupts work", something is not working properly here.

I agree, but I believe the problem is that you're not really understanding what's going on. I'm a VMware guy so I've been dealing with this kind of thing for a long time, and I'm not really that interested in figuring out the exact specifics going on in KVM, which I view as a crappy off-brand virtualization platform that is relatively immature. I do expect that if you dropped your utilization to a single vCPU (one core/one thread) that you would probably see your observed utilization drop to 100% utilization (one core/one thread) or something along those lines.

But all this stuff is generally difficult in a virtualization environment where you're trying to handle packet-heavy tasks in a guest VM, especially via PCIe passthru. From a certain perspective, VM's just aren't good at this job, and you end up wasting some CPU.

Big-T · May 8, 2023

Given your experience with VMware, what kind of hit do you expect under these circumstances? Have you run multiple virtualized servers that handle a high amount of bandwidth? Do you see high host CPU usage in those cases? Is it simply the cost of doing business in a VM?

jgreco · May 8, 2023

Big-T said:
Given your experience with VMware, what kind of hit do you expect under these circumstances? Have you run multiple virtualized servers that handle a high amount of bandwidth? Do you see high host CPU usage in those cases? Is it simply the cost of doing business in a VM?

You can absolutely cause VMware to drag unnecessary CPU cores along for the ride; see for example

https://kb.vmware.com/s/article/1005362

There are lots of discussions about how oversubscription and overallocation interact in general; a lot of it is sort of garbage based on someone's personal experience in a particular kind of environment. This is in part because this stuff is relatively difficult.

I kind of like the answer in this SF question:

Oversubscription vs Over allocation of vCPU resources in vSphere

I'm trying to understand over allocation of vCPU compared to over subscription of vCPU. For the purposes of the question lets assume the host that I will be using has a 16 core Xeon(2.1Ghz). Base...

serverfault.com

It's a good beginner's level explainer, but there are others. VMware reports these things a bit differently, but the underlying problems are similar.

Maximizing VMware Performance and CPU Utilization - Longitude

Overcommitting CPU allows you to maximize use of host CPU resources, but make sure to monitor overcommitted hosts for CPU use and VMware CPU wait metrics.

www.heroix.com

Our infrastructure here is highly virtualized, but some of that is managed by using layer 2 switching and multiple interfaces on a VM to reduce the routing loads. We're a full BGP, DFZ ASN so we have PPS, route table, and firewall pressures to consider, and my interpretation after a quarter of a century of this is that you're better off not routing what you don't have to. At the end of the day, yes, it's the cost of virtualization. Check out what your consolidation ratios are, and appreciate whatever you've saved versus your bare metal baseline. Virtualization isn't magic, alas.

Important Announcement for the TrueNAS Community.

Passthrough NIC high host CPU

Big-T

Cadet

jgreco

Resident Grinch

sretalla

Powered by Neutrality

jgreco

Resident Grinch

Big-T

Cadet

jgreco

Resident Grinch

Big-T

Cadet

jgreco

Resident Grinch

Oversubscription vs Over allocation of vCPU resources in vSphere

Maximizing VMware Performance and CPU Utilization - Longitude

Similar threads