Sporadic crash of Windows 10 VM - qemu-system-x86_64: vfio: Unable to power on device, stuck in D3, GPU passthrough

zubo100

Dabbler
Joined
Aug 25, 2022
Messages
19
Hi everyone!

I switched from synology to TrueNAS just a few weeks ago, so I am still learning. So far I managed to get everything working (Apps, Ingres, Certificates, ...), but there is one issue I can not find a solution for. I have a Windows 10 Virtual Machine on my TrueNAS scale instalation:
1668789351238.png


I have a Asrock Z390m-itx/ac with Intel i7-8700K, and 2x 32 GB Crucial memory. The installation itself was not a problem, I have VT-x and VT-d enabled in Bios, PCI passthrough for my Asus ROG GTX 1080 enabled. (Isolated GPU and Set for passthrough)

1668789382268.png

1668789406949.png


The problem is that sporadicaly (while gaming over steamlink) the entire PC crashes... The screen simply goes black, and the PC is not available. I cannot connect over RDP, Parsec, TeamViewer, not even ping. The issue is sporadic, sometimes after 2-3 minutes, sometimes not even after 6 hours. I assume that it is caused by the PCI passthrough, because of the following:

1. - after the crash, I cannot start the VM because I get the following error:
1668789933582.png

where of course the device 0000:01:00.0 is my Nvidia GPU.

2. - if i restart the entire server after this error, the Nvidia GPU is not recognized ad all, and not available for the GPU isolation in the settings.
3. - the only way I can get it to work again is to shut down the server (not restart) and then turn it on again.

Is there any advice someone can give me? I read about some registration problems with AMD GPU, but this is an NVIDIA GTX. Also my problem does not occut after I restart or shutdown the VM (I can restart, and turn it on/off several times with no problems), my problem occurs while I am using it. Again, completely sporadic.
The only thing in common is the log of the VM :
2022-11-18T16:27:34.525794Z qemu-system-x86_64: vfio: Unable to power on device, stuck in D3
2022-11-18T16:27:34.545705Z qemu-system-x86_64: vfio: Unable to power on device, stuck in D3
2022-11-18T16:27:35.574698Z qemu-system-x86_64: vfio: Unable to power on device, stuck in D3
2022-11-18T16:27:35.575669Z qemu-system-x86_64: vfio: Unable to power on device, stuck in D3


What I tried:
1. setting the CPU from Host Passthrough to Host Model
2. setting the Hide from MSR setting for the GPU passthrough (I read that it sets the driver to kvm or something like that)
3. enable Above 4G encoding in BIOS

Can anyone please help? Any ideas or recommendations are very much apreciated. I appologize if this is a known issue, I tried finding it here, but was not succesfull. I read about people solving this issue by uploading the GPU bios, but I have no idea how I can do it in TrueNAS scale. If you need any more information please just ask, (or tell me how I can get it) and I gladly will.

Thanks for any kind of help.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Libvirt error 127 means the passed-through card didn't reinitialize from the VM crash, so it still believe's it's up and tied to the previous VM boot. Typically this is resolved with a BIOS update to the host.
 

zubo100

Dabbler
Joined
Aug 25, 2022
Messages
19
I
Libvirt error 127 means the passed-through card didn't reinitialize from the VM crash, so it still believe's it's up and tied to the previous VM boot. Typically this is resolved with a BIOS update to the host.
Hi! I found some people solving similar issues with the BIOS update as you suggested, although it was mostly on MSI mainboards. My mainboard is an Asrock, but I wanted to try that, unfortunately the mainboard BIOS is currently on the newest possible version. Or do you mean a BIOS update of the GPU? I also found a few pages where people claim to have solved the issue by dumping the GPU bios and then "cleaning its header" and then providing the clean GPU bios version directly in the host software (this appears to be working for Proxmox and Unraid). Is there some way to specifiy the GPU Bios in TrueNAS?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399

zubo100

Dabbler
Joined
Aug 25, 2022
Messages
19
I'm not aware of anything in TrueNAS SCALE.
Is there anything else you can recommend? Any other ideas? What can I do? This is the 3rd GPU I tried. The first was a Nvidia GTX 760 (not supported for VM), the second a Radeon RX 5700 XT (black screen issue) and now a GTX 1080. The only one working at all is the GTX 1080, but it causes the problem that I mentioned, and right now I have no idea how to fix this... Any help is very much appreciated.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Is there anything else you can recommend? Any other ideas? What can I do? This is the 3rd GPU I tried. The first was a Nvidia GTX 760 (not supported for VM), the second a Radeon RX 5700 XT (black screen issue) and now a GTX 1080. The only one working at all is the GTX 1080, but it causes the problem that I mentioned, and right now I have no idea how to fix this... Any help is very much appreciated.
I'm afraid not. SCALE is still beta. I'd submit a bug report, along with debugs. There may be an upstream KVM patch the TrueNAS devs can pull in.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Is there anything else you can recommend? Any other ideas? What can I do? This is the 3rd GPU I tried. The first was a Nvidia GTX 760 (not supported for VM), the second a Radeon RX 5700 XT (black screen issue) and now a GTX 1080. The only one working at all is the GTX 1080, but it causes the problem that I mentioned, and right now I have no idea how to fix this... Any help is very much appreciated.
Look for instructions on how to update the GPU firmware/BIOS... you probably have to do it from the linux CLI.
 

zubo100

Dabbler
Joined
Aug 25, 2022
Messages
19
Look for instructions on how to update the GPU firmware/BIOS... you probably have to do it from the linux CLI.
I took a look into that... Unfortunately the BIOS Version of the GPU is also the newest possible. At the moment I am trying to figure out if it is a problem of the graphic card itselt, or just a problem of the virtualisation. I will probably try to put the GPU in a windows PC, and test it for a few hours (days) to see if maybe the GPU itself is not causing the problem. I will post an update when I am done testing.
 

zubo100

Dabbler
Joined
Aug 25, 2022
Messages
19
Hi again! So, I tried to test the GPU separately in a Windows 10 PC (with a different mainboard, CPU and RAM) and the problem occurs also. It is sporadic, random, so basically the same beahviour as in the VM on the Truenas. The conclusion is that the card itself has a problem. It is a second hand GTX 1080, almost 7 years old, with apparently no maintenance ever, so I will try to clean it, replace the thermal pads and the thermal compound, maybe it will help. Anyways it looks that my problem is not related to Truenas itself, so thank you all for all the advices given, and feel free to close this topic.
 

cru

Cadet
Joined
Feb 21, 2023
Messages
2
I have the same issue, only im running 8 GPU in this box, server lasts for a couple hours and then dies just like you mentioned.
 

cru

Cadet
Joined
Feb 21, 2023
Messages
2
Also there is a bug where you cant turn off GPU's in a VM when its stopped, to determine if its the video card that is causing it.. Once these are selected there is no going back.. only reload anotherVM or I imagine delete the PCI devices manually.



1677044932309.png
 

zubo100

Dabbler
Joined
Aug 25, 2022
Messages
19
In my case it was a HW issue on the graphics card. I had it repaired, and hat no issues since then. If you have a chance do what I did, simply test the GPU in a completely different PC.
 
Top