No Boot with 2 video cards with upgrade to Scale 23.10.1

jengle

Dabbler
Joined
Jan 4, 2023
Messages
26
After upgrading to 23.10.1 my VM (debian) became inaccessable - I was no longer able to access the video card in the GPU selection configuration. Also when I booted up I was not getting any video from the VisionTEK 4350 for the console access (I almost always just connect via LAN using PuTTY). I rarely use the console - it's on a switch.

I had:
* GT710 SL (1 GB) - that I added to have a 2nd card for the VM
* Old VisionTEK 4350 - basic video card used for a console

I replaced the VisionTEK 4350 with a RTX2060 that recently was retired, and got the following error (console connected to the RTX2060).

vfio: module verification failed: signature and/or requrie key missing
VFIO - User Level meta-driver version: 0.3

I get the same error message with the RTX2060 and the GT710 SL.

With only the RTX2060 inserted, the console works, but when I try to attach the RTX2060 it fails "At least 1 GPU is required by the host for its functions and With your selection, no GPU is available for the host to consume" for the VM.

For some reason, after upgrading it seems I can only have 1 video card now. Current Truenas Scale Version: TrueNAS-SCALE-23.10.1.3

After upgrading to 23.10.1 my VM (debian) became inaccessible - I was no longer able to access the video card in the GPU selection configuration. Also when I booted up I was not getting any video from the VisionTEK 4350 for the console access (I almost always just connect via LAN using PuTTY). I rarely use the console - it's on a switch.

I had:
* GT710 SL (1 GB) - that I added to have a 2nd card for the VM
* Old VisionTEK 4350 - basic video card used for a console

I replaced the VisionTEK 4350 with a RTX2060 that recently was retired, and got the following error (console connected to the RTX2060).

vfio: module verification failed: signature and/or requrie key missing
VFIO - User Level meta-driver version: 0.3

I get the same error message with the RTX2060 and the GT710 SL.

For some reason, after upgrading it seems I can only have 1 video card now.

I have waited 20 minutes and still console still not working and TrueNAS is not booting up. Only boots with a single GPU.
 
Last edited:

jengle

Dabbler
Joined
Jan 4, 2023
Messages
26
I tried to get a login to the bug reporting system and have verified my credentials (google) I still can't submit a bug. I get no response from Discord or here. 2 video cards worked fine before upgrading to TrueNAS Scale 23.10.1. What now?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hey @jengle

While the console might stop output, are you still able to get to the webUI when it gives you the vfio error?

If you can, and can generate a debug file (System -> Advanced -> Save Debug) then we can get a Jira ticket filed.
 

jengle

Dabbler
Joined
Jan 4, 2023
Messages
26
Thanks for the reply @HoneyBadger, It’s not booting past the vfio errors no ping response, no webui. If I try again on my bench, then remove the 2nd video card and boot it the office. Is there a log file I can send?

I downloaded a rescue linux and put that on a usb stick to boot up and can use lspci to verify the video cards before doing the single card Trunas boot. However the 2 older cards worked fine on the prior version of TrueNas Scale. I won’t be able to do that until tomorrow though.
 

jengle

Dabbler
Joined
Jan 4, 2023
Messages
26
Hi @HoneyBadger - sorry for the delay. The monitor on my workbench died and had to order a new one. Back in business again to test.

Original Configuration tested again (that worked prior to the TrueNAS Scale upgrade):

ASUS GeForce 710 with VisionTech AMD/ATI-RV710
After upgrade - console hangs - VFIO error. Booting with Linux Rescue shows both cards using lspci.

Also Tested:
ASUS GeForce 710 with RTX2070 - VFIO error also. Booting with Linux Rescue shows both cards using lspci.

RTX 2070 and VisionTech - boots to TrueNas console but no web UI even waiting 1 hour. TrueNas Scale console shows both cards using lspci.

ASRock B550M Pro4 Motherboard with AMD Ryzen 5 3600 64 Gb

Ideas about getting some good information to you? Right now I only have the RTX2070 installed so I can use the server - so cannot start the VM. System starts with the ASUS GeForce 710 only but cannot start Jellyfin. With the RTX2070 installed all my apps work including Jellyfin; but no VM.
 

Attachments

  • IMG_3538.jpg
    IMG_3538.jpg
    117.9 KB · Views: 32
  • rv70andRTX2070.txt
    3.9 KB · Views: 27
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hey @jengle - sorry to hear your monitor went out.

I'm thinking an incorrect device is stuck in the VFIO passthrough somehow. My suggestion would be to remove the passthrough GT710 from the VM and then check the output of the following queries

Code:
midclt call system.advanced.config | jq
cat /proc/cmdline


For the former we're mostly looking for an empty isolated_gpu_pci_ids but I'd also like to see what it thinks kernel_extra_options should be vs. what it's actually doing with /proc/cmdline
 

jengle

Dabbler
Joined
Jan 4, 2023
Messages
26
Thanks for the reply @HoneyBadger . Check the jpg - when I have 2 gpu cards installed my linux boot stops with the vfio error. No web UI, no console, no network. Doing the command you listed above midclt I get error: cat/0 not defined.

jengle@tatooine:~$ midclt call system.advanced.config | jq cat /proc/cmdline
jq: error: cat/0 is not defined at <top-level>, line 1:
cat
jq: 1 compile error
Not authenticated

root@tatooine[~]# midclt call system.advanced.config | jq cat /proc/cmdline
jq: error: cat/0 is not defined at <top-level>, line 1:
cat
jq: 1 compile error
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe

Jeff

also - there is no GPU defined in the VM. It was removed at the update.

Jeff
 

jengle

Dabbler
Joined
Jan 4, 2023
Messages
26
My bad! Now this is with the 2070 only. Since TrueNAS hangs with 2 GPUs, and Jellyfin won't start with the 710 the 2070 is installed and Jellyfin is working.

root@tatooine[~]# cat /proc/cmdline
BOOT_IMAGE=/ROOT/23.10.1.3@/boot/vmlinuz-6.1.63-production+truenas root=ZFS=boot-pool/ROOT/23.10.1.3 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
root@tatooine[~]#

[
Code:
{
  "id": 1,
  "consolemenu": true,
  "serialconsole": false,
  "serialport": "ttyS0",
  "serialspeed": "9600",
  "powerdaemon": false,
  "swapondrive": 2,
  "overprovision": null,
  "traceback": true,
  "advancedmode": false,
  "autotune": false,
  "debugkernel": false,
  "uploadcrash": true,
  "anonstats": true,
  "anonstats_token": "",
  "motd": "Welcome to TrueNAS",
  "boot_scrub": 7,
  "fqdn_syslog": false,
  "sed_user": "USER",
  "sysloglevel": "F_INFO",
  "syslogserver": "",
  "syslog_transport": "UDP",
  "kdump_enabled": false,
  "isolated_gpu_pci_ids": [
    "0000:06:00.0"
  ],
  "kernel_extra_options": "",
  "syslog_tls_certificate": null,
  "syslog_tls_certificate_authority": null,
  "consolemsg": false
}
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Code:
 "isolated_gpu_pci_ids": [
    "0000:06:00.0"
  ],

Let's see if we can blank that out:

midclt call system.advanced.update '{ "isolated_gpu_pci_ids": [] }'

and then reboot your system, see if it goes away - then plug your GT710 back in.
 

jengle

Dabbler
Joined
Jan 4, 2023
Messages
26
I seem to be missing something @HoneyBadger:

as root:
root@tatooine[~]# midclt call system.advanced.update '{ "isolated_gpu_pci_ids": [] '}
zsh: parse error near `}'
root@tatooine[~]#


As an admin

jengle@tatooine:~$ sudo midclt call system.advanced.update '{ "isolated_gpu_pci_ids": [] }'
{"id": 1, "consolemenu": true, "serialconsole": false, "serialport": "ttyS0", "serialspeed": "9600", "powerdaemon": false, "swapondrive": 2, "overprovision": null, "traceback": true, "advancedmode": false, "autotune": false, "debugkernel": false, "uploadcrash": true, "anonstats": true, "anonstats_token": "", "motd": "Welcome to TrueNAS", "boot_scrub": 7, "fqdn_syslog": false, "sed_user": "USER", "sysloglevel": "F_INFO", "syslogserver": "", "syslog_transport": "UDP", "kdump_enabled": false, "isolated_gpu_pci_ids": [], "kernel_extra_options": "", "syslog_tls_certificate": null, "syslog_tls_certificate_authority": null, "consolemsg": false}
jengle@tatooine:~$
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
There we go, sudo access was the missing piece. You can see in the output from the second command that it's blanked out the list.

Reboot, see if it sticks (by doing the midclt call system.advanced.config | jq line again and looking for the empty array) then shut down and install the GT710.
 

jengle

Dabbler
Joined
Jan 4, 2023
Messages
26
Thanks @HoneyBadger. I now boot with the 2070 and GT710. When i try to assign a GPU to the VM I get

Method 'get_pci_ids_for_gpu_isolation' not found in 'device'

for either GPU

I see this is an issue for others with Cobia, but didn't find an official response/guide yet.

Jeff
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hey @jengle

It looks like this should be fixed in 23.10.2 with the resolution of bug https://ixsystems.atlassian.net/browse/NAS-125803 - we hope to have 23.10.2 ready for next week.

It is possible to run a Docker instance of the webUI with the fix by following instructions in that bug ticket, but that is a little more involved.
 

jengle

Dabbler
Joined
Jan 4, 2023
Messages
26
Thank you again @HoneyBadger. I’ll wait for the official fix since it’s coming out soon.
 

jengle

Dabbler
Joined
Jan 4, 2023
Messages
26
UPDATE from 23.10.2. I can now select the GPU, when I save it does not actually implement. Neither GPU can be selected regardless of whether it is isolated or not (either card can now be selected to be isolated - and that sticks).
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hey @jengle

The best thing to do here would be to use the "Report a Bug" link, and indicate that your problem hasn't been resolved by the fix in 23.10.2 - you may even be able to do this directly from within the SCALE UI, through System -> General -> File Ticket. Check off the option to include a Debug file with the ticket.
 

jengle

Dabbler
Joined
Jan 4, 2023
Messages
26
Thanks again @HoneyBadger. Figured out how to submit a case.
Jeff
 
Top