Nvidia GPU not appearing for use with SCALE

HoneyBadger · Mar 1, 2023

I can't speak to that ticket specifically, but reviewing the linked thread it might be a similar issue if there's still artifacts left over from a previous passthrough setup, and/or the middleware isn't in sync with what the on-disk modprobe configuration is.

Do you have a file existing at /etc/modprobe.d/vfio.conf and if so what are the contents?

@Sparx if you're open to a DM I can see if I have availability to troubleshoot this real-time with you.

Sparx · Mar 1, 2023

Ill be at the server in about 1hour. Send DM please :)

Ukjent1 · Mar 1, 2023

Please let me and anyone else know the result and hopefully a fix :)

HoneyBadger · Mar 1, 2023

Ukjent1 said:
Please let me and anyone else know the result and hopefully a fix :)

For @Sparx it seems like the middleware and the kernel weren't on the same page as far as the vfio-pci usage.

Big thanks to @Sparx for swinging the mallet as we played a game of Whack-A-Mole to sort this out!

Important Edit - These instructions were written for resetting GPU/VFIO claims in TrueNAS SCALE 22.12 "Bluefin" and may not longer be valid for future versions, as GPU mapping may have changed.

First off, check to see if your device is claimed by the vfio-pci driver by investigating the "Kernel driver in use" part of the lspci -v output:

Code:

0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
        DeviceName: pciPassthru0
        Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
        Physical Slot: 192
        Flags: bus master, fast devsel, latency 248, IRQ 19
        Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e4000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Kernel driver in use: vfio-pci ### this line indicates the passthrough driver has claimed your GPU
        Kernel modules: nouveau, nvidia_current_drm, nvidia_current

Code:

0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
        DeviceName: pciPassthru0
        Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
        Physical Slot: 192
        Flags: bus master, fast devsel, latency 248, IRQ 19
        Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e4000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_current_drm, nvidia_current

If you have the vfio-pci driver in use, but you have no isolated GPU (and have rebooted) then you probably have a stuck file somewhere.

Warning - This doesn't preserve any existing passthroughs, and resets all GPUs to host-owned. No VMs were running during this. Fixing this requires you to go mucking with kernel options, manually deleting files, and multiple reboots. Have backups, and obligatory warning:

Ensure you've removed all isolated GPUs from the webUI; reboot the system if any changes were necessary.
Open a root shell (SSH, use sudo -s or prepend commands with sudo to elevate if needed) on your TrueNAS SCALE machine.

Back up the contents of the following files (save them to your pool, copy them as a local text file, wherever you'd like):

Code:

/boot/initramfs_config.json
/etc/initramfs-tools/modules
/etc/modules
/etc/modprobe.d/kvm.conf
/etc/modprobe.d/nvidia.conf
/etc/modprobe.d/vfio.conf

These files will all likely contain the PCI vendor and device IDs of a passthrough GPU, eg: 10DE:1BB3 as well as references to vfio. Once you've backed up the content, remove these files.

Run update-initramfs -k all -u - a number of grep errors will likely be logged. Reboot once more, and you should have your nvidia-smi functionality back.

Note that this isn't a fix for the "two identical GPUs in a system" issue - that one's a bit more complex - nor will it do anything to enable a pre-Maxwell NVIDIA GPU in Bluefin.

Potui Emitti · Mar 1, 2023

I just created this account to let you know, that this worked for me as well. I had the same issues where

Code:

lspci | grep VGA

would show my GPU (Nvidia GTX 1050) but would throw errors at boot and when executing

Code:

nvidia-smi

. This was both with TrueNAS SCALE Bluefin 22.12.0 and 22.12.1.

To add something new: When I tried logging into my machine with another user and switching to root, the suggested solution threw errors when executing

Code:

update-initramfs -k all -u

just as described. However, when using a directly attached console and using a Linux shell (

Code:

7) Open Linux Shell

) no errors occurred.

mgoulet65 · Mar 1, 2023

HoneyBadger said:
For @Sparx it seems like the middleware and the kernel weren't on the same page as far as the vfio-pci usage.

Big thanks to @Sparx for swinging the mallet as we played a game of Whack-A-Mole to sort this out!

First off, check to see if your device is claimed by the vfio-pci driver by investigating the "Kernel driver in use" part of the lspci -v output:

Code:
0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1) DeviceName: pciPassthru0 Subsystem: NVIDIA Corporation GP104GL [Tesla P4] Physical Slot: 192 Flags: bus master, fast devsel, latency 248, IRQ 19 Memory at fc000000 (32-bit, non-prefetchable) [size=16M] Memory at d0000000 (64-bit, prefetchable) [size=256M] Memory at e4000000 (64-bit, prefetchable) [size=32M] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] Secondary PCI Express Kernel driver in use: vfio-pci Kernel modules: nouveau, nvidia_current_drm, nvidia_current

Code:
0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1) DeviceName: pciPassthru0 Subsystem: NVIDIA Corporation GP104GL [Tesla P4] Physical Slot: 192 Flags: bus master, fast devsel, latency 248, IRQ 19 Memory at fc000000 (32-bit, non-prefetchable) [size=16M] Memory at d0000000 (64-bit, prefetchable) [size=256M] Memory at e4000000 (64-bit, prefetchable) [size=32M] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] Secondary PCI Express Kernel driver in use: nvidia Kernel modules: nouveau, nvidia_current_drm, nvidia_current

If you have the vfio-pci driver in use, but you have no isolated GPU (and have rebooted) then you probably have a stuck file somewhere.

Warning - This doesn't preserve any existing passthroughs, and resets all GPUs to host-owned. No VMs were running during this. Fixing this requires you to go mucking with kernel options, manually deleting files, and multiple reboots. Have backups, and obligatory warning:

Ensure you've removed all isolated GPUs from the webUI; reboot the system if any changes were necessary.
Open a root shell (SSH) on your TrueNAS SCALE machine.

Back up the contents of the following files (save them to your pool, copy them as a local text file, wherever you'd like):

Code:
/boot/initramfs_config.json /etc/initramfs-tools/modules /etc/modules /etc/modprobe.d/kvm.conf /etc/modprobe.d/nvidia.conf /etc/modprobe.d/vfio.conf

These files will all likely contain the PCI vendor and device IDs of a passthrough GPU, eg: 10DE:1BB3 as well as references to vfio. Once you've backed up the content, remove these files.

Run update-initramfs -k all -u - a number of grep errors will likely be logged. Reboot once more, and you should have your nvidia-smi functionality back.

Note that this isn't a fix for the "two identical GPUs in a system" issue - that one's a bit more complex - nor will it do anything to enable a pre-Maxwell NVIDIA GPU in Bluefin.

Like some other users I have no nvidia.conf. Any thoughts on what to try?

Code:

cat /etc/modprobe.d/nvidia.conf
cat: /etc/modprobe.d/nvidia.conf: No such file or directory

Code:

find /lib/modules/5.15.79+truenas/ -type f -name '*.ko' | grep nvidia
/lib/modules/5.15.79+truenas/kernel/drivers/net/ethernet/nvidia/forcedeth.ko
/lib/modules/5.15.79+truenas/kernel/drivers/usb/typec/altmodes/typec_nvidia.ko
/lib/modules/5.15.79+truenas/updates/dkms/nvidia-current-peermem.ko
/lib/modules/5.15.79+truenas/updates/dkms/nvidia-current.ko
/lib/modules/5.15.79+truenas/updates/dkms/nvidia-current-uvm.ko
/lib/modules/5.15.79+truenas/updates/dkms/nvidia-current-drm.ko
/lib/modules/5.15.79+truenas/updates/dkms/nvidia-current-modeset.ko

lsmod | grep nvidia returns nothing

Code:

ls -l /etc/modprobe.d/
total 24
-rw-r--r-- 1 root root 154 Dec 20  2019 amd64-microcode-blacklist.conf
-rw-r--r-- 1 root root 127 Feb 12  2021 dkms.conf
-rw-r--r-- 1 root root 154 Jul  4  2022 intel-microcode-blacklist.conf
-rw-r--r-- 1 root root 379 Feb  9  2021 mdadm.conf
-rw-r--r-- 1 root root 101 Aug 24  2022 nvdimm-security.conf
lrwxrwxrwx 1 root root  53 Feb 17 17:43 nvidia-blacklists-nouveau.conf -> /etc/alternatives/glx--nvidia-blacklists-nouveau.conf
-rw-r--r-- 1 root root 260 Jan  6  2021 nvidia-kernel-common.conf

Ukjent1 · Mar 1, 2023

HoneyBadger said:
For @Sparx it seems like the middleware and the kernel weren't on the same page as far as the vfio-pci usage.

Big thanks to @Sparx for swinging the mallet as we played a game of Whack-A-Mole to sort this out!

First off, check to see if your device is claimed by the vfio-pci driver by investigating the "Kernel driver in use" part of the lspci -v output:

Code:
0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1) DeviceName: pciPassthru0 Subsystem: NVIDIA Corporation GP104GL [Tesla P4] Physical Slot: 192 Flags: bus master, fast devsel, latency 248, IRQ 19 Memory at fc000000 (32-bit, non-prefetchable) [size=16M] Memory at d0000000 (64-bit, prefetchable) [size=256M] Memory at e4000000 (64-bit, prefetchable) [size=32M] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] Secondary PCI Express Kernel driver in use: vfio-pci Kernel modules: nouveau, nvidia_current_drm, nvidia_current

Code:
0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1) DeviceName: pciPassthru0 Subsystem: NVIDIA Corporation GP104GL [Tesla P4] Physical Slot: 192 Flags: bus master, fast devsel, latency 248, IRQ 19 Memory at fc000000 (32-bit, non-prefetchable) [size=16M] Memory at d0000000 (64-bit, prefetchable) [size=256M] Memory at e4000000 (64-bit, prefetchable) [size=32M] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] Secondary PCI Express Kernel driver in use: nvidia Kernel modules: nouveau, nvidia_current_drm, nvidia_current

If you have the vfio-pci driver in use, but you have no isolated GPU (and have rebooted) then you probably have a stuck file somewhere.

Warning - This doesn't preserve any existing passthroughs, and resets all GPUs to host-owned. No VMs were running during this. Fixing this requires you to go mucking with kernel options, manually deleting files, and multiple reboots. Have backups, and obligatory warning:

Ensure you've removed all isolated GPUs from the webUI; reboot the system if any changes were necessary.
Open a root shell (SSH) on your TrueNAS SCALE machine.

Back up the contents of the following files (save them to your pool, copy them as a local text file, wherever you'd like):

Code:
/boot/initramfs_config.json /etc/initramfs-tools/modules /etc/modules /etc/modprobe.d/kvm.conf /etc/modprobe.d/nvidia.conf /etc/modprobe.d/vfio.conf

These files will all likely contain the PCI vendor and device IDs of a passthrough GPU, eg: 10DE:1BB3 as well as references to vfio. Once you've backed up the content, remove these files.

Run update-initramfs -k all -u - a number of grep errors will likely be logged. Reboot once more, and you should have your nvidia-smi functionality back.

Note that this isn't a fix for the "two identical GPUs in a system" issue - that one's a bit more complex - nor will it do anything to enable a pre-Maxwell NVIDIA GPU in Bluefin.

Thanks a lot @HoneyBadger
I have "Kernel driver in use: vfio-pci" and no Isolated GPU Devices, and have done several reboots after I have removed the Isolated GPU.

If things goes sideways after removing the files and the last reboot, how do I revert the process? Is it that simple that I still will have access to login in to TrueNAS and copy the files back to it's location and run update-initramfs -k all -u and do another reboot and I am back where a started?
I am fearly new to TrueNAS and Linux, so I just want to be sure that I have a way back to my data, if everything goes bad :)

Again, thanks a lot for your effort @Sparx and @HoneyBadger :)

I will test this later, when I get it confirmed that I have a safe way back :)

Sparx · Mar 2, 2023

@mgoulet65 was your issue related to double GPUs because this wasnt a fix for that I think. Its for some sort of upgrade issue from angelfish.

mgoulet65 · Mar 2, 2023

Sparx said:
@mgoulet65 was your issue related to double GPUs because this wasnt a fix for that I think. Its for some sort of upgrade issue from angelfish.

No. I just have never (Angelfish through current) been able to load a driver...thus nvidia-smi returns nothing.

Code:

nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Sparx · Mar 2, 2023

@mgoulet65 .. so thats probably different. Since my setup was working in angelfish. Whats your GPU?
Is it the K20c stated in your specs? That doesnt work with drivers newer than 460.106 on linux(?)

mgoulet65 · Mar 2, 2023

Sparx said:
@mgoulet65 .. so thats probably different. Since my setup was working in angelfish. Whats your GPU?
Is it the K20c stated in your specs? That doesnt work with drivers newer than 460.106 on linux(?)

Yes that's the one. Frankly I am not equipped to diagnose driver issues :-( I was hoping for guidance from this group.

Sparx · Mar 2, 2023

Yeah maybe its easier to just put something newer in.

mgoulet65 · Mar 2, 2023

Sparx said:
Yeah maybe its easier to just put something newer in.

What would you recommend? Something sure-fire?

HoneyBadger · Mar 2, 2023

Ukjent1 said:
If things goes sideways after removing the files and the last reboot, how do I revert the process? Is it that simple that I still will have access to login in to TrueNAS and copy the files back to it's location and run update-initramfs -k all -u and do another reboot and I am back where a started?
I am fearly new to TrueNAS and Linux, so I just want to be sure that I have a way back to my data, if everything goes bad :)

The files being changed are only used in relation to GPU passthrough, and only change the modules loaded in the kernel at boot-time. Even if the boot device was completely unusable, your TrueNAS data disks could be imported to a fresh install.

I will suggest that you take a configuration backup through the UI, especially if you are using pool or dataset encryption.

mgoulet65 said:
What would you recommend? Something sure-fire?

What's the desired use case for the GPU at home? The K20c was a pretty beefy card back in its day (around a GTX780 if I recall correctly) but if you just need to transcode a bit of video, then you won't need that sort of horsepower. Depending on number of concurrent streams something as little as a Quadro P400 could be enough.

mgoulet65 · Mar 2, 2023

HoneyBadger said:
The files being changed are only used in relation to GPU passthrough, and only change the modules loaded in the kernel at boot-time. Even if the boot device was completely unusable, your TrueNAS data disks could be imported to a fresh install.

I will suggest that you take a configuration backup through the UI, especially if you are using pool or dataset encryption.

What's the desired use case for the GPU at home? The K20c was a pretty beefy card back in its day (around a GTX780 if I recall correctly) but if you just need to transcode a bit of video, then you won't need that sort of horsepower.

Yes it it trying to add HW transcode to Plex and/or Jellyfin (once I decide how I want to go). Is there a diagnostic path I should be on for the K20c or should I change out to some known working GPU? If the latter, which?

HoneyBadger · Mar 2, 2023

mgoulet65 said:
Yes it it trying to add HW transcode to Plex and/or Jellyfin (once I decide how I want to go). Is there a diagnostic path I should be on for the K20c or should I change out to some known working GPU? If the latter, which?

The NVIDIA driver was changed between Angelfish (470.103.01) and Bluefin (515.65.01) which removed support from the Kepler-based GPUs, including your K20c.

I've seen the following page referenced - it allows you to select source and destination resolutions for transcoding, as well as filtering out GPUs by generation. A Quadro P400 will work fine for a home server that's transcoding one or two streams, but if you have a heavier workload and/or are using 4K source material you may need something stronger. Note that NVIDIA has introduced a limit of 3 concurrent NVENC streams on consumer and entry-level prosumer cards as well - if you're going beyond that, look for a card with unlimited stream support.

nVidia Hardware Transcoding Calculator for Plex Estimates

www.elpamsoft.com

morganL · Mar 4, 2023

For Nvidia GPU compatibility, the driver release notes for Bluefin are here:

Version 515.65.01(Linux)/516.94(Windows) :: NVIDIA Data Center GPU Driver Documentation

Release notes for the Release 515 family of NVIDIA® Data Center GPU Drivers for Linux and Windows.

docs.nvidia.com

Unfortunately, Nvidia does allow both older and newer drivers to be used on same system. They have naming conflicts.

gsrcrxsi · Mar 5, 2023

can you, or are there plans to, allow for some kind of driver version selection by the user? sure it would require a reboot to change the driver. but better than forcing a user to choose between major TN release versions just for a different nvidia driver.

i would recommend 3 versions to choose from. ~340 or 390 for ancient legacy cards, ~470 for middle aged cards, and latest driver version (currently 525 branch) to support the newest cards and CUDA features. that should cover pretty much everyone, and major TN software releases could only focus on updating whatever the latest stable version is.

morganL · Mar 5, 2023

gsrcrxsi said:
can you, or are there plans to, allow for some kind of driver version selection by the user? sure it would require a reboot to change the driver. but better than forcing a user to choose between major TN release versions just for a different nvidia driver.

i would recommend 3 versions to choose from. ~340 or 390 for ancient legacy cards, ~470 for middle aged cards, and latest driver version (currently 525 branch) to support the newest cards and CUDA features. that should cover pretty much everyone, and major TN software releases could only focus on updating whatever the latest stable version is.

No plans unfortunately... AFAIK, we would would have to build three separate images. There would need to be separate software trains to allow easy updates. That's a lot of extra cost and complexity.

We are following your advice with the latest TrueNAS Release (22.12) only focussing on the latest stable version. Nvidia has the power to support the older GPUs if they wish (but they are probably happy to force an update).

You can still pass-thru old GPUs to VMs.
Otherwise, its time to update the GPU (even if its an Ebay unit that is 5 years old).

mgoulet65 · Mar 7, 2023

HoneyBadger said:
The files being changed are only used in relation to GPU passthrough, and only change the modules loaded in the kernel at boot-time. Even if the boot device was completely unusable, your TrueNAS data disks could be imported to a fresh install.

I will suggest that you take a configuration backup through the UI, especially if you are using pool or dataset encryption.

What's the desired use case for the GPU at home? The K20c was a pretty beefy card back in its day (around a GTX780 if I recall correctly) but if you just need to transcode a bit of video, then you won't need that sort of horsepower. Depending on number of concurrent streams something as little as a Quadro P400 could be enough.

I ended up adding a newer NVidia GPU (P400) and everything "just worked." Thanks.

Important Announcement for the TrueNAS Community.

Nvidia GPU not appearing for use with SCALE

actually does care

Contributor

Cadet

actually does care

Cadet

Explorer

Cadet

Contributor

Explorer

Contributor

Explorer

Contributor

Explorer

actually does care

Explorer

actually does care

Captain Morgan

Explorer

Captain Morgan

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Nvidia GPU not appearing for use with SCALE"

Similar threads