Nvidia GPU not appearing for use with SCALE

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I can't speak to that ticket specifically, but reviewing the linked thread it might be a similar issue if there's still artifacts left over from a previous passthrough setup, and/or the middleware isn't in sync with what the on-disk modprobe configuration is.

Do you have a file existing at /etc/modprobe.d/vfio.conf and if so what are the contents?

@Sparx if you're open to a DM I can see if I have availability to troubleshoot this real-time with you.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Please let me and anyone else know the result and hopefully a fix :)

For @Sparx it seems like the middleware and the kernel weren't on the same page as far as the vfio-pci usage.

Big thanks to @Sparx for swinging the mallet as we played a game of Whack-A-Mole to sort this out!

Important Edit - These instructions were written for resetting GPU/VFIO claims in TrueNAS SCALE 22.12 "Bluefin" and may not longer be valid for future versions, as GPU mapping may have changed.

First off, check to see if your device is claimed by the vfio-pci driver by investigating the "Kernel driver in use" part of the lspci -v output:

Code:
0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
        DeviceName: pciPassthru0
        Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
        Physical Slot: 192
        Flags: bus master, fast devsel, latency 248, IRQ 19
        Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e4000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Kernel driver in use: vfio-pci ### this line indicates the passthrough driver has claimed your GPU
        Kernel modules: nouveau, nvidia_current_drm, nvidia_current

Code:
0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
        DeviceName: pciPassthru0
        Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
        Physical Slot: 192
        Flags: bus master, fast devsel, latency 248, IRQ 19
        Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e4000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_current_drm, nvidia_current

If you have the vfio-pci driver in use, but you have no isolated GPU (and have rebooted) then you probably have a stuck file somewhere.

Warning - This doesn't preserve any existing passthroughs, and resets all GPUs to host-owned. No VMs were running during this. Fixing this requires you to go mucking with kernel options, manually deleting files, and multiple reboots. Have backups, and obligatory warning:

Ensure you've removed all isolated GPUs from the webUI; reboot the system if any changes were necessary.
Open a root shell (SSH, use sudo -s or prepend commands with sudo to elevate if needed) on your TrueNAS SCALE machine.

Back up the contents of the following files (save them to your pool, copy them as a local text file, wherever you'd like):

Code:
/boot/initramfs_config.json
/etc/initramfs-tools/modules
/etc/modules
/etc/modprobe.d/kvm.conf
/etc/modprobe.d/nvidia.conf
/etc/modprobe.d/vfio.conf


These files will all likely contain the PCI vendor and device IDs of a passthrough GPU, eg: 10DE:1BB3 as well as references to vfio. Once you've backed up the content, remove these files.

Run update-initramfs -k all -u - a number of grep errors will likely be logged. Reboot once more, and you should have your nvidia-smi functionality back.

Note that this isn't a fix for the "two identical GPUs in a system" issue - that one's a bit more complex - nor will it do anything to enable a pre-Maxwell NVIDIA GPU in Bluefin.
 
Last edited:

Potui Emitti

Cadet
Joined
Mar 1, 2023
Messages
2
I just created this account to let you know, that this worked for me as well. I had the same issues where
Code:
lspci | grep VGA
would show my GPU (Nvidia GTX 1050) but would throw errors at boot and when executing
Code:
nvidia-smi
. This was both with TrueNAS SCALE Bluefin 22.12.0 and 22.12.1.

To add something new: When I tried logging into my machine with another user and switching to root, the suggested solution threw errors when executing
Code:
update-initramfs -k all -u
just as described. However, when using a directly attached console and using a Linux shell (
Code:
7) Open Linux Shell
) no errors occurred.
 

mgoulet65

Explorer
Joined
Jun 15, 2021
Messages
95
For @Sparx it seems like the middleware and the kernel weren't on the same page as far as the vfio-pci usage.

Big thanks to @Sparx for swinging the mallet as we played a game of Whack-A-Mole to sort this out!

First off, check to see if your device is claimed by the vfio-pci driver by investigating the "Kernel driver in use" part of the lspci -v output:

Code:
0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
        DeviceName: pciPassthru0
        Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
        Physical Slot: 192
        Flags: bus master, fast devsel, latency 248, IRQ 19
        Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e4000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Kernel driver in use: vfio-pci
        Kernel modules: nouveau, nvidia_current_drm, nvidia_current

Code:
0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
        DeviceName: pciPassthru0
        Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
        Physical Slot: 192
        Flags: bus master, fast devsel, latency 248, IRQ 19
        Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e4000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_current_drm, nvidia_current

If you have the vfio-pci driver in use, but you have no isolated GPU (and have rebooted) then you probably have a stuck file somewhere.

Warning - This doesn't preserve any existing passthroughs, and resets all GPUs to host-owned. No VMs were running during this. Fixing this requires you to go mucking with kernel options, manually deleting files, and multiple reboots. Have backups, and obligatory warning:

Ensure you've removed all isolated GPUs from the webUI; reboot the system if any changes were necessary.
Open a root shell (SSH) on your TrueNAS SCALE machine.

Back up the contents of the following files (save them to your pool, copy them as a local text file, wherever you'd like):

Code:
/boot/initramfs_config.json
/etc/initramfs-tools/modules
/etc/modules
/etc/modprobe.d/kvm.conf
/etc/modprobe.d/nvidia.conf
/etc/modprobe.d/vfio.conf


These files will all likely contain the PCI vendor and device IDs of a passthrough GPU, eg: 10DE:1BB3 as well as references to vfio. Once you've backed up the content, remove these files.

Run update-initramfs -k all -u - a number of grep errors will likely be logged. Reboot once more, and you should have your nvidia-smi functionality back.

Note that this isn't a fix for the "two identical GPUs in a system" issue - that one's a bit more complex - nor will it do anything to enable a pre-Maxwell NVIDIA GPU in Bluefin.
Like some other users I have no nvidia.conf. Any thoughts on what to try?

Code:
cat /etc/modprobe.d/nvidia.conf
cat: /etc/modprobe.d/nvidia.conf: No such file or directory


Code:
find /lib/modules/5.15.79+truenas/ -type f -name '*.ko' | grep nvidia
/lib/modules/5.15.79+truenas/kernel/drivers/net/ethernet/nvidia/forcedeth.ko
/lib/modules/5.15.79+truenas/kernel/drivers/usb/typec/altmodes/typec_nvidia.ko
/lib/modules/5.15.79+truenas/updates/dkms/nvidia-current-peermem.ko
/lib/modules/5.15.79+truenas/updates/dkms/nvidia-current.ko
/lib/modules/5.15.79+truenas/updates/dkms/nvidia-current-uvm.ko
/lib/modules/5.15.79+truenas/updates/dkms/nvidia-current-drm.ko
/lib/modules/5.15.79+truenas/updates/dkms/nvidia-current-modeset.ko


lsmod | grep nvidia returns nothing

Code:
ls -l /etc/modprobe.d/
total 24
-rw-r--r-- 1 root root 154 Dec 20  2019 amd64-microcode-blacklist.conf
-rw-r--r-- 1 root root 127 Feb 12  2021 dkms.conf
-rw-r--r-- 1 root root 154 Jul  4  2022 intel-microcode-blacklist.conf
-rw-r--r-- 1 root root 379 Feb  9  2021 mdadm.conf
-rw-r--r-- 1 root root 101 Aug 24  2022 nvdimm-security.conf
lrwxrwxrwx 1 root root  53 Feb 17 17:43 nvidia-blacklists-nouveau.conf -> /etc/alternatives/glx--nvidia-blacklists-nouveau.conf
-rw-r--r-- 1 root root 260 Jan  6  2021 nvidia-kernel-common.conf
 

Ukjent1

Cadet
Joined
Feb 26, 2023
Messages
5
For @Sparx it seems like the middleware and the kernel weren't on the same page as far as the vfio-pci usage.

Big thanks to @Sparx for swinging the mallet as we played a game of Whack-A-Mole to sort this out!

First off, check to see if your device is claimed by the vfio-pci driver by investigating the "Kernel driver in use" part of the lspci -v output:

Code:
0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
        DeviceName: pciPassthru0
        Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
        Physical Slot: 192
        Flags: bus master, fast devsel, latency 248, IRQ 19
        Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e4000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Kernel driver in use: vfio-pci
        Kernel modules: nouveau, nvidia_current_drm, nvidia_current

Code:
0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
        DeviceName: pciPassthru0
        Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
        Physical Slot: 192
        Flags: bus master, fast devsel, latency 248, IRQ 19
        Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e4000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_current_drm, nvidia_current

If you have the vfio-pci driver in use, but you have no isolated GPU (and have rebooted) then you probably have a stuck file somewhere.

Warning - This doesn't preserve any existing passthroughs, and resets all GPUs to host-owned. No VMs were running during this. Fixing this requires you to go mucking with kernel options, manually deleting files, and multiple reboots. Have backups, and obligatory warning:

Ensure you've removed all isolated GPUs from the webUI; reboot the system if any changes were necessary.
Open a root shell (SSH) on your TrueNAS SCALE machine.

Back up the contents of the following files (save them to your pool, copy them as a local text file, wherever you'd like):

Code:
/boot/initramfs_config.json
/etc/initramfs-tools/modules
/etc/modules
/etc/modprobe.d/kvm.conf
/etc/modprobe.d/nvidia.conf
/etc/modprobe.d/vfio.conf


These files will all likely contain the PCI vendor and device IDs of a passthrough GPU, eg: 10DE:1BB3 as well as references to vfio. Once you've backed up the content, remove these files.

Run update-initramfs -k all -u - a number of grep errors will likely be logged. Reboot once more, and you should have your nvidia-smi functionality back.

Note that this isn't a fix for the "two identical GPUs in a system" issue - that one's a bit more complex - nor will it do anything to enable a pre-Maxwell NVIDIA GPU in Bluefin.
Thanks a lot @HoneyBadger
I have "Kernel driver in use: vfio-pci" and no Isolated GPU Devices, and have done several reboots after I have removed the Isolated GPU.

If things goes sideways after removing the files and the last reboot, how do I revert the process? Is it that simple that I still will have access to login in to TrueNAS and copy the files back to it's location and run update-initramfs -k all -u and do another reboot and I am back where a started?
I am fearly new to TrueNAS and Linux, so I just want to be sure that I have a way back to my data, if everything goes bad :)

Again, thanks a lot for your effort @Sparx and @HoneyBadger :)

I will test this later, when I get it confirmed that I have a safe way back :)
 

Sparx

Contributor
Joined
Apr 18, 2017
Messages
107
@mgoulet65 was your issue related to double GPUs because this wasnt a fix for that I think. Its for some sort of upgrade issue from angelfish.
 

mgoulet65

Explorer
Joined
Jun 15, 2021
Messages
95
@mgoulet65 was your issue related to double GPUs because this wasnt a fix for that I think. Its for some sort of upgrade issue from angelfish.
No. I just have never (Angelfish through current) been able to load a driver...thus nvidia-smi returns nothing.

Code:
nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
 

Sparx

Contributor
Joined
Apr 18, 2017
Messages
107
@mgoulet65 .. so thats probably different. Since my setup was working in angelfish. Whats your GPU?
Is it the K20c stated in your specs? That doesnt work with drivers newer than 460.106 on linux(?)
 

mgoulet65

Explorer
Joined
Jun 15, 2021
Messages
95
@mgoulet65 .. so thats probably different. Since my setup was working in angelfish. Whats your GPU?
Is it the K20c stated in your specs? That doesnt work with drivers newer than 460.106 on linux(?)

Yes that's the one. Frankly I am not equipped to diagnose driver issues :-( I was hoping for guidance from this group.
 

Sparx

Contributor
Joined
Apr 18, 2017
Messages
107
Yeah maybe its easier to just put something newer in.
 

mgoulet65

Explorer
Joined
Jun 15, 2021
Messages
95

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
If things goes sideways after removing the files and the last reboot, how do I revert the process? Is it that simple that I still will have access to login in to TrueNAS and copy the files back to it's location and run update-initramfs -k all -u and do another reboot and I am back where a started?
I am fearly new to TrueNAS and Linux, so I just want to be sure that I have a way back to my data, if everything goes bad :)
The files being changed are only used in relation to GPU passthrough, and only change the modules loaded in the kernel at boot-time. Even if the boot device was completely unusable, your TrueNAS data disks could be imported to a fresh install.

I will suggest that you take a configuration backup through the UI, especially if you are using pool or dataset encryption.

What would you recommend? Something sure-fire?
What's the desired use case for the GPU at home? The K20c was a pretty beefy card back in its day (around a GTX780 if I recall correctly) but if you just need to transcode a bit of video, then you won't need that sort of horsepower. Depending on number of concurrent streams something as little as a Quadro P400 could be enough.
 
Last edited:

mgoulet65

Explorer
Joined
Jun 15, 2021
Messages
95
The files being changed are only used in relation to GPU passthrough, and only change the modules loaded in the kernel at boot-time. Even if the boot device was completely unusable, your TrueNAS data disks could be imported to a fresh install.

I will suggest that you take a configuration backup through the UI, especially if you are using pool or dataset encryption.


What's the desired use case for the GPU at home? The K20c was a pretty beefy card back in its day (around a GTX780 if I recall correctly) but if you just need to transcode a bit of video, then you won't need that sort of horsepower.

Yes it it trying to add HW transcode to Plex and/or Jellyfin (once I decide how I want to go). Is there a diagnostic path I should be on for the K20c or should I change out to some known working GPU? If the latter, which?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Yes it it trying to add HW transcode to Plex and/or Jellyfin (once I decide how I want to go). Is there a diagnostic path I should be on for the K20c or should I change out to some known working GPU? If the latter, which?
The NVIDIA driver was changed between Angelfish (470.103.01) and Bluefin (515.65.01) which removed support from the Kepler-based GPUs, including your K20c.

I've seen the following page referenced - it allows you to select source and destination resolutions for transcoding, as well as filtering out GPUs by generation. A Quadro P400 will work fine for a home server that's transcoding one or two streams, but if you have a heavier workload and/or are using 4K source material you may need something stronger. Note that NVIDIA has introduced a limit of 3 concurrent NVENC streams on consumer and entry-level prosumer cards as well - if you're going beyond that, look for a card with unlimited stream support.

 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694

gsrcrxsi

Explorer
Joined
Apr 15, 2018
Messages
86
can you, or are there plans to, allow for some kind of driver version selection by the user? sure it would require a reboot to change the driver. but better than forcing a user to choose between major TN release versions just for a different nvidia driver.

i would recommend 3 versions to choose from. ~340 or 390 for ancient legacy cards, ~470 for middle aged cards, and latest driver version (currently 525 branch) to support the newest cards and CUDA features. that should cover pretty much everyone, and major TN software releases could only focus on updating whatever the latest stable version is.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
can you, or are there plans to, allow for some kind of driver version selection by the user? sure it would require a reboot to change the driver. but better than forcing a user to choose between major TN release versions just for a different nvidia driver.

i would recommend 3 versions to choose from. ~340 or 390 for ancient legacy cards, ~470 for middle aged cards, and latest driver version (currently 525 branch) to support the newest cards and CUDA features. that should cover pretty much everyone, and major TN software releases could only focus on updating whatever the latest stable version is.

No plans unfortunately... AFAIK, we would would have to build three separate images. There would need to be separate software trains to allow easy updates. That's a lot of extra cost and complexity.

We are following your advice with the latest TrueNAS Release (22.12) only focussing on the latest stable version. Nvidia has the power to support the older GPUs if they wish (but they are probably happy to force an update).

You can still pass-thru old GPUs to VMs.
Otherwise, its time to update the GPU (even if its an Ebay unit that is 5 years old).
 

mgoulet65

Explorer
Joined
Jun 15, 2021
Messages
95
The files being changed are only used in relation to GPU passthrough, and only change the modules loaded in the kernel at boot-time. Even if the boot device was completely unusable, your TrueNAS data disks could be imported to a fresh install.

I will suggest that you take a configuration backup through the UI, especially if you are using pool or dataset encryption.


What's the desired use case for the GPU at home? The K20c was a pretty beefy card back in its day (around a GTX780 if I recall correctly) but if you just need to transcode a bit of video, then you won't need that sort of horsepower. Depending on number of concurrent streams something as little as a Quadro P400 could be enough.
I ended up adding a newer NVidia GPU (P400) and everything "just worked." Thanks.
 
Top