Nvidia GPU not appearing for use with SCALE

airberg

Cadet
Joined
Aug 17, 2023
Messages
9
do me a favor and loot at shell command and run this

cli
system device gpu_pci_ids_choices

what is your output mine looks like this
Code:
root@truenas:~# cli
[truenas]> system device gpu_pci_ids_choices
+-------------------------------------------+--------------+
| NVIDIA Corporation GP106GL [Quadro P2000] | 0000:0b:00.0 |
+-------------------------------------------+--------------+
[truenas]>
Hi there. Sorry for the late response, insanity with the excessive heat at work (HVAC Supervisor for major hospital)....
I am on the latest version of TrueNAS Scale 22.12.3.3
Anyway, I ended up getting this...
1692841057268.png
 

monsterman

Cadet
Joined
Dec 28, 2022
Messages
4
I am essentially running into something similar (22.12.3.3) - but my GPU doesn't show up at all with lspci and:
1694045299938.png

1694045334232.png

Nvidia-smi returning and showing my 3070 - additionally have tried my 3080 same results. Was working before - then I had to RMA my mobo - got same mobo as replacement and here we are with it not working. No version change, so I tried rolling back to 22.12.3 and still same. What happened?
 

airberg

Cadet
Joined
Aug 17, 2023
Messages
9
@airberg I have the exact same issue. Happened overnight without any updates
I essentially had to do a re-install of TrueNAS Scale... The HBA card was not in IT mode (still not sure how to do this) and was causing my pool to get tons of errors (comm based). I ended up only passing through the hard drives in the end and let prox manage the HBA.

Anyways, after doing the reinstall it worked. I was able to pass through and use my GPU again. I think it has something to do with upgrading from Core to Scale that it doesn't properly pass your pci-e devices through. I, also, was not passing my HBA through on Core so I didnt have the errors then. I know it would be best to put the HBA in IT mode but that's a different issue. Being still very green to anything IT related, I figure I only have my media files to lose if something ever happens. I have a Pi SMB share that is backing up all my important files (as a secondary backup).
 

LimboMenga

Cadet
Joined
Jan 23, 2024
Messages
3
For @Sparx it seems like the middleware and the kernel weren't on the same page as far as the vfio-pci usage.

Big thanks to @Sparx for swinging the mallet as we played a game of Whack-A-Mole to sort this out!

First off, check to see if your device is claimed by the vfio-pci driver by investigating the "Kernel driver in use" part of the lspci -v output:

Code:
0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
        DeviceName: pciPassthru0
        Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
        Physical Slot: 192
        Flags: bus master, fast devsel, latency 248, IRQ 19
        Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e4000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Kernel driver in use: vfio-pci ### this line indicates the passthrough driver has claimed your GPU
        Kernel modules: nouveau, nvidia_current_drm, nvidia_current

Code:
0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
        DeviceName: pciPassthru0
        Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
        Physical Slot: 192
        Flags: bus master, fast devsel, latency 248, IRQ 19
        Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e4000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_current_drm, nvidia_current

If you have the vfio-pci driver in use, but you have no isolated GPU (and have rebooted) then you probably have a stuck file somewhere.

Warning - This doesn't preserve any existing passthroughs, and resets all GPUs to host-owned. No VMs were running during this. Fixing this requires you to go mucking with kernel options, manually deleting files, and multiple reboots. Have backups, and obligatory warning:

Ensure you've removed all isolated GPUs from the webUI; reboot the system if any changes were necessary.
Open a root shell (SSH, use sudo -s or prepend commands with sudo to elevate if needed) on your TrueNAS SCALE machine.

Back up the contents of the following files (save them to your pool, copy them as a local text file, wherever you'd like):

Code:
/boot/initramfs_config.json
/etc/initramfs-tools/modules
/etc/modules
/etc/modprobe.d/kvm.conf
/etc/modprobe.d/nvidia.conf
/etc/modprobe.d/vfio.conf


These files will all likely contain the PCI vendor and device IDs of a passthrough GPU, eg: 10DE:1BB3 as well as references to vfio. Once you've backed up the content, remove these files.

Run update-initramfs -k all -u - a number of grep errors will likely be logged. Reboot once more, and you should have your nvidia-smi functionality back.

Note that this isn't a fix for the "two identical GPUs in a system" issue - that one's a bit more complex - nor will it do anything to enable a pre-Maxwell NVIDIA GPU in Bluefin.
I've had this long standing problem since Oct-2023 whereby i upgraded from Bluefin to Cobia. Since i found this; i've been experimenting nearly 2-4 hours every week testing what's what. All i can confirm is this

- I cant find anything in the VFIO.conf or VFIO-PCI files. Since i cant find the file(s); i cant find what's the holdup. After all the reading and experimentation; VFIO is the problem ; but how to fix it?
- if ever i reboot the Truenas server; i lose the NVIDIA-SMI function; which means the driver's lost
- I'm can only do [update-initramfs] so many times; but i suspect either the OS or kubernetes is not seeing my NVIDIA GPU. If i start from beginning BEFORE adding the the NVIDIA to JELLYFIN app; its all fine.
- I'm thinking i may have to either (a) yank the NVIDIA card out and try a different PCIE port or (b) wipe the truenas os and reimport...
- Since the BlueFin - Cobia upgrade; the only saving grace is that my intel iGPU is selectable; but not so with my NVIDIA.
Any help?

*My history with the problem can be found on reddit here.. https://www.reddit.com/r/truenas/comments/185wiq0/rant_just_a_case_of_bad_choices_and_bad_timing/
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I've had this long standing problem since Oct-2023 whereby i upgraded from Bluefin to Cobia. Since i found this; i've been experimenting nearly 2-4 hours every week testing what's what. All i can confirm is this

- I cant find anything in the VFIO.conf or VFIO-PCI files. Since i cant find the file(s); i cant find what's the holdup. After all the reading and experimentation; VFIO is the problem ; but how to fix it?
- if ever i reboot the Truenas server; i lose the NVIDIA-SMI function; which means the driver's lost
- I'm can only do [update-initramfs] so many times; but i suspect either the OS or kubernetes is not seeing my NVIDIA GPU. If i start from beginning BEFORE adding the the NVIDIA to JELLYFIN app; its all fine.
- I'm thinking i may have to either (a) yank the NVIDIA card out and try a different PCIE port or (b) wipe the truenas os and reimport...
- Since the BlueFin - Cobia upgrade; the only saving grace is that my intel iGPU is selectable; but not so with my NVIDIA.
Any help?

*My history with the problem can be found on reddit here.. https://www.reddit.com/r/truenas/comments/185wiq0/rant_just_a_case_of_bad_choices_and_bad_timing/

I'd suggest starting a new thread with specific GPU and problem. Please use SCALE 23.10.1.3 to start with.

Did you check your GPU with the NVidia driver compatibility?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
@LimboMenga

Try running midclt call system.advanced.config | jq and see if anything is visible under the isolated_gpu_pci_ids line - if there is, blank it out with sudo midclt call system.advanced.update '{ "isolated_gpu_pci_ids": [] }' and then reboot. See if you're able to then see your GPU with nvidia-smi and assign it to Apps.

If you need further assistance please spin up a new thread with your system and GPU info and throw an @ tag my way.
 

LimboMenga

Cadet
Joined
Jan 23, 2024
Messages
3
@LimboMenga

Try running midclt call system.advanced.config | jq and see if anything is visible under the isolated_gpu_pci_ids line - if there is, blank it out with sudo midclt call system.advanced.update '{ "isolated_gpu_pci_ids": [] }' and then reboot. See if you're able to then see your GPU with nvidia-smi and assign it to Apps.

If you need further assistance please spin up a new thread with your system and GPU info and throw an @ tag my way.
@HoneyBadger
Sorry about the late response. Anyways spent the last 2 days double checking something "funny"/weird/Interesting.

So basically i was experimenting again a few days ago BEFORE i saw your response. Then i ran the [Cobia 23.10.2] update. It VISUALLY looked like a failur; reason being that the DROP DOWN menu selection for GPU was missing; but the current ADD GPU was still there since the COBIA update . So i was running back and forth between the 23.10.1.3 and 23.10.2 updates trying to regain the functionality. However the setting's visually lost (dont know if its leftover junk from the updates or its something permanent/intentional). Then i tried "playing" with all the commands; and didnt notice anything changed; then decided to play with the current ADD GPU settings and lo behold; it works *(Even though NVIDIA functions are marked as experimental).

However; IF whenever doing a reboot; NVIDIA drivers are missing. So have to re-run the [update-initramfs -k all -u] command everytime truenas reboots; just to enable the drivers (refer to the "here be dragons" section https://www.truenas.com/community/t...-for-use-with-scale.101872/page-3#post-747374)

So i cant say if its the new 23.10.2 update of the "here be dragons" section that solved it; but i do know i screwed around so much somewhere that i may have to do a full nuke on the OS at the new DragonFish update.


@morganL

The card was working and was/is part of the current supported CONSUMER driver list; and is recognised by TRUENAS. Just a weird issue that NVIDIA driver is non-functional after a reboot; and must be manually initialised with command [update-initramfs -k all -u].
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Updating initramfs shouldn't be needed each time, perhaps there's something still stuck in the kernel options. The "here be dragons" section and that post were written during the Bluefin release cycle and things have changed a bit in VFIO usage since then, so check the queries for the advanced config that I posted above to see if it's still bolted to the kernel somehow.
 

latez

Dabbler
Joined
Sep 29, 2014
Messages
12
Updating initramfs shouldn't be needed each time, perhaps there's something still stuck in the kernel options. The "here be dragons" section and that post were written during the Bluefin release cycle and things have changed a bit in VFIO usage since then, so check the queries for the advanced config that I posted above to see if it's still bolted to the kernel somehow.
Hi there, I seem to be having a similar issue. The system seems to see/load my GPU just fine -

1709784544555.png

However I have no dropdown...

1709784614195.png


Nor is the GPU isolated -

1709784643217.png


# midclt call system.advanced.config | jq
"isolated_gpu_pci_ids": [],
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hi there, I seem to be having a similar issue. The system seems to see/load my GPU just fine -

View attachment 76361
However I have no dropdown...

View attachment 76362
Is this the TrueCharts app setup dialog? The regular TrueNAS one should have a drop-down field rather than a free-form integer.

What happens if you add the single GPU and start the container - does nvidia-smi show any process consuming the GPU?
 

LimboMenga

Cadet
Joined
Jan 23, 2024
Messages
3
Is this the TrueCharts app setup dialog? The regular TrueNAS one should have a drop-down field rather than a free-form integer.
@HoneyBadger

This is what happened to me after the [23.10.1.3] update to [23.10.2] as well. There's no dropdown to [add] the GPU. Its why i suspected in my use case it could have been leftover issues before/after the update. Its why i decided a OS level nuke next DragonFish update is the best choice for me. Too many CLI tests and such in between ...
 

latez

Dabbler
Joined
Sep 29, 2014
Messages
12
Just to update this thread - this is in fact the TrueCharts app setup dialogue - as mentioned by @LimboMenga, there is no longer a dropdown. HOWEVER, adding "1" to the NVIDIA gpu section enabled it and it's not working just fine in plex. Go figure? I guess an undocumented change?
 

Haldi

Cadet
Joined
Feb 28, 2024
Messages
8
Hello,

I can't find my GPU via nvidia-smi.
Any idea why or what to do?
i've Activated Above4G Decoding and Resize-Bar support in Bios. And it's not isolated.

1710175942585.png


Code:
root@NAS[~]# nvidia-smi
No devices were found


Code:
root@NAS[~]# midclt call system.advanced.config | jq
{
  "isolated_gpu_pci_ids": [],
}
root@NAS[~]#


root@NAS[~]# lspci | grep VGA
05:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
09:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)
root@NAS[~]#  systemctl status systemd-modules-load.service
● systemd-modules-load.service - Load Kernel Modules
     Loaded: loaded (/lib/systemd/system/systemd-modules-load.service; static)
    Drop-In: /etc/systemd/system/systemd-modules-load.service.d
             └─override.conf
     Active: active (exited) since Mon 2024-03-11 18:01:19 CET; 52min left
       Docs: man:systemd-modules-load.service(8)
             man:modules-load.d(5)
   Main PID: 2444 (code=exited, status=0/SUCCESS)
        CPU: 7ms

Mar 11 18:01:19 NAS systemd[1]: Starting systemd-modules-load.service - Load Kernel Modules...
Mar 11 18:01:19 NAS systemd-modules-load[2444]: Inserted module 'ioatdma'
Mar 11 18:01:19 NAS systemd-modules-load[2444]: Inserted module 'ntb_split'
Mar 11 18:01:19 NAS systemd-modules-load[2444]: Inserted module 'ntb_netdev'
Mar 11 18:01:19 NAS systemd[1]: Finished systemd-modules-load.service - Load Kernel Modules.
root@NAS[~]#



root@NAS[~]# lsmod | grep nvidia
nvidia_uvm           1523712  0
nvidia_drm             77824  0
nvidia_modeset       1310720  1 nvidia_drm
nvidia              56500224  2 nvidia_uvm,nvidia_modeset
video                  65536  1 nvidia_modeset
drm_kms_helper        204800  5 drm_vram_helper,ast,nvidia_drm
drm                   614400  8 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,nvidia_drm,ttm
root@NAS[~]#


root@NAS[~]# lspci -v
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
        Subsystem: Gigabyte Technology Co., Ltd Starship/Matisse Root Complex
        Flags: fast devsel

09:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1) (prog-if 00 [VGA con                           troller])
        Subsystem: Elitegroup Computer Systems GP104 [GeForce GTX 1070]
        Flags: bus master, fast devsel, latency 0, IRQ 106, IOMMU group 16
        Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
        Memory at ffe0000000 (64-bit, prefetchable) [size=256M]
        Memory at fff0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at f000
        Expansion ROM at fc000000 [virtual] [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_current_drm, nvidia_current

09:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)
        Subsystem: Elitegroup Computer Systems GP104 High Definition Audio Controller
        Flags: bus master, fast devsel, latency 0, IRQ 103, IOMMU group 16
        Memory at fc080000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel


TrueNAS-SCALE-23.10.2
 
Last edited:

Amped141

Cadet
Joined
Mar 22, 2024
Messages
1
Hello,

I can't find my GPU via nvidia-smi.
Any idea why or what to do?
i've Activated Above4G Decoding and Resize-Bar support in Bios. And it's not isolated.

View attachment 76475

Code:
root@NAS[~]# nvidia-smi
No devices were found


Code:
root@NAS[~]# midclt call system.advanced.config | jq
{
  "isolated_gpu_pci_ids": [],
}
root@NAS[~]#


root@NAS[~]# lspci | grep VGA
05:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
09:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)
root@NAS[~]#  systemctl status systemd-modules-load.service
● systemd-modules-load.service - Load Kernel Modules
     Loaded: loaded (/lib/systemd/system/systemd-modules-load.service; static)
    Drop-In: /etc/systemd/system/systemd-modules-load.service.d
             └─override.conf
     Active: active (exited) since Mon 2024-03-11 18:01:19 CET; 52min left
       Docs: man:systemd-modules-load.service(8)
             man:modules-load.d(5)
   Main PID: 2444 (code=exited, status=0/SUCCESS)
        CPU: 7ms

Mar 11 18:01:19 NAS systemd[1]: Starting systemd-modules-load.service - Load Kernel Modules...
Mar 11 18:01:19 NAS systemd-modules-load[2444]: Inserted module 'ioatdma'
Mar 11 18:01:19 NAS systemd-modules-load[2444]: Inserted module 'ntb_split'
Mar 11 18:01:19 NAS systemd-modules-load[2444]: Inserted module 'ntb_netdev'
Mar 11 18:01:19 NAS systemd[1]: Finished systemd-modules-load.service - Load Kernel Modules.
root@NAS[~]#



root@NAS[~]# lsmod | grep nvidia
nvidia_uvm           1523712  0
nvidia_drm             77824  0
nvidia_modeset       1310720  1 nvidia_drm
nvidia              56500224  2 nvidia_uvm,nvidia_modeset
video                  65536  1 nvidia_modeset
drm_kms_helper        204800  5 drm_vram_helper,ast,nvidia_drm
drm                   614400  8 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,nvidia_drm,ttm
root@NAS[~]#


root@NAS[~]# lspci -v
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
        Subsystem: Gigabyte Technology Co., Ltd Starship/Matisse Root Complex
        Flags: fast devsel

09:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1) (prog-if 00 [VGA con                           troller])
        Subsystem: Elitegroup Computer Systems GP104 [GeForce GTX 1070]
        Flags: bus master, fast devsel, latency 0, IRQ 106, IOMMU group 16
        Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
        Memory at ffe0000000 (64-bit, prefetchable) [size=256M]
        Memory at fff0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at f000
        Expansion ROM at fc000000 [virtual] [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_current_drm, nvidia_current

09:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)
        Subsystem: Elitegroup Computer Systems GP104 High Definition Audio Controller
        Flags: bus master, fast devsel, latency 0, IRQ 103, IOMMU group 16
        Memory at fc080000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel


TrueNAS-SCALE-23.10.2
Im running into this same exact issue I have a similar card nvidia 1060 same symptoms SMI outputs nothing
 
Top