Hi there. Sorry for the late response, insanity with the excessive heat at work (HVAC Supervisor for major hospital)....
I am on the latest version of TrueNAS Scale 22.12.3.3
Anyway, I ended up getting this...
I am essentially running into something similar (22.12.3.3) - but my GPU doesn't show up at all with lspci and:
Nvidia-smi returning and showing my 3070 - additionally have tried my 3080 same results. Was working before - then I had to RMA my mobo - got same mobo as replacement and here we are with it not working. No version change, so I tried rolling back to 22.12.3 and still same. What happened?
I essentially had to do a re-install of TrueNAS Scale... The HBA card was not in IT mode (still not sure how to do this) and was causing my pool to get tons of errors (comm based). I ended up only passing through the hard drives in the end and let prox manage the HBA.
Anyways, after doing the reinstall it worked. I was able to pass through and use my GPU again. I think it has something to do with upgrading from Core to Scale that it doesn't properly pass your pci-e devices through. I, also, was not passing my HBA through on Core so I didnt have the errors then. I know it would be best to put the HBA in IT mode but that's a different issue. Being still very green to anything IT related, I figure I only have my media files to lose if something ever happens. I have a Pi SMB share that is backing up all my important files (as a secondary backup).
For @Sparx it seems like the middleware and the kernel weren't on the same page as far as the vfio-pci usage.
Big thanks to @Sparx for swinging the mallet as we played a game of Whack-A-Mole to sort this out!
First off, check to see if your device is claimed by the vfio-pci driver by investigating the "Kernel driver in use" part of the lspci -v output:
Code:
0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
DeviceName: pciPassthru0
Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
Physical Slot: 192
Flags: bus master, fast devsel, latency 248, IRQ 19
Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at e4000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Kernel driver in use: vfio-pci ### this line indicates the passthrough driver has claimed your GPU
Kernel modules: nouveau, nvidia_current_drm, nvidia_current
Code:
0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
DeviceName: pciPassthru0
Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
Physical Slot: 192
Flags: bus master, fast devsel, latency 248, IRQ 19
Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at e4000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_current_drm, nvidia_current
If you have the vfio-pci driver in use, but you have no isolated GPU (and have rebooted) then you probably have a stuck file somewhere.
Warning - This doesn't preserve any existing passthroughs, and resets all GPUs to host-owned. No VMs were running during this. Fixing this requires you to go mucking with kernel options, manually deleting files, and multiple reboots. Have backups, and obligatory warning:
Ensure you've removed all isolated GPUs from the webUI; reboot the system if any changes were necessary.
Open a root shell (SSH, use sudo -s or prepend commands with sudo to elevate if needed) on your TrueNAS SCALE machine.
Back up the contents of the following files (save them to your pool, copy them as a local text file, wherever you'd like):
These files will all likely contain the PCI vendor and device IDs of a passthrough GPU, eg: 10DE:1BB3 as well as references to vfio. Once you've backed up the content, remove these files.
Run update-initramfs -k all -u - a number of grep errors will likely be logged. Reboot once more, and you should have your nvidia-smi functionality back.
Note that this isn't a fix for the "two identical GPUs in a system" issue - that one's a bit more complex - nor will it do anything to enable a pre-Maxwell NVIDIA GPU in Bluefin.
I've had this long standing problem since Oct-2023 whereby i upgraded from Bluefin to Cobia. Since i found this; i've been experimenting nearly 2-4 hours every week testing what's what. All i can confirm is this
- I cant find anything in the VFIO.conf or VFIO-PCI files. Since i cant find the file(s); i cant find what's the holdup. After all the reading and experimentation; VFIO is the problem ; but how to fix it?
- if ever i reboot the Truenas server; i lose the NVIDIA-SMI function; which means the driver's lost
- I'm can only do [update-initramfs] so many times; but i suspect either the OS or kubernetes is not seeing my NVIDIA GPU. If i start from beginning BEFORE adding the the NVIDIA to JELLYFIN app; its all fine.
- I'm thinking i may have to either (a) yank the NVIDIA card out and try a different PCIE port or (b) wipe the truenas os and reimport...
- Since the BlueFin - Cobia upgrade; the only saving grace is that my intel iGPU is selectable; but not so with my NVIDIA.
Any help?
I've had this long standing problem since Oct-2023 whereby i upgraded from Bluefin to Cobia. Since i found this; i've been experimenting nearly 2-4 hours every week testing what's what. All i can confirm is this
- I cant find anything in the VFIO.conf or VFIO-PCI files. Since i cant find the file(s); i cant find what's the holdup. After all the reading and experimentation; VFIO is the problem ; but how to fix it?
- if ever i reboot the Truenas server; i lose the NVIDIA-SMI function; which means the driver's lost
- I'm can only do [update-initramfs] so many times; but i suspect either the OS or kubernetes is not seeing my NVIDIA GPU. If i start from beginning BEFORE adding the the NVIDIA to JELLYFIN app; its all fine.
- I'm thinking i may have to either (a) yank the NVIDIA card out and try a different PCIE port or (b) wipe the truenas os and reimport...
- Since the BlueFin - Cobia upgrade; the only saving grace is that my intel iGPU is selectable; but not so with my NVIDIA.
Any help?
Hi everyone, TrueNAS Cobia uses an upgraded NVDIA driver, and the graphics card I use for my transcoding in Jellyfin is no longer supported. The new version of the driver really cut a lot of the frequently-used cheap transcoding cards out - see list here...
Try running midclt call system.advanced.config | jq and see if anything is visible under the isolated_gpu_pci_ids line - if there is, blank it out with sudo midclt call system.advanced.update '{ "isolated_gpu_pci_ids": [] }' and then reboot. See if you're able to then see your GPU with nvidia-smi and assign it to Apps.
If you need further assistance please spin up a new thread with your system and GPU info and throw an @ tag my way.
Try running midclt call system.advanced.config | jq and see if anything is visible under the isolated_gpu_pci_ids line - if there is, blank it out with sudo midclt call system.advanced.update '{ "isolated_gpu_pci_ids": [] }' and then reboot. See if you're able to then see your GPU with nvidia-smi and assign it to Apps.
If you need further assistance please spin up a new thread with your system and GPU info and throw an @ tag my way.
@HoneyBadger
Sorry about the late response. Anyways spent the last 2 days double checking something "funny"/weird/Interesting.
So basically i was experimenting again a few days ago BEFORE i saw your response. Then i ran the [Cobia 23.10.2] update. It VISUALLY looked like a failur; reason being that the DROP DOWN menu selection for GPU was missing; but the current ADD GPU was still there since the COBIA update . So i was running back and forth between the 23.10.1.3 and 23.10.2 updates trying to regain the functionality. However the setting's visually lost (dont know if its leftover junk from the updates or its something permanent/intentional). Then i tried "playing" with all the commands; and didnt notice anything changed; then decided to play with the current ADD GPU settings and lo behold; it works *(Even though NVIDIA functions are marked as experimental).
So i cant say if its the new 23.10.2 update of the "here be dragons" section that solved it; but i do know i screwed around so much somewhere that i may have to do a full nuke on the OS at the new DragonFish update.
The card was working and was/is part of the current supported CONSUMER driver list; and is recognised by TRUENAS. Just a weird issue that NVIDIA driver is non-functional after a reboot; and must be manually initialised with command [update-initramfs -k all -u].
Updating initramfs shouldn't be needed each time, perhaps there's something still stuck in the kernel options. The "here be dragons" section and that post were written during the Bluefin release cycle and things have changed a bit in VFIO usage since then, so check the queries for the advanced config that I posted above to see if it's still bolted to the kernel somehow.
Updating initramfs shouldn't be needed each time, perhaps there's something still stuck in the kernel options. The "here be dragons" section and that post were written during the Bluefin release cycle and things have changed a bit in VFIO usage since then, so check the queries for the advanced config that I posted above to see if it's still bolted to the kernel somehow.
This is what happened to me after the [23.10.1.3] update to [23.10.2] as well. There's no dropdown to [add] the GPU. Its why i suspected in my use case it could have been leftover issues before/after the update. Its why i decided a OS level nuke next DragonFish update is the best choice for me. Too many CLI tests and such in between ...
Just to update this thread - this is in fact the TrueCharts app setup dialogue - as mentioned by @LimboMenga, there is no longer a dropdown. HOWEVER, adding "1" to the NVIDIA gpu section enabled it and it's not working just fine in plex. Go figure? I guess an undocumented change?
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.