GPU not available to Plex app

mgoulet65

Explorer
Joined
Jun 15, 2021
Messages
95

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
Regardless the failed nvidia-drm module, the Tesla P4 card is detected properly in @truecharts Plex app.

1669837471475.png


Can someone provide some guidance how to test the card is used in Plex for transcoding? I believe the easiest way is to check the dashboard.

Dashboard with transcoding disabled/forced, I can see the (hw):

1669838577910.png
1669838764265.png


Transcoder settings:

1669837797979.png
 
Last edited:

mgoulet65

Explorer
Joined
Jun 15, 2021
Messages
95

mgoulet65

Explorer
Joined
Jun 15, 2021
Messages
95

mgoulet65

Explorer
Joined
Jun 15, 2021
Messages
95
Is the card detected into OS?
Code:
# lspci | grep -i nvidia
03:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
Code:
#  lspci | grep -i nvidia
04:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20c] (rev a1)
 

eroji

Contributor
Joined
Feb 2, 2015
Messages
140
I just upgraded to BlueFin last night. There were some anomolies I noticed where the driver failed to load properly but a reboot fixed the issue. A good indicator is that you can execute nvidia-smi. This will show any transcode processes running within the Plex container. If the command comes back with some sort of error then your container will not be able to use hardware transcode.

Code:
Unable to determine the device handle for GPU 0000:81:00.0: Unknown Error


However, if the driver is properly loaded and nvidia-smi shows the GPU is seen then transcode should work properly.
Code:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:81:00.0 Off |                  Off |
| N/A   56C    P0    26W /  75W |    239MiB /  8192MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    172057      C   ...diaserver/Plex Transcoder      237MiB |
+-----------------------------------------------------------------------------+

1669851502207.png
 

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
However, if the driver is properly loaded and nvidia-smi shows the GPU is seen then transcode should work properly.
Thank you for that command, I did not know about it. Here's mines, using 239MiB / 8121MiB and slightly hotter at 62C, yours shows 8192MiB:
Code:
# nvidia-smi
Wed Nov 30 18:47:20 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:03:00.0 Off |                  Off |
| N/A   62C    P0    26W /  75W |    239MiB /  8121MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    211532      C   ...diaserver/Plex Transcoder      237MiB |
+-----------------------------------------------------------------------------+

Do you have the card installed in a rack server? Do these Tesla P4 cards have upgradable firmware?

Edit: While transcoding, I just saw a crash on screen (Angelfish 22.02.04), I'm rebooting to see if I can catch it again:
Code:
# grep -i '18:57:07' -A4 /var/log/messages
Nov 30 18:57:07 uranus kernel: pcieport 0000:00:02.0: AER: Uncorrected (Fatal) error received: 0000:00:02.0
Nov 30 18:57:07 uranus kernel: nvidia 0000:03:00.0: AER: can't recover (no error_detected callback)
Nov 30 18:57:07 uranus kernel: NVRM: GPU at PCI:0000:03:00: GPU-ff5239ea-ba5d-4952-a55f-6ad0fc8f56bb
Nov 30 18:57:07 uranus kernel: NVRM: Xid (PCI:0000:03:00): 79, pid=0, GPU has fallen off the bus.
Nov 30 18:57:07 uranus kernel: NVRM: GPU 0000:03:00.0: GPU has fallen off the bus.
Nov 30 18:57:07 uranus kernel: NVRM: GPU 0000:03:00.0: GPU serial number is [removed].
Nov 30 18:57:07 uranus kernel: NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
Nov 30 18:57:08 uranus kernel: pcieport 0000:00:02.0: AER: Root Port link has been reset (0)
Nov 30 18:57:08 uranus kernel: pcieport 0000:00:02.0: AER: device recovery failed

# nvidia-smi
Unable to determine the device handle for GPU 0000:03:00.0: Unknown Error
 
Last edited:

mgoulet65

Explorer
Joined
Jun 15, 2021
Messages
95
Thank you for that command, I did not know about it. Here's mines, using 239MiB / 8121MiB and slightly hotter at 62C, yours shows 8192MiB:
Code:
# nvidia-smi
Wed Nov 30 18:47:20 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:03:00.0 Off |                  Off |
| N/A   62C    P0    26W /  75W |    239MiB /  8121MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    211532      C   ...diaserver/Plex Transcoder      237MiB |
+-----------------------------------------------------------------------------+

Do you have the card installed in a rack server? Do these Tesla P4 cards have upgradable firmware?

Edit: While transcoding, I just saw a crash on screen (Angelfish 22.02.04), I'm rebooting to see if I can catch it again:
Code:
# grep -i '18:57:07' -A4 /var/log/messages
Nov 30 18:57:07 uranus kernel: pcieport 0000:00:02.0: AER: Uncorrected (Fatal) error received: 0000:00:02.0
Nov 30 18:57:07 uranus kernel: nvidia 0000:03:00.0: AER: can't recover (no error_detected callback)
Nov 30 18:57:07 uranus kernel: NVRM: GPU at PCI:0000:03:00: GPU-ff5239ea-ba5d-4952-a55f-6ad0fc8f56bb
Nov 30 18:57:07 uranus kernel: NVRM: Xid (PCI:0000:03:00): 79, pid=0, GPU has fallen off the bus.
Nov 30 18:57:07 uranus kernel: NVRM: GPU 0000:03:00.0: GPU has fallen off the bus.
Nov 30 18:57:07 uranus kernel: NVRM: GPU 0000:03:00.0: GPU serial number is {REMOVED}.
Nov 30 18:57:07 uranus kernel: NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
Nov 30 18:57:08 uranus kernel: pcieport 0000:00:02.0: AER: Root Port link has been reset (0)
Nov 30 18:57:08 uranus kernel: pcieport 0000:00:02.0: AER: device recovery failed

# nvidia-smi
Unable to determine the device handle for GPU 0000:03:00.0: Unknown Error
Mine shows nothing!

Code:
]# nvidia-smi
No devices were found
 
Last edited:

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
@eroji I got the crash again. Streaming stopped on @truecharts Plex, this is an issue that needs to be reported. How do I pull the crash dump, so I report it to IX into a Jira ticket?

Edit: I created NAS-119223, can you report you experience the same issue in Bluefin and submit a debug file into ticket?

Code:
Nov 30 19:24:21 uranus kernel: pcieport 0000:00:02.0: AER: Uncorrected (Fatal) error received: 0000:00:02.0
Nov 30 19:24:21 uranus kernel: nvidia 0000:03:00.0: AER: can't recover (no error_detected callback)
Nov 30 19:24:21 uranus kernel: NVRM: GPU at PCI:0000:03:00: GPU-ff5239ea-ba5d-4952-a55f-6ad0fc8f56bb
Nov 30 19:24:21 uranus kernel: NVRM: Xid (PCI:0000:03:00): 79, pid=0, GPU has fallen off the bus.
Nov 30 19:24:21 uranus kernel: NVRM: GPU 0000:03:00.0: GPU has fallen off the bus.
Nov 30 19:24:21 uranus kernel: NVRM: GPU 0000:03:00.0: GPU serial number is [removed].
Nov 30 19:24:21 uranus kernel: NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
Nov 30 19:24:22 uranus kernel: pcieport 0000:00:02.0: AER: Root Port link has been reset (0)
Nov 30 19:24:22 uranus kernel: pcieport 0000:00:02.0: AER: device recovery failed

Screen Shot 2022-11-30 at 7.28.19 PM.png
 
Last edited:

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
I am not experiencing the same issue so far.
Good to know, can you please force transcoding for a while to see if the issue makes surface? In my case, after one reboot with forced transcoding, I experienced the issue again 15min later while playing from Apple TV. I'll force again transcoding to see if it happens. In Apple TV Plex app Settings, I set the Home Streaming to 10Mbps, 1080p.

Edit: The crash occurred again, 30min later.

Your versions:
Code:
NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7

Mines:
Code:
NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4

Bluefin 22.12.0 is very close of being released.
 
Last edited:

eroji

Contributor
Joined
Feb 2, 2015
Messages
140
I'm not entirely sure what you mean by "force transcoding". Transcoding is enabled currently and it will use hardware transcode when needed. I've confirmed this. However, right now, through my testing with 2 Tesla P4s it seems like there is an issue of passing 1 GPU to container and 1 to VM. I created a post here: GPU passthrough to VM and app.
 

eroji

Contributor
Joined
Feb 2, 2015
Messages
140
So after playing around with GPU passthrough for both VM and container, now the OS is not detecting both cards as VGA controllers. They however instead show up as 3D controllers, which seems to be causing the OS not to load the nVidia drivers on boot. I have no idea why this is happening.

Code:
Dec  3 13:03:36 nas-02 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236
Dec  3 13:03:36 nas-02 kernel: NVRM: The NVIDIA probe routine was not called for 2 device(s).
Dec  3 13:03:36 nas-02 kernel: NVRM: This can occur when a driver such as:
NVRM: nouveau, rivafb, nvidiafb or rivatv
NVRM: was loaded and obtained ownership of the NVIDIA device(s).
Dec  3 13:03:36 nas-02 kernel: NVRM: Try unloading the conflicting kernel module (and/or
NVRM: reconfigure your kernel without the conflicting

Code:
root@nas-02:~# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Code:
root@nas-02:~# lshw -C display
  *-display                
       description: 3D controller
       product: GP104GL [Tesla P4]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       logical name: /dev/fb0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress cap_list fb
       configuration: depth=32 driver=vfio-pci latency=0 mode=1280x1024 visual=truecolor xres=1280 yres=1024
       resources: iomemory:3800-37ff iomemory:3800-37ff irq:11 memory:f6000000-f6ffffff memory:38060000000-3806fffffff memory:38070000000-38071ffffff
  *-display
       description: 3D controller
       product: GP104GL [Tesla P4]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:81:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress cap_list
       configuration: driver=vfio-pci latency=0
       resources: iomemory:2000-1fff iomemory:2000-1fff irq:11 memory:f0000000-f0ffffff memory:20000000000-2000fffffff memory:20010000000-20011ffffff
  *-display
       description: VGA compatible controller
       product: ASPEED Graphics Family
       vendor: ASPEED Technology, Inc.
       physical id: 0
       bus info: pci@0000:c4:00.0
       version: 41
       width: 32 bits
       clock: 33MHz
       capabilities: pm msi vga_controller cap_list
       configuration: driver=ast latency=0
       resources: irq:327 memory:b6000000-b6ffffff memory:b7000000-b701ffff ioport:e000(size=128)
 

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
After Bluefin 22.12.0 upgrade, all transcoding issues are fixed, no more crashes. Been force transcoding for over 5 hours without issues.
Code:
# nvidia-smi
Tue Dec 13 17:23:58 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:03:00.0 Off |                  Off |
| N/A   93C    P0    27W /  75W |    275MiB /  8192MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    155778      C   ...diaserver/Plex Transcoder      273MiB |
+-----------------------------------------------------------------------------+

Is it expected to run this hot at 93degrees while transcoding for a while? When not in use, the temperature drops to 41degrees:
Code:
# nvidia-smi
Tue Dec 13 23:01:33 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:03:00.0 Off |                  Off |
| N/A   41C    P8     6W /  75W |      2MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
 
Last edited:

Sasquatch

Explorer
Joined
Nov 11, 2017
Messages
87
my quadro
After Bluefin 22.12.0 upgrade, all transcoding issues are fixed, no more crashes. Been force transcoding for over 5 hours without issues.
Code:
# nvidia-smi
Tue Dec 13 17:23:58 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:03:00.0 Off |                  Off |
| N/A   93C    P0    27W /  75W |    275MiB /  8192MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    155778      C   ...diaserver/Plex Transcoder      273MiB |
+-----------------------------------------------------------------------------+

Is it expected to run this hot at 93degrees while transcoding for a while? When not in use, the temperature drops to 41degrees:
Code:
# nvidia-smi
Tue Dec 13 23:01:33 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:03:00.0 Off |                  Off |
| N/A   41C    P8     6W /  75W |      2MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
My quadro p400 runs at 52C while transcoding 4k 10bit to 1080p and 34C idle so your temps are sky high.
Isn't tesla P4 relying for cooling on case air flow and den't have own fan?

BTW bluefin is totaly broken for me when it comes to GPU support.
tried P400 and rtx3070 and with both i get:
Code:
systemd-modules-load[2523]: Failed to find module 'nvidia-drm'
 
Top