mgoulet65
Explorer
- Joined
- Jun 15, 2021
- Messages
- 95
DoneSee above. :) Report that you have the same issue into ticket and attach the debug file.
DoneSee above. :) Report that you have the same issue into ticket and attach the debug file.
nvidia-drm
module, the Tesla P4 card is detected properly in @truecharts Plex app.(hw)
:Mine is not recognized by PlexRegardless the failednvidia-drm
module, the Tesla P4 card is detected properly in @truecharts Plex app.
View attachment 60487
Can someone provide some guidance how to test the card is used in Plex for transcoding? I believe the easiest way is to check the dashboard.
Dashboard with transcoding disabled/forced, I can see the(hw)
:
View attachment 60489 View attachment 60490
Transcoder settings:
View attachment 60488
Make sure you have nothing set in Isolated GPU Devices.Mine is not recognized by Plex
Still having same issue
Is the card detected into OS?Still having same issue
# lspci | grep -i nvidia 03:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
Is the card detected into OS?
Code:# lspci | grep -i nvidia 03:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
# lspci | grep -i nvidia 04:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20c] (rev a1)
Unable to determine the device handle for GPU 0000:81:00.0: Unknown Error
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P4 Off | 00000000:81:00.0 Off | Off | | N/A 56C P0 26W / 75W | 239MiB / 8192MiB | 3% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 172057 C ...diaserver/Plex Transcoder 237MiB | +-----------------------------------------------------------------------------+
Thank you for that command, I did not know about it. Here's mines, using 239MiB / 8121MiB and slightly hotter at 62C, yours shows 8192MiB:However, if the driver is properly loaded and nvidia-smi shows the GPU is seen then transcode should work properly.
# nvidia-smi Wed Nov 30 18:47:20 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P4 Off | 00000000:03:00.0 Off | Off | | N/A 62C P0 26W / 75W | 239MiB / 8121MiB | 2% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 211532 C ...diaserver/Plex Transcoder 237MiB | +-----------------------------------------------------------------------------+
# grep -i '18:57:07' -A4 /var/log/messages Nov 30 18:57:07 uranus kernel: pcieport 0000:00:02.0: AER: Uncorrected (Fatal) error received: 0000:00:02.0 Nov 30 18:57:07 uranus kernel: nvidia 0000:03:00.0: AER: can't recover (no error_detected callback) Nov 30 18:57:07 uranus kernel: NVRM: GPU at PCI:0000:03:00: GPU-ff5239ea-ba5d-4952-a55f-6ad0fc8f56bb Nov 30 18:57:07 uranus kernel: NVRM: Xid (PCI:0000:03:00): 79, pid=0, GPU has fallen off the bus. Nov 30 18:57:07 uranus kernel: NVRM: GPU 0000:03:00.0: GPU has fallen off the bus. Nov 30 18:57:07 uranus kernel: NVRM: GPU 0000:03:00.0: GPU serial number is [removed]. Nov 30 18:57:07 uranus kernel: NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded. Nov 30 18:57:08 uranus kernel: pcieport 0000:00:02.0: AER: Root Port link has been reset (0) Nov 30 18:57:08 uranus kernel: pcieport 0000:00:02.0: AER: device recovery failed # nvidia-smi Unable to determine the device handle for GPU 0000:03:00.0: Unknown Error
Mine shows nothing!Thank you for that command, I did not know about it. Here's mines, using 239MiB / 8121MiB and slightly hotter at 62C, yours shows 8192MiB:
Code:# nvidia-smi Wed Nov 30 18:47:20 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P4 Off | 00000000:03:00.0 Off | Off | | N/A 62C P0 26W / 75W | 239MiB / 8121MiB | 2% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 211532 C ...diaserver/Plex Transcoder 237MiB | +-----------------------------------------------------------------------------+
Do you have the card installed in a rack server? Do these Tesla P4 cards have upgradable firmware?
Edit: While transcoding, I just saw a crash on screen (Angelfish 22.02.04), I'm rebooting to see if I can catch it again:
Code:# grep -i '18:57:07' -A4 /var/log/messages Nov 30 18:57:07 uranus kernel: pcieport 0000:00:02.0: AER: Uncorrected (Fatal) error received: 0000:00:02.0 Nov 30 18:57:07 uranus kernel: nvidia 0000:03:00.0: AER: can't recover (no error_detected callback) Nov 30 18:57:07 uranus kernel: NVRM: GPU at PCI:0000:03:00: GPU-ff5239ea-ba5d-4952-a55f-6ad0fc8f56bb Nov 30 18:57:07 uranus kernel: NVRM: Xid (PCI:0000:03:00): 79, pid=0, GPU has fallen off the bus. Nov 30 18:57:07 uranus kernel: NVRM: GPU 0000:03:00.0: GPU has fallen off the bus. Nov 30 18:57:07 uranus kernel: NVRM: GPU 0000:03:00.0: GPU serial number is {REMOVED}. Nov 30 18:57:07 uranus kernel: NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded. Nov 30 18:57:08 uranus kernel: pcieport 0000:00:02.0: AER: Root Port link has been reset (0) Nov 30 18:57:08 uranus kernel: pcieport 0000:00:02.0: AER: device recovery failed # nvidia-smi Unable to determine the device handle for GPU 0000:03:00.0: Unknown Error
]# nvidia-smi No devices were found
Nov 30 19:24:21 uranus kernel: pcieport 0000:00:02.0: AER: Uncorrected (Fatal) error received: 0000:00:02.0 Nov 30 19:24:21 uranus kernel: nvidia 0000:03:00.0: AER: can't recover (no error_detected callback) Nov 30 19:24:21 uranus kernel: NVRM: GPU at PCI:0000:03:00: GPU-ff5239ea-ba5d-4952-a55f-6ad0fc8f56bb Nov 30 19:24:21 uranus kernel: NVRM: Xid (PCI:0000:03:00): 79, pid=0, GPU has fallen off the bus. Nov 30 19:24:21 uranus kernel: NVRM: GPU 0000:03:00.0: GPU has fallen off the bus. Nov 30 19:24:21 uranus kernel: NVRM: GPU 0000:03:00.0: GPU serial number is [removed]. Nov 30 19:24:21 uranus kernel: NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded. Nov 30 19:24:22 uranus kernel: pcieport 0000:00:02.0: AER: Root Port link has been reset (0) Nov 30 19:24:22 uranus kernel: pcieport 0000:00:02.0: AER: device recovery failed
Good to know, can you please force transcoding for a while to see if the issue makes surface? In my case, after one reboot with forced transcoding, I experienced the issue again 15min later while playing from Apple TV. I'll force again transcoding to see if it happens. In Apple TV Plex app Settings, I set the Home Streaming to 10Mbps, 1080p.I am not experiencing the same issue so far.
NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7
NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4
Dec 3 13:03:36 nas-02 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236 Dec 3 13:03:36 nas-02 kernel: NVRM: The NVIDIA probe routine was not called for 2 device(s). Dec 3 13:03:36 nas-02 kernel: NVRM: This can occur when a driver such as: NVRM: nouveau, rivafb, nvidiafb or rivatv NVRM: was loaded and obtained ownership of the NVIDIA device(s). Dec 3 13:03:36 nas-02 kernel: NVRM: Try unloading the conflicting kernel module (and/or NVRM: reconfigure your kernel without the conflicting
root@nas-02:~# nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
root@nas-02:~# lshw -C display *-display description: 3D controller product: GP104GL [Tesla P4] vendor: NVIDIA Corporation physical id: 0 bus info: pci@0000:01:00.0 logical name: /dev/fb0 version: a1 width: 64 bits clock: 33MHz capabilities: pm msi pciexpress cap_list fb configuration: depth=32 driver=vfio-pci latency=0 mode=1280x1024 visual=truecolor xres=1280 yres=1024 resources: iomemory:3800-37ff iomemory:3800-37ff irq:11 memory:f6000000-f6ffffff memory:38060000000-3806fffffff memory:38070000000-38071ffffff *-display description: 3D controller product: GP104GL [Tesla P4] vendor: NVIDIA Corporation physical id: 0 bus info: pci@0000:81:00.0 version: a1 width: 64 bits clock: 33MHz capabilities: pm msi pciexpress cap_list configuration: driver=vfio-pci latency=0 resources: iomemory:2000-1fff iomemory:2000-1fff irq:11 memory:f0000000-f0ffffff memory:20000000000-2000fffffff memory:20010000000-20011ffffff *-display description: VGA compatible controller product: ASPEED Graphics Family vendor: ASPEED Technology, Inc. physical id: 0 bus info: pci@0000:c4:00.0 version: 41 width: 32 bits clock: 33MHz capabilities: pm msi vga_controller cap_list configuration: driver=ast latency=0 resources: irq:327 memory:b6000000-b6ffffff memory:b7000000-b701ffff ioport:e000(size=128)
# nvidia-smi Tue Dec 13 17:23:58 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P4 Off | 00000000:03:00.0 Off | Off | | N/A 93C P0 27W / 75W | 275MiB / 8192MiB | 1% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 155778 C ...diaserver/Plex Transcoder 273MiB | +-----------------------------------------------------------------------------+
# nvidia-smi Tue Dec 13 23:01:33 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P4 Off | 00000000:03:00.0 Off | Off | | N/A 41C P8 6W / 75W | 2MiB / 8192MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
My quadro p400 runs at 52C while transcoding 4k 10bit to 1080p and 34C idle so your temps are sky high.After Bluefin 22.12.0 upgrade, all transcoding issues are fixed, no more crashes. Been force transcoding for over 5 hours without issues.
Code:# nvidia-smi Tue Dec 13 17:23:58 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P4 Off | 00000000:03:00.0 Off | Off | | N/A 93C P0 27W / 75W | 275MiB / 8192MiB | 1% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 155778 C ...diaserver/Plex Transcoder 273MiB | +-----------------------------------------------------------------------------+
Is it expected to run this hot at 93degrees while transcoding for a while? When not in use, the temperature drops to 41degrees:
Code:# nvidia-smi Tue Dec 13 23:01:33 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P4 Off | 00000000:03:00.0 Off | Off | | N/A 41C P8 6W / 75W | 2MiB / 8192MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
systemd-modules-load[2523]: Failed to find module 'nvidia-drm'
I had to add a fan that sucks the air, now I run at 38degrees idle. I don’t think these cards have a fan, they rely purely on server ventilation. That NVIDIA-drm message is unrelated, I get it also and everything works properly.Isn't tesla P4 relying for cooling on case air flow and den't have own fan?