Bluefin Applications GPU allocation missing

hmak604

Cadet
Joined
Nov 26, 2022
Messages
5
Hey team,

Updated to the new Bluefin RC1
Previously using 22.02.4, I had my iGPU being used by Jellyfin, Photoprism, and Immich by allocating them 1 each via the dropdown in the Truechart.
After updating to Bluefin this GPU allocation can no longer be done. It shows 0 GPUs to be allocated. Any ideas why?

Things I've done:
run
lspci -k | grep -EA3 'VGA|3D|Display' 00:02.0 VGA compatible controller: Intel Corporation Device 4c8a (rev 01) DeviceName: Onboard - Video Subsystem: ASUSTeK Computer Inc. Device 8694 Kernel driver in use: i915 midclt call system.advanced.update '{"kernel_extra_options": "i915.force_probe=4a8c" }' reboot


Hardware is
Processor: 11900t es Rocketlake
 

hmak604

Cadet
Joined
Nov 26, 2022
Messages
5
sorry I meant

midclt call system.advanced.update '{"kernel_extra_options": "i915.force_probe=4c8a" }'

of course
 

samyapsul

Cadet
Joined
May 7, 2022
Messages
6
Upgraded from Angelfish 22.02.2.1 to Bluefin 22.12-RC.1 yesterday, also experiencing this same issue.

AMD Ryzen 5 5600G
Asus ROG Strix B550-i

Previously, I used TrueChart's amd-gpu app to enable GPU support. Since TrueNas Scale 'natively' supports this now, TrueCharts has remove that app from their catalog.

TrueNas Scale uses a docker image from rocm that replaces what TrueChart's app does - https://hub.docker.com/r/rocm/k8s-device-plugin
I can verify that it is running. However, the container log shows that its not able to register the GPU.

Code:
I1208 22:08:43.115131       1 main.go:305] AMD GPU device plugin for Kubernetes
I1208 22:08:43.115418       1 main.go:305] ./k8s-device-plugin version v1.18.1-12-g939a8a0
I1208 22:08:43.115425       1 main.go:305] hwloc: _VERSION: 2.8.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800
I1208 22:08:43.115470       1 manager.go:42] Starting device plugin manager
I1208 22:08:43.115476       1 manager.go:46] Registering for system signal notifications
I1208 22:08:43.115818       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
I1208 22:08:43.115873       1 manager.go:60] Starting Discovery on new plugins
I1208 22:08:43.115878       1 manager.go:66] Handling incoming signals
I1208 22:08:43.115890       1 manager.go:71] Received new list of plugins: [gpu]
I1208 22:08:43.120537       1 manager.go:110] Adding a new plugin "gpu"
I1208 22:08:43.120593       1 plugin.go:64] gpu: Starting plugin server
I1208 22:08:43.120610       1 plugin.go:127] gpu: Registering the DPI with Kubelet
I1208 22:08:43.121549       1 plugin.go:139] gpu: Registration for endpoint amd.com_gpu
E1208 22:08:53.125142       1 plugin.go:156] gpu: Registration failed: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
E1208 22:08:53.125278       1 plugin.go:157] gpu: Make sure that the DevicePlugins feature gate is enabled and kubelet running
E1208 22:08:53.125332       1 plugin.go:78] error registering with device plugin manager: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
E1208 22:08:53.125466       1 manager.go:214] Failed to start plugin's "gpu" server, atempt 1 ouf of 3 waiting 3000000000 before next try: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
I1208 22:08:56.126314       1 plugin.go:64] gpu: Starting plugin server
I1208 22:08:56.126349       1 plugin.go:127] gpu: Registering the DPI with Kubelet
I1208 22:08:56.126538       1 plugin.go:139] gpu: Registration for endpoint amd.com_gpu
E1208 22:09:06.127989       1 plugin.go:156] gpu: Registration failed: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
E1208 22:09:06.128015       1 plugin.go:157] gpu: Make sure that the DevicePlugins feature gate is enabled and kubelet running
E1208 22:09:06.128084       1 plugin.go:78] error registering with device plugin manager: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
E1208 22:09:06.128100       1 manager.go:214] Failed to start plugin's "gpu" server, atempt 2 ouf of 3 waiting 3000000000 before next try: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
I1208 22:09:09.130328       1 plugin.go:64] gpu: Starting plugin server
I1208 22:09:09.130348       1 plugin.go:127] gpu: Registering the DPI with Kubelet
I1208 22:09:09.130490       1 plugin.go:139] gpu: Registration for endpoint amd.com_gpu
E1208 22:09:19.131670       1 plugin.go:156] gpu: Registration failed: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
E1208 22:09:19.131695       1 plugin.go:157] gpu: Make sure that the DevicePlugins feature gate is enabled and kubelet running
E1208 22:09:19.131749       1 plugin.go:78] error registering with device plugin manager: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
I1208 22:09:19.131763       1 manager.go:211] Failed to start plugin's "gpu" server, within given 3 tries: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded



I noticed the k3s config did not contain the DevicePlugins feature gate, so I went ahead and set that under kubelet-arg. Yet, I'm still seeing this same error in the gpu plugin logs.

Hope this provides useful info for anyone else that is also troubleshooting this matter.
 

thadrumr

Cadet
Joined
Apr 28, 2021
Messages
7
I had the same issue I recently upgraded from 22.02.4 with a working AMD RX550 allocated to a Jellyfin App. Upon upgrading to 22.12 the GPU is no longer available to allocate.
 

samyapsul

Cadet
Joined
May 7, 2022
Messages
6
Folks, I was able to get GPU pass through working again by using an older version of the rocm device plugin container (tagged 1.18.0). I'm assuming any version after that will not work for some reason. I tested 1.19 and it resulted in the same output as shown in my earlier post above.

Code:
I1221 09:18:08.315406       1 main.go:296] AMD GPU device plugin for Kubernetes
I1221 09:18:08.315456       1 main.go:296] ./k8s-device-plugin version v1.18-0-g3775549
I1221 09:18:08.315460       1 main.go:296] hwloc: _VERSION: 2.2.0, _API_VERSION: 0x00020100, _COMPONENT_ABI: 6, Runtime: 0x00020100
I1221 09:18:08.315469       1 manager.go:42] Starting device plugin manager
I1221 09:18:08.315474       1 manager.go:46] Registering for system signal notifications
I1221 09:18:08.315723       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
I1221 09:18:08.315845       1 manager.go:60] Starting Discovery on new plugins
I1221 09:18:08.315856       1 manager.go:66] Handling incoming signals
I1221 09:18:08.315877       1 manager.go:71] Received new list of plugins: [gpu]
I1221 09:18:08.315970       1 manager.go:110] Adding a new plugin "gpu"
I1221 09:18:08.315988       1 plugin.go:64] gpu: Starting plugin server
I1221 09:18:08.315993       1 plugin.go:95] gpu: Starting the DPI gRPC server
I1221 09:18:08.316241       1 plugin.go:113] gpu: Serving requests...
I1221 09:18:18.317854       1 plugin.go:129] gpu: Registering the DPI with Kubelet
I1221 09:18:18.318096       1 plugin.go:141] gpu: Registration for endpoint amd.com_gpu
I1221 09:18:18.321137       1 amdgpu.go:100] /sys/module/amdgpu/drivers/pci:amdgpu/0000:08:00.0
I1221 09:18:18.345696       1 main.go:149] Watching GPU with bus ID: 0000:08:00.0 NUMA Node: []
E1221 09:18:18.345711       1 main.go:151] No NUMA node found with bus ID: 0000:08:00.0


Follow FrostyCat's guide to create your own k3s pod with the 1.18.0 version: https://www.truenas.com/community/t...-for-application-container.97863/#post-675951

In the provided yaml file, add a colon followed by 1.18.0 after the image name:
Code:
      - image: rocm/k8s-device-plugin:1.18.0
 
Top