Upgraded from Angelfish 22.02.2.1 to Bluefin 22.12-RC.1 yesterday, also experiencing this same issue.
AMD Ryzen 5 5600G
Asus ROG Strix B550-i
Previously, I used TrueChart's amd-gpu app to enable GPU support. Since TrueNas Scale 'natively' supports this now, TrueCharts has remove that app from their catalog.
TrueNas Scale uses a docker image from rocm that replaces what TrueChart's app does -
https://hub.docker.com/r/rocm/k8s-device-plugin
I can verify that it is running. However, the container log shows that its not able to register the GPU.
Code:
I1208 22:08:43.115131 1 main.go:305] AMD GPU device plugin for Kubernetes
I1208 22:08:43.115418 1 main.go:305] ./k8s-device-plugin version v1.18.1-12-g939a8a0
I1208 22:08:43.115425 1 main.go:305] hwloc: _VERSION: 2.8.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800
I1208 22:08:43.115470 1 manager.go:42] Starting device plugin manager
I1208 22:08:43.115476 1 manager.go:46] Registering for system signal notifications
I1208 22:08:43.115818 1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
I1208 22:08:43.115873 1 manager.go:60] Starting Discovery on new plugins
I1208 22:08:43.115878 1 manager.go:66] Handling incoming signals
I1208 22:08:43.115890 1 manager.go:71] Received new list of plugins: [gpu]
I1208 22:08:43.120537 1 manager.go:110] Adding a new plugin "gpu"
I1208 22:08:43.120593 1 plugin.go:64] gpu: Starting plugin server
I1208 22:08:43.120610 1 plugin.go:127] gpu: Registering the DPI with Kubelet
I1208 22:08:43.121549 1 plugin.go:139] gpu: Registration for endpoint amd.com_gpu
E1208 22:08:53.125142 1 plugin.go:156] gpu: Registration failed: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
E1208 22:08:53.125278 1 plugin.go:157] gpu: Make sure that the DevicePlugins feature gate is enabled and kubelet running
E1208 22:08:53.125332 1 plugin.go:78] error registering with device plugin manager: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
E1208 22:08:53.125466 1 manager.go:214] Failed to start plugin's "gpu" server, atempt 1 ouf of 3 waiting 3000000000 before next try: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
I1208 22:08:56.126314 1 plugin.go:64] gpu: Starting plugin server
I1208 22:08:56.126349 1 plugin.go:127] gpu: Registering the DPI with Kubelet
I1208 22:08:56.126538 1 plugin.go:139] gpu: Registration for endpoint amd.com_gpu
E1208 22:09:06.127989 1 plugin.go:156] gpu: Registration failed: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
E1208 22:09:06.128015 1 plugin.go:157] gpu: Make sure that the DevicePlugins feature gate is enabled and kubelet running
E1208 22:09:06.128084 1 plugin.go:78] error registering with device plugin manager: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
E1208 22:09:06.128100 1 manager.go:214] Failed to start plugin's "gpu" server, atempt 2 ouf of 3 waiting 3000000000 before next try: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
I1208 22:09:09.130328 1 plugin.go:64] gpu: Starting plugin server
I1208 22:09:09.130348 1 plugin.go:127] gpu: Registering the DPI with Kubelet
I1208 22:09:09.130490 1 plugin.go:139] gpu: Registration for endpoint amd.com_gpu
E1208 22:09:19.131670 1 plugin.go:156] gpu: Registration failed: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
E1208 22:09:19.131695 1 plugin.go:157] gpu: Make sure that the DevicePlugins feature gate is enabled and kubelet running
E1208 22:09:19.131749 1 plugin.go:78] error registering with device plugin manager: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
I1208 22:09:19.131763 1 manager.go:211] Failed to start plugin's "gpu" server, within given 3 tries: rpc error: code = Unknown desc = failed to dial device plugin: context deadline exceeded
I noticed the k3s config did not contain the DevicePlugins feature gate, so I went ahead and set that under kubelet-arg. Yet, I'm still seeing this same error in the gpu plugin logs.
Hope this provides useful info for anyone else that is also troubleshooting this matter.