Hello,
Having just learned about the amd-gpu-plugin in comments on the referenced feature request I attempted to follow along on my system and haven't had any success.
Before I started my own thread/submitted an issue I wanted to rule out my gut instinct that my GPU is just
too new. I've attempted both "nfd" and "labeler" within the truecharts chart as well as the manual daemon-set approach suggested by
@FrostyCat yet the gpu allocator remains grayed out.
- labeler=enabled fails to start anything and remains in the "stopped" state
- nfd=enabled starts and runs successfully with no errors in the master and worker pod logs
- not selecting either options fails to start anything and again remains stopped.
- (after removing the truechart charts) manually installing amd.yaml from above successfully installs and starts with no logs of note
When I initially installed the plugin from truecharts I selected nfd and that resulted in multiple kubernetes failures with 'systemctl status k3s' spamming "waiting for control-plane node agent startup truenas." Resetting the cluster and retrying the amd plugin first seemed to workaround that issue but I've still gotten nowhere.
CPU: Ryzen 9 5900 - no iGPU
System board: Asrock Rack x570 am4 board with built in ASPEED "gpu" set as the primary display in the bios. No PCIe IDs are isolated.
feature.node.kubernetes.io/pci-0300_1002.present=true
---
My GPU is the latest-gen RDNA2
Radeon Pro W6600. I know other parts of TrueNAS recognize it as a GPU as it shows it in the pci isolation menu (and it works fine when passed to a vm). TrueNas listing the card as "Advanced Micro Devices, Inc. [AMD/ATI] Device 73e3" leads me to believe I
just missed out on having Truenas' kernel (5.10) support my card (
Kernel 5.11 introduced support for "Dimgrey Cavefish") and that's the root cause of all my issues trying to get the device plugin working.
Is that the case? I'm not sure where to start otherwise. Thank you for your time.