Upgraded to Bluefin no GPU passthur

antsv8

Cadet
Joined
Jun 1, 2022
Messages
5
Hey Peeps,

Upgraded to Bluefn today reletively painless, just have a small issue with no allocatedable GPUS...

Running

k3s kubectl describe nodes

Snip --->


Capacity:
amd.com/gpu: 0
cpu: 48
ephemeral-storage: 537192704Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263996684Ki
pods: 250
Allocatable:
amd.com/gpu: 0
cpu: 48
ephemeral-storage: 522581062042
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263996684Ki
pods: 250


---< Snip

yet when I run

lspci | grep VGA


01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200EH (rev 01)
84:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)

This was working with anglefin? Any pointers / known bugs ?


Cheers

Anthony
 

antsv8

Cadet
Joined
Jun 1, 2022
Messages
5
Additionally, i can see the GPU if i was to spin up a VM ....

1671515593581.png
 

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
No issues with my Nvidia Tesla P4:
Code:
# k3s kubectl get nodes
NAME         STATUS   ROLES                  AGE    VERSION
ix-truenas   Ready    control-plane,master   167d   v1.25.3+k3s-9afcd6b9-dirty
# k3s kubectl describe node ix-truenas
Name:               ix-truenas
Roles:              control-plane,master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    egress.k3s.io/cluster=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ix-truenas
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=true
                    node-role.kubernetes.io/master=true
                    openebs.io/nodeid=ix-truenas
                    openebs.io/nodename=ix-truenas
Annotations:        csi.volume.kubernetes.io/nodeid: {"zfs.csi.openebs.io":"ix-truenas"}
                    k3s.io/node-args:
                      ["server","--cluster-cidr","172.16.0.0/16","--cluster-dns","172.17.0.10","--data-dir","/mnt/software/ix-applications/k3s","--kube-apiserve...
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 06 Jul 2022 03:13:23 -0400
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ix-truenas
  AcquireTime:     <unset>
  RenewTime:       Tue, 20 Dec 2022 17:21:41 -0500
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 20 Dec 2022 17:19:24 -0500   Fri, 04 Nov 2022 01:30:41 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 20 Dec 2022 17:19:24 -0500   Fri, 04 Nov 2022 01:30:41 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 20 Dec 2022 17:19:24 -0500   Fri, 04 Nov 2022 01:30:41 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 20 Dec 2022 17:19:24 -0500   Tue, 20 Dec 2022 11:01:41 -0500   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  192.168.1.8
  Hostname:    ix-truenas
Capacity:
  cpu:                12
  ephemeral-storage:  447680896Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             264065208Ki
  nvidia.com/gpu:     1
  pods:               250
Allocatable:
  cpu:                12
  ephemeral-storage:  435503975288
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             264065208Ki
  nvidia.com/gpu:     1
  pods:               250
System Info:
  Machine ID:                 f0bd1ecedb434c9aae8cdf44057a894b
  System UUID:                4c4c4544-0056-4c10-8033-c2c04f463032
  Boot ID:                    08f28163-5638-4ca2-8189-8bd296173c32
  Kernel Version:             5.15.79+truenas
  OS Image:                   Debian GNU/Linux 11 (bullseye)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://Unknown
  Kubelet Version:            v1.25.3+k3s-9afcd6b9-dirty
  Kube-Proxy Version:         v1.25.3+k3s-9afcd6b9-dirty
PodCIDR:                      172.16.0.0/16
PodCIDRs:                     172.16.0.0/16
Non-terminated Pods:          (25 in total)
  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---
  kube-system                 openebs-zfs-controller-0                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         6h32m
  kube-system                 openebs-zfs-node-d87jm                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         6h32m
  metallb-system              speaker-8t669                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         21h
  prometheus-operator         prometheus-operator-5c7445d877-tg5sj        100m (0%)     200m (1%)   100Mi (0%)       200Mi (0%)     21h    6h32m
  kube-system                 coredns-d76bd69b-vg9m5                      100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     21h      6h32m
  kube-system                 nvidia-device-plugin-daemonset-hvmm6        0 (0%)        0 (0%)      0 (0%)           0 (0%)         21h
  metallb-system              controller-7597dd4f7b-7dr7t                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         21h
  cnpg-system                 cnpg-controller-manager-854876b995-tn82x    100m (0%)     100m (0%)   100Mi (0%)       200Mi (0%)     21h
  ix-photoprism               photoprism-mariadb-0                        10m (0%)      4 (33%)     50Mi (0%)        8Gi (3%)       18h
  ix-plex                     plex-7b6b74f8d9-m4f5d                       10m (0%)      4 (33%)     50Mi (0%)        32Gi (12%)     6h32m
  kube-system                 svclb-plex-438a5308-l77qk                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         6h17m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                390m (3%)   36300m (302%)
  memory             720Mi (0%)  107066Mi (41%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  nvidia.com/gpu     1           1
Events:              <none>
 

Triumph

Dabbler
Joined
May 14, 2014
Messages
12
Apologies as I'm just getting started on Scale, but did you already go through the System Settings > Advanced > Isolated GPU Device and set your GPU in the TrueNas configuration?
Then once that is set, go into your App/VM and passthrough the GPU?
 

thadrumr

Cadet
Joined
Apr 28, 2021
Messages
7
I had the same problem with my Prod system. When running Scale 22.02.4 I was able to allocate my RX550 with no issue and have working transcoding with Jellyfin. Upon upgrading to 22.12.0 the GPU is no longer allocatable. This seems to be an issue with the AMD GPU Plugin Kubernetes POD that is built into the system. There is another thread on this same issue and a Jira ticket that is blocked for some reason. The Jira ticket is NAS-119396 Also to get the GPU working I did not have to isolate the GPU or anything it was available to allocate as soon as I added it to the machine and booted it up. I should also mention my system has a built in ASPEED GPU for the system.
 

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
Nice find on the version @samyapsul! NAS-119396 was updated two hours ago with the same fix. @antsv8 and @thadrumr, I'm curious if the amdgpu-device-plugin-daemonset pod is running on kube-system? I own a NVIDIA GPU, so I cannot test. Can anyone post the output of following commands:
Code:
# midclt call device.get_gpus | jq
# k3s kubectl get pods -n kube-system | grep amd

If the amdgpu-device-plugin-daemonset pod is not running, then iX Systems should look into it. If the pod is running, we can temporarily apply the version fix with the simplified procedure listed below, using official Radeon DaemonSet. You can use NAS-119396 to report additional findings, I presume.

Run the commands as root:
Code:
# curl -sO https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml
# sed -i 's|k8s-device-plugin|k8s-device-plugin:1.18.0|' k8s-ds-amdgpu-dp.yaml
# k3s kubectl apply -f k8s-ds-amdgpu-dp.yaml -n kube-system
# k3s kubectl get pods -n kube-system

To delete the DaemonSet, run the command as root:
Code:
# k3s kubectl delete -f k8s-ds-amdgpu-dp.yaml -n kube-system
 
Last edited:

thadrumr

Cadet
Joined
Apr 28, 2021
Messages
7
This worked for me. I had to reboot and I also had to append --namespace=kube-system to the command to apply the yaml. It also obviously doesn't presist acrosss reboots. I was able to get my Jellyfin working and verified with radeontop that the GPU is being used.
 

antsv8

Cadet
Joined
Jun 1, 2022
Messages
5
Hey Daisuke,

first snip ---->>

root@scale:/tmp/k3# midclt call device.get_gpus | jq
[
{
"addr": {
"pci_slot": "0000:01:00.1",
"domain": "0000",
"bus": "01",
"slot": "00"
},
"description": "Matrox Electronics Systems Ltd. MGA G200EH",
"devices": [
{
"pci_id": "103C:3306",
"pci_slot": "0000:01:00.0",
"vm_pci_slot": "pci_0000_01_00_0"
},
{
"pci_id": "102B:0533",
"pci_slot": "0000:01:00.1",
"vm_pci_slot": "pci_0000_01_00_1"
},
{
"pci_id": "103C:3307",
"pci_slot": "0000:01:00.2",
"vm_pci_slot": "pci_0000_01_00_2"
},
{
"pci_id": "103C:3300",
"pci_slot": "0000:01:00.4",
"vm_pci_slot": "pci_0000_01_00_4"
}
],
"vendor": null,
"uses_system_critical_devices": false,
"available_to_host": true
},
{
"addr": {
"pci_slot": "0000:84:00.0",
"domain": "0000",
"bus": "84",
"slot": "00"
},
"description": "Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]",
"devices": [
{
"pci_id": "1002:67DF",
"pci_slot": "0000:84:00.0",
"vm_pci_slot": "pci_0000_84_00_0"
},
{
"pci_id": "1002:AAF0",
"pci_slot": "0000:84:00.1",
"vm_pci_slot": "pci_0000_84_00_1"
}
],
"vendor": "AMD",
"uses_system_critical_devices": false,
"available_to_host": false
}
]


---< Snip

Snip 2 ---->

root@scale:/tmp/k3# k3s kubectl get pods -n kube-system | grep amd
amdgpu-device-plugin-daemonset-2tjbq 1/1 Running 0 3m37s

---< Snip 2
 

antsv8

Cadet
Joined
Jun 1, 2022
Messages
5
rebooted....

now it workie

Snip --->


root@scale:/tmp/k8# k3s kubectl get pods -n kube-system | grep amd
amdgpu-device-plugin-daemonset-drzcx 1/1 Running 0 3m18s
root@scale:/tmp/k8# midclt call device.get_gpus | jq
[
{
"addr": {
"pci_slot": "0000:01:00.1",
"domain": "0000",
"bus": "01",
"slot": "00"
},
"description": "Matrox Electronics Systems Ltd. MGA G200EH",
"devices": [
{
"pci_id": "103C:3306",
"pci_slot": "0000:01:00.0",
"vm_pci_slot": "pci_0000_01_00_0"
},
{
"pci_id": "102B:0533",
"pci_slot": "0000:01:00.1",
"vm_pci_slot": "pci_0000_01_00_1"
},
{
"pci_id": "103C:3307",
"pci_slot": "0000:01:00.2",
"vm_pci_slot": "pci_0000_01_00_2"
},
{
"pci_id": "103C:3300",
"pci_slot": "0000:01:00.4",
"vm_pci_slot": "pci_0000_01_00_4"
}
],
"vendor": null,
"uses_system_critical_devices": false,
"available_to_host": true
},
{
"addr": {
"pci_slot": "0000:84:00.0",
"domain": "0000",
"bus": "84",
"slot": "00"
},
"description": "Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]",
"devices": [
{
"pci_id": "1002:67DF",
"pci_slot": "0000:84:00.0",
"vm_pci_slot": "pci_0000_84_00_0"
},
{
"pci_id": "1002:AAF0",
"pci_slot": "0000:84:00.1",
"vm_pci_slot": "pci_0000_84_00_1"
}
],
"vendor": "AMD",
"uses_system_critical_devices": false,
"available_to_host": true
}
]


---> Snip end

Now works..... job well done!

Cheers
 
Top