Upgraded to Bluefin no GPU passthur

antsv8 · Dec 19, 2022

Hey Peeps,

Upgraded to Bluefn today reletively painless, just have a small issue with no allocatedable GPUS...

Running

k3s kubectl describe nodes

Snip --->

Capacity:
amd.com/gpu: 0
cpu: 48
ephemeral-storage: 537192704Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263996684Ki
pods: 250
Allocatable:
amd.com/gpu: 0
cpu: 48
ephemeral-storage: 522581062042
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263996684Ki
pods: 250

---< Snip

yet when I run

lspci | grep VGA

01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200EH (rev 01)
84:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)

This was working with anglefin? Any pointers / known bugs ?

Cheers

Anthony

antsv8 · Dec 19, 2022

Additionally, i can see the GPU if i was to spin up a VM ....

Daisuke · Dec 20, 2022

No issues with my Nvidia Tesla P4:

Code:

# k3s kubectl get nodes
NAME         STATUS   ROLES                  AGE    VERSION
ix-truenas   Ready    control-plane,master   167d   v1.25.3+k3s-9afcd6b9-dirty
# k3s kubectl describe node ix-truenas
Name:               ix-truenas
Roles:              control-plane,master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    egress.k3s.io/cluster=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ix-truenas
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=true
                    node-role.kubernetes.io/master=true
                    openebs.io/nodeid=ix-truenas
                    openebs.io/nodename=ix-truenas
Annotations:        csi.volume.kubernetes.io/nodeid: {"zfs.csi.openebs.io":"ix-truenas"}
                    k3s.io/node-args:
                      ["server","--cluster-cidr","172.16.0.0/16","--cluster-dns","172.17.0.10","--data-dir","/mnt/software/ix-applications/k3s","--kube-apiserve...
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 06 Jul 2022 03:13:23 -0400
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ix-truenas
  AcquireTime:     <unset>
  RenewTime:       Tue, 20 Dec 2022 17:21:41 -0500
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 20 Dec 2022 17:19:24 -0500   Fri, 04 Nov 2022 01:30:41 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 20 Dec 2022 17:19:24 -0500   Fri, 04 Nov 2022 01:30:41 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 20 Dec 2022 17:19:24 -0500   Fri, 04 Nov 2022 01:30:41 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 20 Dec 2022 17:19:24 -0500   Tue, 20 Dec 2022 11:01:41 -0500   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  192.168.1.8
  Hostname:    ix-truenas
Capacity:
  cpu:                12
  ephemeral-storage:  447680896Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             264065208Ki
  nvidia.com/gpu:     1
  pods:               250
Allocatable:
  cpu:                12
  ephemeral-storage:  435503975288
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             264065208Ki
  nvidia.com/gpu:     1
  pods:               250
System Info:
  Machine ID:                 f0bd1ecedb434c9aae8cdf44057a894b
  System UUID:                4c4c4544-0056-4c10-8033-c2c04f463032
  Boot ID:                    08f28163-5638-4ca2-8189-8bd296173c32
  Kernel Version:             5.15.79+truenas
  OS Image:                   Debian GNU/Linux 11 (bullseye)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://Unknown
  Kubelet Version:            v1.25.3+k3s-9afcd6b9-dirty
  Kube-Proxy Version:         v1.25.3+k3s-9afcd6b9-dirty
PodCIDR:                      172.16.0.0/16
PodCIDRs:                     172.16.0.0/16
Non-terminated Pods:          (25 in total)
  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---
  kube-system                 openebs-zfs-controller-0                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         6h32m
  kube-system                 openebs-zfs-node-d87jm                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         6h32m
  metallb-system              speaker-8t669                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         21h
  prometheus-operator         prometheus-operator-5c7445d877-tg5sj        100m (0%)     200m (1%)   100Mi (0%)       200Mi (0%)     21h    6h32m
  kube-system                 coredns-d76bd69b-vg9m5                      100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     21h      6h32m
  kube-system                 nvidia-device-plugin-daemonset-hvmm6        0 (0%)        0 (0%)      0 (0%)           0 (0%)         21h
  metallb-system              controller-7597dd4f7b-7dr7t                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         21h
  cnpg-system                 cnpg-controller-manager-854876b995-tn82x    100m (0%)     100m (0%)   100Mi (0%)       200Mi (0%)     21h
  ix-photoprism               photoprism-mariadb-0                        10m (0%)      4 (33%)     50Mi (0%)        8Gi (3%)       18h
  ix-plex                     plex-7b6b74f8d9-m4f5d                       10m (0%)      4 (33%)     50Mi (0%)        32Gi (12%)     6h32m
  kube-system                 svclb-plex-438a5308-l77qk                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         6h17m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                390m (3%)   36300m (302%)
  memory             720Mi (0%)  107066Mi (41%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  nvidia.com/gpu     1           1
Events:              <none>

Triumph · Dec 20, 2022

Apologies as I'm just getting started on Scale, but did you already go through the System Settings > Advanced > Isolated GPU Device and set your GPU in the TrueNas configuration?
Then once that is set, go into your App/VM and passthrough the GPU?

Daisuke · Dec 20, 2022

Triumph said:
did you already go through the System Settings > Advanced > Isolated GPU Device and set your GPU in the TrueNas configuration?

I did not. If I do this, I cannot allocate the GPU into Kubernetes apps. Can you please remove it also and see if this fixes the issue? I don't use VMs.

thadrumr · Dec 20, 2022

I had the same problem with my Prod system. When running Scale 22.02.4 I was able to allocate my RX550 with no issue and have working transcoding with Jellyfin. Upon upgrading to 22.12.0 the GPU is no longer allocatable. This seems to be an issue with the AMD GPU Plugin Kubernetes POD that is built into the system. There is another thread on this same issue and a Jira ticket that is blocked for some reason. The Jira ticket is NAS-119396 Also to get the GPU working I did not have to isolate the GPU or anything it was available to allocate as soon as I added it to the machine and booted it up. I should also mention my system has a built in ASPEED GPU for the system.

Daisuke · Dec 20, 2022

thadrumr said:
I did not have to isolate the GPU or anything it was available to allocate as soon as I added it

Yes, if you isolate the GPU it will not be available for allocation. Good info on AMD, link to NAS-119396. Looks like NVIDA is safe.

samyapsul · Dec 21, 2022

I was able to resolve my issue by running an older version of the rocm device plugin container.
See my post here: https://www.truenas.com/community/t...ns-gpu-allocation-missing.105542/#post-731182

Daisuke · Dec 21, 2022

Nice find on the version @samyapsul! NAS-119396 was updated two hours ago with the same fix. @antsv8 and @thadrumr, I'm curious if the amdgpu-device-plugin-daemonset pod is running on kube-system? I own a NVIDIA GPU, so I cannot test. Can anyone post the output of following commands:

Code:

# midclt call device.get_gpus | jq
# k3s kubectl get pods -n kube-system | grep amd

If the amdgpu-device-plugin-daemonset pod is not running, then iX Systems should look into it. If the pod is running, we can temporarily apply the version fix with the simplified procedure listed below, using official Radeon DaemonSet. You can use NAS-119396 to report additional findings, I presume.

Run the commands as root:

Code:

# curl -sO https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml
# sed -i 's|k8s-device-plugin|k8s-device-plugin:1.18.0|' k8s-ds-amdgpu-dp.yaml
# k3s kubectl apply -f k8s-ds-amdgpu-dp.yaml -n kube-system
# k3s kubectl get pods -n kube-system

To delete the DaemonSet, run the command as root:

Code:

# k3s kubectl delete -f k8s-ds-amdgpu-dp.yaml -n kube-system

thadrumr · Dec 21, 2022

This worked for me. I had to reboot and I also had to append --namespace=kube-system to the command to apply the yaml. It also obviously doesn't presist acrosss reboots. I was able to get my Jellyfin working and verified with radeontop that the GPU is being used.

Daisuke · Dec 21, 2022

@thadrumr Normally we should use kube-system namespace, I used a custom namespace thinking the pod is already running into kube-system namespace on latest tag. Then the fix is even easier, I updated the post. Can you remove the version, so it pulls the latest tag?

antsv8 · Dec 21, 2022

Hey Daisuke,

first snip ---->>

root@scale:/tmp/k3# midclt call device.get_gpus | jq
[
{
"addr": {
"pci_slot": "0000:01:00.1",
"domain": "0000",
"bus": "01",
"slot": "00"
},
"description": "Matrox Electronics Systems Ltd. MGA G200EH",
"devices": [
{
"pci_id": "103C:3306",
"pci_slot": "0000:01:00.0",
"vm_pci_slot": "pci_0000_01_00_0"
},
{
"pci_id": "102B:0533",
"pci_slot": "0000:01:00.1",
"vm_pci_slot": "pci_0000_01_00_1"
},
{
"pci_id": "103C:3307",
"pci_slot": "0000:01:00.2",
"vm_pci_slot": "pci_0000_01_00_2"
},
{
"pci_id": "103C:3300",
"pci_slot": "0000:01:00.4",
"vm_pci_slot": "pci_0000_01_00_4"
}
],
"vendor": null,
"uses_system_critical_devices": false,
"available_to_host": true
},
{
"addr": {
"pci_slot": "0000:84:00.0",
"domain": "0000",
"bus": "84",
"slot": "00"
},
"description": "Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]",
"devices": [
{
"pci_id": "1002:67DF",
"pci_slot": "0000:84:00.0",
"vm_pci_slot": "pci_0000_84_00_0"
},
{
"pci_id": "1002:AAF0",
"pci_slot": "0000:84:00.1",
"vm_pci_slot": "pci_0000_84_00_1"
}
],
"vendor": "AMD",
"uses_system_critical_devices": false,
"available_to_host": false
}
]

---< Snip

Snip 2 ---->

root@scale:/tmp/k3# k3s kubectl get pods -n kube-system | grep amd
amdgpu-device-plugin-daemonset-2tjbq 1/1 Running 0 3m37s

---< Snip 2

antsv8 · Dec 21, 2022

rebooted....

now it workie

Snip --->

root@scale:/tmp/k8# k3s kubectl get pods -n kube-system | grep amd
amdgpu-device-plugin-daemonset-drzcx 1/1 Running 0 3m18s
root@scale:/tmp/k8# midclt call device.get_gpus | jq
[
{
"addr": {
"pci_slot": "0000:01:00.1",
"domain": "0000",
"bus": "01",
"slot": "00"
},
"description": "Matrox Electronics Systems Ltd. MGA G200EH",
"devices": [
{
"pci_id": "103C:3306",
"pci_slot": "0000:01:00.0",
"vm_pci_slot": "pci_0000_01_00_0"
},
{
"pci_id": "102B:0533",
"pci_slot": "0000:01:00.1",
"vm_pci_slot": "pci_0000_01_00_1"
},
{
"pci_id": "103C:3307",
"pci_slot": "0000:01:00.2",
"vm_pci_slot": "pci_0000_01_00_2"
},
{
"pci_id": "103C:3300",
"pci_slot": "0000:01:00.4",
"vm_pci_slot": "pci_0000_01_00_4"
}
],
"vendor": null,
"uses_system_critical_devices": false,
"available_to_host": true
},
{
"addr": {
"pci_slot": "0000:84:00.0",
"domain": "0000",
"bus": "84",
"slot": "00"
},
"description": "Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]",
"devices": [
{
"pci_id": "1002:67DF",
"pci_slot": "0000:84:00.0",
"vm_pci_slot": "pci_0000_84_00_0"
},
{
"pci_id": "1002:AAF0",
"pci_slot": "0000:84:00.1",
"vm_pci_slot": "pci_0000_84_00_1"
}
],
"vendor": "AMD",
"uses_system_critical_devices": false,
"available_to_host": true
}
]

---> Snip end

Now works..... job well done!

Cheers

Daisuke · Dec 21, 2022

@antsv8 the pod is gone after reboot, it needs to run. Can you try without the version and see if you can allocate the GPU? Also, you should use [CODE]text[/CODE] bbcode to properly format segments of code or commands output. What you posted is unreadable.

Important Announcement for the TrueNAS Community.

Upgraded to Bluefin no GPU passthur

antsv8

Cadet

antsv8

Cadet

Daisuke

Contributor

Triumph

Dabbler

Daisuke

Contributor

thadrumr

Cadet

Daisuke

Contributor

samyapsul

Cadet

Daisuke

Contributor

thadrumr

Cadet

Daisuke

Contributor

antsv8

Cadet

antsv8

Cadet

Daisuke

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

Upgraded to Bluefin no GPU passthur

Cadet

Cadet

Contributor

Dabbler

Contributor

Cadet

Contributor

Cadet

Contributor

Cadet

Contributor

Cadet

Cadet

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Upgraded to Bluefin no GPU passthur"

Similar threads