Problem running custom docker apps: cannot allocate unhealthy devices amd.com/gpu

allan.tatter · Nov 12, 2023

I have a fresh install of TrueNAS SCALE on my old PC. I have the following hardware:

CPU: Intel Core i5-8400 (with integrated graphics)
Mother board: GIGABYTE H370M DS3H
Graphics card: AMD Radeon HD 7970/8970 OEM / R9 280X

I have enabled Apps Service (Docker) and deploying application from Discover section works as expected - I can deploy the applications, the go to a running state and work.

However when I try to deploy any custom app (any public image from Docker Hub) the state stays at "Deploying" and never gets to "Running". Under the custom app Details section I have a recurring following Related Kubernetes Event:

Code:

Allocate failed due to no healthy devices present; cannot allocate unhealthy devices amd.com/gpu, which is unexpected

With kubectl I see the following:

Code:

$ kubectl -n ix-whoami4 get pods
NAME                                READY   STATUS                     RESTARTS   AGE
whoami4-ix-chart-749c8ff779-mbrqp   0/1     UnexpectedAdmissionError   0          91s
whoami4-ix-chart-749c8ff779-hf824   0/1     UnexpectedAdmissionError   0          90s
whoami4-ix-chart-749c8ff779-htw88   0/1     UnexpectedAdmissionError   0          89s
whoami4-ix-chart-749c8ff779-lghjc   0/1     UnexpectedAdmissionError   0          89s
whoami4-ix-chart-749c8ff779-9zmz7   0/1     UnexpectedAdmissionError   0          87s
whoami4-ix-chart-749c8ff779-p6g2v   0/1     UnexpectedAdmissionError   0          85s
whoami4-ix-chart-749c8ff779-jnh9m   0/1     Pending                    0          84s`````$ kubectl -n ix-whoami4 describe pods
Name:           whoami4-ix-chart-749c8ff779-mbrqp
Namespace:      ix-whoami4
Priority:       0
Node:           ix-truenas/
Start Time:     Sun, 12 Nov 2023 19:41:04 +0200
Labels:         app.kubernetes.io/instance=whoami4
                app.kubernetes.io/name=ix-chart
                pod-template-hash=749c8ff779
Annotations:    rollme: dcArL
Status:         Failed
Reason:         UnexpectedAdmissionError
Message:        Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices amd.com/gpu, which is unexpected
IP:
IPs:            <none>
Controlled By:  ReplicaSet/whoami4-ix-chart-749c8ff779
Containers:
  ix-chart:
    Image:      traefik/whoami:latest
    Port:       80/TCP
    Host Port:  0/TCP
    Limits:
      amd.com/gpu:         0
      gpu.intel.com/i915:  0
      nvidia.com/gpu:      0
    Requests:
      amd.com/gpu:         0
      gpu.intel.com/i915:  0
      nvidia.com/gpu:      0
    Environment:           <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4xrh6 (ro)
Volumes:
  kube-api-access-4xrh6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                    Age   From               Message
  ----     ------                    ----  ----               -------
  Normal   Scheduled                 103s  default-scheduler  Successfully assigned ix-whoami4/whoami4-ix-chart-749c8ff779-mbrqp to ix-truenas
  Warning  UnexpectedAdmissionError  103s  kubelet            Allocate failed due to no healthy devices present; cannot allocate unhealthy devices amd.com/gpu, which is unexpected

Successfully deployed nginx-proxy-manager app from Discover apps section. Probably related to not having `amd.com/gpu` in the Limits and Requests section.

Code:

$ kubectl -n ix-nginx-proxy-manager get pods
NAME                                   READY   STATUS    RESTARTS   AGE
nginx-proxy-manager-747c57ddf4-qnvfk   1/1     Running   0          119s``````$ kubectl -n ix-nginx-proxy-manager describe pods nginx-proxy-manager-747c57ddf4-qnvfk
Name:         nginx-proxy-manager-747c57ddf4-qnvfk
Namespace:    ix-nginx-proxy-manager
Priority:     0
Node:         ix-truenas/192.168.1.10
Start Time:   Sun, 12 Nov 2023 20:32:25 +0200
Labels:       app=nginx-proxy-manager-1.0.18
              app.kubernetes.io/instance=nginx-proxy-manager
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=nginx-proxy-manager
              app.kubernetes.io/version=2.10.4
              helm-revision=1
              helm.sh/chart=nginx-proxy-manager-1.0.18
              pod-template-hash=747c57ddf4
              pod.name=npm
              release=nginx-proxy-manager
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "ix-net",
                    "interface": "eth0",
                    "ips": [
                        "172.16.0.161"
                    ],
                    "mac": "3e:a4:04:04:81:b6",
                    "default": true,
                    "dns": {},
                    "gateway": [
                        "172.16.0.1"
                    ]
                }]
              rollme: uVBno
Status:       Running
IP:           172.16.0.161
IPs:
  IP:           172.16.0.161
Controlled By:  ReplicaSet/nginx-proxy-manager-747c57ddf4
Containers:
  nginx-proxy-manager:
    Container ID:   containerd://0528182aff475d4963bb237f5e6fb2708a9850bb5fb62a268533bb21f249d5e2
    Image:          jc21/nginx-proxy-manager:2.10.4
    Image ID:       docker.io/jc21/nginx-proxy-manager@sha256:e1000dd653d193ac70cb3635c27333b0183a11f987e2b1c6043589d9d948bc0f
    Ports:          80/TCP, 443/TCP, 81/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Sun, 12 Nov 2023 20:32:26 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     4
      memory:  8Gi
    Requests:
      cpu:      10m
      memory:   50Mi
    Liveness:   exec [/bin/check-health] delay=10s timeout=5s period=10s #success=1 #failure=5
    Readiness:  exec [/bin/check-health] delay=10s timeout=5s period=10s #success=2 #failure=5
    Startup:    exec [/bin/check-health] delay=30s timeout=2s period=5s #success=1 #failure=120
    Environment:
      TZ:                      Europe/Tallinn
      UMASK:                   002
      UMASK_SET:               002
      NVIDIA_VISIBLE_DEVICES:  void
      PUID:                    1000
      USER_ID:                 1000
      UID:                     1000
      PGID:                    1000
      GROUP_ID:                1000
      GID:                     1000
      DB_SQLITE_FILE:          /data/database.sqlite
      DISABLE_IPV6:            true
    Mounts:
      /data from data (rw)
      /etc/letsencrypt from certs (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  certs:
    Type:          HostPath (bare host directory volume)
    Path:          /mnt/pool-2/ix-applications/releases/nginx-proxy-manager/volumes/ix_volumes/certs
    HostPathType:
  data:
    Type:          HostPath (bare host directory volume)
    Path:          /mnt/pool-2/ix-applications/releases/nginx-proxy-manager/volumes/ix_volumes/data
    HostPathType:
QoS Class:         Burstable
Node-Selectors:    <none>
Tolerations:       node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                   node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason          Age                From               Message
  ----     ------          ----               ----               -------
  Normal   Scheduled       86s                default-scheduler  Successfully assigned ix-nginx-proxy-manager/nginx-proxy-manager-747c57ddf4-qnvfk to ix-truenas
  Normal   AddedInterface  86s                multus             Add eth0 [172.16.0.161/16] from ix-net
  Normal   Pulled          86s                kubelet            Container image "jc21/nginx-proxy-manager:2.10.4" already present on machine
  Normal   Created         85s                kubelet            Created container nginx-proxy-manager
  Normal   Started         85s                kubelet            Started container nginx-proxy-manager
  Warning  Unhealthy       46s (x2 over 51s)  kubelet            Startup probe failed: NOT OK

Additional debugging info:

Code:

$ lspci01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti XT [Radeon HD 7970/8970 OEM / R9 280X]

Code:

$ lsmodgpu_sched              53248  1 amdgpu
drm_buddy              20480  2 amdgpu,i915
drm_display_helper    184320  3 amdgpu,radeon,i915
drm_ttm_helper         16384  2 amdgpu,radeon
i2c_algo_bit           16384  3 amdgpu,radeon,i915
video                  65536  3 amdgpu,radeon,i915

Do I need to do anything special about the external GPU when doing a fresh install of TrueNAS SCALE? Maybe installing drivers or anything like that? I have another machine with different hardware and with only integrated graphic and no issues there.

Etorix · Nov 12, 2023

You cannot and should not "install drivers" on TrueNAS: It is an appliance OS.
It might be that your consumer H370 motherboard does not handle PCIe passthrough properly.

allan.tatter · Nov 12, 2023

You cannot and should not "install drivers" on TrueNAS: It is an appliance OS.

Makes sense.

It might be that your consumer H370 motherboard does not handle PCIe passthrough properly.

It does have Intel VT-d option in BIOS settings which is a positive indication, but not a guarantee of full passthrough functionality. How to determine whether the motherboard is supported?

ccfoo242 · Dec 10, 2023

Did you ever resolve this? I just got the same error. These are my only choices. AFAIK I don't need a gpu for what I'm installing.

allan.tatter · Dec 10, 2023

I didn't need a GPU as well for now for my use case. I disabled the GPU somehow, probably isolated the GPU under System Settings > Advanced (/ui/system/advanced) > Isolated GPU Device(s).

ccfoo242 · Dec 11, 2023

allan.tatter said:
I didn't need a GPU as well for now for my use case. I disabled the GPU somehow, probably isolated the GPU under System Settings > Advanced (/ui/system/advanced) > Isolated GPU Device(s).

Thanks. I'm able to create a custom app using the truecharts custom app but I'll check that setting and see if I can change it.

TSM 0h and six · Jan 4, 2024

allan.tatter said:
I didn't need a GPU as well for now for my use case. I disabled the GPU somehow, probably isolated the GPU under System Settings > Advanced (/ui/system/advanced) > Isolated GPU Device(s).

When I tried to isolate my GPU, I get the following message:
At least 1 GPU is required by the host for its functions.
With your selection, no GPU is available for the host to consume.

Anything else I can try?

sydonayrex · Jan 7, 2024

Try doing the following:

Access the advanced settings for apps:

Next, within the settings, uncheck enable GPU support and click save.

If you are on the most recent update for Scale, you may need to click the force button before saving. I had to do that as the most recent update seems to have issues with bridge interfaces.

Important Announcement for the TrueNAS Community.

Problem running custom docker apps: cannot allocate unhealthy devices amd.com/gpu

allan.tatter

Cadet

Etorix

Wizard

allan.tatter

Cadet

ccfoo242

Cadet

allan.tatter

Cadet

ccfoo242

Cadet

TSM 0h and six

Cadet

sydonayrex

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Problem running custom docker apps: cannot allocate unhealthy devices amd.com/gpu

Cadet

Wizard

Cadet

Cadet

Cadet

Cadet

Cadet

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Problem running custom docker apps: cannot allocate unhealthy devices amd.com/gpu"

Similar threads