Nvidia driver issues

spykezap

Cadet
Joined
Dec 1, 2021
Messages
4
Hello TrueNAS community,

Specs first:
- 2 x GTX 1080TIs
- Ryzen 9 5900X
- 64Gb ram
- ASUS Rog Crosshair VIII Hero
TrueNAS-SCALE-22.02-MASTER-20211201-012921

I set up TrueNAS scale a week back now and have been tinkering with it since.
When initially setting it up, i deployed the official plex chart and assigned it one of my 1080Tis.

This was working, transcoding jobs were handled by the GPU just fine.

Sometime after, i also spun up a Windows VM, trying to assign that the other 1080Ti.

This obviously did not work, as TrueNAS requires one of the GPUs, Plex had the other.

After this, neither of the GPUs are available to be assigned to any of the k3s charts. They still show up under System Settings -> Advanced -> Isolated GPU Devices.
They also show up if i try creating a new VM.

However 0 GPUs are listed as available in kubernetes:
Code:
Capacity:
  cpu:                24
  ephemeral-storage:  200329472Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65761324Ki
  nvidia.com/gpu:     0
  pods:               110
Allocatable:
  cpu:                24
  ephemeral-storage:  194880510209
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65761324Ki
  nvidia.com/gpu:     0
  pods:               110
System Info:
  Machine ID:                 7a0b146b48a04e97b85ba66c86c573eb
  System UUID:                877bb289-7c3c-d63f-bea8-3c7c3fd6bea7
  Boot ID:                    d76aad10-ad90-4982-995f-b4fdfa6ad6c6
  Kernel Version:             5.10.81+truenas
  OS Image:                   Debian GNU/Linux 11 (bullseye)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.9
  Kubelet Version:            v1.21.0-k3s1
  Kube-Proxy Version:         v1.21.0-k3s1
PodCIDR:                      172.16.0.0/16
PodCIDRs:                     172.16.0.0/16
Non-terminated Pods:          (32 in total)


Checking the nvidia-device-plugin-daemonset pod:
Code:
root@odin[~]# k3s kubectl -n kube-system describe pod nvidia-device-plugin-daemonset-p2m6t
Name:                 nvidia-device-plugin-daemonset-p2m6t
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ix-truenas/10.8.30.20
Start Time:           Wed, 01 Dec 2021 10:13:28 +0100
Labels:               controller-revision-hash=586f5fbcf9
                      name=nvidia-device-plugin-ds
                      pod-template-generation=2
Annotations:          k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "ix-net",
                            "interface": "eth0",
                            "ips": [
                                "172.16.0.28"
                            ],
                            "mac": "86:ff:c1:23:71:95",
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "ix-net",
                            "interface": "eth0",
                            "ips": [
                                "172.16.0.28"
                            ],
                            "mac": "86:ff:c1:23:71:95",
                            "default": true,
                            "dns": {}
                        }]
                      scheduler.alpha.kubernetes.io/critical-pod:
Status:               Running
IP:                   172.16.0.28
IPs:
  IP:           172.16.0.28
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Containers:
  nvidia-device-plugin-ctr:
    Container ID:  docker://96dd301781c60d929c0faf0093a004dc77b727d801a68e53e303d9d6c475b862
    Image:         nvidia/k8s-device-plugin:v0.9.0
    Image ID:      docker-pullable://nvidia/k8s-device-plugin@sha256:964847cc3fd85ead286be1d74d961f53d638cd4875af51166178b17bba90192f
    Port:          <none>
    Host Port:     <none>
    Args:
      --fail-on-init-error=false
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
      Exit Code:    128
      Started:      Wed, 01 Dec 2021 10:48:15 +0100
      Finished:     Wed, 01 Dec 2021 10:48:15 +0100
    Ready:          False
    Restart Count:  13
    Environment:
      DP_DISABLE_HEALTHCHECKS:  xids
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r5qhn (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType: 
  kube-api-access-r5qhn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  Normal   Scheduled       39m                    default-scheduler  Successfully assigned kube-system/nvidia-device-plugin-daemonset-p2m6t to ix-truenas
  Normal   AddedInterface  38m                    multus             Add eth0 [172.16.0.28/16] from ix-net
  Normal   Created         34m (x6 over 38m)      kubelet            Created container nvidia-device-plugin-ctr
  Warning  Failed          34m (x6 over 37m)      kubelet            Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
  Normal   Pulled          33m (x7 over 38m)      kubelet            Container image "nvidia/k8s-device-plugin:v0.9.0" already present on machine
  Warning  BackOff         3m28s (x141 over 36m)  kubelet            Back-off restarting failed container


Trying to run nvidia-smi:
Code:
root@odin[~]# nvidia-smi                                                                 
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


lspci:
Code:
root@odin[~]# lspci | grep -i vga
0a:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
0b:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)


As the above suggests, the system is not able to load the nvidia driver. I came across another thread and attempted manually installing the driver with apt install nvidia-cuda-dev nvidia-cuda-toolkit, which almost bricked the system.

Any suggestions or help would be highly appreciated!
 

spykezap

Cadet
Joined
Dec 1, 2021
Messages
4
Some additional info i've found:

Code:
systemctl status systemd-modules-load.service
● systemd-modules-load.service - Load Kernel Modules
     Loaded: loaded (/lib/systemd/system/systemd-modules-load.service; static)
     Active: active (exited) since Wed 2021-12-01 13:32:25 CET; 33min ago
       Docs: man:systemd-modules-load.service(8)
             man:modules-load.d(5)
   Main PID: 1392 (code=exited, status=0/SUCCESS)
      Tasks: 0 (limit: 76677)
     Memory: 0B
     CGroup: /system.slice/systemd-modules-load.service

Dec 01 13:32:25 odin.lizcloud.net systemd-modules-load[1392]: Failed to find module 'vfio_pci ids=10DE:1B06,10DE:10EF'
Dec 01 13:32:25 odin.lizcloud.net systemd-modules-load[1392]: Failed to find module 'nvidia-drm'
Dec 01 13:32:25 odin.lizcloud.net systemd-modules-load[1392]: Inserted module 'ioatdma'
Dec 01 13:32:25 odin.lizcloud.net systemd-modules-load[1392]: Inserted module 'ntb_netdev'
Dec 01 13:32:25 odin.lizcloud.net systemd[1]: Finished Load Kernel Modules.
Warning: journal has been rotated since unit was started, output may be incomplete.
 

spykezap

Cadet
Joined
Dec 1, 2021
Messages
4
I finally found an answer to this on the TrueNas discord.

I think when i created a virtual machine with one of the 1080TIs, TrueNas set that GPU as a isolated gpu in settings -> advanced -> Isolated GPU Device(s).

I unticked the GPUs in this setting and restarted the server. Everything is working now. GPUs are available for the k3s apps and nvidia-smi returns both the devices.
 

angst911

Dabbler
Joined
Sep 11, 2015
Messages
12
Note, you may actually be experiencing what I'm dealing with with two Tesla M40's. Did you by chance set your BIOS to prefer the built-in video (assuming you have that on that motherboard). That would make both cards available for passthrough.

My thread which I just started
 

max333

Cadet
Joined
May 14, 2023
Messages
1
i think the issue is around helm, ks3 server, docker. i lately messed with those and am facing the same issue now. trying to reset those to default.
 
Top