Potentially broke Kubernetes/storage backend isn't working

mew · Jun 21, 2022

I am currently preparing to upgrade my TrueNAS Scale server from a two 14tb pool (mirrored) to a ten 14tb pool. My friend loaned me two of his 14tb drives to use as a temporary buffer so I can swap out my drives and resilver them for my new pool. I swapped out one and started the resilvering process with no issues thinking that I'd speed up the process to add a pool using my own drives. Upon starting that I tried to see if I could start up my plex server while this is going on only to find out that it doesn't seem to be working at all. I am using TrueCharts apps and asked them in there if this is an issue that happens when drives resilver hoping that this is a known issue. On further inspection it seems to be that the storage backend is (allegedly) not working at all. I restarted my system a little bit into the resilvering process to see if that'd fix the kubernetes issue but my issues still persisted.
This is the output from kube-system

Code:

root@server[~]# k3s kubectl describe pods -n kube-system
Name:                 openebs-zfs-node-g5mw6
Namespace:            kube-system
Priority:             900001000
Priority Class Name:  openebs-zfs-csi-node-critical
Node:                 ix-truenas/192.168.1.49
Start Time:           Tue, 21 Jun 2022 17:51:09 -0700
Labels:               app=openebs-zfs-node
                      controller-revision-hash=57f5455f6b
                      openebs.io/component-name=openebs-zfs-node
                      openebs.io/version=ci
                      pod-template-generation=1
                      role=openebs-zfs
Annotations:          <none>
Status:               Pending
IP:                   192.168.1.49
IPs:
  IP:           192.168.1.49
Controlled By:  DaemonSet/openebs-zfs-node
Containers:
  csi-node-driver-registrar:
    Container ID:
    Image:         k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.3.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=5
      --csi-address=$(ADDRESS)
      --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      ADDRESS:               /plugin/csi.sock
      DRIVER_REG_SOCK_PATH:  /var/lib/kubelet/plugins/zfs-localpv/csi.sock
      KUBE_NODE_NAME:         (v1:spec.nodeName)
      NODE_DRIVER:           openebs-zfs
    Mounts:
      /plugin from plugin-dir (rw)
      /registration from registration-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qvnfn (ro)
  openebs-zfs-plugin:
    Container ID:
    Image:         openebs/zfs-driver:2.0.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --nodename=$(OPENEBS_NODE_NAME)
      --endpoint=$(OPENEBS_CSI_ENDPOINT)
      --plugin=$(OPENEBS_NODE_DRIVER)
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      OPENEBS_NODE_NAME:      (v1:spec.nodeName)
      OPENEBS_CSI_ENDPOINT:  unix:///plugin/csi.sock
      OPENEBS_NODE_DRIVER:   agent
      OPENEBS_NAMESPACE:     openebs
      ALLOWED_TOPOLOGIES:    All
    Mounts:
      /dev from device-dir (rw)
      /home/keys from encr-keys (rw)
      /host from host-root (ro)
      /plugin from plugin-dir (rw)
      /sbin/zfs from chroot-zfs (rw,path="zfs")
      /var/lib/kubelet/ from pods-mount-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qvnfn (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  device-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:  Directory
  encr-keys:
    Type:          HostPath (bare host directory volume)
    Path:          /home/keys
    HostPathType:  DirectoryOrCreate
  chroot-zfs:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      openebs-zfspv-bin
    Optional:  false
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  Directory
  registration-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins_registry/
    HostPathType:  DirectoryOrCreate
  plugin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins/zfs-localpv/
    HostPathType:  DirectoryOrCreate
  pods-mount-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/
    HostPathType:  Directory
  kube-api-access-qvnfn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason                  Age                 From               Message
  ----     ------                  ----                ----               -------
  Normal   Scheduled               52m                 default-scheduler  Successfully assigned kube-system/openebs-zfs-node-g5mw6 to ix-truenas
  Normal   SandboxChanged          10m (x12 over 32m)  kubelet            Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  1s (x25 over 50m)   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "openebs-zfs-node-g5mw6": operation timeout: context deadline exceeded

Name:                 coredns-d76bd69b-6h7nj
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 ix-truenas/192.168.1.49
Start Time:           Tue, 21 Jun 2022 17:51:09 -0700
Labels:               k8s-app=kube-dns
                      pod-template-hash=d76bd69b
Annotations:          <none>
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        ReplicaSet/coredns-d76bd69b
Containers:
  coredns:
    Container ID:
    Image:         rancher/mirrored-coredns-coredns:1.9.1
    Image ID:
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=2s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /etc/coredns/custom from custom-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lxmn7 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  custom-config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns-custom
    Optional:  true
  kube-api-access-lxmn7:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              beta.kubernetes.io/os=linux
Tolerations:                 CriticalAddonsOnly op=Exists
                             node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Warning  FailedScheduling        57m                   default-scheduler  0/1 nodes are available: 1 node(s) had taint {ix-svc-start: }, that the pod didn't tolerate.
  Warning  FailedScheduling        54m (x1 over 55m)     default-scheduler  0/1 nodes are available: 1 node(s) had taint {ix-svc-start: }, that the pod didn't tolerate.
  Warning  FailedScheduling        53m                   default-scheduler  0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.
  Normal   Scheduled               52m                   default-scheduler  Successfully assigned kube-system/coredns-d76bd69b-6h7nj to ix-truenas
  Warning  FailedSync              29m (x5 over 30m)     kubelet            error determining status: rpc error: code = Unknown desc = Error: No such container: 438d145717f95533dc20661f4ca3259e5af73e94521a4ec2a05fbd0ec0c7781a
  Normal   SandboxChanged          17m (x9 over 34m)     kubelet            Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  5m19s (x22 over 50m)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "coredns-d76bd69b-6h7nj": operation timeout: context deadline exceeded


Name:                 nvidia-device-plugin-daemonset-n7fwf
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ix-truenas/192.168.1.49
Start Time:           Tue, 21 Jun 2022 17:51:37 -0700
Labels:               controller-revision-hash=77f95bfc79
                      name=nvidia-device-plugin-ds
                      pod-template-generation=1
Annotations:          scheduler.alpha.kubernetes.io/critical-pod:
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        DaemonSet/nvidia-device-plugin-daemonset
Containers:
  nvidia-device-plugin-ctr:
    Container ID:
    Image:          nvidia/k8s-device-plugin:v0.10.0
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r2vhn (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:
  kube-api-access-r2vhn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Normal   Scheduled               51m                   default-scheduler  Successfully assigned kube-system/nvidia-device-plugin-daemonset-n7fwf to ix-truenas
  Normal   SandboxChanged          9m51s (x13 over 34m)  kubelet            Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  3m40s (x23 over 49m)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "nvidia-device-plugin-daemonset-n7fwf": operation timeout: context deadline exceeded


Name:                 openebs-zfs-controller-0
Namespace:            kube-system
Priority:             900000000
Priority Class Name:  openebs-zfs-csi-controller-critical
Node:                 ix-truenas/192.168.1.49
Start Time:           Tue, 21 Jun 2022 17:51:36 -0700
Labels:               app=openebs-zfs-controller
                      controller-revision-hash=openebs-zfs-controller-698698d48b
                      openebs.io/component-name=openebs-zfs-controller
                      openebs.io/version=ci
                      role=openebs-zfs
                      statefulset.kubernetes.io/pod-name=openebs-zfs-controller-0
Annotations:          <none>
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        StatefulSet/openebs-zfs-controller
Containers:
  csi-resizer:
    Container ID:
    Image:         k8s.gcr.io/sig-storage/csi-resizer:v1.2.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=5
      --csi-address=$(ADDRESS)
      --leader-election
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      ADDRESS:  /var/lib/csi/sockets/pluginproxy/csi.sock
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jwxj2 (ro)
  csi-snapshotter:
    Container ID:
    Image:         k8s.gcr.io/sig-storage/csi-snapshotter:v4.0.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --csi-address=$(ADDRESS)
      --leader-election
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      ADDRESS:  /var/lib/csi/sockets/pluginproxy/csi.sock
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jwxj2 (ro)
  snapshot-controller:
    Container ID:
    Image:         k8s.gcr.io/sig-storage/snapshot-controller:v4.0.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=5
      --leader-election=true
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jwxj2 (ro)
  csi-provisioner:
    Container ID:
    Image:         k8s.gcr.io/sig-storage/csi-provisioner:v3.0.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --csi-address=$(ADDRESS)
      --v=5
      --feature-gates=Topology=true
      --strict-topology
      --leader-election
      --extra-create-metadata=true
      --enable-capacity=true
      --default-fstype=ext4
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      ADDRESS:    /var/lib/csi/sockets/pluginproxy/csi.sock
      NAMESPACE:  kube-system (v1:metadata.namespace)
      POD_NAME:   openebs-zfs-controller-0 (v1:metadata.name)
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jwxj2 (ro)
  openebs-zfs-plugin:
    Container ID:
    Image:         openebs/zfs-driver:2.0.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --endpoint=$(OPENEBS_CSI_ENDPOINT)
      --plugin=$(OPENEBS_CONTROLLER_DRIVER)
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      OPENEBS_CONTROLLER_DRIVER:    controller
      OPENEBS_CSI_ENDPOINT:         unix:///var/lib/csi/sockets/pluginproxy/csi.sock
      OPENEBS_NAMESPACE:            openebs
      OPENEBS_IO_INSTALLER_TYPE:    zfs-operator
      OPENEBS_IO_ENABLE_ANALYTICS:  true
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jwxj2 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  socket-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-jwxj2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Normal   Scheduled               51m                   default-scheduler  Successfully assigned kube-system/openebs-zfs-controller-0 to ix-truenas
  Warning  FailedSync              38m (x3 over 38m)     kubelet            error determining status: rpc error: code = Unknown desc = Error: No such container: 2c97475ed30a8d3dc3c987f44517dad4720751c3a6366dc3869cbc4216141ef5
  Warning  FailedSync              23m (x3 over 23m)     kubelet            error determining status: rpc error: code = Unknown desc = Error: No such container: a81e009b4f084f12806bec335c332c6c2b168fd461909906875183083448cb12
  Normal   SandboxChanged          19m (x10 over 37m)    kubelet            Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  4m47s (x22 over 49m)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "openebs-zfs-controller-0": operation timeout: context deadline exceeded
root@server[~]#

The resilvering process is also taking forever and I'm assuming it's due to the fact that I've restarted the machine several times and caused more issues than it has helped.
I also seemingly managed to make over 9000 snapshots in the past day and a half alone because kubernetes keeps trying to create pods?

I've made a ticket on Jira but I'm waiting for a response and I'm really not sure what I should do. Should I leave my server running while it is spamming non stop snapshots and bogging down the system? Is there any way to delete these en masse?

My current specs are

Code:

Intel i7 5820k
Gigabyte X99 UD4 (Revision 1)
102gb of of DDR4 RAM
Corsair ax850 Gold
Nvidia EVGA 1070
One (formerly two) shucked WD easystore (wd whites) - 14tb
One WD Purple 14tb drive

Sorry for posting this on the forum as well as on the jira, it's stressing me out as I've not seen anyone else run into similar issues?
Please let me know if you have any questions about my issue!

Thanks for the help :)

mew · Jun 21, 2022

I also just received this warning while connected to the server

Code:

2022 Jun 21 17:43:38 server [1\2] 2022-06-21 17:43:38 Warning: No server certificate defined; will use a selfsigned one.
2022 Jun 21 17:43:38 server [2/2]  Suggested action: either install a certificate or change tls_advertise_hosts option
2022 Jun 21 17:43:38 server 2022-06-21 17:43:38 Cannot open main log file "/var/log/exim4/mainlog": Permission denied: euid=0 egid=120
2022 Jun 21 17:43:38 server exim: could not open panic log - aborting: see message(s) above

And this is the zpool status -v

Code:

root@server[~]# zpool status -v
  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:07 with 0 errors on Sun Jun 19 03:45:09 2022
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          sda3      ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Jun 19 14:45:30 2022
        4.87T scanned at 933M/s, 183G issued at 34.3M/s, 7.22T total
        183G resilvered, 2.48% done, 2 days 11:51:04 to go
config:

        NAME                                      STATE     READ WRITE CKSUM
        tank                                      ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            5bbcb42a-0cd2-4803-80c1-42e71105acf1  ONLINE       0     0     0  (resilvering)
            a25780ed-8858-4d75-8868-6e3babbd7b74  ONLINE       0     0     0

errors: No known data errors

mew · Jun 21, 2022

Oh and to be clear I'm running TrueNAS-SCALE-22.02.1, I haven't updated to TrueNAS-SCALE-22.02.2 as I'm not sure if I should because I'm having all these issues

morganL · Jun 21, 2022

Its a bit late now, but the best way to expand a pool is to add VDEVs.

Right now, I think your best bet is patience and let resilver complete. If resilver is stopped, then diagnose that.

mew · Jun 21, 2022

morganL said:
Its a bit late now, but the best way to expand a pool is to add VDEVs.

Right now, I think your best bet is patience and let resilver complete. If resilver is stopped, then diagnose that.

What about the infinite number of snapshots being created?

mew · Jun 21, 2022

morganL said:
Its a bit late now, but the best way to expand a pool is to add VDEVs.

Right now, I think your best bet is patience and let resilver complete. If resilver is stopped, then diagnose that.

It is getting slower and slower as time goes on due to the database being too big/too many snapshots, sorry for the double post I'm used to being able to edit posts. Sorry!

mew · Jun 22, 2022

Unfortunately it still occurs on the latest version of Scale (TrueNAS-SCALE-22.02.2)
Any advice on the mass deletion of these empty snapshots?
Thanks!

Important Announcement for the TrueNAS Community.

Potentially broke Kubernetes/storage backend isn't working

mew

Cadet

mew

Cadet

mew

Cadet

morganL

Captain Morgan

mew

Cadet

mew

Cadet

mew

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Potentially broke Kubernetes/storage backend isn't working

mew

Cadet

mew

Cadet

mew

Cadet

morganL

Captain Morgan

mew

Cadet

mew

Cadet

mew

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Potentially broke Kubernetes/storage backend isn't working"

Similar threads