snapshot failing repeatedly - dataset does not exist, how to clean this up?

Chris666

Dabbler
Joined
Aug 1, 2022
Messages
10
I was looking at the logs today and I'm getting a repeating error, I cannot find any way to resolve this. How to clean this up? TrueNAS-SCALE-22.02.3

k3s_daemon.log:

Sep 13 13:29:52 truenas k3s[262724]: E0913 13:29:52.609681 262724 remote_runtime.go:394] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown d
esc = Error response from daemon: exit status 2: \"/usr/sbin/zfs fs snapshot ssd/ix-applications/docker/d6e44efa8a3af577a1f164fa4adffcf96ff914f7125bc378acc1c4ac67ee7618@606843339\
" => cannot open 'ssd/ix-applications/docker/d6e44efa8a3af577a1f164fa4adffcf96ff914f7125bc378acc1c4ac67ee7618': dataset does not exist\nusage:\n\tsnapshot [-r] [-o property=value]
... @ ...\n\nFor the property list, run: zfs set|get\n\nFor the delegated permission list, run: zfs allow|unallow" podSandboxID="ad13414c91af5093fcf166ca
89eebc2f92ea554359156978b1bf94fc92d112e1"
Sep 13 13:29:52 truenas k3s[262724]: E0913 13:29:52.609749 262724 kuberuntime_manager.go:919] container &Container{Name:intel-gpu-plugin,Image:intel/intel-gpu-plugin:0.19.0,Comma
nd:[],Args:[-shared-dev-num 5],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:NODE_NAME,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,Fi
eldPath:spec.nodeName,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts
:[]VolumeMount{VolumeMount{Name:devfs,ReadOnly:true,MountPath:/dev/dri,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:sysfs,ReadOnly:true,MountPath:/sys/class/drm,S
ubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:kubeletsockets,ReadOnly:false,MountPath:/var/lib/kubelet/device-plugins,SubPath:,MountPropagation:nil,SubPathExpr:,},V
olumeMount{Name:kube-api-access-pmv8t,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,Readin
essProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:nil,SELinuxOptions
:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:*true,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,Stdi
nOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod intel-gpu-plugin-mbkpv_kube-sys
tem(80b45bdc-b505-474f-9e95-d2fd02da734e): CreateContainerError: Error response from daemon: exit status 2: "/usr/sbin/zfs fs snapshot ssd/ix-applications/docker/d6e44efa8a3af577a
1f164fa4adffcf96ff914f7125bc378acc1c4ac67ee7618@606843339" => cannot open 'ssd/ix-applications/docker/d6e44efa8a3af577a1f164fa4adffcf96ff914f7125bc378acc1c4ac67ee7618': dataset do
es not exist
Sep 13 13:29:52 truenas k3s[262724]: usage:
Sep 13 13:29:52 truenas k3s[262724]: snapshot [-r] [-o property=value] ... @ ...
Sep 13 13:29:52 truenas k3s[262724]: For the property list, run: zfs set|get
Sep 13 13:29:52 truenas k3s[262724]: For the delegated permission list, run: zfs allow|unallow
Sep 13 13:29:52 truenas k3s[262724]: E0913 13:29:52.609785 262724 pod_workers.go:949] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"intel-gpu-plugin\" wit
h CreateContainerError: \"Error response from daemon: exit status 2: \\\"/usr/sbin/zfs fs snapshot ssd/ix-applications/docker/d6e44efa8a3af577a1f164fa4adffcf96ff914f7125bc378acc1c
4ac67ee7618@606843339\\\" => cannot open 'ssd/ix-applications/docker/d6e44efa8a3af577a1f164fa4adffcf96ff914f7125bc378acc1c4ac67ee7618': dataset does not exist\\nusage:\\n\\tsnapsh
ot [-r] [-o property=value] ... @ ...\\n\\nFor the property list, run: zfs set|get\\n\\nFor the delegated permission list, run: zfs allow|unallow\"" pod="
kube-system/intel-gpu-plugin-mbkpv" podUID=80b45bdc-b505-474f-9e95-d2fd02da734e
Sep 13 13:29:52 truenas k3s[262724]: E0913 13:29:52.716992 262724 token_manager.go:121] "Couldn't update token" err="pods \"intel-gpu-plugin-mbkpv\" not found" cacheKey="\"defaul
t\"/\"kube-system\"/[]string(nil)/3607/v1.BoundObjectReference{Kind:\"Pod\", APIVersion:\"v1\", Name:\"intel-gpu-plugin-mbkpv\", UID:\"80b45bdc-b505-474f-9e95-d2fd02da734e\"}"
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
Is there contextual information about anything set up or tried before this happened? Or, did it just start happening out of the blue?
 

Chris666

Dabbler
Joined
Aug 1, 2022
Messages
10
Is there contextual information about anything set up or tried before this happened? Or, did it just start happening out of the blue?
I was looking at logs and happened to notice that this was being spewed. It's been going on for a while apparently.
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
Hard to know what's going without more info. Could be out of space. Corrupt files. Bad network.

You could try resetting the application dataset or moving the existing one to a new location.
 

Chris666

Dabbler
Joined
Aug 1, 2022
Messages
10
Hard to know what's going without more info. Could be out of space. Corrupt files. Bad network.

You could try resetting the application dataset or moving the existing one to a new location.
Nothing like that, not out of space, nothing to do with networks. It's obviously a corrupt file somewhere, there is an entry for something that does not exist. It's a little worrisome that the TrueNAS software is so fragile. From the logs it is apparent that something installed intel-gpu -plugin and something got borked somehow. The software should be able to detect this issue and handle it and not just step on its dick
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Nothing like that, not out of space, nothing to do with networks. It's obviously a corrupt file somewhere, there is an entry for something that does not exist. It's a little worrisome that the TrueNAS software is so fragile. From the logs it is apparent that something installed intel-gpu -plugin and something got borked somehow. The software should be able to detect this issue and handle it and not just step on its dick

Please show your system hardware... is there an intel GPU involved?
Is the system operational otherwise?

We'should add a feature to TrueNAS where it automatically handles its own bugs and the bugs of all the other software used in the system. My guess is that would be quite popular.

In the meantime, If there is no answer and/or it's causing a problem, I'd recommend the old manual process of reporting a bug..
 

Chris666

Dabbler
Joined
Aug 1, 2022
Messages
10
Please show your system hardware... is there an intel GPU involved?
Is the system operational otherwise?

We'should add a feature to TrueNAS where it automatically handles its own bugs and the bugs of all the other software used in the system. My guess is that would be quite popular.

In the meantime, If there is no answer and/or it's causing a problem, I'd recommend the old manual process of reporting a bug..
Yeah, self-healing software would indeed be a very popular feature :), but this issue here has to do with data integrity; I can see that there is redundant data, it looks like there are the actual containers etc and a data set that points to it, probably containing additional metadata about the container and possible a cache structure for performance considerations. There should be a procedure for verifying the integrity of this data, which appears to be out of sync with reality. I was hoping that there was something that would reset this, if there is not then that might be the actual bug to report (technically a feature request). Since I'm learning TrueNAS there are volumes of information I don't know, so I'm running on experience of other systems.

Anyway.... The hardware configuration is pretty straightforward, it's an Intel i3-10105 which does have a GPU. I'm thinking that this is related to Jellyfin, which I have installed and it likes like it might be the one that tried to get the GPU working for transcoding (a wild guess on my part), I'm not sure if this is the case or if the install tries to install the module for the GPU via docker - I could not find any breadcrumbs pointing to Jellyfin, so so I just guess because that's the only thing that would care about a GPU that I can think of.

I guess the real question is, now that I think about it - can I rebuild the data structures used for docker and Kubernetes, etc, etc? There is the nuclear option of reinstall, which is not the end of the world, I've done it once and getting back to running again is probably under an hour.

I would like to be able to fix this kind of issue. I'm always nervous about systems that do all the magic behind a curtain, it's nice that when you do it yourself you know how everything works, but the TrueNAS Scale platform has some really nice features for making things happen.

If I do find the crux of this issue I will be sure to file a bug report, I've been walking though the architecture to divine how it works (treading lightly) This is actually the first real bit of reverse engineering I've had to do with TrueNAS. I think a system intended for enterprise use needs some integrity checks, they can quite tricky to implement in my experience. I think what happened is that something was trying to get installed and it went south and left things in an indeterminate state. Unfortunately (or fortunately) the only indication that things where amiss was these messages in k3_daemon.log


Thanks!
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
I was hoping that there was something that would reset this, if there is not then that might be the actual bug to report (technically a feature request).

There is a process to reset your apps: https://www.truenas.com/community/threads/possible-to-reset-kubernetes.101051/

Of course, you would lose the state of any currently installed apps doing this. But, that should give you a working blank slate to start from again.

Anyway.... The hardware configuration is pretty straightforward, it's an Intel i3-10105 which does have a GPU.

What about the rest? ECC ram? Motherboard? Drives? Where is the ix-applications dataset? Please tell me it's not on some crusty old USB thumb drive.
 

Chris666

Dabbler
Joined
Aug 1, 2022
Messages
10
There is a process to reset your apps: https://www.truenas.com/community/threads/possible-to-reset-kubernetes.101051/

Of course, you would lose the state of any currently installed apps doing this. But, that should give you a working blank slate to start from again.



What about the rest? ECC ram? Motherboard? Drives? Where is the ix-applications dataset? Please tell me it's not on some crusty old USB thumb drive.
That actually makes sense, removing and adding the pool. It did change the messages a bit, which is almost always a good sign. I may just try to recreate the whole pool, I'm not running that many apps. A lot of this is me learning TrueNAS, there are clients that could use this and it's useful for me to get to know.

i3 does not support ECC, something one definitely want for production ZFS. I have 5 WD Red+ drives and a SDD where the ix stuff lives. The root is on an NVME. It's actually not a bad little system, it could be "pro" if it supported ECC. Since it's not very busy, just a few shares, mostly NFS it gets by quite fine. This is my first real hiccup.

If I get it happy again I'll let you know all know how I managed it.
 

Chris666

Dabbler
Joined
Aug 1, 2022
Messages
10
It looks like there is no recovery other than to reinstall. Resetting changes the messages a bit but does not recover

Sep 13 22:55:34 truenas k3s[9270]: E0913 22:55:34.355806 9270 kubelet_volumes.go:245] "There were many si
milar errors. Turn up verbosity to see them." err="orphaned pod \"71e19841-e00e-4f51-9722-4b82bc6793da\" fou
nd, but error not a directory occurred when trying to remove the volumes dir" numErrs=2
Sep 13 22:55:35 truenas k3s[9270]: E0913 22:55:35.337423 9270 remote_image.go:160] "Get ImageStatus from
image service failed" err="rpc error: code = Unknown desc = Error response from daemon: layer does not exist
" image="intel/intel-gpu-plugin:0.19.0"
Sep 13 22:55:35 truenas k3s[9270]: E0913 22:55:35.337439 9270 kuberuntime_image.go:86] "Failed to get ima
ge status" err="rpc error: code = Unknown desc = Error response from daemon: layer does not exist" image="in
tel/intel-gpu-plugin:0.19.0"
Sep 13 22:55:35 truenas k3s[9270]: E0913 22:55:35.337486 9270 kuberuntime_manager.go:919] container &Cont
ainer{Name:intel-gpu-plugin,Image:intel/intel-gpu-plugin:0.19.0,Command:[],Args:[-shared-dev-num 5],WorkingD
ir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:NODE_NAME,Value:,ValueFrom:&EnvVarSource{FieldRef:&Obje
ctFieldSelector{APIVersion:v1,FieldPath:spec.nodeName,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRe
f:nil,},},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]Vo
lumeMount{VolumeMount{Name:devfs,ReadOnly:true,MountPath:/dev/dri,SubPath:,MountPropagation:nil,SubPathExpr:
,},VolumeMount{Name:sysfs,ReadOnly:true,MountPath:/sys/class/drm,SubPath:,MountPropagation:nil,SubPathExpr:,
},VolumeMount{Name:kubeletsockets,ReadOnly:false,MountPath:/var/lib/kubelet/device-plugins,SubPath:,MountPro
pagation:nil,SubPathExpr:,},VolumeMount{Name:kube-api-access-kwc7b,ReadOnly:true,MountPath:/var/run/secrets/
kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe
:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:
&SecurityContext{Capabilities:nil,Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyR
ootFilesystem:*true,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProf
ile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,Volu
meDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod intel-gpu-plugin-k668c_kube-system(fe0ccea
a-f269-46ed-9909-48c4eb5d014b): ImageInspectError: Failed to inspect image "intel/intel-gpu-plugin:0.19.0":
rpc error: code = Unknown desc = Error response from daemon: layer does not exist
Sep 13 22:55:35 truenas k3s[9270]: E0913 22:55:35.337505 9270 pod_workers.go:949] "Error syncing pod, ski
pping" err="failed to \"StartContainer\" for \"intel-gpu-plugin\" with ImageInspectError: \"Failed to inspec
t image \\\"intel/intel-gpu-plugin:0.19.0\\\": rpc error: code = Unknown desc = Error response from daemon:
layer does not exist\"" pod="kube-system/intel-gpu-plugin-k668c" podUID=fe0cceaa-f269-46ed-9909-48c4eb5d014b
Sep 13 22:55:36 truenas k3s[9270]: E0913 22:55:36.350963 9270 kubelet_volumes.go:245] "There were many si
milar errors. Turn up verbosity to see them." err="orphaned pod \"71e19841-e00e-4f51-9722-4b82bc6793da\" fou
nd, but error not a directory occurred when trying to remove the volumes dir" numErrs=2
 

Chris666

Dabbler
Joined
Aug 1, 2022
Messages
10
Well, I seemed to have figured it out. Applications->Settings->Advanced Settings->Enable GPU was set. That appears to have broken things. Turning this off ended all the intel-gpu-plugin messages, etc. and now I just have one message that repeats every couple of seconds. So basically that feature is very broken in at least this configuration. I guess there is a bug report there. I can post the full hardware but the key element is the i3-10105 which has Intel UHD Graphics 630 I would imagine that is the crux of the issue.

Sep 14 12:27:25 truenas k3s[8991]: E0914 12:27:25.052888 8991 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"2
3de9516-bd2c-476f-93c5-74bbbd2caa16\" found, but failed to rmdir() volume at path /var/lib/kubelet/pods/23de9516-bd2c-476f-93c5-74bbbd2caa16/volumes/kubernetes.io~csi/pvc-ecfe
52bd-3b91-408f-8b17-ed29a150647f/mount: directory not empty" numErrs=3
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
Interesting. Maybe that GPU is not compatible?

Are you using the GPU for anything? If not, you could probably disable it in the BIOS just to be sure that TrueNAS doesn't detect and try to utilize it.

Not sure about the remaining issue. Did you reset the apps after disabling GPU?
 

Chris666

Dabbler
Joined
Aug 1, 2022
Messages
10
Interesting. Maybe that GPU is not compatible?

Are you using the GPU for anything? If not, you could probably disable it in the BIOS just to be sure that TrueNAS doesn't detect and try to utilize it.

Not sure about the remaining issue. Did you reset the apps after disabling GPU?
It eventually got its act together, it downloaded the support for the GPU but failed miserably. It would be nice if it worked because Jellyfin can use it for transcoding. I am only getting one message and that looks to be related to an upstream bug in k3s.

Unrelated to this, but the kernel has issues with the rocket lake GPU, you lose video on boot on the console, but this is an issue that has popped up for ages in Linux world and has the old GRUB nomodeset fix and it worked. That's not the GPU so much as the video out. It's been a "feature" for decades that pops up now and then. Unrelated to the issue here, as that appears to be bugs in the enable GPU code, if it did not work then that would be "no support" but no matter what it should not freak out.
Thanks!
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
It eventually got its act together, it downloaded the support for the GPU but failed miserably. It would be nice if it worked because Jellyfin can use it for transcoding. I am only getting one message and that looks to be related to an upstream bug in k3s.

I picked up a P400 for transcoding from the server. It definitely is better than without. But, I don't feel it's quite the 100% solution many expect. At least for me, it's still fairly easy to bog the system down transcoding 4k if/when the source file isn't the ideal format, etc. In hindsight, I think a better solution is to use some of the auto-encoding apps available to just have everything in the library converted to ideal formats automatically as its added.
 

Chris666

Dabbler
Joined
Aug 1, 2022
Messages
10
I picked up a P400 for transcoding from the server. It definitely is better than without. But, I don't feel it's quite the 100% solution many expect. At least for me, it's still fairly easy to bog the system down transcoding 4k if/when the source file isn't the ideal format, etc. In hindsight, I think a better solution is to use some of the auto-encoding apps available to just have everything in the library converted to ideal formats automatically as its added.
Pretty much the case
 
Top