Directory on SMB share causing kernel panic (how to delete?)

bluesky

Cadet
Joined
Sep 10, 2021
Messages
3
Hello,

My TrueNAS (Version: TrueNAS-12.0-U5.1) server recently started rebooting itself and I tracked it down to at least one directory that seems to have some corrupt content in it. Writing, reading, or attempting to delete the directory causes a kernel panic, and the server reboots. This is on an SMB share that is accessed by several Windows 10 pcs. The directory in question is used for daily backups of one of the workstations. I can keep TrueNAS from rebooting by disconnecting the PC that was writing its backups to the directory in question.

Running a scrub will find checksum errors across a random selection of drives, it changes each time, but after a reboot the checksums will be back to 0 when I run a zpool status or look at it from the UI. I do see a persistent "3 errors" message reported and a permanent error in "Storage/Share:<0x0>" Storage is the name of the pool and Share is one of the SMB shares.

I'm looking for any suggestions on how I might be able to remove this directory with the bad files without destroying and recreating the pool. So far, everything I've tried results in a kernel panic and server reboot with the directory remaining. Things I've tried from the TrueNAS shell using "directory" as an example:

Code:
rm -rf ./directory
find . -name "directory" -exec rm -r "{}" \;


Both will cause kernel panic. Attempting to list the directory contents will also panic. Attempting to delete from a windows box using the share will panic.

With respect to root cause, the only hardware issue I've found was a BIOS message about a multi-bit ECC error. I've tested the memory with memtest86 pro, with ECC error injection at least a half-dozen times with no errors. I replaced the stick implicated in the BIOS message anyway, just in case. The 6 drives passed Seatools long tests and SMART long tests with no errors. PSU seems stable with consistent voltages and the system runs on a UPS that halts the server in the case of power loss.

I also tried running in single-user mode and running zdb:

Code:
set vfs.zfs.recover=1
set vfs.zfs.debug=1 
boot -s 
zpool import -R /mnt -f Storage
zdb -e -bcsvL Storage


Unfortunately, after about 11 hours zdb ran into an error and exited (about halfway through the scan).

I have backups of all of the content and I am able to access everything except the directory in question but the pool continues to report as unhealthy due to the permanent errors. Any other ideas on things to try or is it time to destroy and recreate?

Thank you for any insights or suggestions.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
To separate the various aspects (ZFS and Samba) I would create a new directory and point the Samba share to this one. Then further testing should be able to narrow down the root-cause.
 

bluesky

Cadet
Joined
Sep 10, 2021
Messages
3
Thank you for the reply. I'm not sure I completely understand what you are suggesting, do you mind clarifying for me?

Let's say that the mount point for the share is /mnt/Storage/Share and the share name is "Share." Are you recommending (Option #1) that I create a new share called "Share2" for example and point it to the same mount point: /mnt/Storage/Share or (Option #2) that I create a new mount point /mnt/Storage/Share2 and point "Share" to it or (Option #3) create a new share and mount point, or (Option #4) something different?

Once I complete that, what additional testing do you recommend?

I figured that since I was able to trigger the kernel panic by accessing the directory from the TrueNAS CLI that it was a filesystem issue rather than a Samba one. Is that assumption incorrect? I read about some issues with ZFS where the file mode could be set to something invalid which would cause panics but since I can't see the contents of the directory in question I can't verify whether that happened in this case. I was hoping I could just delete the files to remove that as a potential issue.

I do have the pool upgraded to the latest version available in TrueNAS.

Thank you again.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
I meant to create a new directory (/mnt/Storage/Share2 ) and re-configure the Samba share share to point there.

In terms of testing: Yes, it should be a file system issue, after what you wrote. But as the saying goes "assuming means not-knowing". It is all about decomposing a symptom and narrow things down. Also, you may just to want to find a way around this and, once you have achieved this, not care at all about finding the root-cause. But that only you can decide ...
 

bluesky

Cadet
Joined
Sep 10, 2021
Messages
3
Just in case someone stumbles across this thread I will describe how I "resolved" my situation.

My goal was to not have to destroy and recreate my pool and recover as much of the data from the current pool rather than going to backup. I also determined that all of the file errors were present in just one of my pool datasets.

What I did:

1. Created a new dataset in the current pool
2. Copied all of the data that I could from the problem dataset to the new dataset using the TrueNAS command line and "cp -rp" to maintain file permissions.
3. I had to ignore the known problem directory during the copy because it would cause a kernel panic and reboot the server.
4. Once the files were copied, I deleted the dataset that contained the problem files.
5. Ran a "zpool clear <poolname>" to clear the old errors. Ran a scrub and saw no errors.
6. Pointed the old SMB share name to the new dataset so client PC network drive mappings continued to work.

So far no further issues...
 
Top