Hello,
My TrueNAS (Version: TrueNAS-12.0-U5.1) server recently started rebooting itself and I tracked it down to at least one directory that seems to have some corrupt content in it. Writing, reading, or attempting to delete the directory causes a kernel panic, and the server reboots. This is on an SMB share that is accessed by several Windows 10 pcs. The directory in question is used for daily backups of one of the workstations. I can keep TrueNAS from rebooting by disconnecting the PC that was writing its backups to the directory in question.
Running a scrub will find checksum errors across a random selection of drives, it changes each time, but after a reboot the checksums will be back to 0 when I run a zpool status or look at it from the UI. I do see a persistent "3 errors" message reported and a permanent error in "Storage/Share:<0x0>" Storage is the name of the pool and Share is one of the SMB shares.
I'm looking for any suggestions on how I might be able to remove this directory with the bad files without destroying and recreating the pool. So far, everything I've tried results in a kernel panic and server reboot with the directory remaining. Things I've tried from the TrueNAS shell using "directory" as an example:
Both will cause kernel panic. Attempting to list the directory contents will also panic. Attempting to delete from a windows box using the share will panic.
With respect to root cause, the only hardware issue I've found was a BIOS message about a multi-bit ECC error. I've tested the memory with memtest86 pro, with ECC error injection at least a half-dozen times with no errors. I replaced the stick implicated in the BIOS message anyway, just in case. The 6 drives passed Seatools long tests and SMART long tests with no errors. PSU seems stable with consistent voltages and the system runs on a UPS that halts the server in the case of power loss.
I also tried running in single-user mode and running zdb:
Unfortunately, after about 11 hours zdb ran into an error and exited (about halfway through the scan).
I have backups of all of the content and I am able to access everything except the directory in question but the pool continues to report as unhealthy due to the permanent errors. Any other ideas on things to try or is it time to destroy and recreate?
Thank you for any insights or suggestions.
My TrueNAS (Version: TrueNAS-12.0-U5.1) server recently started rebooting itself and I tracked it down to at least one directory that seems to have some corrupt content in it. Writing, reading, or attempting to delete the directory causes a kernel panic, and the server reboots. This is on an SMB share that is accessed by several Windows 10 pcs. The directory in question is used for daily backups of one of the workstations. I can keep TrueNAS from rebooting by disconnecting the PC that was writing its backups to the directory in question.
Running a scrub will find checksum errors across a random selection of drives, it changes each time, but after a reboot the checksums will be back to 0 when I run a zpool status or look at it from the UI. I do see a persistent "3 errors" message reported and a permanent error in "Storage/Share:<0x0>" Storage is the name of the pool and Share is one of the SMB shares.
I'm looking for any suggestions on how I might be able to remove this directory with the bad files without destroying and recreating the pool. So far, everything I've tried results in a kernel panic and server reboot with the directory remaining. Things I've tried from the TrueNAS shell using "directory" as an example:
Code:
rm -rf ./directory find . -name "directory" -exec rm -r "{}" \;
Both will cause kernel panic. Attempting to list the directory contents will also panic. Attempting to delete from a windows box using the share will panic.
With respect to root cause, the only hardware issue I've found was a BIOS message about a multi-bit ECC error. I've tested the memory with memtest86 pro, with ECC error injection at least a half-dozen times with no errors. I replaced the stick implicated in the BIOS message anyway, just in case. The 6 drives passed Seatools long tests and SMART long tests with no errors. PSU seems stable with consistent voltages and the system runs on a UPS that halts the server in the case of power loss.
I also tried running in single-user mode and running zdb:
Code:
set vfs.zfs.recover=1 set vfs.zfs.debug=1 boot -s zpool import -R /mnt -f Storage zdb -e -bcsvL Storage
Unfortunately, after about 11 hours zdb ran into an error and exited (about halfway through the scan).
I have backups of all of the content and I am able to access everything except the directory in question but the pool continues to report as unhealthy due to the permanent errors. Any other ideas on things to try or is it time to destroy and recreate?
Thank you for any insights or suggestions.