Troubleshooting Unhealthy ZPool Status

FreeNASUser5129 · Apr 13, 2022

Hi Friends,

I am running TrueNAS Core 12.0-U8.

I recently needed to replace one of my hard disks within my zpool. After replacing and resilvering, my zpool status says Unhealthy.

I ran

Code:

zpool status -x

I saw that it is showing ZFS-8000-8A as the error, and referenced this documentation page.

Based on that page I ran

Code:

zpool status -xv

with this result:

Code:

root@freenas:~ # zpool status -xv
pool: Volume1
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 06:57:34 with 15 errors on Sat Apr  9 14:07:52 2022
config:

        NAME                                                STATE     READ WRITE CKSUM
        Volume1                                             ONLINE       0     0     0
          raidz1-0                                          ONLINE       0     0     0
            gptid/7a341d28-0344-11e5-8c94-002590dbbf1d.eli  ONLINE       0     0    47
            gptid/84901eb6-067c-11e3-8226-d43d7e93e546.eli  ONLINE       0     0    47
            gptid/855049fe-067c-11e3-8226-d43d7e93e546.eli  ONLINE       0     0    47
            gptid/6d5b0978-aefb-11ec-a75a-002590dbbf1d.eli  ONLINE       0     0    47
            gptid/b52424a6-a670-11e5-a798-002590dbbf1d.eli  ONLINE       0     0    47
            gptid/e5c9026a-56f7-11e5-8bb4-002590dbbf1d.eli  ONLINE       0     0    47

errors: Permanent errors have been detected in the following files:

        /var/db/system/syslog-92e6ac8b6342418e99afb49a8441c54c/log/debug.log
        /var/db/system/syslog-92e6ac8b6342418e99afb49a8441c54c/log/console.log
        /var/db/system/syslog-92e6ac8b6342418e99afb49a8441c54c/log/messages
        /var/db/system/syslog-92e6ac8b6342418e99afb49a8441c54c/log/daemon.log
        /var/db/system/syslog-92e6ac8b6342418e99afb49a8441c54c/log/middlewared.log
        /var/db/system/rrd-92e6ac8b6342418e99afb49a8441c54c/localhost/disktemp-ada1/temperature.rrd
        /var/db/system/rrd-92e6ac8b6342418e99afb49a8441c54c/localhost/disktemp-ada3/temperature.rrd

I have done several pool scrubs, and I have run each of the 6 disks through Manual LONG S.M.A.R.T. tests. Everything completes successfully.

The action of 'destroying the pool and re-creating from a backup' seems risky and like something I may not achieve successfully. I'm hoping someone can provide some guidance for what I should do to resolve the unhealthy status?

Thanks very much.

Alecmascot · Apr 13, 2022

You should put your command output in code tages to make it easier to read.
Putting that aside, all your disks are reporting checksum errors which are usually caused by a cabling issue.

FreeNASUser5129 · Apr 13, 2022

Thanks for the tip. I've edited the post to put the command output into code tags.

Does it make sense to shut down the server and physically re-seat the drives themselves? I have a hot swap cage for the HDDs, so I didn't open anything up to replace the drive.

FreeNASUser5129 · Apr 13, 2022

I have re-seated all of the drives and restarted - but sadly it has not changed the status.

sretalla · Apr 13, 2022

FreeNASUser5129 said:
Does it make sense to shut down the server and physically re-seat the drives themselves?

If the Checksum error counts are still going up, yes, otherwise... and maybe as you found while I was typing... no.

zpool clear Volume1 will reset the counters

The simplest solution may be to just move the system dataset off that pool, make sure all those files are gone, then put it back.

FreeNASUser5129 · Apr 13, 2022

I have reset the counters.

Any chance you could help me with moving the system dataset and getting rid of the files? This begins to push the boundaries of my knowledge on how to work on this (I'm a home user).

sretalla · Apr 13, 2022

System | System Dataset (select pool other than Volume1... can be the Boot pool temporarily)

at the Shell, cd to the location(s)

cd /var/db/system/syslog-92e6ac8b6342418e99afb49a8441c54c/log/

ls -l to see what's there, then:

remove the file(s)

rm debug.log

or skip the cd if you don't care for it:
rm /var/db/system/syslog-92e6ac8b6342418e99afb49a8441c54c/log/debug.log

FreeNASUser5129 · Aug 27, 2022

Hey guys - I'm finally back to troubleshooting this - thanks again for all of the replies so far. I've done as suggested above - moved the System Dataset to my Boot pool and removed the debug logs. There were 2 files there, debug.0 and debug.1 - so I removed both.

Now my pool state says Degraded, and TrueNAS tells me that I can't move the System Dataset back to Volume1 because it's encrypted.

What's my best next course of action please @sretalla ?

Important Announcement for the TrueNAS Community.

Troubleshooting Unhealthy ZPool Status

FreeNASUser5129

Dabbler

Alecmascot

Guru

FreeNASUser5129

Dabbler

FreeNASUser5129

Dabbler

sretalla

Powered by Neutrality

FreeNASUser5129

Dabbler

sretalla

Powered by Neutrality

FreeNASUser5129

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Troubleshooting Unhealthy ZPool Status

Dabbler

Guru

Dabbler

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Troubleshooting Unhealthy ZPool Status"

Similar threads