SOLVED Healthy Disks, Unhealthy Pool, No CKSUM Errors Following 6-Drive Upgrade

SlackerDude · Sep 5, 2022

Hi, I need some fresh eyes to look at the data I have collected & hopefully get some advice on what I need to do to clear the status of my pool. I recently completed the upgrade of my 6-drive 4TB raidz2 pool with 6 16TB drives. After the final resilvering had been completed, each drive showed as good, as did the pool. However, after approximately two weeks the pool shows it is unhealthy.

Let's get the preliminary stuff out of the way. This is my system:

Case: LIAN LI PC-Q25B
Power Supply: CORSAIR CX430M 430W 80 PLUS BRONZE
Motherboard: ASRock FM2A88X-ITX+
CPU: AMD A6-5400K
CPU Cooler: ZALMAN CNPS8900
RAM: 16GB DDR3
Boot Drive: 2 x SanDisk 16GB SDCZ36-016G-B35
Storage Drive: 6 x Seagate Exos X16 16 TB ST16000NM001G
NIC: Intel EXPI9301CTBLK 10/100/1000Mbps
TrueNAS-13.0-U2

I ran zpool status -v and got the following result:

Code:

# zpool status -v media
  pool: media
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 06:22:50 with 0 errors on Fri Sep  2 06:22:53 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        media                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/72f3d3ca-e412-11ec-baa7-6805ca1c24be  ONLINE       0     0     0
            gptid/6a12613a-e4e4-11ec-baa7-6805ca1c24be  ONLINE       0     0     0
            gptid/14444d72-e5a0-11ec-8722-6805ca1c24be  ONLINE       0     0     0
            gptid/156ef9cc-e669-11ec-90c2-6805ca1c24be  ONLINE       0     0     0
            gptid/9118b1f0-24b6-11ed-8a64-6805ca1c24be  ONLINE       0     0     0
            gptid/11f76a0d-254a-11ed-8b18-6805ca1c24be  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xffffffffffffffff>:<0x0>

I do not know where this file might be found, or what, exactly, to do with it. I have tried scrubbing the pool multiple times, yet the condition persists. I have run and collected the results of smartctl -a on each drive, and am including it as an attached text file:

I am hoping someone will be able to recommend the next steps to take to resolve this?

Thanks,

Samuel Tai · Sep 5, 2022

That's not a file, that's pool metadata, and judging by the address, that could be a critical TXG. The only way I know to recover from this is to recreate the pool from backup.

SlackerDude · Sep 5, 2022

Samuel Tai said:
That's not a file, that's pool metadata, and judging by the address, that could be a critical TXG. The only way I know to recover from this is to recreate the pool from backup.

Thank you Samuel. I was afraid that might be the answer. Unfortunately, I have no pool backup, only configs. I will begin the process of copying data to a series of external drives and then rebuild, maybe I will go with Scale this time, as it appears the better plugin support might lie in that direction these days.

ChrisRJ · Sep 5, 2022

@Samuel Tai , how can such an error come into existence?

Samuel Tai · Sep 6, 2022

@ChrisRJ, OP's system doesn't have ECC memory. The only scenario I can think of is a random bit flip in RAM during the resilvers.

SlackerDude · Sep 6, 2022

Samuel Tai said:
@ChrisRJ, OP's system doesn't have ECC memory. The only scenario I can think of is a random bit flip in RAM during the resilvers.

I knew I was accepting the risk when building the system with non-ECC RAM, or adequate backup storage. Such is the case when working with limited funds, yet forging ahead in spite of it. The responsibility is mine, one I hope to rectify before too much longer. Thanks again for a speedy response Samuel. Perhaps others will read this one and learn from it (LOL).

Important Announcement for the TrueNAS Community.

SOLVED Healthy Disks, Unhealthy Pool, No CKSUM Errors Following 6-Drive Upgrade

SlackerDude

Explorer

Attachments

Samuel Tai

Never underestimate your own stupidity

SlackerDude

Explorer

ChrisRJ

Wizard

Samuel Tai

Never underestimate your own stupidity

SlackerDude

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED Healthy Disks, Unhealthy Pool, No CKSUM Errors Following 6-Drive Upgrade

SlackerDude

Explorer

Attachments

Samuel Tai

Never underestimate your own stupidity

SlackerDude

Explorer

ChrisRJ

Wizard

Samuel Tai

Never underestimate your own stupidity

SlackerDude

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Healthy Disks, Unhealthy Pool, No CKSUM Errors Following 6-Drive Upgrade"

Similar threads