SOLVED Healthy Disks, Unhealthy Pool, No CKSUM Errors Following 6-Drive Upgrade

SlackerDude

Explorer
Joined
Feb 1, 2014
Messages
76
Hi, I need some fresh eyes to look at the data I have collected & hopefully get some advice on what I need to do to clear the status of my pool. I recently completed the upgrade of my 6-drive 4TB raidz2 pool with 6 16TB drives. After the final resilvering had been completed, each drive showed as good, as did the pool. However, after approximately two weeks the pool shows it is unhealthy.

Let's get the preliminary stuff out of the way. This is my system:
  • Case: LIAN LI PC-Q25B
  • Power Supply: CORSAIR CX430M 430W 80 PLUS BRONZE
  • Motherboard: ASRock FM2A88X-ITX+
  • CPU: AMD A6-5400K
  • CPU Cooler: ZALMAN CNPS8900
  • RAM: 16GB DDR3
  • Boot Drive: 2 x SanDisk 16GB SDCZ36-016G-B35
  • Storage Drive: 6 x Seagate Exos X16 16 TB ST16000NM001G
  • NIC: Intel EXPI9301CTBLK 10/100/1000Mbps
  • TrueNAS-13.0-U2
I ran zpool status -v and got the following result:
Code:
# zpool status -v media
  pool: media
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 06:22:50 with 0 errors on Fri Sep  2 06:22:53 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        media                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/72f3d3ca-e412-11ec-baa7-6805ca1c24be  ONLINE       0     0     0
            gptid/6a12613a-e4e4-11ec-baa7-6805ca1c24be  ONLINE       0     0     0
            gptid/14444d72-e5a0-11ec-8722-6805ca1c24be  ONLINE       0     0     0
            gptid/156ef9cc-e669-11ec-90c2-6805ca1c24be  ONLINE       0     0     0
            gptid/9118b1f0-24b6-11ed-8a64-6805ca1c24be  ONLINE       0     0     0
            gptid/11f76a0d-254a-11ed-8b18-6805ca1c24be  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xffffffffffffffff>:<0x0>


I do not know where this file might be found, or what, exactly, to do with it. I have tried scrubbing the pool multiple times, yet the condition persists. I have run and collected the results of smartctl -a on each drive, and am including it as an attached text file:


I am hoping someone will be able to recommend the next steps to take to resolve this?

Thanks,
 

Attachments

  • SlackerDude_Output.txt
    44.3 KB · Views: 105
Last edited by a moderator:

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
That's not a file, that's pool metadata, and judging by the address, that could be a critical TXG. The only way I know to recover from this is to recreate the pool from backup.
 

SlackerDude

Explorer
Joined
Feb 1, 2014
Messages
76
That's not a file, that's pool metadata, and judging by the address, that could be a critical TXG. The only way I know to recover from this is to recreate the pool from backup.
Thank you Samuel. I was afraid that might be the answer. Unfortunately, I have no pool backup, only configs. I will begin the process of copying data to a series of external drives and then rebuild, maybe I will go with Scale this time, as it appears the better plugin support might lie in that direction these days.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
@Samuel Tai , how can such an error come into existence?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
@ChrisRJ, OP's system doesn't have ECC memory. The only scenario I can think of is a random bit flip in RAM during the resilvers.
 

SlackerDude

Explorer
Joined
Feb 1, 2014
Messages
76
@ChrisRJ, OP's system doesn't have ECC memory. The only scenario I can think of is a random bit flip in RAM during the resilvers.
I knew I was accepting the risk when building the system with non-ECC RAM, or adequate backup storage. Such is the case when working with limited funds, yet forging ahead in spite of it. The responsibility is mine, one I hope to rectify before too much longer. Thanks again for a speedy response Samuel. Perhaps others will read this one and learn from it (LOL).
 
Top