Pool State is Degraded. No errors.

warriorcookie

Explorer
Joined
Apr 17, 2017
Messages
67
Hello, I got a degraded status error today, looking for some assistance in checking what's wrong before I throw parts at a problem.
This disk is part of a special vdev mirror for metadata.
I've started a scrub, waiting for it to finish, will post results then.

CRITICAL

Pool Data1 state is DEGRADED: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
The following devices are not healthy:

  • Disk ATA Samsung SSD 870 ***SERIAL**** is DEGRADED
  • Disk 9160991739036178900 is DEGRADED



iostat:
Code:
# zpool iostat -v
                                          capacity     operations     bandwidth
pool                                    alloc   free   read  write   read  write
--------------------------------------  -----  -----  -----  -----  -----  -----
ChaCha                                  14.1T   481G      1      0  75.5K      6
  gptid/f0795930-f96a-11eb-a8d2-005056a3ab54  7.03T   239G      0      0  37.9K      2
  gptid/f25d8c41-f96a-11eb-a8d2-005056a3ab54  7.03T   241G      0      0  37.6K      3
--------------------------------------  -----  -----  -----  -----  -----  -----
Data1                                   11.1T  4.35T     25    110  2.11M  5.36M
  mirror                                2.51T  1.12T      4      4   443K   678K
    gptid/6bc04c74-3407-11eb-8139-005056a3ab54      -      -      2      2   222K   339K
    gptid/de86f85c-3450-11eb-8139-005056a3ab54      -      -      2      2   221K   339K
  mirror                                2.69T   955G      4      4   452K   610K
    gptid/c45a1d29-31cc-11eb-8139-005056a3ab54      -      -      2      2   226K   305K
    gptid/857ea31d-6b80-11ea-b546-000c29306b8a      -      -      2      2   226K   305K
  mirror                                2.79T   868G      3      4   450K   649K
    gptid/de0e1bec-86b1-11ea-8be8-005056a3ab54      -      -      1      2   226K   325K
    gptid/ded789dc-31a0-11eb-8139-005056a3ab54      -      -      1      2   224K   325K
  mirror                                2.42T  1.21T      3      4   457K   724K
    gptid/541d4cd3-3686-11eb-81ca-005056a3ab54      -      -      1      2   229K   362K
    gptid/445e4bfa-3718-11eb-81ca-005056a3ab54      -      -      1      2   227K   362K
special                                     -      -      -      -      -      -
  mirror                                 678G   250G      9     91   354K  2.76M
    gptid/710ffff0-500a-11ec-9d21-005056a3ab54      -      -      4     45   178K  1.38M
    da9                                     -      -      4     45   176K  1.38M
--------------------------------------  -----  -----  -----  -----  -----  -----
boot-pool                               1.42G  30.1G      0      0  1.87K    243
  da0p2                                 1.42G  30.1G      0      0  1.87K    243
--------------------------------------  -----  -----  -----  -----  -----  -----


Smart (~11TBW) :
Code:
smartctl -a /dev/da11
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p11 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 870 EVO 1TB
Serial Number:    ***SERIAL***
LU WWN Device Id: 5 002538 f316058b2
Firmware Version: SVT01B6Q
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Feb 10 14:35:39 2022 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (    0) seconds.
Offline data collection
capabilities:              (0x53) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  85) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1761
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       32
177 Wear_Leveling_Count     0x0013   098   098   000    Pre-fail  Always       -       38
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   073   061   000    Old_age   Always       -       27
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       7
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       21932635884

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1761         -
# 2  Short offline       Completed without error       00%      1747         -
# 3  Extended offline    Completed without error       00%      1702         -
# 4  Short offline       Completed without error       00%      1699         -
# 5  Short offline       Completed without error       00%      1651         -
# 6  Short offline       Completed without error       00%      1603         -
# 7  Short offline       Completed without error       00%      1555         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  256        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

warriorcookie

Explorer
Joined
Apr 17, 2017
Messages
67
After scrub finished, here's my zpool status -v

Code:
  pool: Data1
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 60K in 14:02:30 with 0 errors on Fri Feb 11 04:21:17 2022
config:

    NAME                                            STATE     READ WRITE CKSUM
    Data1                                           DEGRADED     0     0     0
      mirror-0                                      ONLINE       0     0     0
        gptid/6bc04c74-3407-11eb-8139-005056a3ab54  ONLINE       0     0     0
        gptid/de86f85c-3450-11eb-8139-005056a3ab54  ONLINE       0     0     0
      mirror-1                                      ONLINE       0     0     0
        gptid/c45a1d29-31cc-11eb-8139-005056a3ab54  ONLINE       0     0     0
        gptid/857ea31d-6b80-11ea-b546-000c29306b8a  ONLINE       0     0     1
      mirror-2                                      ONLINE       0     0     0
        gptid/de0e1bec-86b1-11ea-8be8-005056a3ab54  ONLINE       0     0     0
        gptid/ded789dc-31a0-11eb-8139-005056a3ab54  ONLINE       0     0     0
      mirror-3                                      ONLINE       0     0     0
        gptid/541d4cd3-3686-11eb-81ca-005056a3ab54  ONLINE       0     0     0
        gptid/445e4bfa-3718-11eb-81ca-005056a3ab54  ONLINE       0     0     0
    special   
      mirror-4                                      DEGRADED     0     0     0
        gptid/710ffff0-500a-11ec-9d21-005056a3ab54  DEGRADED     0     0     1  too many errors
        da8                                         DEGRADED     0     0     2  too many errors

errors: Permanent errors have been detected in the following files:

        /mnt/Data1/iocage/jails/Downloader/root/usr/local/include/libxml2/libxml/catalog.h
        /mnt/Data1/iocage/jails/Downloader/root/usr/local/lib/libxcb-xinerama.a
        /mnt/Data1/iocage/jails/Downloader/root/usr/local/share/doc/libiconv/iconv.3.html
        /mnt/Data1/iocage/jails/Downloader/root/usr/local/lib/liblcms2.a
        /mnt/Data1/iocage/jails/Downloader/root/usr/local/share/gtk-doc/html/harfbuzz/shaping-plans-and-caching.html
        /mnt/Data1/iocage/jails/Downloader/root/usr/local/share/fonts/dejavu/DejaVuSerif-BoldItalic.ttf
        Data1/.system/samba4@auto-2022-01-22_18-00:<0x0>
        Data1/.system/samba4@auto-2022-01-15_16-00:<0x0>
        Data1/.system/samba4@wbc-1641321255:<0x0>
        Data1/.system/syslog-7c35bc62b22f460fb3766e1c156d5c44@auto-2022-01-20_15-00:<0x0>
        Data1/.system/samba4@auto-2022-01-13_22-00:<0x0>
        Data1/.system/samba4@auto-2022-01-21_12-00:<0x0>
        Data1/.system/samba4@auto-2022-02-07_15-00:<0x0>
        Data1/.system/samba4@auto-2022-01-25_21-00:<0x0>
        /mnt/Data1/iocage/jails/chia/root/usr/local/go/src/net/ipsock_posix.go
        /mnt/Data1/iocage/jails/chia/root/usr/local/go/src/syscall/types_freebsd.go
        /mnt/Data1/iocage/jails/chia/root/usr/local/lib/gcc10/include/c++/codecvt
        /mnt/Data1/iocage/jails/chia/root/usr/local/go/src/runtime/os_openbsd.go
        /mnt/Data1/iocage/jails/chia/root/usr/local/go/test/ken/chan.go
        /mnt/Data1/iocage/jails/chia/root/usr/local/go/src/path/match.go
        /mnt/Data1/iocage/jails/chia/root/usr/local/go/src/runtime/print.go
        /mnt/Data1/iocage/jails/chia/root/usr/local/man/man1/git-sparse-checkout.1.gz
        /mnt/Data1/iocage/jails/chia/root/usr/local/go/test/makeslice.go
        /mnt/Data1/iocage/jails/chia/root/usr/local/go/src/internal/bytealg/indexbyte_ppc64x.s
        /mnt/Data1/iocage/jails/chia/root/usr/local/go/src/image/testdata/video-001.q50.422.jpeg
        /mnt/Data1/iocage/jails/chia/root/usr/local/go/src/runtime/sys_openbsd2.go
        /mnt/Data1/iocage/jails/chia/root/usr/local/go/src/runtime/defs_linux_386.go
        /mnt/Data1/iocage/jails/chia/root/usr/local/go/src/os/user/user_test.go
        /mnt/Data1/iocage/jails/chia/root/usr/local/go/src/runtime/signal_aix_ppc64.go
        /mnt/Data1/iocage/jails/chia/root/usr/local/go/test/literal.go
        Data1/.system/samba4@auto-2022-02-01_13-00:<0x0>
        Data1/.system/syslog-7c35bc62b22f460fb3766e1c156d5c44@auto-2022-02-04_19-00:<0x0>
        Data1/.system/samba4@auto-2022-01-20_11-00:<0x0>
        Data1/.system/samba4@auto-2022-01-13_15-00:<0x0>
        Data1/.system/samba4@auto-2022-01-25_16-00:<0x0>


Any recommendations on next steps?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Both your special VDEV have checksum errors, as does one of your mirror pool drives. Several files already have been damaged. Make sure your pool has backups, as I think you'll have to rebuild it from backup with better SSDs for your special VDEVs.
 

warriorcookie

Explorer
Joined
Apr 17, 2017
Messages
67
Both your special VDEV have checksum errors, as does one of your mirror pool drives. Several files already have been damaged. Make sure your pool has backups, as I think you'll have to rebuild it from backup with better SSDs for your special VDEVs.
Thank you.

I have backups yes.

How can I be certain it's the ssd's and not another piece of hardware? ie, cables, hba, etc?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399

warriorcookie

Explorer
Joined
Apr 17, 2017
Messages
67
The damaged files are the clincher.
Understood. I can restore the files. None of them are really important anyway.

But before I go and buy new ssd's, I'd like to verify it's actually the ssd's at fault. I don't see any issues in the smart data. How can I drill down to verify?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
It's not just damaged files. You have damaged metadata as well, which is the purpose of a special VDEV. These are the entries that look like:

Data1/.system/samba4@auto-2022-01-22_18-00:<0x0>

As for SMART, the only thing that stands out is the Total_LBA _Written field, with almost 22G blocks written. I don't think a consumer drive outside Optanes is built for that kind of write volume. An enterprise or data center SSD could handle that.
 

warriorcookie

Explorer
Joined
Apr 17, 2017
Messages
67
So, I shut down, swapped cables between two drives, started up again. Ran another scrub. Checksum errors persisted but the file errors are now all gone....

I did zpool clear and will monitor now.

Is there a way to restore just metadata from a snapshot or do I have to rollback the entire snapshot?

As for SMART, the only thing that stands out is the Total_LBA _Written field, with almost 22G blocks written. I don't think a consumer drive outside Optanes is built for that kind of write volume. An enterprise or data center SSD could handle that.
The Total LBA's written equates to a little over 11 TB. The drive is warrantied for 600TBW. That's not even 2%.... Am I missing something?
 
Last edited:

warriorcookie

Explorer
Joined
Apr 17, 2017
Messages
67
Put the lid back on the server and went to bed. Hour later there's a bunch of checksum errors and a couple corrupted files. None of the files from last time, they're new. I've taken the lid off and am running srub again to see if the corrupted files get fixed again.

Possibly heat related. The HBA's don't report die temp do they? My PCH heatsink is right beside the broadcom heatsink on the motherboard and it's reporting 60C. I'll try repasting heatsinks next.
 

indy

Patron
Joined
Dec 28, 2013
Messages
287
What is the history of the special vdev?
Did you manually replace/attach one of the drives at some point (since it is not referenced by gptid)?
 

warriorcookie

Explorer
Joined
Apr 17, 2017
Messages
67
What is the history of the special vdev?
Did you manually replace/attach one of the drives at some point (since it is not referenced by gptid)?
Yes i manually added 2x 1tb evo 860 as a special vdev using the cli once truenas added support. Sometime later i bought 2x 1 tb evo 870 for a proxmox project, but first I swapped out one of the 860's from the special vdev so I didn't have 2 mirrored devices with the exact same number of hours.
 

warriorcookie

Explorer
Joined
Apr 17, 2017
Messages
67
This sure seems heat related. My Flir was saying the heatsink for the broadcom controller was 85C but the PCH heatsink beside was 55C. I put a 40mm fan on it and scrub has completed with no errors. Any files reported damaged previously are no longer listed under zpool status.

I'm going to re-paste the heatsink and make a more permanent solution for the fan. We'll see if the issue is resolved still after a couple days.
 

warriorcookie

Explorer
Joined
Apr 17, 2017
Messages
67
So re-pasted the broadcom HBA heatsink along with the PCH heatsink. Mounted a little 30mm blower style fan. Everything seems to be running fine. Multiple scrubs with no errors.

Is there a way to scan to check for errors in the metadata?
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702

warriorcookie

Explorer
Joined
Apr 17, 2017
Messages
67
You should probably look into the articles about SSD drive burnout and chia mining...

One example...
Appreciate this info, I'm well aware. All plotting happens on the array of spinning disks in a dataset that has metadata(special) small block size set to 0. The node db is downloaded to the same dataset to prevent constant re-writes to ssd every-time the block height increases.

The only part of the chia mining that goes anywhere near the ssd's is the metadata for the jail.
 
Top