Degraded pool

MMrrTT

Cadet
Joined
Oct 11, 2023
Messages
5
Hello.

I have 2 disk mirror pool. One disk was offline after the reboot because of power cable problem. I put it online again and now I have not healthy pool because this disk is degraged.

I believe the disk condition is ok, but still I performed full scrub and now long s.m.a.r.t. test. But how to get rid of the degraded state? With "zpool replace" command or any other easy way?

Code:
root@TrueNAS[~]# smartctl -a /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.55-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Ultrastar DC HC550
Device Model:     WUH721818ALE6L4
Serial Number:    5DG7W4VJ
LU WWN Device Id: 5 000cca 2b9c3948a
Add. Product Id:  202211
Firmware Version: PCGAW660
User Capacity:    18,000,207,937,536 bytes [18.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-4 published, ANSI INCITS 529-2018
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Nov  8 05:27:20 2023 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1911) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.


SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   001    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   148   148   054    Pre-fail  Offline      -       48
  3 Spin_Up_Time            0x0007   083   083   001    Pre-fail  Always       -       347 (Average 343)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       26
  5 Reallocated_Sector_Ct   0x0033   100   100   001    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   001    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   140   140   020    Pre-fail  Offline      -       15
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       9260
 10 Spin_Retry_Count        0x0013   100   100   001    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       26
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
 82 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       255
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       414
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       414
194 Temperature_Celsius     0x0002   057   057   000    Old_age   Always       -       37 (Min/Max 16/54)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0012   100   100   000    Old_age   Always       -       29095913930
242 Total_LBAs_Read         0x0012   100   100   000    Old_age   Always       -       220335601709


SMART Error Log Version: 1
No Errors Logged


SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Please list the output of the following command, in CODE tags please;
zpool status -v

It is probable that your pool now consists of a single disk, because the other was "lost".

If the disk is shown OFFLINE, you can on-line it and it will re-sync up.

But, if the disk is missing, then you need to replace it, with it's self. That will cause a full re-sync, longer than above, but probably necessary.
 

MMrrTT

Cadet
Joined
Oct 11, 2023
Messages
5
Both are online, but there's checksum error, though I didn't write anything on that pool while disk were offline.

Code:
root@TrueNAS[~]# zpool status -v
  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:07 with 0 errors on Wed Nov  8 03:45:08 2023
config:


        NAME         STATE     READ WRITE CKSUM
        boot-pool    ONLINE       0     0     0
          nvme0n1p3  ONLINE       0     0     0


errors: No known data errors


  pool: mirr.pool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 08:24:41 with 0 errors on Wed Nov  8 05:08:04 2023
config:


        NAME                                      STATE     READ WRITE CKSUM
        mirr.pool                                 ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            d3433001-1d18-4dda-b4f1-5b4cdf017bdf  ONLINE       0     0     1
            f73e3831-127f-4d09-a5f7-21e7a1397328  ONLINE       0     0     0


errors: No known data errors


  pool: stripe
 state: ONLINE
  scan: scrub repaired 0B in 15:59:17 with 0 errors on Sun Nov  5 14:59:18 2023
config:


        NAME                                    STATE     READ WRITE CKSUM
        stripe                                  ONLINE       0     0     0
          26a56564-6b1a-43be-b265-0c4278b36c45  ONLINE       0     0     0
          72717500-6907-4ba3-b2c1-830c61940728  ONLINE       0     0     0


errors: No known data errors
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
That single checksum error is reasonably harmless. That is just ZFS telling you it found a problem, and automatically corrected it.

The proof is the line;
errors: No known data errors

If this is the first error on that disk, make a note, clear the error and make sure you have at least monthly ZFS scrubs of your pools. Plus, SMART tests. Here is the error clear command;
zpool clear mirr.pool
 

MMrrTT

Cadet
Joined
Oct 11, 2023
Messages
5
Thank you. I will wait untill the long s.m.a.r.t. test finish and then clear the error.
 
Top