Is it time for replace, help needed

KpuCko

Dabbler
Joined
Jun 20, 2019
Messages
48
Hello guys,
I'm using TrueNAS last two years, so I can say I'm quite new on this system. Few weeks ago I noticed that two of the disks started showing some errors on read status, because of this the pool goes to unhealthy state, but if I reboot it again goes to healthy.

1652099460602.png


I know that ZFS will try to self-heal the filesystem, but I'm not sure how to handle this situation.
For instance, in the past I used QNAP and when there is a issue with the disk, it simply mark it as offline, so you are required to replace.

Here I have only these errors reported by S.M.A.R.T but nothing else. I also checked the disks via CLI commands with the help of this article: https://www.thomas-krenn.com/en/wiki/Analyzing_a_Faulty_Hard_Disk_using_Smartctl but I cannot see anything which I have to worry about.

I have none of these:

Code:
1 Raw_Read_Error_Rate  
5 Reallocated_Sector_Ct 
7 Seek_Error_Rate  
196 Reallocated_Event_Count
197 Current_Pending_Sector     


Please check the code below:

Code:
sofx-nas01# smartctl -a /dev/ada0
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     LITEONIT LCS-128M6S
Serial Number:    S0C41178Z1ZSVB068496
Firmware Version: DC72205
User Capacity:    128,035,676,160 bytes [128 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available, deterministic, zeroed
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS, ATA/ATAPI-7 T13/1532D revision 4a
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon May  9 16:51:15 2022 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   10) seconds.
Offline data collection
capabilities:                    (0x15) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Abort Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x00) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       17452
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       414
170 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
171 Unknown_Attribute       0x0003   100   100   000    Pre-fail  Always       -       0
172 Unknown_Attribute       0x0003   100   100   000    Pre-fail  Always       -       0
173 Unknown_Attribute       0x0003   100   100   000    Pre-fail  Always       -       3318106
174 Unknown_Attribute       0x0003   100   100   000    Pre-fail  Always       -       301
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   000    Old_age   Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   100   100   010    Pre-fail  Always       -       896
184 End-to-End_Error        0x0033   100   100   000    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0003   100   100   000    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0003   100   100   000    Pre-fail  Always       -       204433
241 Total_LBAs_Written      0x0003   100   100   000    Pre-fail  Always       -       3270934
242 Total_LBAs_Read         0x0003   100   100   000    Pre-fail  Always       -       1729200

SMART Error Log Version: 0
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     11332         -
# 2  Short offline       Completed without error       00%     11332         -
# 3  Extended offline    Completed without error       00%     11332         -
# 4  Short offline       Completed without error       00%     11076         -
# 5  Short offline       Completed without error       00%     11076         -
# 6  Short offline       Completed without error       00%     11076         -
# 7  Short offline       Completed without error       00%     11076         -
# 8  Short offline       Completed without error       00%     10820         -
# 9  Short offline       Completed without error       00%     10820         -
#10  Short offline       Completed without error       00%     10820         -
#11  Short offline       Completed without error       00%     10820         -
#12  Short offline       Completed without error       00%     10820         -
#13  Short offline       Completed without error       00%     38676         -
#14  Short offline       Completed without error       00%     10564         -
#15  Short offline       Completed without error       00%     10564         -
#16  Short offline       Completed without error       00%     10564         -
#17  Short offline       Completed without error       00%     10564         -
#18  Short offline       Completed without error       00%     10564         -
#19  Short offline       Completed without error       00%     10308         -
#20  Short offline       Completed without error       00%     10308         -
#21  Short offline       Completed without error       00%     10308         -

Selective Self-tests/Logging not supported

sofx-nas01#
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
For instance, in the past I used QNAP and when there is a issue with the disk, it simply mark it as offline, so you are required to replace.
ZFS will do the same if you leave it until there are enough errors.

Please see the pastebin link:
There's really no reason to use external links... the forum allows file attachments, screen shots and code tags.

If you want forum members to read them, post them here.
 

KpuCko

Dabbler
Joined
Jun 20, 2019
Messages
48
Code:
sofx-nas01# smartctl -a /dev/ada1

smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)

Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===

Device Model:     LITEONIT LCS-128M6S

Serial Number:    S0C41178Z1ZSDX039700

Firmware Version: DC72205

User Capacity:    128,035,676,160 bytes [128 GB]

Sector Size:      512 bytes logical/physical

Rotation Rate:    Solid State Device

TRIM Command:     Available, deterministic, zeroed

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:   ATA8-ACS, ATA/ATAPI-7 T13/1532D revision 4a

SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is:    Mon May  9 16:52:35 2022 EEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x00) Offline data collection activity

                                        was never started.

                                        Auto Offline Data Collection: Disabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                (   10) seconds.

Offline data collection

capabilities:                    (0x15) SMART execute Offline immediate.

                                        No Auto Offline data collection support.

                                        Abort Offline collection upon new

                                        command.

                                        No Offline surface scan supported.

                                        Self-test supported.

                                        No Conveyance Self-test supported.

                                        No Selective Self-test supported.

SMART capabilities:            (0x0002) Does not save SMART data before

                                        entering power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x00) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (   1) minutes.

Extended self-test routine

recommended polling time:        (  10) minutes.

SCT capabilities:              (0x003d) SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.


SMART Attributes Data Structure revision number: 1

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       22246

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       280

170 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0

171 Unknown_Attribute       0x0003   100   100   000    Pre-fail  Always       -       0

172 Unknown_Attribute       0x0003   100   100   000    Pre-fail  Always       -       0

173 Unknown_Attribute       0x0003   100   100   000    Pre-fail  Always       -       3645735

174 Unknown_Attribute       0x0003   100   100   000    Pre-fail  Always       -       138

178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   000    Old_age   Always       -       0

180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   100   100   010    Pre-fail  Always       -       896

184 End-to-End_Error        0x0033   100   100   000    Pre-fail  Always       -       279642112

187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0

199 UDMA_CRC_Error_Count    0x0003   100   100   000    Pre-fail  Always       -       0

233 Media_Wearout_Indicator 0x0003   100   100   000    Pre-fail  Always       -       246600

241 Total_LBAs_Written      0x0003   100   100   000    Pre-fail  Always       -       3945614

242 Total_LBAs_Read         0x0003   100   100   000    Pre-fail  Always       -       2894544


SMART Error Log Version: 0

No Errors Logged


SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed without error       00%     58966         -

# 2  Short offline       Completed without error       00%     58966         -

# 3  Extended offline    Completed without error       00%     58966         -

# 4  Short offline       Completed without error       00%     58710         -

# 5  Short offline       Completed without error       00%     58710         -

# 6  Short offline       Completed without error       00%     58710         -

# 7  Short offline       Completed without error       00%     58710         -

# 8  Short offline       Completed without error       00%     58710         -

# 9  Short offline       Completed without error       00%     58454         -

#10  Short offline       Completed without error       00%     58454         -

#11  Short offline       Completed without error       00%     58454         -

#12  Short offline       Completed without error       00%     58454         -

#13  Short offline       Completed without error       00%     45094         -

#14  Short offline       Completed without error       00%     58454         -

#15  Short offline       Completed without error       00%     58198         -

#16  Short offline       Completed without error       00%     58198         -

#17  Short offline       Completed without error       00%     58198         -

#18  Short offline       Completed without error       00%     58198         -

#19  Short offline       Completed without error       00%     58198         -

#20  Short offline       Completed without error       00%     58198         -

#21  Short offline       Completed without error       00%     57942         -


Selective Self-tests/Logging not supported


sofx-nas01#
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I know that ZFS will try to self-heal the filesystem, but I'm not sure how to handle this situation.
What part of it are you unsure about? You have a disk that's consistently failing its internal self-tests. Replace it. The manual tells you how.
 

KpuCko

Dabbler
Joined
Jun 20, 2019
Messages
48
Code:
sofx-nas01# smartctl -a /dev/ada2

smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)

Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===

Model Family:     Western Digital Red

Device Model:     WDC WD20EFRX-68EUZN0

Serial Number:    WD-WCC4M1SARTHY

LU WWN Device Id: 5 0014ee 26438b400

Firmware Version: 82.00A82

User Capacity:    2,000,398,934,016 bytes [2.00 TB]

Sector Sizes:     512 bytes logical, 4096 bytes physical

Rotation Rate:    5400 rpm

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   ACS-2 (minor revision not indicated)

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is:    Mon May  9 16:52:58 2022 EEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x00) Offline data collection activity

                                        was never started.

                                        Auto Offline Data Collection: Disabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                (26580) seconds.

Offline data collection

capabilities:                    (0x7b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (   2) minutes.

Extended self-test routine

recommended polling time:        ( 268) minutes.

Conveyance self-test routine

recommended polling time:        (   5) minutes.

SCT capabilities:              (0x703d) SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.


SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0

  3 Spin_Up_Time            0x0027   176   175   021    Pre-fail  Always       -       4175

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       111

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   050   050   000    Old_age   Always       -       36976

 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0

 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       99

192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       58

193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2303

194 Temperature_Celsius     0x0022   111   099   000    Old_age   Always       -       36

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0


SMART Error Log Version: 1

No Errors Logged


SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed without error       00%     36972         -

# 2  Short offline       Completed without error       00%     36804         -

# 3  Extended offline    Completed without error       00%     36773         -

# 4  Short offline       Completed without error       00%     36636         -

# 5  Short offline       Completed without error       00%     36468         -

# 6  Short offline       Completed without error       00%     36301         -

# 7  Short offline       Completed without error       00%     36133         -

# 8  Extended offline    Completed without error       00%     36023         -

# 9  Short offline       Completed without error       00%     35965         -

#10  Short offline       Completed without error       00%     35798         -

#11  Short offline       Completed without error       00%     35630         -

#12  Short offline       Completed without error       00%     35463         -

#13  Short offline       Completed without error       00%     35295         -

#14  Short offline       Completed without error       00%     35127         -

#15  Short offline       Completed without error       00%     34959         -

#16  Short offline       Completed without error       00%     34791         -

#17  Short offline       Completed without error       00%     34624         -

#18  Short offline       Completed without error       00%     34456         -

#19  Short offline       Completed without error       00%     34288         -

#20  Short offline       Completed without error       00%     34120         -

#21  Short offline       Completed without error       00%     33959         -


SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.


sofx-nas01#
 

KpuCko

Dabbler
Joined
Jun 20, 2019
Messages
48
Code:
sofx-nas01# smartctl -a /dev/ada3

smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)

Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===

Model Family:     Western Digital Red

Device Model:     WDC WD20EFRX-68EUZN0

Serial Number:    WD-WCC4M2KLNTJL

LU WWN Device Id: 5 0014ee 2b98db6ee

Firmware Version: 82.00A82

User Capacity:    2,000,398,934,016 bytes [2.00 TB]

Sector Sizes:     512 bytes logical, 4096 bytes physical

Rotation Rate:    5400 rpm

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   ACS-2 (minor revision not indicated)

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is:    Mon May  9 16:53:19 2022 EEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x00) Offline data collection activity

                                        was never started.

                                        Auto Offline Data Collection: Disabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                (28020) seconds.

Offline data collection

capabilities:                    (0x7b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (   2) minutes.

Extended self-test routine

recommended polling time:        ( 283) minutes.

Conveyance self-test routine

recommended polling time:        (   5) minutes.

SCT capabilities:              (0x703d) SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.


SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       139

  3 Spin_Up_Time            0x0027   177   176   021    Pre-fail  Always       -       4116

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       109

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   050   050   000    Old_age   Always       -       36972

 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0

 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       97

192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       58

193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2279

194 Temperature_Celsius     0x0022   109   098   000    Old_age   Always       -       38

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1


SMART Error Log Version: 1

No Errors Logged


SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed without error       00%     36968         -

# 2  Short offline       Completed without error       00%     36800         -

# 3  Extended offline    Completed: read failure       10%     36770         3896097744

# 4  Short offline       Completed without error       00%     36632         -

# 5  Short offline       Completed without error       00%     36464         -

# 6  Short offline       Completed without error       00%     36297         -

# 7  Short offline       Completed without error       00%     36129         -

# 8  Extended offline    Completed: read failure       10%     36033         3896097744

# 9  Short offline       Completed without error       00%     35961         -

#10  Short offline       Completed without error       00%     35794         -

#11  Short offline       Completed without error       00%     35626         -

#12  Short offline       Completed without error       00%     35458         -

#13  Short offline       Completed without error       00%     35291         -

#14  Short offline       Completed without error       00%     35123         -

#15  Short offline       Completed without error       00%     34955         -

#16  Short offline       Completed without error       00%     34787         -

#17  Short offline       Completed without error       00%     34620         -

#18  Short offline       Completed without error       00%     34452         -

#19  Short offline       Completed without error       00%     34284         -

#20  Short offline       Completed without error       00%     34116         -

#21  Short offline       Completed without error       00%     33955         -


SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.


sofx-nas01#
 

KpuCko

Dabbler
Joined
Jun 20, 2019
Messages
48
Code:
sofx-nas01# smartctl -a /dev/ada4

smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)

Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===

Model Family:     Western Digital Red

Device Model:     WDC WD20EFRX-68EUZN0

Serial Number:    WD-WCC4M6ZX4Z5S

LU WWN Device Id: 5 0014ee 2b98e5970

Firmware Version: 82.00A82

User Capacity:    2,000,398,934,016 bytes [2.00 TB]

Sector Sizes:     512 bytes logical, 4096 bytes physical

Rotation Rate:    5400 rpm

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   ACS-2 (minor revision not indicated)

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is:    Mon May  9 16:53:40 2022 EEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x00) Offline data collection activity

                                        was never started.

                                        Auto Offline Data Collection: Disabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                (27540) seconds.

Offline data collection

capabilities:                    (0x7b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (   2) minutes.

Extended self-test routine

recommended polling time:        ( 278) minutes.

Conveyance self-test routine

recommended polling time:        (   5) minutes.

SCT capabilities:              (0x703d) SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.


SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0

  3 Spin_Up_Time            0x0027   177   175   021    Pre-fail  Always       -       4141

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       108

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   050   050   000    Old_age   Always       -       36970

 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0

 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       96

192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       58

193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2167

194 Temperature_Celsius     0x0022   110   100   000    Old_age   Always       -       37

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0


SMART Error Log Version: 1

No Errors Logged


SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed without error       00%     36966         -

# 2  Short offline       Completed without error       00%     36798         -

# 3  Extended offline    Completed without error       00%     36768         -

# 4  Short offline       Completed without error       00%     36630         -

# 5  Short offline       Completed without error       00%     36462         -

# 6  Short offline       Completed without error       00%     36295         -

# 7  Short offline       Completed without error       00%     36127         -

# 8  Extended offline    Completed without error       00%     36038         -

# 9  Short offline       Completed without error       00%     35959         -

#10  Short offline       Completed without error       00%     35792         -

#11  Short offline       Completed without error       00%     35624         -

#12  Short offline       Completed without error       00%     35456         -

#13  Short offline       Completed without error       00%     35289         -

#14  Short offline       Completed without error       00%     35121         -

#15  Short offline       Completed without error       00%     34953         -

#16  Short offline       Completed without error       00%     34785         -

#17  Short offline       Completed without error       00%     34617         -

#18  Short offline       Completed without error       00%     34450         -

#19  Short offline       Completed without error       00%     34282         -

#20  Short offline       Completed without error       00%     34114         -

#21  Short offline       Completed without error       00%     33953         -


SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.


sofx-nas01#
 

KpuCko

Dabbler
Joined
Jun 20, 2019
Messages
48
Code:
sofx-nas01# smartctl -a /dev/ada5

smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)

Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===

Model Family:     Western Digital Red

Device Model:     WDC WD20EFRX-68EUZN0

Serial Number:    WD-WCC4M5ZEDLVF

LU WWN Device Id: 5 0014ee 20ee37329

Firmware Version: 82.00A82

User Capacity:    2,000,398,934,016 bytes [2.00 TB]

Sector Sizes:     512 bytes logical, 4096 bytes physical

Rotation Rate:    5400 rpm

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   ACS-2 (minor revision not indicated)

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is:    Mon May  9 16:54:07 2022 EEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x00) Offline data collection activity

                                        was never started.

                                        Auto Offline Data Collection: Disabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                (25980) seconds.

Offline data collection

capabilities:                    (0x7b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (   2) minutes.

Extended self-test routine

recommended polling time:        ( 263) minutes.

Conveyance self-test routine

recommended polling time:        (   5) minutes.

SCT capabilities:              (0x703d) SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.


SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       13

  3 Spin_Up_Time            0x0027   177   176   021    Pre-fail  Always       -       4150

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       107

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   050   050   000    Old_age   Always       -       36968

 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0

 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       95

192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       58

193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2204

194 Temperature_Celsius     0x0022   112   102   000    Old_age   Always       -       35

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       3


SMART Error Log Version: 1

No Errors Logged


SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed without error       00%     36963         -

# 2  Short offline       Completed without error       00%     36795         -

# 3  Extended offline    Completed: read failure       90%     36759         22355920

# 4  Short offline       Completed without error       00%     36628         -

# 5  Short offline       Completed without error       00%     36460         -

# 6  Short offline       Completed without error       00%     36292         -

# 7  Short offline       Completed without error       00%     36124         -

# 8  Extended offline    Completed: read failure       90%     36048         22355848

# 9  Short offline       Completed without error       00%     35956         -

#10  Short offline       Completed without error       00%     35789         -

#11  Short offline       Completed without error       00%     35622         -

#12  Short offline       Completed without error       00%     35454         -

#13  Short offline       Completed without error       00%     35286         -

#14  Short offline       Completed without error       00%     35118         -

#15  Short offline       Completed without error       00%     34950         -

#16  Short offline       Completed without error       00%     34782         -

#17  Short offline       Completed without error       00%     34615         -

#18  Short offline       Completed without error       00%     34447         -

#19  Short offline       Completed without error       00%     34279         -

#20  Short offline       Completed without error       00%     34111         -

#21  Short offline       Completed without error       00%     33950         -


SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.


sofx-nas01#
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
So both ada3 and ada5 are showing read errors and multi-zone errors, and both are consistently failing long SMART self-tests. What part of that is confusing?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
In any case, as already pointed out, you indeed have the errors on ada3 and ada5 that you were looking for...
I have none of these:

Code:
1 Raw_Read_Error_Rate
 

KpuCko

Dabbler
Joined
Jun 20, 2019
Messages
48
Great, understood. So at this time I can still wait they to fail, or this is too risky, because both of them shows the same errors.
My ZFS pool consist two disks as mirror and two as spare, if both of them fail, probably I will end up with broken pool.

So the conclusion is to replace ada3 and ada5, do I got it right?
 
Joined
Oct 22, 2019
Messages
3,641
So at this time I can still wait they to fail, or this is too risky, because both of them shows the same errors.
You really like to live on the edge? :wink:
 
Joined
Oct 22, 2019
Messages
3,641
At this point, I'd be careful and make sure you don't try to replace both drives at the same time, but rather one-at-a-time.

What is your pool's layout?

zpool status -v
 

KpuCko

Dabbler
Joined
Jun 20, 2019
Messages
48
Code:
sofx-nas01# zpool status -v
  pool: Data01
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 1M in 04:05:29 with 0 errors on Sun Apr 24 04:05:31 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        Data01                                          ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/e2d996f4-15e8-11eb-b46a-3cecef205174  ONLINE       0     0     0
            gptid/829ad914-1614-11eb-8fd5-3cecef205174  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            gptid/42e7b63d-1628-11eb-a131-3cecef205174  ONLINE       0     0     0
            gptid/52b86158-163b-11eb-a101-3cecef205174  ONLINE       2     0     0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0B in 00:00:55 with 0 errors on Tue May 10 03:45:55 2022
config:

        NAME          STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            ada0p2    ONLINE       0     0     0
            ada1p2    ONLINE       0     0     0

errors: No known data errors
sofx-nas01#
 
Joined
Oct 22, 2019
Messages
3,641
Now you can use glabel status to match the GPTID to the drive (partition, really) to know which drive to replace.

So run,
glabel status
to find out which drive's partition is 52b86158-163b-11eb-a101-3cecef205174 (since this is the one with read errors, as reported by ZFS.)

For example, it might be ada5p2 (ada5).

Then you would safely and carefully replace only the drive ada5 with a good drive (as per the replacement guide).

How to figure out which drive this is, physically? Either you labeled them with sticky notes, or you can match ada5 to the drive's serial number, if you are able to see the manufacturer's printed info on the drive's sticker label.

I'm using ada5 for the same of example. The above GPTID might in fact be for ada3.

---

From what you've shared, it appears that ada3 and ada5 both yield read errors with extended SMART selftests.

You'll need to use glabel status to find their respective GPTID, which will let you see where they reside in you're Data01 pool layout. Hopefully they are on different mirror vdevs (not on the same mirror vdev together). Either way, you want to replace them one at a time; a successful resilver followed by another successful resilver. Not at the same time.

---

The above presumes that the extended SMART selftests are alerting you to the fact that ada3 and ada5 are showing early signs of failure, but ZFS's redundancy and ability to self-heal is keeping your data safe (for now.)
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
For example, it might be ada5p2 (ada5).
...or you could bypass this and check the pool status in the GUI, where it would tell you that it was ada5 (or whatever) with the read errors.
or you can match ada5 to the drive's serial number
...which you can also do in the GUI, at the Storage -> Disks page.
You'll need to use glabel status to find their respective GPTID
There's no need to do anything at the CLI for this.
Either way, you want to replace them one at a time; a successful resilver followed by another successful resilver. Not at the same time.
Agreed, with the only exception being if you have a way to have both replacement drives online at the same time as the drives to be replaced (i.e., you have at least two spare SATA ports, a way to power the replacements, and a--perhaps temporary--place to put them). In that case, you can safely replace both simultaneously, as there's no loss of redundancy while resilvering.
 
Joined
Oct 22, 2019
Messages
3,641
...or you could bypass this and check the pool status in the GUI, where it would tell you that it was ada5 (or whatever) with the read errors.
Good catch! I thought that the GUI would simply output the same as is seen by invoking zpool status.

...which you can also do in the GUI, at the Storage -> Disks page.
Was more referring to the alternative method if you didn't create your own sticky labels, you can still peak at the drive's manufacturer's info sticker, which has the serial number printed on the plate. (Otherwise, for those who setup everything, up and running, they can streamline the process by just slapping their own labels to the side of the drive, in plain view, with the name and serial number, and to which vdev/pool it belongs.)
 

KpuCko

Dabbler
Joined
Jun 20, 2019
Messages
48
Yeah, guys thanks for the detailed explanation.
I think I already identified the bad disks, so I'm going to replace them one by one.

Fingers crossed that, the shit won't hit the fan during the re-silvering process ;-)
 
Top