Time to replace HDD?

bbolgar

Cadet
Joined
Dec 3, 2022
Messages
2
Hi all,

TrueNas reports from time to time, that /dev/ada1 is unreachable, due to Current_Pending_Sector and Offline_Uncorrectable errors on it. All checks come back successful, the pool status is online - question is, do I need to replace this disk ASAP, or I should wait till it really fails?



Checks:

Code:
smartctl -a /dev/ada1
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda Green (AF)
Device Model:     ST2000DL003-9VT166
Serial Number:    5YD6FA7H
LU WWN Device Id: 5 000c50 04628a645
Firmware Version: CC3C
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Dec  3 17:45:24 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  623) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 338) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30b7) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.


SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   108   099   006    Pre-fail  Always       -       18977536
  3 Spin_Up_Time            0x0003   093   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       674
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   088   060   030    Pre-fail  Always       -       688111789
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       17444
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       299
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   080   080   000    Old_age   Always       -       20
188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       6
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   062   046   045    Old_age   Always       -       38 (Min/Max 36/40)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       46
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       979
194 Temperature_Celsius     0x0022   038   054   000    Old_age   Always       -       38 (0 17 0 0 0)
195 Hardware_ECC_Recovered  0x001a   031   003   000    Old_age   Always       -       18977536
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       40
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       40
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       11686 (136 224 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3373103005
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2053614134


SMART Error Log Version: 1
No Errors Logged


SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     17444         -
# 2  Short offline       Completed without error       00%     17421         -
# 3  Extended offline    Completed without error       00%     17236         -


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



Code:
zpool status -v
  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:09 with 0 errors on Tue Nov 29 03:45:09 2022
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          ada0p2    ONLINE       0     0     0

errors: No known data errors

  pool: default
 state: ONLINE
config:

        NAME                                            STATE     READ WRITE CKSUM
        default                                         ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/6fcaa62c-5f86-11ed-9a9a-bcaec58adf3b  ONLINE       0     0     0
            gptid/6fdbd184-5f86-11ed-9a9a-bcaec58adf3b  ONLINE       0     0     0



Thanks in advance for the advices!
 

Pitfrr

Wizard
Joined
Feb 10, 2014
Messages
1,531
Hello,

Well, I would definitely test this drive thoroughly:
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 40 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 40
because this is concerning...

The long smart test doesn't seem to have reported any error... but I would run a badblock (destructive) on this drive to make sure of it.
If it turns out good, I would still look into changing the drive (and maybe use it for backups) or keep a very closed eye on it... Of course this also depends on the data on it and how much you value that data... :smile:

187 Reported_Uncorrect 0x0032 080 080 000 Old_age Always - 20
Don't know much about this one what it exactly means but doesn't sound so good... Seems it is not a critical one but could indicate some problems... I would say in combination with #197 and #198, that doesn't look good.

I would make sure all your backups are good... and then test the drive thoroughly.
Also: it would be good for the future to plan regular SMART tests (long and short), because this drive didn't see any until recently only.
 

bbolgar

Cadet
Joined
Dec 3, 2022
Messages
2
Hello,

Well, I would definitely test this drive thoroughly:
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 40 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 40
because this is concerning...

The long smart test doesn't seem to have reported any error... but I would run a badblock (destructive) on this drive to make sure of it.
If it turns out good, I would still look into changing the drive (and maybe use it for backups) or keep a very closed eye on it... Of course this also depends on the data on it and how much you value that data... :smile:

187 Reported_Uncorrect 0x0032 080 080 000 Old_age Always - 20
Don't know much about this one what it exactly means but doesn't sound so good... Seems it is not a critical one but could indicate some problems... I would say in combination with #197 and #198, that doesn't look good.

I would make sure all your backups are good... and then test the drive thoroughly.
Also: it would be good for the future to plan regular SMART tests (long and short), because this drive didn't see any until recently only.
Thanks for the quick response on this - I have enabled long & short tests today when I saw this error. Also, the NAS has been set up a few weeks ago only, that might explain why there are only recent tests ;)
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

Pitfrr

Wizard
Joined
Feb 10, 2014
Messages
1,531
Also, the NAS has been set up a few weeks ago only, that might explain why there are only recent tests ;)
This means the drives are used one, right.
Because the details of the one you posted shows +17k hours, so that's about 2 years.

Anyway, a good practice is to do some burn in of the drives before using them (doesn't matter if new of used).
 

Matt_G

Explorer
Joined
Jan 24, 2016
Messages
65
Personally, I would replace that drive right now.
40 current pending sectors is totally unacceptable to me.
I value my data too much to trust a drive in that condition any longer.
 
Top