CRITICAL Device: /dev/ada6, Self-Test Log error count increased from 0 to 1.

Nakedape

Dabbler
Joined
Mar 2, 2020
Messages
16
I got the post title as an alert and after skimming through the Hard Drive Troubleshooting Guide (Basic Common Failures) [1] I did

Code:
smartctl -t long /dev/ada6

and when that completed I did

Code:
root@freenas[~]# smartctl -a /dev/ada6

smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68AX9N0
Serial Number:    WD-WMC1T229407
LU WWN Device Id: 5 0014ee 0ae22a134
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Aug 10 22:35:45 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 113) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (40080) seconds.
Offline data collection
capabilities:                    (0x7b)
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 402) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       77
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       24
193 Load_Cycle_Count        0x0032   195   195   000    Old_age   Always       -       15694
194 Temperature_Celsius     0x0022   114   097   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA
_of_first_error
# 1  Extended offline    Completed: read failure       10%     54310         1565557296
# 2  Short offline       Completed: read failure       60%     54281         1565557296
# 3  Short offline       Completed: read failure       60%     54257         1565557296
# 4  Short offline       Completed: read failure       60%     54229         1565557296
# 5  Short offline       Completed without error       00%     54205         -
# 6  Short offline       Completed without error       00%     54181         -
# 7  Short offline       Completed without error       00%     54157         -
# 8  Short offline       Completed without error       00%     54133         -
# 9  Short offline       Completed without error       00%     54109         -
#10  Short offline       Completed without error       00%     54085         -
#11  Short offline       Completed without error       00%     54061         -
#12  Short offline       Completed without error       00%     54037         -
#13  Short offline       Completed without error       00%     54013         -
#14  Short offline       Completed without error       00%     53989         -
#15  Short offline       Completed without error       00%     53969         -
#16  Short offline       Completed without error       00%     53945         -
#17  Short offline       Completed without error       00%     53921         -
#18  Short offline       Completed without error       00%     53889         -
#19  Short offline       Completed without error       00%     53865         -
#20  Extended offline    Completed without error       00%     53845         -
#21  Short offline       Completed without error       00%     53837         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing

Selective self-test flags (0x0):
 After scanning selected spans, do NOT re-scan remainder of disk.
If selective self-test is pending on power-up, resume after 0 minute delay.


If I'm reading [1] correctly then non-zero values for SMART Attributes 5, 197, and 198 are indicative of a physical hard drive failure while a non-zero value for Attribute 199 points in the direction of a communication error. Attribute 5 (Reallocated_Sector_Count) is nowhere to be seen, but at least the others are zero. Which leaves me wondering if I really have a problem here and, if yes, then what is the best course of action here? Thanks.

PS I got the disk from a-guy-on-the-internet so a RMA is not an option.

[1] https://www.ixsystems.com/community...leshooting-guide-basic-common-failures.41026/
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
The disk clearly has an error in LBA 1565557296 and that's stopping the long test from checking further.

I think the disk is showing signs of impending death... you still have time to react with no need to panic.

Replace the disk when you can.

If you want to spend/waste your time, there is an article around somewhere mentioning how to "fix" a bad block by zeroing it out, but I'm unconvinced by the idea and my experience has always shown, where there's one, there are others coming soon.
 

Nakedape

Dabbler
Joined
Mar 2, 2020
Messages
16
The disk clearly has an error in LBA 1565557296 and that's stopping the long test from checking further.

I think the disk is showing signs of impending death... you still have time to react with no need to panic.

Replace the disk when you can.

If you want to spend/waste your time, there is an article around somewhere mentioning how to "fix" a bad block by zeroing it out, but I'm unconvinced by the idea and my experience has always shown, where there's one, there are others coming soon.
Thanks. I'll order a new disk. That will incidentally also completely gobble up the few dollars I saved by buying second-hand disks from some random internet person. That brilliant idea did not age well.

In fact all I've done with this disk (which is part of a six-disk RAIDZ2) since I got it is to run a several days long burn-in process, the whole point of which or so I thought, was to ensure that everything was ready to go. That was several weeks ago and the only thing of note I've done after that, beyond setting up a test pool and some shares, is to rebuild the server and move the box physically. Maybe I should go back and check the burn-in logs for this disk and see If I missed anything.

In any case I'll keep using this disk for now and hope for something more exciting than the underwhelming read-error at LBA 1565557296. Hopefully I can learn something before tossing it out.
 

Nakedape

Dabbler
Joined
Mar 2, 2020
Messages
16
I got the same error message again, but this time on ada5 which is a WD30ERFX disk of same type and heritage as the ada6 disk from the OP. Both were bought second-hand and subsequently succcessfully put through a burn-in regime as described somewhere else on this forum (i.e. a combination of smart and badblock tests). They have both been powered on for a couple of months since the burn-in process, but there's been no read or write activity (at all).

Does @sretalla's advice to Keep Calm and Replace Disk hold for this case as well? If so, why are my disks failing like this? I realize they are not exactly in pristine condition after 50k+ hours, but was I foolish to believe that I could get some mileage out of these disks? Is it just a coincident that the LBA error in both cases are almost in the same spot (ada6: 1565557296 vs ada5: 1565554128)?

Code:
root@freenas[~/scripts/]# smartctl -t long /dev/ada5
[...]

root@freenas[~/scripts/]# smartctl -a /dev/ada5
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68AX9N0
Serial Number:   
LU WWN Device Id: 5 0014ee 00375c086
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Sep 16 07:03:59 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 241) Self-test routine in progress...
                                        10% of test remaining.
Total time to complete Offline
data collection:                (40080) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off supp
ort.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.

recommended polling time:        (   5) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   173   173   021    Pre-fail  Always       -       6325
  4 Start_Stop_Count        0x0032   085   085   000    Old_age   Always       -       15719
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   023   023   000    Old_age   Always       -       56388
       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       77
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       24
193 Load_Cycle_Count        0x0032   195   195   000    Old_age   Always       -       15694
194 Temperature_Celsius     0x0022   113   097   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA
_of_first_error
# 1  Short offline       Completed: read failure       60%     56381         1565554128
# 2  Short offline       Completed: read failure       50%     56358         1565554128
# 3  Short offline       Completed: read failure       60%     56334         1565554128
# 4  Short offline       Completed: read failure       50%     56310         1565554128
# 5  Short offline       Completed: read failure       60%     56286         1565554128
# 6  Short offline       Completed: read failure       60%     56262         1565554128
# 7  Short offline       Completed: read failure       60%     56238         1565554128
# 8  Short offline       Completed: read failure       60%     56214         1565554128
# 9  Short offline       Completed: read failure       60%     56190         1565554128
#10  Short offline       Completed: read failure       60%     56166         1565554128
#11  Short offline       Completed: read failure       60%     56142         1565554128
#12  Short offline       Completed: read failure       60%     56118         1565554128
#13  Short offline       Completed: read failure       60%     56094         1565554128
#14  Short offline       Completed without error       00%     56070         -
#15  Short offline       Completed without error       00%     56046         -
#16  Short offline       Completed without error       00%     56022         -
#17  Short offline       Completed without error       00%     55998         -
#18  Short offline       Completed without error       00%     55974         -
#19  Short offline       Completed without error       00%     55950         -
#20  Short offline       Completed without error       00%     55926         -
#21  Short offline       Completed without error       00%     55902         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
this happens sometimes..especially with used drives as you have no idea what they have been put through. considering this is 5 years of actual hours it's been around a while. I just populated my r520 with 6TB SAS drives and I had two of them fail soon after due to age. I've replaced them and have two more on standby(burning in). Once those two new drives are done I'll put them in and use my used drives as spares for temporary usage until new ones could be done.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
50K hours is getting close to 6 years... I would expect most drives to be well into their failure window at that point.

It's clear that based on workload over the life of the drive, some drives will live on happily and maybe make it as far as 10 years, but I wouldn't stake my important data on any disk older than 5 years without plentiful spares, a lightning-fast operational procedure for failed drive replacement (watching SMART to get ahead of that) and at least RAIDZ2 to give yourself a decent chance of finishing a resilver before more disks in the VDEV meet their demise.
 
Top