Is this drive dead? Or another issue...

Fish

Contributor
Joined
Jun 4, 2015
Messages
108
Hey all,

I've had a pair of 12TB WD Red drives in a mirrored pool running for quite some time. Last week TrueNAS started spitting out errors from one and took it offline. I replaced that drive yesterday and now today the other drive did the same thing. The timing seems weird, but I did buy and install them at the same time.

Here's the SMART data for the newly-erroring drive:
Code:
root@freenas[~]# smartctl -i /dev/ada2 -a
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi/HGST Travelstar Z7K500
Device Model:     HGST HTS725050A7E630
Serial Number:    TF0500WH2AA90L
LU WWN Device Id: 5 000cca 77ee0e159
Firmware Version: GH2OA470
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Mar 18 10:03:37 2023 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         ( 5155) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  88) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   077   062    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0025   198   100   040    Pre-fail  Offline      -       99
  3 Spin_Up_Time            0x0023   219   100   033    Pre-fail  Always       -       1
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       2966
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002f   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   120   100   040    Pre-fail  Offline      -       32
  9 Power_On_Hours          0x0032   057   057   000    Old_age   Always       -       18924
 10 Spin_Retry_Count        0x0033   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       109
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       393216
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       4295294996
190 Airflow_Temperature_Cel 0x0022   073   054   045    Old_age   Always       -       27 (Min/Max 25/30)
191 G-Sense_Error_Rate      0x0032   052   052   000    Old_age   Always       -       12344
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       2949165
193 Load_Cycle_Count        0x0032   097   097   000    Old_age   Always       -       34492
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   100   100   000    Old_age   Always       -       3
223 Load_Retry_Count        0x002a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 3
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 occurred at disk power-on lifetime: 76 hours (3 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 33 4d 12 62 04  Error: ICRC, ABRT 51 sectors at LBA = 0x0462124d = 73536077

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 40 40 12 62 e0 00      00:17:42.570  WRITE DMA EXT
  35 00 40 00 12 62 e0 00      00:17:42.570  WRITE DMA EXT
  35 00 40 c0 11 62 e0 00      00:17:42.569  WRITE DMA EXT
  35 00 40 80 11 62 e0 00      00:17:42.569  WRITE DMA EXT
  35 00 40 40 11 62 e0 00      00:17:42.568  WRITE DMA EXT

Error 2 occurred at disk power-on lifetime: 75 hours (3 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 c3 3d 39 09 00  Error: ICRC, ABRT 195 sectors at LBA = 0x0009393d = 604477

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 00 00 36 09 e0 00      00:02:10.522  WRITE DMA EXT
  35 00 00 00 32 09 e0 00      00:02:10.520  WRITE DMA EXT
  35 00 00 00 2e 09 e0 00      00:02:10.511  WRITE DMA EXT
  35 00 00 00 2a 09 e0 00      00:02:10.509  WRITE DMA EXT
  35 00 00 00 26 09 e0 00      00:02:10.501  WRITE DMA EXT

Error 1 occurred at disk power-on lifetime: 14 hours (0 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 10 f0 0d 97 0d  Error: ICRC, ABRT 16 sectors at LBA = 0x0d970df0 = 228003312

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 80 80 0d 97 e0 00      01:38:20.192  WRITE DMA EXT
  35 00 80 00 0d 97 e0 00      01:38:20.191  WRITE DMA EXT
  35 00 80 80 0c 97 e0 00      01:38:20.189  WRITE DMA EXT
  35 00 80 00 0c 97 e0 00      01:38:20.188  WRITE DMA EXT
  35 00 80 80 0b 97 e0 00      01:38:20.186  WRITE DMA EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     18915         -
# 2  Short offline       Completed without error       00%     18725         -
# 3  Extended offline    Completed without error       00%     18679         -
# 4  Short offline       Completed without error       00%     18677         -
# 5  Short offline       Completed without error       00%     18629         -
# 6  Short offline       Completed without error       00%     18628         -
# 7  Short offline       Completed without error       00%     18581         -
# 8  Short offline       Completed without error       00%     18533         -
# 9  Short offline       Completed without error       00%     18485         -
#10  Short offline       Completed without error       00%     18437         -
#11  Short offline       Completed without error       00%     18389         -
#12  Extended offline    Completed without error       00%     18343         -
#13  Short offline       Completed without error       00%     18341         -
#14  Short offline       Completed without error       00%     18293         -
#15  Short offline       Completed without error       00%     18245         -
#16  Short offline       Completed without error       00%     18197         -
#17  Short offline       Completed without error       00%     18149         -
#18  Short offline       Completed without error       00%     18101         -
#19  Short offline       Completed without error       00%     18053         -
#20  Extended offline    Completed without error       00%     18007         -
#21  Short offline       Completed without error       00%     18005         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


I'm seeing a lot of "error" terms, but I don't have much experience reading this output. Is the drive toast or maybe is my HBA acting flaky?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Looks like platter damage at various LBA addresses. Could be from head crashes. I'd return this drive for a warranty replacement.
 

Fish

Contributor
Joined
Jun 4, 2015
Messages
108
Looks like platter damage at various LBA addresses. Could be from head crashes. I'd return this drive for a warranty replacement.

Unfortunately I think I am past the warranty date on this drive. You would not use a drive that was reporting this data?
 
Joined
Jun 2, 2019
Messages
591
You provided the smartctl output for a 500GB HGST Travelstar. Was that intentional?
 

Fish

Contributor
Joined
Jun 4, 2015
Messages
108
You provided the smartctl output for a 500GB HGST Travelstar. Was that intentional?
D'oh! My mistake. It looks like the removed drive is not even listed anymore under /dev/ so I'll have to reboot the server and see if it comes back up.
 
Joined
Jun 2, 2019
Messages
591
D'oh! My mistake. It looks like the removed drive is not even listed anymore under /dev/ so I'll have to reboot the server and see if it comes back up.
Are you using the Travelstar as a boot drive? Might be time to replace.
 
Last edited:

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
You would not use a drive that was reporting this data?
HDDs have an internal remapping capability. So when the first sector goes bad, it will be replaced by one that is "hidden" from the outside. This was introduced decades ago. I don't know how big those remapping areas are today. But my assumption is that something must have gone badly wrong, if errors start to show up to the "outside".

So to answer your question: For me this drives is effectively dead. I would immediately ensure that I have a fresh and tested backup. And then get a replacement drive, whatever the RMA status is.

In addition it can of course not hurt to check the cabling, esp. for the SATA connections.
 

Fish

Contributor
Joined
Jun 4, 2015
Messages
108
Okay sorry everybody, here's the SMART data for the 12TB drive. I accidentally posted the data for my boot drive in the OP


Capture.PNG
 

Fish

Contributor
Joined
Jun 4, 2015
Messages
108
And here is the specific alert email I got

Code:
New alerts:
* Pool blue state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
  • Disk WDC WD120EMFZ-11A6JA0 9JHG3G8T is REMOVED


Seems from the error like the drive was just disconnected? Maybe the SATA controller on my motherboard is finally dying?
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
I would avoid constantly spinning up and down the drives.

This drives seems pretty new with around 100 hours of operation. So it is basically dead on arrival. Next time I recommend a proper burn-in.

But the really disconcerting value for me is the temperature: 65 C is way too high. Ideal would be 30-35 C and 45 C for not too long periods. 45 C for longer periods will increase of the HDD dying at about 5 years of age, when I remember the Backblaze report correctly.
 

Fish

Contributor
Joined
Jun 4, 2015
Messages
108
I'll have to check the settings - I was pretty sure I had it set to not spin down. The temperature listed there isn't accurate since that was plugged into a SATA-USB adapter on my desktop just for diagnostics but that is a good reminder to check the temps when I get everything put back.
 
Top