SmartCtl Interpretation - How to know when to replace a drive?

Ellimist · Jun 12, 2020

Hi Folks,

I usually just replace drives when SmartCtl starts reporting issues with drives. I had a bunch of drives floating around though that were a bit older and with the 3TB SMR issue I'm looking to replace most of my array with something much larger later in the year.

In testing drives I haven't used for a while thought I followed this guide: https://www.ixsystems.com/community/resources/hard-drive-burn-in-testing.92/

I do like it but its missing info on how to interrupt and understand when to replace drives with the smartctl data for older drives.

Does anyone have a good thread or information on when to replace aging drives.

This is the results from one of my drives just now doing badblock and a long test.

Code:

root@NASTST01[~]# smartctl -A /dev/ada0

smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       205
  3 Spin_Up_Time            0x0027   204   180   021    Pre-fail  Always       -       4791
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       62
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       2
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   074   074   000    Old_age   Always       -       19115
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       62
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       34
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       5739
194 Temperature_Celsius     0x0022   118   105   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   199   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       5

subhuman · Jun 13, 2020

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 2
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1

Run badblocks, smart short and smart long tests as per the instructions in that thread. If you still have a nonzero pending sector count after that, or if either of the two above values start increasing rapidly, trash the drive. Note that if the pending sector is successfully remapped, you SHOULD see the reallocated sector count increment by one.

Both #1 and #200 are nonzero, but I don't know what's normal here. You'll have to look into that yourself.
#3 "normal" values will vary based on specific drive model, and sometime firmware version.

Other than that, I don't see anything that I would be more concerned with than normal. And by that, I mean a several year old HD is always worth some concern and a certain degree of mistrust.

danb35 · Jun 13, 2020

subhuman said:
If you still have a nonzero pending sector count after that, or if either of the two above values start increasing rapidly, trash the drive.

I wouldn't necessarily agree with the first phrase--I'm comfortable leaving a disk in the pool with current pending, offline uncorrectable, and/or reallocated sectors in the low single digits and stable, as long as there are no other signs of trouble. In this case, there are other signs of trouble, in the raw read error rate and the multi-zone error rate. For whatever reason, OP has chosen not to provide the SMART test history, which would also be relevant.

Ellimist said:
Does anyone have a good thread or information on when to replace aging drives.

The discussion comes up a bit, but there really isn't a definitive answer; it's largely a judgment call on your part--though this "aging" drive only has just over two years' time in service, which I wouldn't really consider an aging disk (unless it was sitting around for a long time before being put in service). But here's my general thought process:

Time in service is not a reason to replace a disk, or even much of a factor in whether to replace the disk.
Consistent SMART self-test failures are a reason to replace a disk. A single test failure (especially a short test failure), followed by a passing long test, isn't a reason.
Unreadable/pending or offline/uncorrectable sectors in small numbers aren't generally a reason to replace a disk (as long as they're stable), though they are something to pay attention to. What constitutes "small numbers" is a bit of a judgment call--I'm fine with low single digits, probably OK with 6-8, but if the numbers reach two digits I'm probably replacing the disk.
Other SMART attributes (1, 5, 7, 196, 200) can be of concern, but I don't generally see them at problematic levels without impacting 197/198, so the latter are what I generally pay attention to.

Hard Drive Troubleshooting Guide (All Versions of FreeNAS)

UPDATE: 22 September 2018 - Added Drive Data Refreshing UPDATE: 2 April 2017 - Added support for FreeNAS Corral (FreeNAS 10 and beyond) UPDATE: 1 November 2020 - Added ID 1 and 7 description for Seagate drives at bottom of Appendix B This guide...

www.ixsystems.com

subhuman · Jun 13, 2020

I wouldn't necessarily agree with the first phrase

I think we're effectively saying the same thing. You qualify that those counts should be stable.
If he runs badblocks, short then long SMART tests as I recommended, the drive should have remapped that pending sector. So if the count is still nonzero afterwards, that means it found more marginal sectors and thus the count is not stable.

Ellimist · Jun 13, 2020

Thanks folks. normally I just outright replace drives when I start seen the errors in the console as I have short and long tests regularly scheduled on my main nas but I wanted to test these drives for use on my VM server so I chucked them in a test bench with Freenas to run through.

I didn't provide history as I didn't know there was a command to produce that. i just ran the smartlctl -A to get the data. I started the thread as I wanted to understand the results a bit better than oh I got an error replace. The information provided has been quite useful in doing that. Cheers.

danb35 · Jun 13, 2020

Ellimist said:
i just ran the smartlctl -A to get the data.

-a instead of -A will give more information including the SMART test history.

Important Announcement for the TrueNAS Community.

SmartCtl Interpretation - How to know when to replace a drive?

Ellimist

Dabbler

subhuman

Contributor

danb35

Hall of Famer

Hard Drive Troubleshooting Guide (All Versions of FreeNAS)

subhuman

Contributor

Ellimist

Dabbler

danb35

Hall of Famer

Similar threads

Important Announcement for the TrueNAS Community.

SmartCtl Interpretation - How to know when to replace a drive?

Ellimist

Dabbler

subhuman

Contributor

danb35

Hall of Famer

Hard Drive Troubleshooting Guide (All Versions of FreeNAS)

subhuman

Contributor

Ellimist

Dabbler

danb35

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SmartCtl Interpretation - How to know when to replace a drive?"

Similar threads