Yet Another SMART error thread

eldo · Aug 9, 2017

This morning I noticed I have a critical alarm:
Aug. 9, 2017, 2:54 a.m. - Device: /dev/ada5, Self-Test Log error count increased from 0 to 1

Looking at the SMART results I'm not sure how to interpret this, and since I'm heading out for a week long trip saturday morning I want to make sure things are stable.
The drive is in warranty, but I wont' be able to get the RMA shipment (if needed), smart/badblock test and resilver before I leave.

So at the moment I'm looking for help interpretting the SMART results, and advice if I need to drop into a local big box and pick up a suitable stop-gap (I know, I know... I should have a spare available but I don't).

I noticed a similar issue here: https://forums.freenas.org/index.ph...r-ran-long-test-what-to-do.56858/#post-399079 and the advice was to monitor the drive since there's been a reallocation.
However, in that thread, the drive shows

Code:

ID# ATTRIBUTE_NAME		 FLAG	VALUE WORST THRESH TYPE	 UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate	0x002f 200 200 051	Pre-fail Always	 -	 46
5 Reallocated_Sector_Ct 0x0033 200 200 140	Pre-fail Always	 -	 1
200 Multi_Zone_Error_Rate 0x0008 200 200 000	Old_age Offline	 -	 23

Num Test_Description	Status				 Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline	Completed without error	 00%	35181		-
#13 Extended offline	Completed: read failure	 90%	35161		309151832

I also notice that the previous long test was bad, but the most recent test was complete without error.

In my case (below), I do not have any raw read error rates, or anny reallocated sector count.
I do have a multi zone error rate of '1'.

test #1 short I ran once I got to work and saw my crit alarm, and I'm in the process of running a long test now, due to be complete in about 4 hours.

Code:

root@freenas:~ # smartctl -a /dev/ada5
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:  Western Digital Red
Device Model:  WDC WD20EFRX-68EUZN0
LU WWN Device Id: 5 0014ee 2b55fb0de
Firmware Version: 82.00A82
User Capacity:  2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  5400 rpm
Device is:  In smartctl database [for details use: -P show]
ATA Version is:  ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:  Wed Aug  9 11:17:17 2017 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
  was never started.
  Auto Offline Data Collection: Disabled.
Self-test execution status:  ( 248) Self-test routine in progress...
  80% of test remaining.
Total time to complete Offline
data collection:  (27240) seconds.
Offline data collection
capabilities:  (0x7b) SMART execute Offline immediate.
  Auto Offline data collection on/off support.
  Suspend Offline collection upon new
  command.
  Offline surface scan supported.
  Self-test supported.
  Conveyance Self-test supported.
  Selective Self-test supported.
SMART capabilities:  (0x0003) Saves SMART data before entering
  power-saving mode.
  Supports SMART auto save timer.
Error logging capability:  (0x01) Error logging supported.
  General Purpose Logging supported.
Short self-test routine
recommended polling time:  (  2) minutes.
Extended self-test routine
recommended polling time:  ( 275) minutes.
Conveyance self-test routine
recommended polling time:  (  5) minutes.
SCT capabilities:  (0x703d) SCT Status supported.
  SCT Error Recovery Control supported.
  SCT Feature Control supported.
  SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  0
  3 Spin_Up_Time  0x0027  196  175  021  Pre-fail  Always  -  3200
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  159
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  069  069  000  Old_age  Always  -  23029
 10 Spin_Retry_Count  0x0032  100  100  000  Old_age  Always  -  0
 11 Calibration_Retry_Count 0x0032  100  100  000  Old_age  Always  -  0
 12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  159
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  142
193 Load_Cycle_Count  0x0032  200  200  000  Old_age  Always  -  1098
194 Temperature_Celsius  0x0022  111  107  000  Old_age  Always  -  36
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  100  253  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000  Old_age  Offline  -  1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline  Completed without error  00%  23028  -
# 2  Extended offline  Completed: read failure  10%  23020  3836215800
# 3  Short offline  Completed without error  00%  22884  -
# 4  Short offline  Completed without error  00%  22812  -
# 5  Short offline  Completed without error  00%  22793  -
# 6  Short offline  Completed without error  00%  22721  -
# 7  Short offline  Completed without error  00%  22690  -
# 8  Short offline  Completed without error  00%  22612  -
# 9  Short offline  Completed without error  00%  22492  -
#10  Short offline  Completed without error  00%  22435  -
#11  Short offline  Completed without error  00%  22374  -
#12  Short offline  Completed without error  00%  22325  -
#13  Extended offline  Interrupted (host reset)  10%  22290  -
#14  Short offline  Completed without error  00%  22239  -
#15  Short offline  Completed without error  00%  22184  -
#16  Short offline  Completed without error  00%  22100  -
#17  Short offline  Completed without error  00%  22079  -
#18  Short offline  Completed without error  00%  22005  -
#19  Short offline  Completed without error  00%  21982  -
#20  Extended offline  Completed without error  00%  21923  -
#21  Short offline  Completed without error  00%  21860  -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Not_testing
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

zpool status looks to be good as far as I can tell as well.

Code:

root@freenas:~ # zpool status
  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h20m with 0 errors on Sun Jul 16 04:05:59 2017
config:

  NAME  STATE  READ WRITE CKSUM
  freenas-boot  ONLINE  0  0  0
  da0p2  ONLINE  0  0  0

errors: No known data errors

  pool: tank
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
  still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
  the pool may no longer be accessible by software that does not support
  the features. See zpool-features(7) for details.
  scan: scrub repaired 0 in 188h7m with 0 errors on Tue Aug  8 22:07:31 2017
config:

  NAME  STATE  READ WRITE CKSUM
  tank  ONLINE  0  0  0
  raidz2-0  ONLINE  0  0  0
  gptid/11d0bc66-8bba-11e4-8a98-d0509950442d  ONLINE  0  0  0
  gptid/123e4c35-8bba-11e4-8a98-d0509950442d  ONLINE  0  0  0
  gptid/12ab260f-8bba-11e4-8a98-d0509950442d  ONLINE  0  0  0
  gptid/131657ce-8bba-11e4-8a98-d0509950442d  ONLINE  0  0  0
  gptid/856e1bab-74fb-11e5-b6c6-d0509950442d  ONLINE  0  0  0
  gptid/13efb8f6-8bba-11e4-8a98-d0509950442d  ONLINE  0  0  0

errors: No known data errors

Any help and/or advice would be most welcome.

danb35 · Aug 9, 2017

Well, the disk failed its last long self-test, which is a bad sign. I'd try doing another long test and see if it gives the same result. If so, plan to replace the disk ASAP, though I wouldn't think it's necessary to down the server in the interim.

eldo · Aug 9, 2017

danb35 said:
Well, the disk failed its last long self-test, which is a bad sign. I'd try doing another long test and see if it gives the same result. If so, plan to replace the disk ASAP, though I wouldn't think it's necessary to down the server in the interim.

That's my current plan.

So assuming the current long test fails, and I'm going out of town in a couple days... does it make sense to get a dekstop drive as a stop gap, resilver it, then do a warranty replacement, then resilver that the following week?

Assuming the current long test passes, does it make sense to adjust my existing smart scheduling, or set an increased schedule against that particular drive?
Currently
Long: 2AM 8th, 22nd every month
Short: 1AM every 2 days

danb35 · Aug 9, 2017

eldo said:
does it make sense to get a dekstop drive as a stop gap, resilver it, then do a warranty replacement, then resilver that the following week?

No, I wouldn't think so, especially if you can get an advance RMA--I know I've been able to have WD do that. That way, I don't have to take the (on its way to failing, but still working) disk out of the pool until I'm ready to install its replacement, and possibly not even then. In my case, I have a few (well, a lot, actually, but most of them aren't easily accessible) open bays in my server, so I install the replacement drive, burn it in, and start the replacement without removing the failing disk. Once the replacement process completes (but not before), the old disk is offline'd, so I never lose redundancy in the pool.

eldo · Aug 9, 2017

danb35 said:
No, I wouldn't think so, especially if you can get an advance RMA--I know I've been able to have WD do that. That way, I don't have to take the (on its way to failing, but still working) disk out of the pool until I'm ready to install its replacement, and possibly not even then. In my case, I have a few (well, a lot, actually, but most of them aren't easily accessible) open bays in my server, so I install the replacement drive, burn it in, and start the replacement without removing the failing disk. Once the replacement process completes (but not before), the old disk is offline'd, so I never lose redundancy in the pool.

Thanks for the recommendation.
This would be the 3rd of 6 WD Reds that have failed in this system.

I'll get the advance RMA started then, and address it when I get back in town.

Guess I'll try to use an idle workstation to do the initial smart/bad blocks test on, then replace when I'm ready to do the resilver.
I don't have any spare bays or ports in my FN case

eldo · Aug 9, 2017

One last question:

assuming the drive tests good and I leave it in place, will the FreeNAS SMART tasks alert me again on the same drive if it has further read/write errors?

danb35 · Aug 9, 2017

eldo said:
will the FreeNAS SMART tasks alert me again on the same drive if it has further read/write errors?

They should.

eldo · Aug 9, 2017

excellent. thanks again.

nojohnny101 · Aug 9, 2017

All good advice so far. The only thing I would add is that this is the very reason that this forum always recommends raidz2 in most situations. I've had situations where I was on the road for multiple weeks and a drive died (like completely gone, no warning). Anyways, I didn't want to shut the server down because I was using it while on my trip remotely for file access and media consumption. So when I returned, I RMAd the disk and replaced it. Server never went down.

So for me one of the biggest advantages of running raidz2 is I don't have to drop everything in my life and treat the server like a child when things go wrong. Granted I don't purposefully neglect it or ignore failed drives, but the server remains operational and I get to it when I can.

It seems your case is similar to mine. Just my 2 cents.

Important Announcement for the TrueNAS Community.

Yet Another SMART error thread

eldo

Explorer

danb35

Hall of Famer

eldo

Explorer

danb35

Hall of Famer

eldo

Explorer

eldo

Explorer

danb35

Hall of Famer

eldo

Explorer

nojohnny101

Wizard

Similar threads

Important Announcement for the TrueNAS Community.

Yet Another SMART error thread

Explorer

Hall of Famer

Explorer

Hall of Famer

Explorer

Explorer

Hall of Famer

Explorer

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Yet Another SMART error thread"

Similar threads