My first SMART error: What to do

Status
Not open for further replies.

IceBoosteR

Guru
Joined
Sep 27, 2016
Messages
503
Hi guys,

when I loged in into my server today, I saw that /dev/da5 got a problem with SMART.
Here is the output:
Code:
root@freenas:~ # smartctl -a /dev/da5
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:  Western Digital Red
Device Model:  WDC WD40EFRX-68WT0N0
Serial Number:  WD-WCXXXXXXXXXXX
LU WWN Device Id: 5 0014ee 2b8716486
Firmware Version: 82.00A82
User Capacity:  4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  5400 rpm
Device is:  In smartctl database [for details use: -P show]
ATA Version is:  ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:  Thu Nov  2 21:20:08 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
  was never started.
  Auto Offline Data Collection: Disabled.
Self-test execution status:  ( 113) The previous self-test completed having
  the read element of the test failed.
Total time to complete Offline
data collection:  (51720) seconds.
Offline data collection
capabilities:  (0x7b) SMART execute Offline immediate.
  Auto Offline data collection on/off support.
  Suspend Offline collection upon new
  command.
  Offline surface scan supported.
  Self-test supported.
  Conveyance Self-test supported.
  Selective Self-test supported.
SMART capabilities:  (0x0003) Saves SMART data before entering
  power-saving mode.
  Supports SMART auto save timer.
Error logging capability:  (0x01) Error logging supported.
  General Purpose Logging supported.
Short self-test routine
recommended polling time:  (  2) minutes.
Extended self-test routine
recommended polling time:  ( 517) minutes.
Conveyance self-test routine
recommended polling time:  (  5) minutes.
SCT capabilities:  (0x703d) SCT Status supported.
  SCT Error Recovery Control supported.
  SCT Feature Control supported.
  SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  35
  3 Spin_Up_Time  0x0027  184  171  021  Pre-fail  Always  -  7775
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  99
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  100  253  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  095  095  000  Old_age  Always  -  3769
10 Spin_Retry_Count  0x0032  100  253  000  Old_age  Always  -  0
11 Calibration_Retry_Count 0x0032  100  253  000  Old_age  Always  -  0
12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  99
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  12
193 Load_Cycle_Count  0x0032  199  199  000  Old_age  Always  -  4756
194 Temperature_Celsius  0x0022  125  112  000  Old_age  Always  -  27
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  100  253  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000  Old_age  Offline  -  0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline  Completed: read failure  10%  3729  51135160
# 2  Short offline  Completed without error  00%  3705  -
# 3  Extended offline  Completed without error  00%  3592  -
# 4  Conveyance offline  Completed without error  00%  3582  -
# 5  Short offline  Completed without error  00%  3582  -
# 6  Short offline  Completed without error  00%  3576  -
# 7  Extended offline  Completed without error  00%  3461  -
# 8  Short offline  Completed without error  00%  3337  -
# 9  Extended offline  Completed without error  00%  3245  -
#10  Short offline  Completed without error  00%  3165  -
#11  Short offline  Completed without error  00%  3093  -
#12  Short offline  Completed without error  00%  3021  -
#13  Short offline  Completed without error  00%  2949  -
#14  Short offline  Completed without error  00%  2877  -
#15  Short offline  Completed without error  00%  2805  -
#16  Short offline  Completed without error  00%  2733  -
#17  Short offline  Completed without error  00%  2661  -
#18  Short offline  Completed without error  00%  2589  -
#19  Short offline  Completed without error  00%  2529  -
#20  Short offline  Completed without error  00%  2468  -
#21  Short offline  Completed without error  00%  2467  -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Not_testing
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



No my question, what should I do next? This drive is about a year old, and after my big move, all was fine in a long test. Nothing happend to the server, it is just running.
Any help -> Appreciated :)

Do you need my system specs? Please look to my signature ;)

I have made another test, now without errors? :O
Code:
SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline  Completed without error  00%  3769  -
# 2  Short offline  Completed: read failure  10%  3729  51135160
# 3  Short offline  Completed without error  00%  3705  -
# 4  Extended offline  Completed without error  00%  3592  -
# 5  Conveyance offline  Completed without error  00%  3582  -
# 6  Short offline  Completed without error  00%  3582  -
# 7  Short offline  Completed without error  00%  3576  -
# 8  Extended offline  Completed without error  00%  3461  -
# 9  Short offline  Completed without error  00%  3337  -
#10  Extended offline  Completed without error  00%  3245  -
#11  Short offline  Completed without error  00%  3165  -
#12  Short offline  Completed without error  00%  3093  -
#13  Short offline  Completed without error  00%  3021  -
#14  Short offline  Completed without error  00%  2949  -
#15  Short offline  Completed without error  00%  2877  -
#16  Short offline  Completed without error  00%  2805  -
#17  Short offline  Completed without error  00%  2733  -
#18  Short offline  Completed without error  00%  2661  -
#19  Short offline  Completed without error  00%  2589  -
#20  Short offline  Completed without error  00%  2529  -
#21  Short offline  Completed without error  00%  2468  -



When I looked at the timing, the only thing I could eventually imagine, that a microwave caused the problem. Due to some issues of space, the server is in my kitchen, in a big metal Nanoxia Deep Silence 6 Rev.B.
About 50cm near to it, there is my microwave. Could this be the reason? If yes, if need to move that damn thing :/
Funny, but it should not https://www.reddit.com/r/pcmasterra..._there_risks_in_keeping_your_computer_near_a/.

Okey well another one:
Code:
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  35
  3 Spin_Up_Time  0x0027  184  171  021  Pre-fail  Always  -  7775
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  99
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  100  253  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  095  095  000  Old_age  Always  -  3769
 10 Spin_Retry_Count  0x0032  100  253  000  Old_age  Always  -  0
 11 Calibration_Retry_Count 0x0032  100  253  000  Old_age  Always  -  0
 12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  99
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  12
193 Load_Cycle_Count  0x0032  199  199  000  Old_age  Always  -  4756
194 Temperature_Celsius  0x0022  124  112  000  Old_age  Always  -  28
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  100  253  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000  Old_age  Offline  -  0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline  Completed: read failure  10%  3769  51142152
# 2  Short offline  Completed without error  00%  3769  -
# 3  Short offline  Completed: read failure  10%  3729  51135160
# 4  Short offline  Completed without error  00%  3705  -
# 5  Extended offline  Completed without error  00%  3592  -
# 6  Conveyance offline  Completed without error  00%  3582  -
# 7  Short offline  Completed without error  00%  3582  -
# 8  Short offline  Completed without error  00%  3576  -
# 9  Extended offline  Completed without error  00%  3461  -
#10  Short offline  Completed without error  00%  3337  -
#11  Extended offline  Completed without error  00%  3245  -
#12  Short offline  Completed without error  00%  3165  -
#13  Short offline  Completed without error  00%  3093  -
#14  Short offline  Completed without error  00%  3021  -
#15  Short offline  Completed without error  00%  2949  -
#16  Short offline  Completed without error  00%  2877  -
#17  Short offline  Completed without error  00%  2805  -
#18  Short offline  Completed without error  00%  2733  -
#19  Short offline  Completed without error  00%  2661  -
#20  Short offline  Completed without error  00%  2589  -
#21  Short offline  Completed without error  00%  2529  -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Not_testing
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



I will go with an long test, but maybe I need to replace this drive.
 
Last edited:

styno

Patron
Joined
Apr 11, 2016
Messages
466

rs225

Guru
Joined
Jun 28, 2014
Messages
878
If it was out of warranty, I would probably wait and watch it. But since you have two read failures fairly close to each other, there is a good chance it will worsen over time. Microwave probably not related. Scrubs are clean?
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
RMA the drive.
 

Waco

Explorer
Joined
Dec 29, 2014
Messages
53
If it's in warranty, RMA it. If not, do a full read of the entire drive (dd if=/dev/da5 of=/dev/null bs=1m) and see if any of the error counters increase. If not, keep a close eye on it. If yes, get a replacement on order. :)
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Rma
 

IceBoosteR

Guru
Joined
Sep 27, 2016
Messages
503
Okey, thank you guys.
I have just made some more tests:
Code:
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  35
  3 Spin_Up_Time  0x0027  184  171  021  Pre-fail  Always  -  7775
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  99
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  095  095  000  Old_age  Always  -  3795
 10 Spin_Retry_Count  0x0032  100  253  000  Old_age  Always  -  0
 11 Calibration_Retry_Count 0x0032  100  253  000  Old_age  Always  -  0
 12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  99
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  12
193 Load_Cycle_Count  0x0032  199  199  000  Old_age  Always  -  4773
194 Temperature_Celsius  0x0022  124  112  000  Old_age  Always  -  28
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  100  253  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000  Old_age  Offline  -  22


SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline  Completed: read failure  10%  3795  51135160
# 2  Extended offline  Completed without error  00%  3779  -
# 3  Short offline  Completed: read failure  10%  3769  51142152
# 4  Short offline  Completed without error  00%  3769  -
# 5  Short offline  Completed: read failure  10%  3729  51135160
# 6  Short offline  Completed without error  00%  3705  -
# 7  Extended offline  Completed without error  00%  3592  -
# 8  Conveyance offline  Completed without error  00%  3582  -
# 9  Short offline  Completed without error  00%  3582  -
#10  Short offline  Completed without error  00%  3576  -
#11  Extended offline  Completed without error  00%  3461  -
#12  Short offline  Completed without error  00%  3337  -
#13  Extended offline  Completed without error  00%  3245  -
#14  Short offline  Completed without error  00%  3165  -
#15  Short offline  Completed without error  00%  3093  -
#16  Short offline  Completed without error  00%  3021  -
#17  Short offline  Completed without error  00%  2949  -
#18  Short offline  Completed without error  00%  2877  -
#19  Short offline  Completed without error  00%  2805  -
#20  Short offline  Completed without error  00%  2733  -
#21  Short offline  Completed without error  00%  2661  -



And I have requested an RMA at WD. Do you know how long this needs? Because I have no idea if I should run my server with 1 disk less, or not....
So this could take up two two weeks.
So what I first do is a full backup. But I do not know if I should risk a scrub now.
 
Last edited:

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
And I have requested an RMA at WD. Do you know how long this needs?
If you request an advance RMA (which you could do with WD last time I needed to return a drive), you don't have to be without a drive at all. Get the new drive, burn it in, replace the old one, then send back the old one.
 

IceBoosteR

Guru
Joined
Sep 27, 2016
Messages
503
If you request an advance RMA (which you could do with WD last time I needed to return a drive), you don't have to be without a drive at all. Get the new drive, burn it in, replace the old one, then send back the old one.
Hi danb35,
thanks for your answer.
Unfortunately I have requested the RMA yesterday as a "Standard RMA". I saw that I could request an advanced one, but when I read through everything with Creditcard and so one, I have choosen the Standard one. I do not know if I could change it now from Standard to advanced.
If this is not possible I have two options:
1. let the pool run in degraded mode for 2 weeks (RAIDZ2)
2. Replace the WD red temporary with a desktop WD Green

What would you guys do?
Thank you!
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Your pool shouldn't be degraded right now, so there is no problem with waiting two weeks. Even if it was degraded, there is no problem waiting two weeks as long as you scrub more often than never.
 

IceBoosteR

Guru
Joined
Sep 27, 2016
Messages
503
Hi there,

so last week I have sent the drive to WD for RMA. Even today, thanks to DHL, the drive/paket has not yet arrived at WD.
I have powered down my NAS and still waiting for the drive...
Unless this process can take forever, I think about buying a new one instead. Also it can happen that I get a refurbished one back from WD, maybe its not the best idea to run a refurbished one in FN 24/7.
But maybe I get a new one back, who knows?
Any kind of experience with that?
Should I buy a new one and let the replacement drive acting as a cold spare?

Thanks,
Ice
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
Burn the replacement in before you put it in service to make sure its ok to add to your pool.
 

IceBoosteR

Guru
Joined
Sep 27, 2016
Messages
503
Burn the replacement in before you put it in service to make sure its ok to add to your pool.
For sure, when I got it at sometime.
 
Status
Not open for further replies.
Top