Long SMART test fail...

BlueMagician · Mar 16, 2016

Hi all,

I had an email notification from FreeNAS on Tuesday morning, to inform me that:

Code:

Device: /dev/da5 [SAT], Self-Test Log error count increased from 0 to 1

I ran a second manual long test, and sure enough, a second email came telling me the error count had increased from 1 to 2.

Running smartctl -a /dev/da5 reveals:

Code:

=== START OF INFORMATION SECTION ===
Model Family:  Western Digital Red
Device Model:  WDC WD60EFRX-68MYMN1
Serial Number:  WD-WXxxxxxxxxxx
LU WWN Device Id: x xxxxxx xxxxxxxxx
Firmware Version: 82.00A82
User Capacity:  6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  5700 rpm
Device is:  In smartctl database [for details use: -P show]
ATA Version is:  ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:  Wed Mar 16 07:28:30 2016 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
  was never started.
  Auto Offline Data Collection: Disabled.
Self-test execution status:  ( 244) Self-test routine in progress...
  40% of test remaining.
Total time to complete Offline
data collection:  ( 5384) seconds.
Offline data collection
capabilities:  (0x7b) SMART execute Offline immediate.
  Auto Offline data collection on/off support.
  Suspend Offline collection upon new
  command.
  Offline surface scan supported.
  Self-test supported.
  Conveyance Self-test supported.
  Selective Self-test supported.
SMART capabilities:  (0x0003) Saves SMART data before entering
  power-saving mode.
  Supports SMART auto save timer.
Error logging capability:  (0x01) Error logging supported.
  General Purpose Logging supported.
Short self-test routine
recommended polling time:  (  2) minutes.
Extended self-test routine
recommended polling time:  ( 707) minutes.
Conveyance self-test routine
recommended polling time:  (  5) minutes.
SCT capabilities:  (0x303d) SCT Status supported.
  SCT Error Recovery Control supported.
  SCT Feature Control supported.
  SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  0
  3 Spin_Up_Time  0x0027  203  201  021  Pre-fail  Always  -  8825
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  865
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  095  095  000  Old_age  Always  -  4174
10 Spin_Retry_Count  0x0032  100  100  000  Old_age  Always  -  0
11 Calibration_Retry_Count 0x0032  100  100  000  Old_age  Always  -  0
12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  428
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  191
193 Load_Cycle_Count  0x0032  200  200  000  Old_age  Always  -  1582
194 Temperature_Celsius  0x0022  122  111  000  Old_age  Always  -  30
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  100  253  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000  Old_age  Offline  -  1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline  Completed: read failure  30%  4167  3527164016
# 2  Extended offline  Completed: read failure  30%  4144  3527164016
# 3  Short offline  Completed without error  00%  3972  -
# 4  Short offline  Completed without error  00%  3970  -
# 5  Extended offline  Completed without error  00%  3839  -
# 6  Extended offline  Completed without error  00%  3791  -
# 7  Short offline  Completed without error  00%  3684  -
# 8  Short offline  Completed without error  00%  3682  -
# 9  Short offline  Completed without error  00%  3612  -
#10  Short offline  Completed without error  00%  3610  -
#11  Extended offline  Completed without error  00%  3455  -
#12  Short offline  Completed without error  00%  3276  -
#13  Short offline  Completed without error  00%  3274  -
#14  Extended offline  Completed without error  00%  3144  -
#15  Extended offline  Completed without error  00%  3048  -
#16  Short offline  Completed without error  00%  2941  -
#17  Short offline  Completed without error  00%  2939  -
#18  Short offline  Completed without error  00%  2869  -
#19  Short offline  Completed without error  00%  2867  -
#20  Extended offline  Completed without error  00%  2713  -
#21  Short offline  Completed without error  00%  2533  -

I suspect the key info to worry about here are the Extended offline read failures and possibly the Multi-zone error rate?

The drive is under warranty, but I've heard that WD will furnish you with a refurb unit rather than a new one - and with seemingly only one error in one section so far, I wonder if it's too early to panic?

For the record, this WD RED drive was the first one I purchased as a test unit and was initially a standalone drive - later added along with more drives into an array when I moved to FreeNAS server - hence the initially high cycle-count accumulated...

So is it time to worry, or shall I just keep an eye peeled?

Many thanks,

DrKK · Mar 16, 2016

I think 99% of us would replace the drive, for a couple of reasons:

1) The error you're receiving is fairly serious---bad enough to stop the SMART testing and throw a SMART test fail. That means you can't actually complete SMART tests, which, by definition, should be one of your requirements to keeping a drive in service on a NAS, don't you think?
2) We don't run drives in our NASses that we don't have confidence in.
3) WD has a good (usually) reputation for RMA'ing reds and replacing them with something reasonable, even if refurbished.
4) A 6TB drive throwing errors at 4000 hours into its life will almost certainly get worse, quickly.

I would replace it, using the RMA process where you can receive the advance replacement and THEN send yours back.

Your NAS deserves it.

joeschmuck · Mar 16, 2016

Don't just take one persons opinion, I'll add my two cents as well...

RMA the drive. You have a drive which is showing infant mortality where the drive dies early in it's life. This does happen. As for the drive you get from WD in trade, it will likely be a refurbed unit however it will still at least be covered by the warranty period so if it were to fail as well, you just RMA it again. That isn't ideal having to think about it but drives do fail. Additionally it appears you might be sleeping the drive based on IDs 4 and 12, but those could be from maybe early testing you did however if you are sleeping the drives, I would recommend against that.

On a different topic, you have some weird SMART testing schedule setup, maybe you should check into that. What I'm seeing is you have short tests running 2 hours apart on some days, back to back long tests. It doesn't make sense to me at all. I'd recommend once a day short and once a week long, but you may do it however you like, at least you are running the tests.

BlueMagician · Mar 16, 2016

Thank you @joeschmuck and @DrKK.

@joeschmuck - I have all my drives set to never sleep in FreeNAS. It's likely that what you're seeing are counts from the first 6 months of the drives life, where it did duty as a standalone drive in a desktop tower - before I built the NAS.

Also, your other catch regarding the test schedules..

The Short test timings I spotted myself last night - an extra box was mistakenly selected in the 'hours' section on the GUI scheduler.

The Long test looking as if it's back to back, is probably because I did run two long tests back to back: one generated the initial error, and one was me confirming it.

I think I've got it right now, but will triple check later today anyway. Thank you for the catch!

The advice on the main issue is much appreciated. I wanted to check if it wasn't too serious - and I'm glad now that I didn't ignore it.

I'll start the RMA process, although I don't really like the idea of ending up with a refurb'd drive that could have had a harder life than this one! I guess I have to trust WD to not be complete asshats.

Kind regards,

Edit: typo's

joeschmuck · Mar 16, 2016

WD is pretty good which is why we recommend them. No one like to RMA a fairly new drive, well any drive. In the old days you sometimes got a drive that would not last long and you had to RMA it again. Things are actually better these days. The last time I RMA'd a WD drive I did the Advance Replacement (think that what it's called) where I had them ship me a drive immediately and used the box the new drive came in and sent it back, plus WD paid for return shipping. Not sure if they still do that and since all my drives are out of warranty now, I don't see me looking to do that anytime soon.

BlueMagician · Mar 16, 2016

Phoned the WD Red Hotline, and after a few minutes of swapping details I got bounced to an alternative number for EU.

Anywho, I logged the RMA on the grounds that the drive is repeatedly failing an extended SMART test - no questions asked - advance replacement on it's way. No guarantee that it will be a new unit though, which does still irk me - because I'd like to know the history on the drive that I'm about to put in my system... hmm.

I suppose that is the way of things. Will keep you posted.. Thanks to all again,

BlueMagician · Mar 18, 2016

I've just had confirmation from a WD rep, that the replacement drive I receive will definitely be a refurb.

As I said in my previous post, I'll just have to suck it up...

...but it still grates me that I could potentially be putting a drive into my array that's had triple the 'power-on' time of the drive I'm removing, and had all it's stats simply reset at the factory to hide any previous failings.

Maybe it's just my paranoia, but I do feel like I'm being fobbed off with a car that's had it's mileage clocked and a quick £10 supermarket valet job, to make it look nice.

I'm grateful that the process has been so painless thus far, but I still don't see how it's acceptable to replace a product of this class and intended use, with a second hand part.

Sorry. I'll go and stick my head in a bucket of water now.

joeschmuck · Mar 18, 2016

It's my hope that WD isn't sending out refurbed drives with high hour counts on the drive mechanical components. I would think that they replace any parts which are prone to wear like this. So when they receive your failed drive, it has only a small amount of hours on it. I could see them replacing the platters and heads, formatting it and tossing it to the next person. I'd also hope that they do some testing to verify that the motor is within specs for bearing wear. Also keep in mind that many refurb units could be just ones where someone "thinks" they had a problem and swapped it out or someone purchased a boxed unit and didn't like it and returned it to the vendor.

You can additionally run some hard drive tests on it like badblocks to verify it is sound. Also check the SMART data, odds are it's been cleared but also run the extended test on it periodically.

BlueMagician · Mar 18, 2016

Agreed @joeschmuck.

I'd like to think that everything you've said is true, and I have no reason to doubt their quality control.

It's just the principal of it that doesn't sit right - especially in this bracket of product. Maybe I'm too fussy.

I'll test the hell out of it when it gets here, and for what it's worth I'll check its stat history - but I'd be amazed if they've not been reset.

Anywho, thanks again to all. I shall sit here patiently for my replacement to arrive. Eventually...

Regards and happy weekend!

joeschmuck · Mar 18, 2016

I felt the same way the first time I had to RMA a new hard drive and found out it was not brand spanking new. I think that drive was only a few months of age and when the "new drive" arrived it had a refurb label on it. I called the hard drive company and voiced my concerns and how irritated I was to get a refurb drive. They didn't care and only stated that they will replace the drive again if needed during the original warranty period. Yes, I was unhappy too and you are not alone. That was decades ago.

Hope that "new" drive works much longer than the original.

DrKK · Mar 18, 2016

You know, typically speaking, WesternDigital has really brought an A game to the table recently, especially with respect to NAS drive offerings. Reds don't frequently have to be RMA'd, and when they do, I am quite sure we'd be hearing Holy Hell here in the FreeNAS community if the procedures and/or quality of the drives sent as replacements were not up to snuff. I think this is a case where there is no need to get needlessly upset, and you can reasonably hope for, and expect, adequate product. I'd take a refurbed RMA replacement WD red over a brand spanking new Seagate, any time of the day.

I'm just saying that getting upset here is not only pointless, it's probably not even warranted.

joeschmuck · Mar 18, 2016

DrKK said:
I'd take a refurbed RMA replacement WD red over a brand spanking new Seagate, any time of the day.

Amen

BlueMagician · Mar 18, 2016

DrKK said:
I'm just saying that getting upset here is not only pointless, it's probably not even warranted.

Not getting upset really, just keeping the thread updated whilst pointing out how I think the process could be better for a product in this class-bracket.

For example, there's no way I'd accept a refurb part to put into a SAN shelf at work - so I guess it's just a case of adapting to how a company treats it's SOHO customers versus its Corporates...

Thanks again to all,

BlueMagician · Mar 26, 2016

So the advanced replacement drive finally arrived. Got it installed yesterday, and it finished resilvering last night. So I ran a scrub, and left it running overnight.
It's still running, so the drive has been in a constant state of access.

Potential problem...

During the first few minutes of the resilver, I checked the SMART logs with smartctl.

All counters were 0, except the ones you'd expect after one power-up event, which were at 1.

Fine. But...

I checked smartctl stats again this morning, and after 20 hours of power on, the LoadCycleCount is already at 3.

This drive, seemingly running the latest firmware, has been accessing continuously - but has parked 3 times already?!

Any thoughts appreciated... thanks!

joeschmuck · Mar 26, 2016

Let the long test finish up and then capture all the smart data (entire output of smartctl -a), post it here and if it looks like something's up, we will let you know. Next periodically check the values and if the load cycle count goes up by maybe 5 or more, post another output of the smart data.

If your head parking timer is set to 8 seconds (could be) then you may have a large increase in this counter and therefore you should run the WDIDLE tool and set the timer to 300 seconds. Here is the link to the page you would need:
https://forums.freenas.org/index.php?threads/hacking-wd-greens-and-reds-with-wdidle3-exe.18171/

Bidule0hm · Mar 26, 2016

BlueMagician said:
I checked smartctl stats again this morning, and after 20 hours of power on, the LoadCycleCount is already at 3.

3 in 20 hours isn't a problem at all, the drive will die because of something else well before you will have a too high LCC ;)

BlueMagician · Mar 26, 2016

Bidule0hm said:
3 in 20 hours isn't a problem at all, the drive will die because of something else well before you will have a too high LCC ;)

Indeed. I realise it's not 100 in an hour.

It's more the fact that the figure went up at all. After all, the system has been busy 99.9% of the time since the drive went in.

It was resilvering, then scrubbing. When would it have time to park?

If it were set to 8 seconds, it should be racking-up real quick.

If it were already set to 300, then it should never have had the chance to increase at all.

It's probably just my paranoia, but figured to check with the experts before a 'new' drive got un-necessarily abused due to mis-config or firmware setting.

For what it's worth - the drive is a new revision, but the same firmware as my other drives.

Many thanks,

joeschmuck · Mar 26, 2016

BlueMagician said:
If it were already set to 300, then it should never have had the chance to increase at all.

Well it depends on what was happening. There could have been a few instances where the drive was idle longer than you thought.

BlueMagician · Mar 26, 2016

So I realised that at least one of my other drives LCC increased by 1 in the same timeframe.

As unfeasible at it seems, there must have been at least one or two 5 minute periods of inactivity during last nights resilver and scrub!

In any case, I've now changed the SMART temperature check interval to 4 minutes.

I assume that this check/request resets the 'idle' timer - thus meaning at 240 second intervals, the 300 second park threshold will never be reached.

But that is quite an assumption, and may make no difference at all ;)

Whatever the real reason, the LCC has not increased again in the last 5 hours - so I can probably chill.

And for the record, the replacement drive I received from WD (Poland to UK!), came sealed in a fresh anti-stat rip-bag, and all its SMART values set to zero.. so it may well be a new drive or it may well be a professional refurb. Guess we'll never know!

I'd like to thank @Bidule0hm, @DrKK and @joeschmuck again, for their patience.

joeschmuck · Mar 26, 2016

BlueMagician said:
In any case, I've now changed the SMART temperature check interval to 4 minutes.

I assume that this check/request resets the 'idle' timer - thus meaning at 240 second intervals, the 300 second park threshold will never be reached.

Not sure how you came to that conclusion, I would expect the drive electronics to just check the temperature, it doesn't mean the drive must access the platters to do this. If you do not want the drive heads to park, run WDIDLE and disable the timer. I disabled my timers shortly after I got the drives and it's been 3+ years without issue (knock on wood).

I'd actually want the temperature checks to be over a longer period of time so that it can check to see if the drive temperature is increasing a lot over a period of time. But to be honest, I've never seen an email stating I've exceeded my temperature threshold.

Important Announcement for the TrueNAS Community.

Long SMART test fail...

Explorer

FreeNAS Generalissimo

Old Man

Explorer

Old Man

Explorer

Explorer

Old Man

Explorer

Old Man

FreeNAS Generalissimo

Old Man

Explorer

Explorer

Old Man

Server Electronics Sorcerer

Explorer

Old Man

Explorer

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Long SMART test fail..."

Similar threads