SOLVED Dying drive, I think

Status
Not open for further replies.

aussiejuggalo

Explorer
Joined
Apr 26, 2016
Messages
50
Started having some problems about 5 months ago with a drive but was to busy / lazy to deal with it until I noticed 2 LED's on my HDD bays were off and only flashed red instead of soild blue, 1 drive I know has problems but I can't seem to hunt down the other drive in FreeNAS atm (unless the bay is just being stupid).

Drive da2 started having S.M.A.R.T. errors, I didn't think to much of it at the time but after finally getting around to running a long self test (I still haven't setup auto tests, I know stupid) it's coming back with "2 Currently unreadable (pending) sectors" & "Self-Test Log error count increased from 4 to 5". I've got 2 spare drives ready to go in if I have to replace it so that's not a problem.

So what do you think does it need to be replaced?

Code:
=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD40EFRX-68WT0N0
Serial Number:	nope :)
LU WWN Device Id: 5 0014ee 2b72346c6
Firmware Version: 82.00A82
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sun Jul 16 09:11:26 2017 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  ( 121) The previous self-test completed having
										the read element of the test failed.
Total time to complete Offline
data collection:				(51840) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 518) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x703d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   248
  3 Spin_Up_Time			0x0027   182   181   021	Pre-fail  Always	   -	   7900
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   28
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   083   083   000	Old_age   Always	   -	   12498
10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   28
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   26
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   70
194 Temperature_Celsius	 0x0022   116   106   000	Old_age   Always	   -	   36
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   100   253   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	 12475		 10837785
# 2  Short offline	   Completed: read failure	   30%	 11185		 10834384
# 3  Short offline	   Completed: read failure	   20%	 10898		 10837712
# 4  Short offline	   Completed: read failure	   30%	 10777		 10834648
# 5  Short offline	   Completed: read failure	   30%	 10732		 10834648
# 6  Short offline	   Completed without error	   00%	  2434		 -
# 7  Short offline	   Completed without error	   00%	  2411		 -
# 8  Short offline	   Completed without error	   00%	  2387		 -
# 9  Short offline	   Completed without error	   00%	  2363		 -
#10  Short offline	   Completed without error	   00%	  2339		 -
#11  Short offline	   Completed without error	   00%	  2315		 -
#12  Short offline	   Completed without error	   00%	  2291		 -
#13  Short offline	   Completed without error	   00%	  2267		 -
#14  Short offline	   Completed without error	   00%	  2243		 -
#15  Short offline	   Completed without error	   00%	  2219		 -
#16  Short offline	   Completed without error	   00%	  2195		 -
#17  Short offline	   Completed without error	   00%	  2171		 -
#18  Short offline	   Completed without error	   00%	  2147		 -
#19  Short offline	   Completed without error	   00%	  2123		 -
#20  Short offline	   Completed without error	   00%	  2099		 -
#21  Short offline	   Completed without error	   00%	  2075		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Quick question on S.M.A.R.T. tests can I run long tests on all the drives at once? same question about short tests.

Thanks :).
 
Last edited:

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
So what do you think does it need to be replaced?
Yes, that drive has been having read errors for the last 453 hours of it's life :-( (actually it's more)
Code:
SMART Self-test log structure revision number 1
Num Test_Description	Status				 Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline	Completed: read failure	 90%	 12475		 10837785
# 2 Short offline	 Completed: read failure	 30%	 11185		 10834384
# 3 Short offline	 Completed: read failure	 20%	 10898		 10837712
# 4 Short offline	 Completed: read failure	 30%	 10777		 10834648
# 5 Short offline	 Completed: read failure	 30%	 10732		 10834648
# 6 Short offline	 Completed without error	 00%	 2434		 -

Quick question on S.M.A.R.T. tests can I run long tests on all the drives at once? same question about short tests.
Yes, log in to your Secure Shell and start those tests as fast as you can type the commands for each drive,
I would not bother with the short tests though...
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The disk should still be in warranty, so RMA it. Failing SMART tests are plenty of justification for that.
 

aussiejuggalo

Explorer
Joined
Apr 26, 2016
Messages
50
Yes, that drive has been having read errors for the last 453 hours of it's life :-( (actually it's more)
Code:
SMART Self-test log structure revision number 1
Num Test_Description	Status				 Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline	Completed: read failure	 90%	 12475		 10837785
# 2 Short offline	 Completed: read failure	 30%	 11185		 10834384
# 3 Short offline	 Completed: read failure	 20%	 10898		 10837712
# 4 Short offline	 Completed: read failure	 30%	 10777		 10834648
# 5 Short offline	 Completed: read failure	 30%	 10732		 10834648
# 6 Short offline	 Completed without error	 00%	 2434		 -


Yes, log in to your Secure Shell and start those tests as fast as you can type the commands for each drive,
I would not bother with the short tests though...

:oops: Didn't realise it'd been having errors for that long oh well lol.

Why shouldn't I bother with short tests? I thought you use short tests every few days to a week then long every month.

The disk should still be in warranty, so RMA it. Failing SMART tests are plenty of justification for that.

Yeah it's still under warranty, only bought it at the start of last year so I'll RMA it.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Why shouldn't I bother with short tests? I thought you use short tests every few days to a week then long every month.

The long test's read/verify segment tests the entire disk no matter how long it takes, the short test is
performed over a much smaller segment and is limited by time.
 

aussiejuggalo

Explorer
Joined
Apr 26, 2016
Messages
50
The long test's read/verify segment tests the entire disk no matter how long it takes, the short test is
performed over a much smaller segment and is limited by time.

Ah ok, so how should I set the long tests up, once a month or a every 2 weeks etc?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
how should I set the long tests up, once a month or a every 2 weeks etc?
For regular SMART tests, I run a short test daily and a long test weekly.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Ah ok, so how should I set the long tests up, once a month or a every 2 weeks etc?
I was suggesting only doing the long tests for a more immediate evaluation of your drive's current condition.
Not as a matter of standard maintenance. Sorry for the confusion...
 

aussiejuggalo

Explorer
Joined
Apr 26, 2016
Messages
50
For regular SMART tests, I run a short test daily and a long test weekly.
I was suggesting only doing the long tests for a more immediate evaluation of your drive's current condition.
Not as a matter of standard maintenance. Sorry for the confusion...

Thanks, gonna try the short tests for this week and if they bug me I'll stop them.

I'll run long tests tonight on all the drives and see what they say before I replace any of them, hopefully no more than that 1 drive needs replacing.
 

aussiejuggalo

Explorer
Joined
Apr 26, 2016
Messages
50
So been waiting for the last couple of days for long tests to run, however for some reason there not running even though I scheduled them. I set them to run at 11 pm for the last couple of nights but they haven't run, the short tests run fine every day though. Posted a pic of my S.M.A.R.T. tasks tab to see if I screwed up anything.

Also 2 quick questions

1. I was told recently that I need to keep half my pool free for cron jobs etc, is that true? my setup is 8 4TB drives in RAIDz2 on one pool so... heaps of storage and heaps unusable if that is true.

2. Is there a way to see current running S.M.A.R.T. jobs?

Thanks
 

Attachments

  • Untitled.png
    Untitled.png
    29.8 KB · Views: 219

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I was told recently that I need to keep half my pool free for cron jobs
Not even close to true. Either someone's very badly misinformed, or you're very badly misunderstanding what they're saying.

Edit: I don't see anything wrong, as such, with your SMART test setup, but it's much more complicated than it needs to be. A single entry for the short tests on all drives, and a second entry for the long tests on all the drives, is all you need. No need to have a separate entry for each drive.

Edit 2: Just kick off the SMART tests at the CLI for now. smartctl -t long /dev/da0, repeat for each drive. After whenever it tells you to expect the test to be done, check the results with smartctl -a /dev/da0 (and again, repeat for each drive).
 
Last edited:

aussiejuggalo

Explorer
Joined
Apr 26, 2016
Messages
50
Not even close to true. Either someone's very badly misinformed, or you're very badly misunderstanding what they're saying.

Probably misunderstanding them, it was on another forum where I did a build log for my NAS, this is what they said:

Also to add you want approx 1/2 free drive space ( in your case 9 tb). For everything else you dont see (cron jobs ect) on freenas

What i will be doing is making multiple zpools. That way muy personal media can be on smaller safer hard drives ( by safer I mean quicker to copy vs large
Large hdds ) and my media on large ones

freenas likes 50% freespace !


Edit: I don't see anything wrong, as such, with your SMART test setup, but it's much more complicated than it needs to be. A single entry for the short tests on all drives, and a second entry for the long tests on all the drives, is all you need. No need to have a separate entry for each drive.

Edit 2: Just kick off the SMART tests at the CLI for now. smartctl -t long /dev/da0, repeat for each drive. After whenever it tells you to expect the test to be done, check the results with smartctl -a /dev/da0 (and again, repeat for each drive).

Oh, didn't realise you could select all drives at once, my bad :oops:.

Ah ok, I'll do that in a bit. I've got emails set up now to so I should get them hopefully.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
this is what they said:
No, I'm going to say they're just wrong. Sure, any CoW filesystem (like ZFS) likes its free space, but we normally recommend at least 20% free (i.e., your pool not more than 80% full). There are use cases where we suggest more free space, mainly involving lots of VM storage and/or iSCSI use, but the general purpose recommendation is at least 20% free. And cron has nothing to do with anything.
 

aussiejuggalo

Explorer
Joined
Apr 26, 2016
Messages
50
Ah ok, well I always have a rule with HDD / SSD anyway no more then 90% used even if it means buying a new drive.

Got around to replacing this drive today, reslivering atm. It threw me for a second though because it said "replace drive da7" then I realised either FreeNAS or the RAID card swapped the order of the drives, a quick check of the serials and I realised it was the right one, now to play the waiting game.

Got one more question, is it better to have 1 big pool like I have (8x 4TB in RAIDZ2) or 2 smaller pools? I mean better as in reliablity, resliver etc.

Thanks.

Edit, submitting the faulty drive for RMA, hope none of the others die in the mean time :eek:.
 
Last edited:
Status
Not open for further replies.
Top