4 years in, first failing drive? Weee!

Status
Not open for further replies.

sremick

Patron
Joined
Sep 24, 2014
Messages
323
So I've been running my FreeNAS box for a tad over 4 years. Looks like I might have my first failing drive. Unfortunately a changed email password resulted in me not getting automatic notifications, so I caught it by chance while in the GUI for another reason.

Code:
CRITICAL: Nov. 28, 2018, 4:02 p.m. - Device: /dev/ada2, 1 Currently unreadable (pending) sectors

Just 1 sector? Hmm.

So, please guide me so I don't fsck this up. Options:
A) "It's just 1 sector. Do _____________________ and you should be fine (unless you start seeing more bad sectors)"
B) "You probably should replace the drive, but you're probably fine to use the NAS in the meantime."
C) "SHUT THAT BABY OFF ASAP UNTIL YOUR NEW DRIVE ARRIVES!!!"

Fear not, I have tried to RTFA and I have this link on speeddial for when the time comes:
https://forums.freenas.org/index.php?resources/replacing-a-failed-failing-disk.75/

(this is not mission-critical, so that's why no cold spare)
 

sremick

Patron
Joined
Sep 24, 2014
Messages
323
Code:
# smartctl -a /dev/ada2
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD30EFRX-68EUZN0
Serial Number:	xxxxxxxxxxxxxxxxxxxx
LU WWN Device Id: 5 0014ee 604fc6e6e
Firmware Version: 82.00A82
User Capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Wed Nov 28 23:45:27 2018 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)   Offline data collection activity
				   was never started.
				   Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0)   The previous self-test routine completed
				   without error or no self-test has ever
				   been run.
Total time to complete Offline
data collection:		(40680) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
				   Auto Offline data collection on/off support.
				   Suspend Offline collection upon new
				   command.
				   Offline surface scan supported.
				   Self-test supported.
				   Conveyance Self-test supported.
				   Selective Self-test supported.
SMART capabilities:			(0x0003)   Saves SMART data before entering
				   power-saving mode.
				   Supports SMART auto save timer.
Error logging capability:		(0x01)   Error logging supported.
				   General Purpose Logging supported.
Short self-test routine
recommended polling time:	 (   2) minutes.
Extended self-test routine
recommended polling time:	 ( 408) minutes.
Conveyance self-test routine
recommended polling time:	 (   5) minutes.
SCT capabilities:		   (0x703d)   SCT Status supported.
				   SCT Error Recovery Control supported.
				   SCT Feature Control supported.
				   SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   10
  3 Spin_Up_Time			0x0027   178   176   021	Pre-fail  Always	   -	   6075
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   57
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   052   052   000	Old_age   Always	   -	   35499
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   57
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   41
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   372
194 Temperature_Celsius	 0x0022   117   102   000	Old_age   Always	   -	   33
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   1
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	 14332		 -
# 2  Extended offline	Completed without error	   00%	 14244		 -
# 3  Short offline	   Completed without error	   00%	 14164		 -
# 4  Short offline	   Completed without error	   00%	 13996		 -
# 5  Extended offline	Completed without error	   00%	 13909		 -
# 6  Short offline	   Completed without error	   00%	 13828		 -
# 7  Short offline	   Completed without error	   00%	 13588		 -
# 8  Extended offline	Completed without error	   00%	 13501		 -
# 9  Short offline	   Completed without error	   00%	 13420		 -
#10  Short offline	   Completed without error	   00%	 13252		 -
#11  Extended offline	Completed without error	   00%	 13178		 -
#12  Short offline	   Completed without error	   00%	 13097		 -
#13  Short offline	   Completed without error	   00%	 12882		 -
#14  Extended offline	Completed without error	   00%	 12794		 -
#15  Short offline	   Completed without error	   00%	 12714		 -
#16  Short offline	   Completed without error	   00%	 12546		 -
#17  Extended offline	Completed without error	   00%	 12459		 -
#18  Short offline	   Completed without error	   00%	 12378		 -
#19  Short offline	   Completed without error	   00%	 12138		 -
#20  Extended offline	Completed without error	   00%	 12051		 -
#21  Short offline	   Completed without error	   00%	 11971		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



 

sremick

Patron
Joined
Sep 24, 2014
Messages
323
I just noticed that the SMART tests weren't running. :( This was mysterious and frustrating since I know I read about doing this and clearly remember creating the tasks.
So I go and look... sure enough, the tasks are there... but I now see an UI thing I didn't notice before: although it lists all the disks, you actually have to ctrl-click on them to select the disks you want that scheduled task to run on. Doh! That wasn't really obvious (I saw them displayed and just thought it ran on the displayed disks). Suffice it to say I've remedied this.
 

sremick

Patron
Joined
Sep 24, 2014
Messages
323
So I took the opportunity to manually run short and long tests on all 6 drives. I happen to notice the following on the console:
Code:
Device: /dev/ada3, Self-Test Log error count increased from 15 to 16

Another drive? :(

Code:
# smartctl -a /dev/ada3
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD30EFRX-68EUZN0
Serial Number:	xxxxxxxxxxxxxxxxxxxxxxxx
LU WWN Device Id: 5 0014ee 604f7808b
Firmware Version: 82.00A82
User Capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Thu Nov 29 00:30:47 2018 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)   Offline data collection activity
				   was never started.
				   Auto Offline Data Collection: Disabled.
Self-test execution status:	  ( 249)   Self-test routine in progress...
				   90% of test remaining.
Total time to complete Offline
data collection:		(40560) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
				   Auto Offline data collection on/off support.
				   Suspend Offline collection upon new
				   command.
				   Offline surface scan supported.
				   Self-test supported.
				   Conveyance Self-test supported.
				   Selective Self-test supported.
SMART capabilities:			(0x0003)   Saves SMART data before entering
				   power-saving mode.
				   Supports SMART auto save timer.
Error logging capability:		(0x01)   Error logging supported.
				   General Purpose Logging supported.
Short self-test routine
recommended polling time:	 (   2) minutes.
Extended self-test routine
recommended polling time:	 ( 407) minutes.
Conveyance self-test routine
recommended polling time:	 (   5) minutes.
SCT capabilities:		   (0x703d)   SCT Status supported.
				   SCT Error Recovery Control supported.
				   SCT Feature Control supported.
				   SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   181   179   021	Pre-fail  Always	   -	   5933
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   59
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   052   052   000	Old_age   Always	   -	   35500
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   59
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   42
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   352
194 Temperature_Celsius	 0x0022   113   101   000	Old_age   Always	   -	   37
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed: read failure	   50%	 35500		 1565546408
# 2  Short offline	   Completed: read failure	   60%	 14332		 1565546408
# 3  Extended offline	Completed: read failure	   10%	 14244		 1565546408
# 4  Short offline	   Completed: read failure	   60%	 14164		 1565546408
# 5  Short offline	   Completed: read failure	   60%	 13996		 1565546408
# 6  Extended offline	Completed: read failure	   10%	 13909		 1565546408
# 7  Short offline	   Completed: read failure	   60%	 13828		 1565546408
# 8  Short offline	   Completed: read failure	   60%	 13588		 1565546408
# 9  Extended offline	Completed: read failure	   10%	 13501		 1565546408
#10  Short offline	   Completed: read failure	   60%	 13420		 1565546408
#11  Short offline	   Completed: read failure	   60%	 13252		 1565546408
#12  Extended offline	Completed: read failure	   10%	 13178		 1565546408
#13  Short offline	   Completed: read failure	   60%	 13097		 1565546408
#14  Short offline	   Completed: read failure	   60%	 12882		 1565546408
#15  Extended offline	Completed: read failure	   10%	 12794		 1565546408
#16  Short offline	   Completed: read failure	   10%	 12714		 1565546408
#17  Short offline	   Completed without error	   00%	 12546		 -
#18  Extended offline	Completed without error	   00%	 12459		 -
#19  Short offline	   Completed without error	   00%	 12378		 -
#20  Short offline	   Completed without error	   00%	 12138		 -
#21  Extended offline	Completed without error	   00%	 12051		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


 

Pitfrr

Wizard
Joined
Feb 10, 2014
Messages
1,531
It seems your ada3 has been failing smart tests since its 12714 hour anniversary.... and it's now over 35k hours! :tongue:
I'd be more concerned about this.
I'd advise to change this drive. You could also try to wipe the drive and restart a long test to see if it was an unreadable sector and if it has been remapped... (that's what I did on my backup system but... it's backup so less critical).

For ada2, depending on your sensitivity and on the importance of the data, you could either wait to see if the count of currently unreadable (pending) sectors is rising or not. But I would monitor that closely.

Let's see also what other forum members advise...

But keep in mind, I suppose you have a RAIDz2 pool and now 2 disks have issues (they are not failing but those are indicators) and you have 2 disks redundancy... so, still ok but not the most comfortable! :smile:
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Fear not, I have tried to RTFA and I have this link on speeddial for when the time comes:
https://forums.freenas.org/index.php?resources/replacing-a-failed-failing-disk.75/
So, that is how you replace a disk, here is how you troubleshoot a disk:

Hard Drive Troubleshooting Guide (All Versions of FreeNAS)
https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/

You might also want to use some of these scripts to improve the monitoring of your system:

Github repository for FreeNAS scripts, including disk burnin
https://forums.freenas.org/index.ph...for-freenas-scripts-including-disk-burnin.28/

Typically, between 4 and 5 years of wear is where drives fail most frequently. Around the beginning of this year I had to replace all the drives in both of my FreeNAS systems because they were all over 5 years of age and I was loosing a drive or two every month. So I bought a case of drives:
20180813_190918.jpg
I have 12 dries in each of my two home FreeNAS systems... Drives don't last forever and it is time for you to start preparing.
If you have any questions, please ask, someone here will be happy to help.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
PS. Your ada3 drive needs to be changed ASAP and I would change ada2 also. These drives are of an age that they are ticking time bombs.
 

Skro

Contributor
Joined
Jun 26, 2018
Messages
100
So I bought a case of drives:

I don't want to take this off topic, but I have to ask because I've been thinking about adding/changing some of my drives, what drives did you buy by the case? I've seen you suggest some used drives over new but I've been unable to find those discussions in the forum, so I gotta know what drives you suggest personally/professionally for FN?
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
so I gotta know what drives you suggest personally/professionally for FN?
Click on Chris's "Show:Emily-NAS" button...

And, yes, you are hijacking this thread and should best have created your own in Storage (this forum).
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I've seen you suggest some used drives over new
I do suggest used SSDs for boot drives but for storage drives I recommend new. I have done the used drive thing for my personal storage, but those drives have a higher failure rate and I don't suggest that unless you really can't afford new. If I were buying drives and I couldn't afford new, I would probably go with something like this:
https://www.ebay.com/itm/HGST-Ultra...TA-6Gb-s-3-5-64MB-HDD-hard-drive/223016135412
 

Skro

Contributor
Joined
Jun 26, 2018
Messages
100
Click on Chris's "Show:Emily-NAS" button...

And, yes, you are hijacking this thread and should best have created your own in Storage (this forum).

My apologies, I really did believe the information might be relevant to the OP, but I will start my own thread.
 

Skro

Contributor
Joined
Jun 26, 2018
Messages
100
I do suggest used SSDs for boot drives but for storage drives I recommend new. I have done the used drive thing for my personal storage, but those drives have a higher failure rate and I don't suggest that unless you really can't afford new. If I were buying drives and I couldn't afford new, I would probably go with something like this:
https://www.ebay.com/itm/HGST-Ultra...TA-6Gb-s-3-5-64MB-HDD-hard-drive/223016135412

Thank you, Chris! I appreciate your response.
 

sremick

Patron
Joined
Sep 24, 2014
Messages
323
Well I've got one new drive in to replace ada3. Currently doing a burn-in and test on a separate computer. My intention was always to replace the 3TB drives with larger as they failed, so that eventually when I replaced the 6th drive poof more storage, so this burn-in might take a while (8TB).
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Well I've got one new drive in to replace ada3. Currently doing a burn-in and test on a separate computer. My intention was always to replace the 3TB drives with larger as they failed, so that eventually when I replaced the 6th drive poof more storage, so this burn-in might take a while (8TB).
I recently did a burn-in on 6 TB drives and I think it was three days.
 

CraigD

Patron
Joined
Mar 8, 2016
Messages
343
Well I've got one new drive in to replace ada3. Currently doing a burn-in and test on a separate computer. My intention was always to replace the 3TB drives with larger as they failed, so that eventually when I replaced the 6th drive poof more storage, so this burn-in might take a while (8TB).

I just burnt in an 8TB Ironwolf with badblocks took 5 days

Code:
SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	   214		 -
# 2  Extended offline	Completed without error	   00%	   118		 -
# 3  Conveyance offline  Completed without error	   00%		 0		 -
# 4  Short offline	   Completed without error	   00%		 0		 -
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
I just burnt in an 8TB Ironwolf with badblocks took 5 days

Code:
SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	   214		 -
# 2  Extended offline	Completed without error	   00%	   118		 -
# 3  Conveyance offline  Completed without error	   00%		 0		 -
# 4  Short offline	   Completed without error	   00%		 0		 -
Code:
SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	   123		 -
# 2  Short offline	   Completed without error	   00%	   110		 -
# 3  Extended offline	Completed without error	   00%		29		 -
# 4  Short offline	   Completed without error	   00%		16		 -
# 5  Short offline	   Completed without error	   00%		12		 -
# 6  Short offline	   Completed without error	   00%		 8		 -
# 7  Short offline	   Completed without error	   00%		 0		 -


~5 days to do test 5@6TB drives. Good times were had given there were a couple power failures that laster longer than my UPS did right at the start - hence the 4 consecutive SHORT tests.

These big drives are really fun...
 

sremick

Patron
Joined
Sep 24, 2014
Messages
323
Just an update that the burn-in and replacement process of ada3 went fine.
Main issue now is the alternate screw spacing on the high-capacity drives that does not line up with the 4 holes my Node 304 uses. :mad:
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
[QUOTE="sremick, post: 498171, member: 43590"Main issue now is the alternate screw spacing on the high-capacity drives that does not line up with the 4 holes my Node 304 uses. :mad:[/QUOTE]
That is pretty terrible, but you might be able to get replacement drive brackets from the manufacturer. I remember some other people running into that problem.
 

sremick

Patron
Joined
Sep 24, 2014
Messages
323
That is pretty terrible, but you might be able to get replacement drive brackets from the manufacturer. I remember some other people running into that problem.
Haven't heard back from them yet, but from what I've read they don't sell just the brackets. There also used to be an adapter/shim they sold on the website but it's out of stock with no ETA. :mad:

The new Node 304s come with modified drive brackets with a 5th hole. But they won't sell just the brackets to all the previous owners? :mad: I'm seriously considering buying a new case, taking the brackets, and then reselling it on eBay w/ my original brackets (being up-front about it all of course). Whatever loss I take is effectively my cost for the brackets.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
The new Node 304s come with modified drive brackets with a 5th hole. But they won't sell just the brackets to all the previous owners? :mad: I'm seriously considering buying a new case, taking the brackets, and then reselling it on eBay w/ my original brackets (being up-front about it all of course). Whatever loss I take is effectively my cost for the brackets.
I looked into it a little and I don't think the brackets are interchangeable. You would probably need to move your build into the new case and sell the entire old case.
 
Status
Not open for further replies.
Top