Bad Drive? Not sure how to read this info

Status
Not open for further replies.

Shay.ca

Dabbler
Joined
Jan 8, 2015
Messages
44
Hello,

I logged into my freenas box the other day and noticed the red flashy light on the top right corner.

The warning said:

[*]CRITICAL: Nov. 9, 2016, 2:30 p.m. - Device: /dev/ada3, 1 Currently unreadable (pending) sectors
[*] CRITICAL: Nov. 9, 2016, 3:30 p.m. - Device: /dev/ada3, Self-Test Log error count increased from 0 to 1


I did some online searching and came across this website which had the same issue as I did:

http://bytesandbolts.com/fixing-freenas-error-currently-unreadable-pending-sectors/

I ran the SMART LOG SELF TEST using the command:

Code:
smartctl -t long /dev/adaX


It took a bunch of hours to complete. Once completed I ran the code to view the results:

Code:
smartctl -a /dev/adaX 


The website says to get the sector size and location from LBA_of_first_error.

These are the results:

Code:
recommended polling time:		(   2) minutes.																					
Extended self-test routine																										
recommended polling time:		( 404) minutes.																					
Conveyance self-test routine																										
recommended polling time:		(   5) minutes.																					
SCT capabilities:			  (0x703d) SCT Status supported.																	  
										SCT Error Recovery Control supported.													  
										SCT Feature Control supported.															
										SCT Data Table supported.																  
																																	
SMART Attributes Data Structure revision number: 16																				
Vendor Specific SMART Attributes with Thresholds:																				  
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE									
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   14										  
  3 Spin_Up_Time			0x0027   175   175   021	Pre-fail  Always	   -	   6241										
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   535										
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0											
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0											
  9 Power_On_Hours		  0x0032   082   082   000	Old_age   Always	   -	   13670										
10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0											
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0											
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   48										  
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   14										  
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   1061										
194 Temperature_Celsius	 0x0022   115   100   000	Old_age   Always	   -	   35										  
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0											
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   1											
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0											
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0											
200 Multi_Zone_Error_Rate   0x0008   100   253   000	Old_age   Offline	  -	   0											
																																	
recommended polling time:		(   2) minutes.																					
Extended self-test routine																										
recommended polling time:		( 404) minutes.																					
Conveyance self-test routine																										
recommended polling time:		(   5) minutes.																					
SCT capabilities:			  (0x703d) SCT Status supported.																	  
										SCT Error Recovery Control supported.													  
										SCT Feature Control supported.															
										SCT Data Table supported.																  
																																	
SMART Attributes Data Structure revision number: 16																				
Vendor Specific SMART Attributes with Thresholds:																				  
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE									
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   14										  
  3 Spin_Up_Time			0x0027   175   175   021	Pre-fail  Always	   -	   6241										
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   535										
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0											
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0											
  9 Power_On_Hours		  0x0032   082   082   000	Old_age   Always	   -	   13670										
10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0											
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0											
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   48										  
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   14										  
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   1061										
194 Temperature_Celsius	 0x0022   115   100   000	Old_age   Always	   -	   35										  
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0											
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   1											
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0											
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0											
200 Multi_Zone_Error_Rate   0x0008   100   253   000	Old_age   Offline	  -	   0											
																																	
SMART Error Log Version: 1																										
No Errors Logged																													
																																	
SMART Self-test log structure revision number 1																					
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error									
# 1  Extended offline	Completed: read failure	   90%	 13506		 1570078680											
																																	
SMART Selective self-test log data structure revision number 1																	
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS																						
	1		0		0  Not_testing																								
	2		0		0  Not_testing																								
	3		0		0  Not_testing																								
	4		0		0  Not_testing																								
	5		0		0  Not_testing																								
Selective self-test flags (0x0):																									
  After scanning selected spans, do NOT read-scan remainder of disk.																
If Selective self-test is pending on power-up, resume after 0 minute delay.														 

I am not sure what the sector size is but I believe the error was at 1570078680

Is this correct?


Can someone confirm or help me get this going?

Am I out of a drive and need to get it replaced ASAP? I am set up in a raidZ2 so I am ok right now but would rather not lose another drive.
 
Last edited by a moderator:

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
Failed a SMART test, I'd RMA it if it's under warranty still. And make sure you get regular smart tests setup on the new/existing drives before you end up with a dead server.
 

Shay.ca

Dabbler
Joined
Jan 8, 2015
Messages
44
Wow, I copied and pasted failed, not sure the code
Failed a SMART test, I'd RMA it if it's under warranty still. And make sure you get regular smart tests setup on the new/existing drives before you end up with a dead server.


Is there an easy way to figure out which drive it is in the bay? I don;t have the blinky blinky lights to identify it. I can trace the SATA cables back but not sure how they are labeled to narrow down to the bad drive?

I believe they are still under warranty still.

Is there a link or tutorial to set up the smart tests? This is new to me, recommended by a old coworker to backup family photos and stuff.

Thanks!
 

wblock

Documentation Engineer
Joined
Nov 14, 2014
Messages
1,506
I did some online searching and came across this website
That site shows how to overwrite some random data to force a block to be mapped out to a spare. It might change the pending sectors number, but might also lose more data. Replace the disk and let ZFS resilver the correct data onto the new one.

Is there an easy way to figure out which drive it is in the bay?

The part of the SMART tests that were not included shows the serial number of the drive, or can be shown at the command line with diskinfo -v adax.
 

Shay.ca

Dabbler
Joined
Jan 8, 2015
Messages
44
Thanks,

I got the serial number using your command and have set up an advanced RMA with WD. Hopefully it gets here semi soon and I can get it out.

Is there a quick way to figure out which drive it is on my server when I open the case without removing every single drive? Or is the best way to just shut it down, open the case and remove one drive at a time looking at the serial numbers as I go and put it back together.


Once I get the new disk, I would plug it in and then what? Does it auto find it and rebuild or is there some commands I need to run?

Thanks everyone for their input and help!
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
Identifying the drive by serial number is the best way.
 

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,478
check out the manual, it has steps specifically for this task.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Yes, that's right. If you have a spare SATA port, you can plug in the replacement without removing the old one, do the replacement (no need to offline the old disk first), and then remove the old disk when it's finished. That way, you don't lose any redundancy.
 

Shay.ca

Dabbler
Joined
Jan 8, 2015
Messages
44
Yes, that's right. If you have a spare SATA port, you can plug in the replacement without removing the old one, do the replacement (no need to offline the old disk first), and then remove the old disk when it's finished. That way, you don't lose any redundancy.

Thanks for the info. No spare sata ports. My board only has 6 and they are all used by the hard drives. Ill follow the steps when the new drive arrives and go from there. Thanks everyone for your help.
 

Shay.ca

Dabbler
Joined
Jan 8, 2015
Messages
44
Got the drive replaced, its reslivering now. will let that run over night and follow the rest of the steps tomorrow.

Just to confirm, once the resilver is complete and it shows no errors, everything is good to go? No other steps to complete?
 
Last edited by a moderator:

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Yep you are good to go.

I would have fully burned in the new drive to make sure it was OK for use but to late for that now.

Sent from my Nexus 5X using Tapatalk
 

brando56894

Wizard
Joined
Feb 15, 2014
Messages
1,537
Do you have adequate cooling for the drives? It looked like the temperature test failed 35 times, unless that's saying that it's 35c. In that case it's odd that they put the temperature in the "failed" column.
 

okgunguy

Explorer
Joined
Aug 4, 2015
Messages
72
Do you have adequate cooling for the drives? It looked like the temperature test failed 35 times, unless that's saying that it's 35c. In that case it's odd that they put the temperature in the "failed" column.

194 Temperature_Celsius 0x0022 115 100 000 Old_age Always - 35

*Means that the drive is 35 degrees Celsius at the time of running the "smartctl -a" command.
And it's not in the 'failed' column. It's in the 'raw_value' column. The spacing on his read-out is off a little.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The forum loves to destroy smartctl formatting.
 

Shay.ca

Dabbler
Joined
Jan 8, 2015
Messages
44
Thanks everyone for your help.

The computer has two fans in the front and one in the rear. I think the cooling is fine. I did notice after swapping the drive that the rear fan was not working. I shut it down for the mean time until I can figure out if I just bumped a connection or why its not powering on.


Anyone care to point me in the right direction for setting up those SMART tests so I can get email alerts when an issue happens. Luckily I caught this one pretty quick (logged in a bit after it happened) but I usually don't log into it like I did this time. Just access the file share stuff.

Thanks a ton, you guys are awesome!
 

Shay.ca

Dabbler
Joined
Jan 8, 2015
Messages
44
Thanks everyone for your help.

The computer has two fans in the front and one in the rear. I think the cooling is fine. I did notice after swapping the drive that the rear fan was not working. I shut it down for the mean time until I can figure out if I just bumped a connection or why its not powering on.


Anyone care to point me in the right direction for setting up those SMART tests so I can get email alerts when an issue happens. Luckily I caught this one pretty quick (logged in a bit after it happened) but I usually don't log into it like I did this time. Just access the file share stuff.

Thanks a ton, you guys are awesome!

Guessing its this?

http://doc.freenas.org/9.10/services.html#s-m-a-r-t
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Status
Not open for further replies.
Top