Help with SMART results

Status
Not open for further replies.

samuyeah

Dabbler
Joined
May 26, 2016
Messages
12
Hello everyone,

This week I received a critical alert because of an "unrecoverable error in one or more devices": http://illumos.org/msg/ZFS-8000-9P

After a scrub, ada1 showed 11 checksum errors so I ran a long SMART test. These are the results:
Code:
# smartctl -a /dev/ada1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)															
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org														
																																	
=== START OF INFORMATION SECTION ===																								
Model Family:	 Western Digital Caviar Green (AF)																				
Device Model:	 WDC WD20EARS-00J99B0																							
Serial Number:	WD-WCAWZ1677837																								  
LU WWN Device Id: 5 0014ee 25bc3d884																								
Firmware Version: 80.00A80																										
User Capacity:	2,000,398,934,016 bytes [2.00 TB]																				
Sector Sizes:	 512 bytes logical, 4096 bytes physical																			
Device is:		In smartctl database [for details use: -P show]																  
ATA Version is:   ATA8-ACS (minor revision not indicated)																		  
SATA Version is:  SATA 2.6, 3.0 Gb/s																								
Local Time is:	Tue Jun  6 07:27:20 2017 CEST																					
SMART support is: Available - device has SMART capability.																		
SMART support is: Enabled																										  
																																	
=== START OF READ SMART DATA SECTION ===																							
SMART overall-health self-assessment test result: PASSED																			
																																	
General SMART Values:																											  
Offline data collection status:  (0x85) Offline data collection activity															
										was aborted by an interrupting command from host.										  
										Auto Offline Data Collection: Enabled.													
Self-test execution status:	  (   0) The previous self-test routine completed													
										without error or no self-test has ever													
										been run.																				  
Total time to complete Offline																									
data collection:				(33900) seconds.																					
Offline data collection																											
capabilities:					(0x7b) SMART execute Offline immediate.															
										Auto Offline data collection on/off support.												
										Suspend Offline collection upon new														
										command.																					
										Offline surface scan supported.															
										Self-test supported.																		
										Conveyance Self-test supported.															
										Selective Self-test supported.															
SMART capabilities:			(0x0003) Saves SMART data before entering															
										power-saving mode.																		
										Supports SMART auto save timer.															
Error logging capability:		(0x01) Error logging supported.																	
										General Purpose Logging supported.														
Short self-test routine																											
recommended polling time:		(   2) minutes.																					
Extended self-test routine																										
recommended polling time:		( 327) minutes.																					
Conveyance self-test routine
recommended polling time:		(   5) minutes.																					
SCT capabilities:			  (0x3035) SCT Status supported.																	  
										SCT Feature Control supported.															
										SCT Data Table supported.																  
																																	
SMART Attributes Data Structure revision number: 16																				
Vendor Specific SMART Attributes with Thresholds:																				  
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE									
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   98										  
  3 Spin_Up_Time			0x0027   144   139   021	Pre-fail  Always	   -	   9800										
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   830										
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0											
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0											
  9 Power_On_Hours		  0x0032   093   093   000	Old_age   Always	   -	   5285										
10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0											
11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0											
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   429										
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   172										
193 Load_Cycle_Count		0x0032   108   108   000	Old_age   Always	   -	   276579									  
194 Temperature_Celsius	 0x0022   114   095   000	Old_age   Always	   -	   38										  
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0											
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0											
198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0											
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0											
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   3											
																																	
SMART Error Log Version: 1																										
No Errors Logged																													
																																	
SMART Self-test log structure revision number 1																					
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error									
# 1  Extended offline	Completed: read failure	   70%	  5280		 1275044616											
# 2  Short offline	   Completed without error	   00%	  5278		 -													
# 3  Short offline	   Completed without error	   00%	  5268		 -													
# 4  Short offline	   Completed: read failure	   90%	  5172		 12852030											  
# 5  Short offline	   Completed: read failure	   20%	  4963		 12852024											  
# 6  Short offline	   Completed without error	   00%	  4934		 -													
# 7  Extended offline	Completed without error	   00%	  4932		 -													
# 8  Short offline	   Completed without error	   00%	  4838		 -													
# 9  Short offline	   Completed without error	   00%	  4742		 -													
#10  Short offline	   Completed without error	   00%	  4646		 -													
#11  Extended offline	Completed without error	   00%	  4597		 -													
#12  Short offline	   Completed without error	   00%	  4550		 -													
#13  Short offline	   Completed without error	   00%	  4454		 -													
#14  Short offline	   Completed without error	   00%	  4382		 -													
#15  Short offline	   Completed without error	   00%	  4287		 -													
#16  Short offline	   Completed without error	   00%	  4191		 -													
#17  Extended offline	Completed without error	   00%	  4189		 -													
#18  Short offline	   Completed without error	   00%	  4165		 -
11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0											
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   429										
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   172										
193 Load_Cycle_Count		0x0032   108   108   000	Old_age   Always	   -	   276579									  
194 Temperature_Celsius	 0x0022   114   095   000	Old_age   Always	   -	   38										  
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0											
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0											
198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0											
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0											
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   3											
																																	
SMART Error Log Version: 1																										
No Errors Logged																													
																																	
SMART Self-test log structure revision number 1																					
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error									
# 1  Extended offline	Completed: read failure	   70%	  5280		 1275044616											
# 2  Short offline	   Completed without error	   00%	  5278		 -													
# 3  Short offline	   Completed without error	   00%	  5268		 -													
# 4  Short offline	   Completed: read failure	   90%	  5172		 12852030											  
# 5  Short offline	   Completed: read failure	   20%	  4963		 12852024											  
# 6  Short offline	   Completed without error	   00%	  4934		 -													
# 7  Extended offline	Completed without error	   00%	  4932		 -													
# 8  Short offline	   Completed without error	   00%	  4838		 -													
# 9  Short offline	   Completed without error	   00%	  4742		 -													
#10  Short offline	   Completed without error	   00%	  4646		 -													
#11  Extended offline	Completed without error	   00%	  4597		 -													
#12  Short offline	   Completed without error	   00%	  4550		 -													
#13  Short offline	   Completed without error	   00%	  4454		 -													
#14  Short offline	   Completed without error	   00%	  4382		 -													
#15  Short offline	   Completed without error	   00%	  4287		 -													
#16  Short offline	   Completed without error	   00%	  4191		 -													
#17  Extended offline	Completed without error	   00%	  4189		 -													
#18  Short offline	   Completed without error	   00%	  4165		 -													
#19  Short offline	   Completed without error	   00%	  4069		 -													
#20  Short offline	   Completed without error	   00%	  4001		 -													
#21  Extended offline	Completed without error	   00%	  3952		 -													
																																	
SMART Selective self-test log data structure revision number 1																	
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS																						
	1		0		0  Not_testing																								
	2		0		0  Not_testing																								
	3		0		0  Not_testing																								
	4		0		0  Not_testing																								
	5		0		0  Not_testing																								
Selective self-test flags (0x0):																									
  After scanning selected spans, do NOT read-scan remainder of disk.																
If Selective self-test is pending on power-up, resume after 0 minute delay.

Is it so serious doctor? Should I replace the disk ASAP?

Thanks in advance for all your kind support.
 

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,478

samuyeah

Dabbler
Joined
May 26, 2016
Messages
12
You should check this resource that is designed to help you answer the very question you asked:
https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/
Thanks johnny!

Not sure if I'm reading the log correctly but only thing I can see there is:
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 3
...
# 4 Short offline Completed: read failure 90% 5172 12852030
# 5 Short offline Completed: read failure 20% 4963 12852024
From the troubleshooting guide:
ID 200 MultiZone Error Rate can be the cause of a drive failure although a value in this location doesn't always mean it's the fault. It is notable if there are no other failing indications.
I'm trying to get a second opinion from anyone here with more experience and knowledge about this matter: what would you do in my situation?
 

CraigD

Patron
Joined
Mar 8, 2016
Messages
343
Do your drives get hot? Sometimes drives fail at extreme temperatures, then work fine once temperatures normalizes

Is the drive under warranty?
It has low hours, I didn't think WD still made 3.5" green drives guess I'm wrong

Look at "Lifetime Min/Max Temperature" Drives can play up in extreme temperatures, then come right in normal temperatures

Code:
[root@freenas] ~# smartctl -l scttemp /dev/ada1
=== START OF READ SMART DATA SECTION ===
SCT Status Version:  3
SCT Version (vendor specific):  258 (0x0102)
SCT Support Level:  1
Device State:  Active (0)
Current Temperature:  25 Celsius
Power Cycle Min/Max Temperature:  22/29 Celsius
Lifetime  Min/Max Temperature:  21/32 Celsius
Under/Over Temperature Limit Count:  0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:  2
Temperature Sampling Period:  1 minute
Temperature Logging Interval:  1 minute
Min/Max recommended Temperature:  0/60 Celsius
Min/Max Temperature Limit:  -41/85 Celsius
Temperature History Size (Index):  478 (366)



Have Fun
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,358
If the drive is in warranty, then RMA it.
 

samuyeah

Dabbler
Joined
May 26, 2016
Messages
12
Do your drives get hot? Sometimes drives fail at extreme temperatures, then work fine once temperatures normalizes

Thanks Craig! Yes: my disks sometimes get too hot (above 40ºC) inside my Dell T20. I tried different things but air flow inside this case is very poor. I'll try to test the disk again under different conditions... anyway, rest of the disks (total = 4 x 2TB WD Green) also get as hot as that one but didn't received those kind of errors (yet).

If the drive is in warranty, then RMA it.

My 2TB WD Green are already some years old in spite of their low hours :(
 

samuyeah

Dabbler
Joined
May 26, 2016
Messages
12
Just in case a disk replacement is needed: is WD NAS Red still the cheapest-reliable-favourite option in these forums?
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
I don't worry about a drive in the bottom bay hovering a few degrees above 40C under load, e.g. during a scrub. Is it ideal? No, but it's well within acceptable limits.
 

tvsjr

Guru
Joined
Aug 29, 2015
Messages
959
I'll confess... my drives hover in the high 30s and hit low 40s under significant load (which is fairly often). I've not experienced issues as a result. Pushing 50 I'd worry about... but 40 is in the "meh" range in my book. I know the ZFS Police will get on me about this... probably take away my pony.

WD Red or HGST Deskstar NAS seem to be the favorites. I've been very happy with HGST drives, so I've stuck with them. They are 7200RPM drives, so they naturally run a bit hotter.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,358
Just in case a disk replacement is needed: is WD NAS Red still the cheapest-reliable-favourite option in these forums?

Seagate NAS drives tend to be cheaper. Jury is still out on if they are reliable or not.

I have 16 and they've been good so far ;)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
Status
Not open for further replies.
Top