CRITICAL: Device: /dev/ada3, ATA error count increased from 0 to 1

Status
Not open for further replies.

Avi Poss

Dabbler
Joined
Dec 11, 2016
Messages
13
Hi,
I got this error on my freenas (CRITICAL: Device: /dev/ada3, ATA error count increased from 0 to 1), ada3 is my L2ARC drive, so I'm not super worried, I just wanted to know if anyone has some insight as to why this is happening. If you have any advice or words of caution I'd appreciate it. Thanks!
I ran smartctl -a /dev/ada3 and this the output:
Code:
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p16 amd64] (local build)														
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org														
																																  
=== START OF INFORMATION SECTION ===																								
Model Family:	 Marvell based SanDisk SSDs																						
Device Model:	 SanDisk SDSSDHII120G																							 
Serial Number:	151419400808																									 
LU WWN Device Id: 5 001b44 e4ce1ae68																								
Firmware Version: X31200RL																										 
User Capacity:	120,034,123,776 bytes [120 GB]																					
Sector Sizes:	 512 bytes logical, 4096 bytes physical																			
Rotation Rate:	Solid State Device																								
Form Factor:	  2.5 inches																										
Device is:		In smartctl database [for details use: -P show]																 
ATA Version is:   ACS-2 T13/2015-D revision 3																					 
SATA Version is:  SATA >3.1, 6.0 Gb/s (current: 6.0 Gb/s)																		 
Local Time is:	Mon Dec 12 09:47:05 2016 IST																					 
SMART support is: Available - device has SMART capability.																		 
SMART support is: Enabled																										 
																																  
=== START OF READ SMART DATA SECTION ===																							
SMART overall-health self-assessment test result: PASSED																			
																																  
General SMART Values:																											 
Offline data collection status:  (0x00) Offline data collection activity															
										was never started.																		 
										Auto Offline Data Collection: Disabled.													
Self-test execution status:	  (   0) The previous self-test routine completed													
										without error or no self-test has ever													 
										been run.																				 
Total time to complete Offline																									 
data collection:				(	0) seconds.																					
Offline data collection																											
capabilities:					(0x11) SMART execute Offline immediate.															
										No Auto Offline data collection support.													
										Suspend Offline collection upon new														
										command.																					
										No Offline surface scan supported.														 
										Self-test supported.																		
										No Conveyance Self-test supported.														 
										No Selective Self-test supported.														 
SMART capabilities:			(0x0003) Saves SMART data before entering															
										power-saving mode.																		 
										Supports SMART auto save timer.															
Error logging capability:		(0x01) Error logging supported.																	
										General Purpose Logging supported.														 
Short self-test routine																											
recommended polling time:		(   2) minutes.																					
Extended self-test routine
recommended polling time:		(  10) minutes.																					
																																  
SMART Attributes Data Structure revision number: 4																				 
Vendor Specific SMART Attributes with Thresholds:																				 
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE									
  5 Reallocated_Sector_Ct   0x0032   100   100   ---	Old_age   Always	   -	   0											
  9 Power_On_Hours		  0x0032   253   100   ---	Old_age   Always	   -	   2892										
12 Power_Cycle_Count	   0x0032   100   100   ---	Old_age   Always	   -	   9											
165 Total_Write/Erase_Count 0x0032   100   100   ---	Old_age   Always	   -	   21692419167								 
166 Min_W/E_Cycle		   0x0032   100   100   ---	Old_age   Always	   -	   104										 
167 Min_Bad_Block/Die	   0x0032   100   100   ---	Old_age   Always	   -	   13										 
168 Maximum_Erase_Cycle	 0x0032   100   100   ---	Old_age   Always	   -	   134										 
169 Total_Bad_Block		 0x0032   100   100   ---	Old_age   Always	   -	   0											
171 Program_Fail_Count	  0x0032   100   100   ---	Old_age   Always	   -	   0											
172 Erase_Fail_Count		0x0032   100   100   ---	Old_age   Always	   -	   0											
173 Avg_Write/Erase_Count   0x0032   100   100   ---	Old_age   Always	   -	   132										 
174 Unexpect_Power_Loss_Ct  0x0032   100   100   ---	Old_age   Always	   -	   4											
187 Reported_Uncorrect	  0x0032   100   100   ---	Old_age   Always	   -	   6											
194 Temperature_Celsius	 0x0022   066   039   ---	Old_age   Always	   -	   34 (Min/Max 24/39)						 
199 SATA_CRC_Error		  0x0032   100   100   ---	Old_age   Always	   -	   0											
230 Perc_Write/Erase_Count  0x0032   100   100   ---	Old_age   Always	   -	   3487952280104								
232 Perc_Avail_Resrvd_Space 0x0033   100   100   004	Pre-fail  Always	   -	   100										 
233 Total_NAND_Writes_GiB   0x0032   100   100   ---	Old_age   Always	   -	   15481										
234 Perc_Write/Erase_Ct_BC  0x0032   100   100   ---	Old_age   Always	   -	   16688										
241 Total_Writes_GiB		0x0030   253   253   ---	Old_age   Offline	  -	   6425										
242 Total_Reads_GiB		 0x0030   253   253   ---	Old_age   Offline	  -	   5563										
244 Thermal_Throttle		0x0032   000   100   ---	Old_age   Always	   -	   0											
																																  
SMART Error Log Version: 1																										 
ATA Error Count: 1																												 
		CR = Command Register [HEX]																								
		FR = Features Register [HEX]																								
		SC = Sector Count Register [HEX]																							
		SN = Sector Number Register [HEX]																						 
		CL = Cylinder Low Register [HEX]																							
		CH = Cylinder High Register [HEX]																						 
		DH = Device/Head Register [HEX]																							
		DC = Device Command Register [HEX]																						 
		ER = Error register [HEX]																								 
		ST = Status register [HEX]																								 
Powered_Up_Time is measured from power on, and printed as																		 
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,																			 
SS=sec, and sss=millisec. It "wraps" after 49.710 days.																			
																																  
Error 1 occurred at disk power-on lifetime: 2645 hours (110 days + 5 hours)														
  When the command that caused the error occurred, the device was active or idle.												 
																																  
  After command completion occurred, registers were:																				
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --																											 
  41 40 00 00 00 00 00																											 
																																  
  Commands leading to the command that caused the error were:																	 
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name																 
  -- -- -- -- -- -- -- --  ----------------  --------------------																 
  60 08 60 e8 55 af 40 08	  00:00:00.000  READ FPDMA QUEUED																	 
																																  
SMART Self-test log structure revision number 1																					
No self-tests have been logged.  [To run self-tests, use: smartctl -t]															 
																																  
Selective Self-tests/Logging not supported
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,974
First, the error was likely a SATA cable transmission error. If you keep getting these errors for this drive then replace the SATA cable and/or relocate the drive to a different SATA port. If that doesn't work then you may need to replace the SSD. I didn't see any other obvious issues with the data but SSDs are a new breed trying to figure out what each manufacturer's data means.

Second, it would be nice to know why you have an L2ARC and if you actually have your system properly configured so maybe you could also provide some system specs. But you don't need to if you don't want, it would only be some friendly advice.
 

Avi Poss

Dabbler
Joined
Dec 11, 2016
Messages
13
Second, it would be nice to know why you have an L2ARC and if you actually have your system properly configured so maybe you could also provide some system specs. But you don't need to if you don't want, it would only be some friendly advice.
Thanks!
FreeNAS mini(FreeNAS-9.3-STABLE-201506292130) with 8 intel atom cores and 32GB RAM, 4x6TB WD Re drives (WD6001FSYZ) in RAIDZ2 with L2ARC on 120GB Sandisk SSD. Used for file sharing on a company network, with CIFS and AFP. Snapshots taken anywhere from 5 to 16 times a day (on work days only) and kept from anywhere from 3 days to 2 months depending on the volume. Connected to Active directory via LDAP.
Did I miss anything?
Thanks again!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,974
Hum... An L2ARC could help you out. Many times we see people adding an L2ARC for the wrong reasons. Keep an eye on this and you really need to upgrade to the current version of FreeNAS. There is a fix for a specific motherboard, which I believe you have in which the motherboard will die, no exaggeration. It's a flaw in the motherboard firmware and I don't believe there is a fix for it yet so FreeNAS made a change which avoids the danger. Ensure you make a backup of your configuration file first and then update to the latest version of FreeNAS Stable.
 

Avi Poss

Dabbler
Joined
Dec 11, 2016
Messages
13
Hum... An L2ARC could help you out. Many times we see people adding an L2ARC for the wrong reasons. Keep an eye on this and you really need to upgrade to the current version of FreeNAS. There is a fix for a specific motherboard, which I believe you have in which the motherboard will die, no exaggeration. It's a flaw in the motherboard firmware and I don't believe there is a fix for it yet so FreeNAS made a change which avoids the danger. Ensure you make a backup of your configuration file first and then update to the latest version of FreeNAS Stable.
This bug?
https://bugs.pcbsd.org/issues/16028
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,974

Avi Poss

Dabbler
Joined
Dec 11, 2016
Messages
13
This bug.
Thanks a ton!
I checked via SSH to IPMI and I have that bug (writes every 10 seconds as opposed to every second as reported in the bug, but still bad...).
I scheduled downtime for next week. Hopefully it won't die before then...

Just a couple more questions.
In the update I presume that watchdogd is disabled, does that mean I need to disable the watchdog in the BIOS?

When I choose 9.10-STABLE in the update screen it gives me a warning "Are you sure you want to change trains?" Is that a big deal? Why the warning? Do I need to worry about things not loading properly after the update?

Thanks again!
 

Avi Poss

Dabbler
Joined
Dec 11, 2016
Messages
13
Everything went smoothly with the upgrade.
Thanks again!
Do I need to worry about bug 16028? My MB temps in IPMI are between 47°C and 51°C.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176
What's your motherboard?
 
Status
Not open for further replies.
Top