Resilvered and degraded state messages. What is happening to my drivers? Need I replace any?

Status
Not open for further replies.

pumadace

Cadet
Joined
Jan 29, 2016
Messages
2
Hello,
I am relatively new and mostly noob. Please I would appreciate any help and explanations.

Recently, my freenas send me mail stating:
(11.10 at 9:50) "The volume pumanas (ZFS) state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state."
(11.10 at 9:55) "The volume pumanas (ZFS) state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state."
(12.11 at 3:11) "freenas.local kernel log messages:
ada1 at ahcich5 bus 0 scbus6 target 0 lun 0
ada1: <ST8000AS0002-1NA17Z AR15> s/n Z840J0MS detached
GEOM_ELI: Device ada1p1.eli destroyed.
GEOM_ELI: Detached ada1p1.eli on last close.
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
(ada1:ahcich5:0:0:0): Periph destroyed
ada1 at ahcich5 bus 0 scbus6 target 0 lun 0
ada1: <ST8000AS0002-1NA17Z AR15> ACS-2 ATA SATA 3.x device
ada1: Serial Number Z840J0MS
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 7630885MB (15628053168 512 byte sectors)
ada1: Previously was known as ad14
-- End of security output --"

after that I get almost every 2 hours in different intervals mail same as the first one (DEGRADED).
My configuration is 10x8TB Seagates Archive with two USB sticks Sandisk.
Code:
[root@freenas ~]# zpool status -v																								   
  pool: freenas-boot																												
 state: ONLINE																													 
  scan: scrub repaired 0 in 0h4m with 0 errors on Sat Sep 22 03:49:44 2018														 
config:																															 
																																   
	   NAME		STATE	 READ WRITE CKSUM																					 
	   freenas-boot  ONLINE	   0	 0	 0																					
		 mirror-0  ONLINE	   0	 0	 0																					 
		   da8p2   ONLINE	   0	 0	 0																					 
		   da9p2   ONLINE	   0	 0	 0																					 
																																   
errors: No known data errors																										
																																   
  pool: pumanas																													 
 state: ONLINE																													 
  scan: resilvered 16K in 0h0m with 0 errors on Fri Oct 12 22:37:46 2018															
config:																															 
																																   
	   NAME											STATE	 READ WRITE CKSUM												 
	   pumanas										 ONLINE	   0	 0	 0												 
		 raidz2-0									  ONLINE	   0	 0	 0												 
		   gptid/456c47b9-fa8d-11e6-aaad-0cc47a6c1358  ONLINE	   0	 0	 0												 
		   gptid/463ceeda-fa8d-11e6-aaad-0cc47a6c1358  ONLINE	   0	 0	 0												 
		   gptid/47132221-fa8d-11e6-aaad-0cc47a6c1358  ONLINE	   0	 0	 0												 
		   gptid/47ebee2f-fa8d-11e6-aaad-0cc47a6c1358  ONLINE	   0	 0	 0												 
		   gptid/48bf0f82-fa8d-11e6-aaad-0cc47a6c1358  ONLINE	   0	 0	 0												 
		   gptid/498938d1-fa8d-11e6-aaad-0cc47a6c1358  ONLINE	   0	 0	 0												 
		   gptid/4a605ceb-fa8d-11e6-aaad-0cc47a6c1358  ONLINE	   0	 0	 0												 
		   gptid/4b549442-fa8d-11e6-aaad-0cc47a6c1358  ONLINE	   0	 0	 0												 
		   gptid/4c2ebdf6-fa8d-11e6-aaad-0cc47a6c1358  ONLINE	   0	 0	 0												 
		   gptid/4d05dd7a-fa8d-11e6-aaad-0cc47a6c1358  ONLINE	   0	 0	 0												 
																																   
errors: No known data errors 

Code:
[root@freenas ~]# camcontrol devlist																								
<ATA ST8000AS0002-1NA AR15>		at scbus0 target 0 lun 0 (pass0,da0)															 
<ATA ST8000AS0002-1NA AR15>		at scbus0 target 1 lun 0 (pass1,da1)															 
<ATA ST8000AS0002-1NA AR15>		at scbus0 target 2 lun 0 (pass2,da2)															 
<ATA ST8000AS0002-1NA AR15>		at scbus0 target 3 lun 0 (pass3,da3)															 
<ATA ST8000AS0002-1NA AR15>		at scbus0 target 4 lun 0 (pass4,da4)															 
<ATA ST8000AS0002-1NA AR15>		at scbus0 target 5 lun 0 (pass5,da5)															 
<ATA ST8000AS0002-1NA AR15>		at scbus0 target 6 lun 0 (pass6,da6)															 
<ATA ST8000AS0002-1NA AR15>		at scbus0 target 7 lun 0 (pass7,da7)															 
<ST8000AS0002-1NA17Z AR17>		 at scbus5 target 0 lun 0 (pass8,ada0)															
<ST8000AS0002-1NA17Z AR15>		 at scbus6 target 0 lun 0 (ada1,pass9)															
<SanDisk Ultra Fit 1.00>		   at scbus8 target 0 lun 0 (pass10,da8)															
<SanDisk Ultra Fit 1.00>		   at scbus9 target 0 lun 0 (pass11,da9)   


I dont know how to find any report of given problem and overall I dont understand current situation.
I am speculating whether drives are ok, or if there is some problem with data/power cables.

Thank you in advance for any suggestion or help.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You may have a failing data cable for drive ada1. Lets rule out the hard drive first. I'd recommend that you post the output of smartctl -a /dev/ada1 right now and if you have not run a SMART Long test in the past several day then manually run one and post the output again after that test is complete. Refer to the link in my signature on how to troubleshoot hard drives and it will give you the proper commands as well.

Your pool looks fine right now.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I didn't even look up the model numer but you are right, SMR drives are terrible! Will they work? Sure, just not very well.
 

pumadace

Cadet
Joined
Jan 29, 2016
Messages
2
hello, I would like to conclude the topic.
I have a smart of the drive:
Code:
=== START OF INFORMATION SECTION ===																							   
Model Family:	 Seagate Archive HDD																							   
Device Model:	 ST8000AS0002-1NA17Z																							   
Serial Number:	Z840J0MS																										 
LU WWN Device Id: 5 000c50 0912b7a14																							   
Firmware Version: AR15																											 
User Capacity:	8,001,563,222,016 bytes [8.00 TB]																				 
Sector Sizes:	 512 bytes logical, 4096 bytes physical																		   
Rotation Rate:	5980 rpm																										 
Device is:		In smartctl database [for details use: -P show]																   
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b																			   
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)																		   
Local Time is:	Sat Oct 13 12:20:09 2018 CEST																					 
SMART support is: Available - device has SMART capability.																		 
SMART support is: Enabled																										   
																																   
=== START OF READ SMART DATA SECTION ===																						   
SMART overall-health self-assessment test result: PASSED																		   
																																   
General SMART Values:																											   
Offline data collection status:  (0x82) Offline data collection activity														   
										was completed without error.															   
										Auto Offline Data Collection: Enabled.													 
Self-test execution status:	  (   0) The previous self-test routine completed												   
										without error or no self-test has ever													 
										been run.																				   
Total time to complete Offline																									 
data collection:				(	0) seconds.																				   
Offline data collection																											 
capabilities:					(0x7b) SMART execute Offline immediate.														   
										Auto Offline data collection on/off support.											   
										Suspend Offline collection upon new														 
										command.																				   
										Offline surface scan supported.															 
										Self-test supported.																	   
										Conveyance Self-test supported.															 
										Selective Self-test supported.															 
SMART capabilities:			(0x0003) Saves SMART data before entering														   
										power-saving mode.																		 
										Supports SMART auto save timer.															 
Error logging capability:		(0x01) Error logging supported.																   
										General Purpose Logging supported.														 
Short self-test routine																											 
recommended polling time:		(   1) minutes.																				   
Extended self-test routine																										 
recommended polling time:		( 941) minutes.	 
Short self-test routine																											 
recommended polling time:		(   1) minutes.																				   
Extended self-test routine																										 
recommended polling time:		( 941) minutes.																				   
Conveyance self-test routine																									   
recommended polling time:		(   2) minutes.																				   
SCT capabilities:			  (0x30a5) SCT Status supported.																	   
										SCT Data Table supported.																   
																																   
SMART Attributes Data Structure revision number: 10																				 
Vendor Specific SMART Attributes with Thresholds:																				   
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE								   
  1 Raw_Read_Error_Rate	 0x000f   112   099   006	Pre-fail  Always	   -	   48964256									 
  3 Spin_Up_Time			0x0003   097   090   000	Pre-fail  Always	   -	   0										   
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   40										   
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0										   
  7 Seek_Error_Rate		 0x000f   090   060   030	Pre-fail  Always	   -	   988611542								   
  9 Power_On_Hours		  0x0032   080   080   000	Old_age   Always	   -	   17970									   
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0										   
 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   40										   
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0										   
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0										   
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0										   
188 Command_Timeout		 0x0032   100   100   000	Old_age   Always	   -	   0										   
189 High_Fly_Writes		 0x003a   100   100   000	Old_age   Always	   -	   0										   
190 Airflow_Temperature_Cel 0x0022   063   057   045	Old_age   Always	   -	   37 (Min/Max 37/37)						   
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0										   
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   664										 
193 Load_Cycle_Count		0x0032   098   098   000	Old_age   Always	   -	   4406										 
194 Temperature_Celsius	 0x0022   037   043   000	Old_age   Always	   -	   37 (0 21 0 0 0)							 
195 Hardware_ECC_Recovered  0x001a   112   099   000	Old_age   Always	   -	   48964256									 
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0										   
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0										   
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0										   
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   17539 (37 96 0)							 
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   26940471036								 
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   307968630254								 
																																   
SMART Error Log Version: 1																										 
No Errors Logged   

SMART Self-test log structure revision number 1																					 
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error									 
# 1  Short offline	   Completed without error	   00%	 17793		 -													 
# 2  Short offline	   Completed without error	   00%	 17625		 -													 
# 3  Short offline	   Completed without error	   00%	 17457		 -													 
# 4  Short offline	   Completed without error	   00%	 17289		 -													 
# 5  Short offline	   Completed without error	   00%	 17121		 -													 
# 6  Short offline	   Completed without error	   00%	 16953		 -													 
# 7  Short offline	   Completed without error	   00%	 16785		 -													 
# 8  Short offline	   Completed without error	   00%	 16617		 -													 
# 9  Short offline	   Completed without error	   00%	 16449		 -													 
#10  Short offline	   Completed without error	   00%	 16281		 -													 
#11  Short offline	   Completed without error	   00%	 16113		 -													 
#12  Short offline	   Completed without error	   00%	 15946		 -													 
#13  Short offline	   Completed without error	   00%	 15778		 -													 
#14  Short offline	   Completed without error	   00%	 15610		 -													 
#15  Short offline	   Completed without error	   00%	 15442		 -													 
#16  Short offline	   Completed without error	   00%	 15274		 -													 
#17  Short offline	   Completed without error	   00%	 15106		 -													 
#18  Short offline	   Completed without error	   00%	 14938		 -													 
#19  Short offline	   Completed without error	   00%	 14770		 -													 
#20  Short offline	   Completed without error	   00%	 14602		 -													 
#21  Short offline	   Completed without error	   00%	 14434		 -													 
																																   
SMART Selective self-test log data structure revision number 1																	 
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS																					   
	1		0		0  Not_testing																							   
	2		0		0  Not_testing																							   
	3		0		0  Not_testing																							   
	4		0		0  Not_testing																							   
	5		0		0  Not_testing																							   
Selective self-test flags (0x0):																								   
  After scanning selected spans, do NOT read-scan remainder of disk.															   
If Selective self-test is pending on power-up, resume after 0 minute delay.

and there is no apparent issue. I went during weekend to NAS and change data cable. (also I must say I found there power cable which also might be a problem, but since it also feeds another drive and I got error only from one, I keep it there)
Since then I did not receive any error mails, thus for now I believe that the cable was the issue.

Considering SMR drivers and their exchange. I read a lot of comments and opinion before I decide to go this way. But at that time the most of them and negative one were from people without any experience (just their opitions). On other hand very few people bought ones and they claimed no issues.
I have those drivers over 2 years, and for now I am very happy with them. However, I use NAS mostly as cold storage. It runs 24/7 but write/reads are very occasional. Thus, I believe SMR can be good choice for given applications/usage.

Thank you for your quick response. I think it was the issue (it just surpise me that a cable can go wrong).
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Glad you replaced the Data cable and the problem seems to have gone away. SMR drives are significantly slower at write operations and this doens't sound like it impacts the way you use your FreeNAS system.
 
Status
Not open for further replies.
Top