SOLVED Drive Unavailable

Status
Not open for further replies.

Grantp

Contributor
Joined
Feb 26, 2013
Messages
111
Hi, I came home this evening to find I had a drive unavailable but not really sure what has happened. The drive is new ( as in about a month old). This is the output from the console.
Screenshot-6.png


zpool status command shows
Code:
 pool: tank
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
	the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 579M in 0h2m with 0 errors on Fri Oct 27 08:25:21 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	tank											DEGRADED	 0	 0	 0
	  raidz2-0									  DEGRADED	 0	 0	 0
		gptid/3ce45aa2-63de-11e7-a017-00074305cc80  ONLINE	   0	 0	 0
		gptid/04b22e4e-a7c5-11e7-825a-00074305cc80  ONLINE	   0	 0	 0
		gptid/474cc46d-e3a6-11e5-8e14-00074305cc80  ONLINE	   0	 0	 0
		gptid/7d69125f-a84e-11e7-825a-00074305cc80  ONLINE	   0	 0	 0
		gptid/71739a87-a6ee-11e7-825a-00074305cc80  ONLINE	   0	 0	 0
		gptid/499fb308-e3a6-11e5-8e14-00074305cc80  ONLINE	   0	 0	 0
		gptid/92373f1c-9d3c-11e7-b8a3-00074305cc80  ONLINE	   0	 0	 0
		gptid/4b125270-e3a6-11e5-8e14-00074305cc80  ONLINE	   0	 0	 0
		18063216091891569336						UNAVAIL	  0	 0	 0  was /dev/gptid/88836e1a-a056-11e7-84cb-00074305cc80
		gptid/8a7fa1dc-85ef-11e7-a903-00074305cc80  ONLINE	   0	 0	 0
	  raidz2-1									  ONLINE	   0	 0	 0
		gptid/2b260964-343b-11e7-ba8e-00074305cc80  ONLINE	   0	 0	 0
		gptid/2dbacd3f-343b-11e7-ba8e-00074305cc80  ONLINE	   0	 0	 0
		gptid/30303919-343b-11e7-ba8e-00074305cc80  ONLINE	   0	 0	 0
		gptid/32b6bc12-343b-11e7-ba8e-00074305cc80  ONLINE	   0	 0	 0
		gptid/63848798-a114-11e7-9189-00074305cc80  ONLINE	   0	 0	 0
		gptid/378986ec-343b-11e7-ba8e-00074305cc80  ONLINE	   0	 0	 0


This is the output from a SMART run that was done 2 days ago

Code:
########## SMART status report for da13 drive (HGST Deskstar NAS: PK1334PEKD6UBS) ##########
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000b   100   100   016	Pre-fail  Always	   -	   0
  2 Throughput_Performance  0x0005   137   137   054	Pre-fail  Offline	  -	   77
  3 Spin_Up_Time			0x0007   124   124   024	Pre-fail  Always	   -	   624 (Average 620)
  4 Start_Stop_Count		0x0012   100   100   000	Old_age   Always	   -	   31
  5 Reallocated_Sector_Ct   0x0033   100   100   005	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000b   100   100   067	Pre-fail  Always	   -	   0
  8 Seek_Time_Performance   0x0005   124   124   020	Pre-fail  Offline	  -	   33
  9 Power_On_Hours		  0x0012   100   100   000	Old_age   Always	   -	   761
10 Spin_Retry_Count		0x0013   100   100   060	Pre-fail  Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   31
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   31
193 Load_Cycle_Count		0x0012   100   100   000	Old_age   Always	   -	   31
194 Temperature_Celsius	 0x0002   171   171   000	Old_age   Always	   -	   35 (Min/Max 22/49)
196 Reallocated_Event_Count 0x0032   100   100   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0022   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0008   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x000a   200   200   000	Old_age   Always	   -	   1

ATA Error Count: 1
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 756 hours (31 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 08 e0 34 ab 00  Error: ICRC, ABRT at LBA = 0x00ab34e0 = 11220192

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 18 08 df 34 ab 40 ff   2d+16:57:02.560  WRITE FPDMA QUEUED
  61 18 00 d0 34 ab 40 00   2d+16:57:02.558  WRITE FPDMA QUEUED
  61 08 00 88 ee 60 40 00   2d+16:57:02.558  WRITE FPDMA QUEUED
  61 08 00 60 82 56 40 00   2d+16:57:02.557  WRITE FPDMA QUEUED
  61 10 00 d8 95 d9 40 00   2d+16:57:02.557  WRITE FPDMA QUEUED

Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline	Completed without error	   00%	   747		 -


Is there anything I can do to find out what the problem is so I can get the drive RMA'ed

Many thanks Grant
 

Grantp

Contributor
Joined
Feb 26, 2013
Messages
111
Check the SATA cable, its connections on both ends, and that the power connector is seated as well.

Are you suggesting that because of this error

199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 1
 
Joined
Apr 9, 2015
Messages
1,258

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Are you certain that is the affected drive?

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

Grantp

Contributor
Joined
Feb 26, 2013
Messages
111
Yeah, could be. Also you only have done one smart test on the drive? You did burn in the drives before you put them into service as is recommended https://forums.freenas.org/index.php?threads/how-to-hard-drive-burn-in-testing.21451/ I hope.

Should also have short and long smart tests running on a regular basis.

Yes I did burn in as suggested. I do have regular SMART test running I do a long test every 5 days and a short test every 5 days so in a months period I run 6 Long and 6 Short on every drive.
 

Grantp

Contributor
Joined
Feb 26, 2013
Messages
111
Are you certain that is the affected drive?

Sent from my SAMSUNG-SGH-I537 using Tapatalk

Yes I am sure. I have a list off all my drive serial numbers and I can see them all apart from PK1334PEKD6UBS.

What makes you think it may not be that drive?
 

Grantp

Contributor
Joined
Feb 26, 2013
Messages
111

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Best to make sure.
CRC errors often are due to connection problems.
If all the connections check out and the drive does not come back after a reboot, it is probably a hard fail. It happens sometimes.
Even with new drives.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Nightshade what makes you say I have only done 1 test
Because the SMART data you posted only shows one test. The output of smartctl -a would show the last 21 tests.
 

Grantp

Contributor
Joined
Feb 26, 2013
Messages
111
No matter which drive I do 'smartctl -a' on I only appear to get the same amount of output. Here is another drive

Code:
root@freenas:~ # smartctl -a /dev/da7
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 HGST Deskstar NAS
Device Model:	 HGST HDN724040ALE640
Serial Number:	PK1334PEKD9N6S
LU WWN Device Id: 5 000cca 250efdf32
Firmware Version: MJAOA5E0
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	7200 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sat Oct 28 00:34:43 2017 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:	  (  35)	The self-test routine was interrupted
					by the host with a hard or soft reset.
Total time to complete Offline
data collection:		 (   24) seconds.
Offline data collection
capabilities:			  (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time:	  (   1) minutes.
Extended self-test routine
recommended polling time:	  ( 579) minutes.
SCT capabilities:			(0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000b   100   100   016	Pre-fail  Always	   -	   0
  2 Throughput_Performance  0x0005   136   136   054	Pre-fail  Offline	  -	   80
  3 Spin_Up_Time			0x0007   100   100   024	Pre-fail  Always	   -	   474
  4 Start_Stop_Count		0x0012   100   100   000	Old_age   Always	   -	   5
  5 Reallocated_Sector_Ct   0x0033   100   100   005	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000b   100   100   067	Pre-fail  Always	   -	   0
  8 Seek_Time_Performance   0x0005   119   119   020	Pre-fail  Offline	  -	   35
  9 Power_On_Hours		  0x0012   100   100   000	Old_age   Always	   -	   600
10 Spin_Retry_Count		0x0013   100   100   060	Pre-fail  Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   5
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   5
193 Load_Cycle_Count		0x0012   100   100   000	Old_age   Always	   -	   5
194 Temperature_Celsius	 0x0002   200   200   000	Old_age   Always	   -	   30 (Min/Max 24/40)
196 Reallocated_Event_Count 0x0032   100   100   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0022   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0008   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x000a   200   200   000	Old_age   Always	   -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Interrupted (host reset)	  30%	   583		 -
# 2  Short offline	   Completed without error	   00%	   528		 -
# 3  Extended offline	Completed without error	   00%	   526		 -
# 4  Short offline	   Completed without error	   00%	   408		 -
# 5  Extended offline	Completed without error	   00%	   356		 -
# 6  Short offline	   Completed without error	   00%	   288		 -
# 7  Extended offline	Completed without error	   00%	   228		 -
# 8  Extended offline	Interrupted (host reset)	  40%		31		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@freenas:~ #


Here is my Smart Test setup

Screenshot-7.png

Am I doing something wrong?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Am I doing something wrong?
No, you just didn't give all the information at first so we were asking questions to try and clarify our understanding of your situation.
To me it looks like you had what I call a catastrophic drive failure. I have a server at work that we just turned on in January of this year and it has already had two drives fail. It happens.
I would just tell the vendor that it is completely failed and the drive is no longer reported as present in the host. It should be pretty easy.
I have had plenty of drives do that over the years.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Code:

Error 1 occurred at disk power-on lifetime: 756 hours (31 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 08 e0 34 ab 00  Error: ICRC, ABRT at LBA = 0x00ab34e0 = 11220192

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 18 08 df 34 ab 40 ff   2d+16:57:02.560  WRITE FPDMA QUEUED
  61 18 00 d0 34 ab 40 00   2d+16:57:02.558  WRITE FPDMA QUEUED
  61 08 00 88 ee 60 40 00   2d+16:57:02.558  WRITE FPDMA QUEUED
  61 08 00 60 82 56 40 00   2d+16:57:02.557  WRITE FPDMA QUEUED
  61 10 00 d8 95 d9 40 00   2d+16:57:02.557  WRITE FPDMA QUEUED

That appears to be the most significant part of the data you presented.
If you wanted to give the drive vendor any additional information, you could offer them this.
 

Grantp

Contributor
Joined
Feb 26, 2013
Messages
111
Ok thanks Chris. I will contact vendor to try and get drive RMA'ed. I have just reseated all the cable's and reboot still not showing up. I have also changed the SFF-8087 cable for a new one and drive still doesn't show up.

I am still a bit confused though over this comment

Because the SMART data you posted only shows one test. The output of smartctl -a would show the last 21 tests.

Why my smartctl -a is only showing one test and not all the test as danb35 says it should.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Why my smartctl -a is only showing one test and not all the test as danb35 says it should.
Perhaps some drives don't store as many entries in their firmware as others. The drives I use store 21 entries in the list of tests.
For example:
Code:
root@Emily-NAS:~ # smartctl -a /dev/da9
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Barracuda 7200.14 (AF)
Device Model:	 ST2000DM001-1ER164
Serial Number:	Z4Z25RWE
LU WWN Device Id: 5 000c50 07a8ab4dc
Firmware Version: CC25
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	7200 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Fri Oct 27 19:21:18 2017 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(   89) seconds.
Offline data collection
capabilities:					(0x73) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										No Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   1) minutes.
Extended self-test routine
recommended polling time:		( 226) minutes.
Conveyance self-test routine
recommended polling time:		(   2) minutes.
SCT capabilities:			  (0x1085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   111   099   006	Pre-fail  Always	   -	   36103488
  3 Spin_Up_Time			0x0003   096   096   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   099   099   020	Old_age   Always	   -	   1299
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   078   060   030	Pre-fail  Always	   -	   76134550
  9 Power_On_Hours		  0x0032   092   092   000	Old_age   Always	   -	   7636
10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   99
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   100   100   000	Old_age   Always	   -	   0 0 0
189 High_Fly_Writes		 0x003a   098   098   000	Old_age   Always	   -	   2
190 Airflow_Temperature_Cel 0x0022   070   058   045	Old_age   Always	   -	   30 (Min/Max 26/33)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   64
193 Load_Cycle_Count		0x0032   098   098   000	Old_age   Always	   -	   4425
194 Temperature_Celsius	 0x0022   030   042   000	Old_age   Always	   -	   30 (0 13 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   7363h+03m+51.884s
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   30213194817
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   629713751083

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	  7631		 -
# 2  Short offline	   Completed without error	   00%	  7625		 -
# 3  Extended offline	Completed without error	   00%	  7619		 -
# 4  Short offline	   Completed without error	   00%	  7613		 -
# 5  Extended offline	Completed without error	   00%	  7607		 -
# 6  Short offline	   Completed without error	   00%	  7601		 -
# 7  Extended offline	Completed without error	   00%	  7595		 -
# 8  Short offline	   Completed without error	   00%	  7589		 -
# 9  Extended offline	Completed without error	   00%	  7583		 -
#10  Short offline	   Completed without error	   00%	  7577		 -
#11  Extended offline	Completed without error	   00%	  7572		 -
#12  Short offline	   Completed without error	   00%	  7565		 -
#13  Extended offline	Completed without error	   00%	  7559		 -
#14  Short offline	   Completed without error	   00%	  7553		 -
#15  Extended offline	Completed without error	   00%	  7548		 -
#16  Short offline	   Completed without error	   00%	  7541		 -
#17  Extended offline	Completed without error	   00%	  7535		 -
#18  Short offline	   Completed without error	   00%	  7529		 -
#19  Extended offline	Completed without error	   00%	  7525		 -
#20  Short offline	   Completed without error	   00%	  7517		 -
#21  Extended offline	Completed without error	   00%	  7511		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The thing they are talking about is your first post, where you list the smart test results, it only shows one test. This could be due to the rest of your post being snipped by the forum software for length or a simple mistake in copy and paste.
I only caused a little confusion. Drive failures can be stressful and that is part of the reason for the guidance to use RAIDz2, a little extra safety.
 

Grantp

Contributor
Joined
Feb 26, 2013
Messages
111
Thanks for taking time to explain Chris I do appreciate it I am still feel very much a NOOB most of the time. I've looked at other drives now and yes they have the other listed I just cut the original one short not realising what that was showing.

Thanks to everyone for there help
 
Status
Not open for further replies.
Top