Pretty Sure Drive Is On Its Way Out

Status
Not open for further replies.

MrPaul

Cadet
Joined
Apr 5, 2017
Messages
3
Just wanted a second opinion that this is a bad drive.

Trigger was checksum errors as seen here

Code:
zpool status
  pool: RAIDZ-4x2t
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
		attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
		using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 12h40m with 0 errors on Sun Apr  2 13:40:24 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		RAIDZ-4x2t									  ONLINE	   0	 0	 0
		  raidz1-0									  ONLINE	   0	 0	 0
			gptid/7668ecff-b77d-11e2-8fa4-000854aa3c36  ONLINE	   0	 0	 0
			gptid/0c9649c6-3013-11e5-8a7d-1c6f6595c8c8  ONLINE	   0	 0	 0
			gptid/72e19b91-c283-11e2-be4f-000854aa3c36  ONLINE	   0	 0	 2
			gptid/78ec7bb7-b77d-11e2-8fa4-000854aa3c36  ONLINE	   0	 0	 0

errors: No known data errors


The dmesg output has lots of these

Code:
(ada2:ahcich14:0:0:0): READ_FPDMA_QUEUED. ACB: 60 58 e0 8b dc 40 07 00 00 00 00 00
(ada2:ahcich14:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada2:ahcich14:0:0:0): Retrying command
(ada2:ahcich14:0:0:0): READ_FPDMA_QUEUED. ACB: 60 58 f8 56 dd 40 07 00 00 00 00 00
(ada2:ahcich14:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada2:ahcich14:0:0:0): Retrying command
(ada2:ahcich14:0:0:0): READ_FPDMA_QUEUED. ACB: 60 58 d0 17 e2 40 07 00 00 00 00 00
(ada2:ahcich14:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada2:ahcich14:0:0:0): Retrying command
(ada2:ahcich14:0:0:0): READ_FPDMA_QUEUED. ACB: 60 58 28 18 e2 40 07 00 00 00 00 00
(ada2:ahcich14:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada2:ahcich14:0:0:0): Retrying command
(ada2:ahcich14:0:0:0): READ_FPDMA_QUEUED. ACB: 60 58 10 a0 ea 40 07 00 00 00 00 00
(ada2:ahcich14:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada2:ahcich14:0:0:0): Retrying command
(ada2:ahcich14:0:0:0): READ_FPDMA_QUEUED. ACB: 60 58 18 a1 ea 40 07 00 00 00 00 00
(ada2:ahcich14:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada2:ahcich14:0:0:0): Retrying command

 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Smart test results? Looks like a bad connection or cable maybe

Sent from my Nexus 5X using Tapatalk
 

MrPaul

Cadet
Joined
Apr 5, 2017
Messages
3
I don't have permission to run smart tests. I'll check out groups when I get back to the house...perhaps there is a wheel group or something similar.

Code:
[x@freenas ~]$ smartctl -t long /dev/ada2
smartctl 6.4 2015-06-04 r4109 [FreeBSD 10.3-RELEASE amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/xpt0 control device couldn't opened: Permission denied
Unable to get CAM device list
/dev/ada2: Unable to detect device type
Please specify device type with the -d option.

Use smartctl -h to get a usage summary

[x@freenas ~]$ sudo smartctl -t long /dev/ada2
Password:
Sorry, user xis not allowed to execute '/usr/local/sbin/smartctl -t long /dev/ada2' as root on freenas.local.


Ordered a new drive anyway for now.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Log in as root?

Sent from my Nexus 5X using Tapatalk
 

MrPaul

Cadet
Joined
Apr 5, 2017
Messages
3
Please refrain from the green drive lecture. I'm about to buy 4x3Tb WD Red drives and decide if I want to build a second zpool (raidz2, mirror, something else) and migrate or replace 1 drive at a time.

Specs:
FreeNAS-9.10-STABLE-201603252134 (412fb1c)
ASRock Rack Mini ITX DDR3 1333 Motherboards (C2550D4I)
Intel(R) Atom(TM) CPU C2550 @ 2.40GHz
16329MB ECC RAM

Using the GUI I enabled sudo on my account. Verified right drive

Code:
[x@freenas ~]$ glabel status | grep 72e19b91-c283-11e2-be4f-000854aa3c36
gptid/72e19b91-c283-11e2-be4f-000854aa3c36	 N/A  ada2p2


Initiating a long test

Code:
[x@freenas ~]$ sudo smartctl -t long /dev/ada2	
smartctl 6.4 2015-06-04 r4109 [FreeBSD 10.3-RELEASE amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 319 minutes for test to complete.
Test will complete after Thu Apr  6 15:29:40 2017

Use smartctl -X to abort test.


Unfortunatley, the long test will not complete. It just says it was "Interrupted (host rest)".

Code:
[x@freenas ~]$ sudo smartctl -a /dev/ada2
Password:
Sorry, try again.
Password:
smartctl 6.4 2015-06-04 r4109 [FreeBSD 10.3-RELEASE amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Green
Device Model:	 WDC WD20EARX-00MMMB0
Serial Number:	WD-WCAWZ0808072
LU WWN Device Id: 5 0014ee 205f6e643
Firmware Version: 80.00A80
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Thu Apr  6 10:13:40 2017 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
										was completed without error.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (  41) The self-test routine was interrupted
										by the host with a hard or soft reset.
Total time to complete Offline
data collection:				(33000) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 319) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x3035) SCT Status supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   1
  3 Spin_Up_Time			0x0027   191   143   021	Pre-fail  Always	   -	   7425
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   73
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   055   055   000	Old_age   Always	   -	   33275
10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   73
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   64
193 Load_Cycle_Count		0x0032   001   001   000	Old_age   Always	   -	   2735428
194 Temperature_Celsius	 0x0022   086   079   000	Old_age   Always	   -	   66
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Interrupted (host reset)	  90%	 33275		 -
# 2  Extended offline	Interrupted (host reset)	  70%	 33275		 -
# 3  Short offline	   Completed without error	   00%	 33270		 -
# 4  Short offline	   Completed without error	   00%	 33246		 -
# 5  Short offline	   Completed without error	   00%	 33222		 -
# 6  Short offline	   Completed without error	   00%	 33198		 -
# 7  Short offline	   Completed without error	   00%	 33178		 -
# 8  Short offline	   Completed without error	   00%	 33152		 -
# 9  Short offline	   Completed without error	   00%	 33128		 -
#10  Short offline	   Completed without error	   00%	 33104		 -
#11  Short offline	   Completed without error	   00%	 33080		 -
#12  Short offline	   Completed without error	   00%	 33056		 -
#13  Short offline	   Completed without error	   00%	 33032		 -
#14  Short offline	   Interrupted (host reset)	  80%	 33009		 -
#15  Short offline	   Completed without error	   00%	 32984		 -
#16  Short offline	   Completed without error	   00%	 32960		 -
#17  Short offline	   Completed without error	   00%	 32936		 -
#18  Short offline	   Completed without error	   00%	 32912		 -
#19  Short offline	   Completed without error	   00%	 32888		 -
#20  Short offline	   Completed without error	   00%	 32864		 -
#21  Short offline	   Interrupted (host reset)	  10%	 32842		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

droeders

Contributor
Joined
Mar 21, 2016
Messages
179
Code:
194 Temperature_Celsius	 0x0022   086   079   000	Old_age   Always	   -	   66

That drive is WAY too hot and looks like it's been even hotter in the past. You should definitely address your lack of sufficient drive cooling.

The extremely high Load_Cycle_Count also suggests you never adjusted the default idle time on this drive.

If this were my drive, I would replace it immediately.
 
Status
Not open for further replies.
Top