Single CKSUM error

Status
Not open for further replies.

IanWorthington

Contributor
Joined
Sep 13, 2013
Messages
144
A disk dropped out of my pool whilst I was doing some work on it (disk was cold to the touch when I investigated). I reconnected it, then SCRUBbed it, but it's still showing a single CKSUM error. Is this expected, ie is the CKSUM count persistent? Or should I just CLEAR the counters and rescrub?

Code:
 % zpool status VOLUME1
  pool: VOLUME1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
		attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
		using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 34h10m with 0 errors on Wed Nov  9 12:34:39 2016
config:

		NAME											STATE	 READ WRITE CKSUM
		VOLUME1										 ONLINE	   0	 0	 0
		  raidz3-0									  ONLINE	   0	 0	 0
			gptid/444302b9-da2a-11e3-90c3-002590878c66  ONLINE	   0	 0	 0
			gptid/44acaf47-da2a-11e3-90c3-002590878c66  ONLINE	   0	 0	 0
			gptid/458b61fe-da2a-11e3-90c3-002590878c66  ONLINE	   0	 0	 0
			gptid/45f04d30-da2a-11e3-90c3-002590878c66  ONLINE	   0	 0	 0
			gptid/46dd2963-da2a-11e3-90c3-002590878c66  ONLINE	   0	 0	 0
			gptid/47cdf0aa-da2a-11e3-90c3-002590878c66  ONLINE	   0	 0	 0
			gptid/48565317-da2a-11e3-90c3-002590878c66  ONLINE	   0	 0	 0
			gptid/48cd4928-da2a-11e3-90c3-002590878c66  ONLINE	   0	 0	 0
			gptid/4a58c8ac-da2a-11e3-90c3-002590878c66  ONLINE	   0	 0	 0
			gptid/4c508137-da2a-11e3-90c3-002590878c66  ONLINE	   0	 0	 0
			gptid/4566e651-a3d2-11e6-af18-002590878c66  ONLINE	   0	 0	 1

errors: No known data errors

 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,554
Have you configured SMART tests? It might be a good idea to figure out why the disk dropped out of your pool. Perhaps post smartctl -x output for the disk in question.
 

IanWorthington

Contributor
Joined
Sep 13, 2013
Messages
144
Yes, long SMART tests are performed weekly:

Code:
% sudo smartctl -a /dev/da4
Password:
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p13 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68L0BN1
Serial Number:	WD-WX31DA5D6NPK
LU WWN Device Id: 5 0014ee 2b7b9f6bb
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5700 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Thu Nov 10 14:21:07 2016 COT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				( 4244) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 696) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x303d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   199   199   021	Pre-fail  Always	   -	   9025
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   11
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   334
10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   11
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   7
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   21
194 Temperature_Celsius	 0x0022   115   105   000	Old_age   Always	   -	   37
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%		34		 -
# 2  Extended offline	Completed without error	   00%		12		 -
# 3  Conveyance offline  Completed without error	   00%		 0		 -
# 4  Short offline	   Completed without error	   00%		 0		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



(Is CODE formatting broken?)
 

IanWorthington

Contributor
Joined
Sep 13, 2013
Messages
144
Sorry, you asked for -x not -a:

Code:
% sudo smartctl -x /dev/da4
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p13 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68L0BN1
Serial Number:	WD-WX31DA5D6NPK
LU WWN Device Id: 5 0014ee 2b7b9f6bb
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5700 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Thu Nov 10 14:26:00 2016 COT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				( 4244) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 696) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x303d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAGS	VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate	 POSR-K   200   200   051	-	0
  3 Spin_Up_Time			POS--K   199   199   021	-	9025
  4 Start_Stop_Count		-O--CK   100   100   000	-	11
  5 Reallocated_Sector_Ct   PO--CK   200   200   140	-	0
  7 Seek_Error_Rate		 -OSR-K   100   253   000	-	0
  9 Power_On_Hours		  -O--CK   100   100   000	-	334
 10 Spin_Retry_Count		-O--CK   100   253   000	-	0
 11 Calibration_Retry_Count -O--CK   100   253   000	-	0
 12 Power_Cycle_Count	   -O--CK   100   100   000	-	11
192 Power-Off_Retract_Count -O--CK   200   200   000	-	7
193 Load_Cycle_Count		-O--CK   200   200   000	-	21
194 Temperature_Celsius	 -O---K   115   105   000	-	37
196 Reallocated_Event_Count -O--CK   200   200   000	-	0
197 Current_Pending_Sector  -O--CK   200   200   000	-	0
198 Offline_Uncorrectable   ----CK   100   253   000	-	0
199 UDMA_CRC_Error_Count	-O--CK   200   200   000	-	0
200 Multi_Zone_Error_Rate   ---R--   200   200   000	-	0
							||||||_ K auto-keep
							|||||__ C event count
							||||___ R error rate
							|||____ S speed/performance
							||_____ O updated online
							|______ P prefailure warning

General Purpose Log Directory Version 1
SMART		   Log Directory Version 1 [multi-sector log support]
Address	Access  R/W   Size  Description
0x00	   GPL,SL  R/O	  1  Log Directory
0x01		   SL  R/O	  1  Summary SMART error log
0x02		   SL  R/O	  5  Comprehensive SMART error log
0x03	   GPL	 R/O	  6  Ext. Comprehensive SMART error log
0x06		   SL  R/O	  1  SMART self-test log
0x07	   GPL	 R/O	  1  Extended self-test log
0x09		   SL  R/W	  1  Selective self-test log
0x10	   GPL	 R/O	  1  NCQ Command Error log
0x11	   GPL	 R/O	  1  SATA Phy Event Counters
0x21	   GPL	 R/O	  1  Write stream error log
0x22	   GPL	 R/O	  1  Read stream error log
0x30	   GPL,SL  R/O	  9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W	 16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS	  16  Device vendor specific log
0xa8-0xb6  GPL,SL  VS	   1  Device vendor specific log
0xb7	   GPL,SL  VS	  54  Device vendor specific log
0xbd	   GPL,SL  VS	   1  Device vendor specific log
0xc0	   GPL,SL  VS	   1  Device vendor specific log
0xc1	   GPL	 VS	  93  Device vendor specific log
0xe0	   GPL,SL  R/W	  1  SCT Command/Status
0xe1	   GPL,SL  R/W	  1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%		34		 -
# 2  Extended offline	Completed without error	   00%		12		 -
# 3  Conveyance offline  Completed without error	   00%		 0		 -
# 4  Short offline	   Completed without error	   00%		 0		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:				  3
SCT Version (vendor specific):	   258 (0x0102)
SCT Support Level:				   1
Device State:						Active (0)
Current Temperature:					37 Celsius
Power Cycle Min/Max Temperature:	 24/39 Celsius
Lifetime	Min/Max Temperature:	 20/47 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:	 2
Temperature Sampling Period:		 1 minute
Temperature Logging Interval:		1 minute
Min/Max recommended Temperature:	  0/60 Celsius
Min/Max Temperature Limit:		   -41/85 Celsius
Temperature History Size (Index):	478 (471)

Index	Estimated Time   Temperature Celsius
 472	2016-11-10 06:29	36  *****************
 ...	..(196 skipped).	..  *****************
 191	2016-11-10 09:46	36  *****************
 192	2016-11-10 09:47	37  ******************
 ...	..(278 skipped).	..  ******************
 471	2016-11-10 14:26	37  ******************

SCT Error Recovery Control:
		   Read:	 70 (7.0 seconds)
		  Write:	 70 (7.0 seconds)

Device Statistics (GP Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11)
ID	  Size	 Value  Description
0x0001  2			0  Command failed due to ICRC error
0x0002  2			0  R_ERR response for data FIS
0x0003  2			0  R_ERR response for device-to-host data FIS
0x0004  2			0  R_ERR response for host-to-device data FIS
0x0005  2			0  R_ERR response for non-data FIS
0x0006  2			0  R_ERR response for device-to-host non-data FIS
0x0007  2			0  R_ERR response for host-to-device non-data FIS
0x0008  2			0  Device-to-host non-data FIS retries
0x0009  2			1  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2		   11  Device-to-host register FISes sent due to a COMRESET
0x000b  2			0  CRC errors within host-to-device FIS
0x000d  2			0  Non-CRC errors within host-to-device FIS
0x000f  2			0  R_ERR response for host-to-device data FIS, CRC
0x0012  2			0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4	   216217  Vendor specific

 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Sorry, you asked for -x not -a:

Code:
Device Model:	 WDC WD60EFRX-68L0BN1

I think we have the same problem. After having replaced almost everything in my machine, it turns out, the drive with worst behaviour is of type WD60EFRX-68L0BN1 just like yours. Other disks with device type WD60EFRX-68MYMN1 in my zpool, do not exibit the same behaviour (Throwing SCSI errors).
 
Status
Not open for further replies.
Top