zpool scrub issues

Status
Not open for further replies.

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Both short and long mode tests run in the background at the drive. If there is real data transfer going on at the same time, it can become very slow due to head contention. The SMART test might not timeout, which would be consistent with what you are seeing. But it's also consistent with a drive that has a lot of errors and retries.



I tried to read that, and after allowing a lot of Javascript (which is normally forbidden as a security measure), I still couldn't read it. Posting it in code tags here should work.

Anyway, I was going to look to see whether the problem drives had already accumulated some bad blocks, which would be an indicator they were on the way out.

See here:

Code:
# smartctl -x /dev/da1
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD80EFZX-68UW8N0
Serial Number:	R6GVGPWY
LU WWN Device Id: 5 000cca 263cc08da
Firmware Version: 83.H0A83
User Capacity:	8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Tue Jul 24 21:12:00 2018 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:	  ( 241)	Self-test routine in progress...
					10% of test remaining.
Total time to complete Offline
data collection:		 (  101) seconds.
Offline data collection
capabilities:			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time:	 (   2) minutes.
Extended self-test routine
recommended polling time:	 (1376) minutes.
SCT capabilities:			(0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAGS	VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate	 PO-R--   100   100   016	-	0
  2 Throughput_Performance  P-S---   132   132   054	-	112
  3 Spin_Up_Time			POS---   152   152   024	-	429 (Average 438)
  4 Start_Stop_Count		-O--C-   100   100   000	-	34
  5 Reallocated_Sector_Ct   PO--CK   100   100   005	-	0
  7 Seek_Error_Rate		 PO-R--   100   100   067	-	0
  8 Seek_Time_Performance   P-S---   140   140   020	-	15
  9 Power_On_Hours		  -O--C-   100   100   000	-	3856
 10 Spin_Retry_Count		PO--C-   100   100   060	-	0
 12 Power_Cycle_Count	   -O--CK   100   100   000	-	27
 22 Helium_Level			PO---K   100   100   025	-	100
192 Power-Off_Retract_Count -O--CK   100   100   000	-	128
193 Load_Cycle_Count		-O--C-   100   100   000	-	128
194 Temperature_Celsius	 -O----   151   151   000	-	43 (Min/Max 18/44)
196 Reallocated_Event_Count -O--CK   100   100   000	-	0
197 Current_Pending_Sector  -O---K   100   100   000	-	0
198 Offline_Uncorrectable   ---R--   100   100   000	-	0
199 UDMA_CRC_Error_Count	-O-R--   200   200   000	-	0
							||||||_ K auto-keep
							|||||__ C event count
							||||___ R error rate
							|||____ S speed/performance
							||_____ O updated online
							|______ P prefailure warning

General Purpose Log Directory Version 1
SMART		   Log Directory Version 1 [multi-sector log support]
Address	Access  R/W   Size  Description
0x00	   GPL,SL  R/O	  1  Log Directory
0x01		   SL  R/O	  1  Summary SMART error log
0x02		   SL  R/O	  1  Comprehensive SMART error log
0x03	   GPL	 R/O	  1  Ext. Comprehensive SMART error log
0x04	   GPL,SL  R/O	  8  Device Statistics log
0x06		   SL  R/O	  1  SMART self-test log
0x07	   GPL	 R/O	  1  Extended self-test log
0x08	   GPL	 R/O	  2  Power Conditions log
0x09		   SL  R/W	  1  Selective self-test log
0x10	   GPL	 R/O	  1  NCQ Command Error log
0x11	   GPL	 R/O	  1  SATA Phy Event Counters log
0x12	   GPL	 R/O	  1  SATA NCQ Non-Data log
0x15	   GPL	 R/W	  1  Rebuild Assist log
0x21	   GPL	 R/O	  1  Write stream error log
0x22	   GPL	 R/O	  1  Read stream error log
0x24	   GPL	 R/O	256  Current Device Internal Status Data log
0x25	   GPL	 R/O	256  Saved Device Internal Status Data log
0x30	   GPL,SL  R/O	  9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W	 16  Host vendor specific log
0xe0	   GPL,SL  R/W	  1  SCT Command/Status
0xe1	   GPL,SL  R/W	  1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%		 1		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:				  3
SCT Version (vendor specific):	   256 (0x0100)
SCT Support Level:				   1
Device State:						DST executing in background (3)
Current Temperature:					43 Celsius
Power Cycle Min/Max Temperature:	 28/44 Celsius
Lifetime	Min/Max Temperature:	 18/44 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:	 2
Temperature Sampling Period:		 1 minute
Temperature Logging Interval:		1 minute
Min/Max recommended Temperature:	  0/65 Celsius
Min/Max Temperature Limit:		   -40/70 Celsius
Temperature History Size (Index):	128 (15)

Index	Estimated Time   Temperature Celsius
  16	2018-07-24 19:05	42  ***********************
 ...	..( 66 skipped).	..  ***********************
  83	2018-07-24 20:12	42  ***********************
  84	2018-07-24 20:13	43  ************************
  85	2018-07-24 20:14	42  ***********************
 ...	..( 44 skipped).	..  ***********************
   2	2018-07-24 20:59	42  ***********************
   3	2018-07-24 21:00	43  ************************
   4	2018-07-24 21:01	42  ***********************
 ...	..(  4 skipped).	..  ***********************
   9	2018-07-24 21:06	42  ***********************
  10	2018-07-24 21:07	43  ************************
  11	2018-07-24 21:08	42  ***********************
  12	2018-07-24 21:09	42  ***********************
  13	2018-07-24 21:10	43  ************************
  14	2018-07-24 21:11	43  ************************
  15	2018-07-24 21:12	42  ***********************

SCT Error Recovery Control:
		   Read:	 70 (7.0 seconds)
		  Write:	 70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size		Value Flags Description
0x01  =====  =			   =  ===  == General Statistics (rev 2) ==
0x01  0x008  4			  27  ---  Lifetime Power-On Resets
0x01  0x018  6	129195070663  ---  Logical Sectors Written
0x01  0x020  6	  1090273029  ---  Number of Write Commands
0x01  0x028  6	191350182873  ---  Logical Sectors Read
0x01  0x030  6	  1175510861  ---  Number of Read Commands
0x01  0x038  6	 13883627700  ---  Date and Time TimeStamp
0x03  =====  =			   =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4			2367  ---  Spindle Motor Power-on Hours
0x03  0x010  4			2367  ---  Head Flying Hours
0x03  0x018  4			 128  ---  Head Load Events
0x03  0x020  4			   0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4			  27  ---  Read Recovery Attempts
0x03  0x030  4			   0  ---  Number of Mechanical Start Failures
0x04  =====  =			   =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4			   0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4			   2  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =			   =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1			  43  ---  Current Temperature
0x05  0x010  1			  42  N--  Average Short Term Temperature
0x05  0x018  1			  41  N--  Average Long Term Temperature
0x05  0x020  1			  44  ---  Highest Temperature
0x05  0x028  1			  18  ---  Lowest Temperature
0x05  0x030  1			  43  N--  Highest Average Short Term Temperature
0x05  0x038  1			  20  N--  Lowest Average Short Term Temperature
0x05  0x040  1			  41  N--  Highest Average Long Term Temperature
0x05  0x048  1			  25  N--  Lowest Average Long Term Temperature
0x05  0x050  4			   0  ---  Time in Over-Temperature
0x05  0x058  1			  65  ---  Specified Maximum Operating Temperature
0x05  0x060  4			   0  ---  Time in Under-Temperature
0x05  0x068  1			   0  ---  Specified Minimum Operating Temperature
0x06  =====  =			   =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4			   0  ---  Number of Hardware Resets
0x06  0x010  4			   0  ---  Number of ASR Events
0x06  0x018  4			   0  ---  Number of Interface CRC Errors
								|||_ C monitored condition met
								||__ D supports DSN
								|___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID	  Size	 Value  Description
0x0001  2			0  Command failed due to ICRC error
0x0002  2			0  R_ERR response for data FIS
0x0003  2			0  R_ERR response for device-to-host data FIS
0x0004  2			0  R_ERR response for host-to-device data FIS
0x0005  2			0  R_ERR response for non-data FIS
0x0006  2			0  R_ERR response for device-to-host non-data FIS
0x0007  2			0  R_ERR response for host-to-device non-data FIS
0x0008  2			0  Device-to-host non-data FIS retries
0x0009  2			0  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2			1  Device-to-host register FISes sent due to a COMRESET
0x000b  2			0  CRC errors within host-to-device FIS
0x000d  2			0  Non-CRC errors within host-to-device FIS


Another drive:
Code:
# smartctl -x /dev/da2
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD80EFZX-68UW8N0
Serial Number:	R6GP0K9Y
LU WWN Device Id: 5 000cca 263c98de3
Firmware Version: 83.H0A83
User Capacity:	8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Tue Jul 24 21:13:08 2018 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:	  ( 241)	Self-test routine in progress...
					10% of test remaining.
Total time to complete Offline 
data collection:		 (  101) seconds.
Offline data collection
capabilities:			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time:	 (   2) minutes.
Extended self-test routine
recommended polling time:	 (1342) minutes.
SCT capabilities:			(0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAGS	VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate	 PO-R--   100   100   016	-	0
  2 Throughput_Performance  P-S---   131   131   054	-	116
  3 Spin_Up_Time			POS---   154   154   024	-	419 (Average 436)
  4 Start_Stop_Count		-O--C-   100   100   000	-	486
  5 Reallocated_Sector_Ct   PO--CK   100   100   005	-	0
  7 Seek_Error_Rate		 PO-R--   100   100   067	-	0
  8 Seek_Time_Performance   P-S---   140   140   020	-	15
  9 Power_On_Hours		  -O--C-   100   100   000	-	4524
 10 Spin_Retry_Count		PO--C-   100   100   060	-	0
 12 Power_Cycle_Count	   -O--CK   100   100   000	-	34
 22 Helium_Level			PO---K   100   100   025	-	100
192 Power-Off_Retract_Count -O--CK   100   100   000	-	702
193 Load_Cycle_Count		-O--C-   100   100   000	-	702
194 Temperature_Celsius	 -O----   147   147   000	-	44 (Min/Max 19/46)
196 Reallocated_Event_Count -O--CK   100   100   000	-	0
197 Current_Pending_Sector  -O---K   100   100   000	-	0
198 Offline_Uncorrectable   ---R--   100   100   000	-	0
199 UDMA_CRC_Error_Count	-O-R--   200   200   000	-	0
							||||||_ K auto-keep
							|||||__ C event count
							||||___ R error rate
							|||____ S speed/performance
							||_____ O updated online
							|______ P prefailure warning

General Purpose Log Directory Version 1
SMART		   Log Directory Version 1 [multi-sector log support]
Address	Access  R/W   Size  Description
0x00	   GPL,SL  R/O	  1  Log Directory
0x01		   SL  R/O	  1  Summary SMART error log
0x02		   SL  R/O	  1  Comprehensive SMART error log
0x03	   GPL	 R/O	  1  Ext. Comprehensive SMART error log
0x04	   GPL,SL  R/O	  8  Device Statistics log
0x06		   SL  R/O	  1  SMART self-test log
0x07	   GPL	 R/O	  1  Extended self-test log
0x08	   GPL	 R/O	  2  Power Conditions log
0x09		   SL  R/W	  1  Selective self-test log
0x10	   GPL	 R/O	  1  NCQ Command Error log
0x11	   GPL	 R/O	  1  SATA Phy Event Counters log
0x12	   GPL	 R/O	  1  SATA NCQ Non-Data log
0x15	   GPL	 R/W	  1  Rebuild Assist log
0x21	   GPL	 R/O	  1  Write stream error log
0x22	   GPL	 R/O	  1  Read stream error log
0x24	   GPL	 R/O	256  Current Device Internal Status Data log
0x25	   GPL	 R/O	256  Saved Device Internal Status Data log
0x30	   GPL,SL  R/O	  9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W	 16  Host vendor specific log
0xe0	   GPL,SL  R/W	  1  SCT Command/Status
0xe1	   GPL,SL  R/W	  1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	   670		 -
# 2  Short offline	   Completed without error	   00%		30		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:				  3
SCT Version (vendor specific):	   256 (0x0100)
SCT Support Level:				   1
Device State:						DST executing in background (3)
Current Temperature:					44 Celsius
Power Cycle Min/Max Temperature:	 27/46 Celsius
Lifetime	Min/Max Temperature:	 19/46 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:	 2
Temperature Sampling Period:		 1 minute
Temperature Logging Interval:		1 minute
Min/Max recommended Temperature:	  0/65 Celsius
Min/Max Temperature Limit:		   -40/70 Celsius
Temperature History Size (Index):	128 (96)

Index	Estimated Time   Temperature Celsius
  97	2018-07-24 19:06	44  *************************
 ...	..(126 skipped).	..  *************************
  96	2018-07-24 21:13	44  *************************

SCT Error Recovery Control:
		   Read:	 70 (7.0 seconds)
		  Write:	 70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size		Value Flags Description
0x01  =====  =			   =  ===  == General Statistics (rev 2) ==
0x01  0x008  4			  34  ---  Lifetime Power-On Resets
0x01  0x018  6	172081242871  ---  Logical Sectors Written
0x01  0x020  6	  1223326718  ---  Number of Write Commands
0x01  0x028  6	215958952504  ---  Logical Sectors Read
0x01  0x030  6	  1270614150  ---  Number of Read Commands
0x01  0x038  6	 16288697200  ---  Date and Time TimeStamp
0x03  =====  =			   =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4			2694  ---  Spindle Motor Power-on Hours
0x03  0x010  4			2694  ---  Head Flying Hours
0x03  0x018  4			 702  ---  Head Load Events
0x03  0x020  4			   0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4			   3  ---  Read Recovery Attempts
0x03  0x030  4			   0  ---  Number of Mechanical Start Failures
0x04  =====  =			   =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4			   0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4			   9  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =			   =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1			  44  ---  Current Temperature
0x05  0x010  1			  44  N--  Average Short Term Temperature
0x05  0x018  1			  43  N--  Average Long Term Temperature
0x05  0x020  1			  46  ---  Highest Temperature
0x05  0x028  1			  19  ---  Lowest Temperature
0x05  0x030  1			  45  N--  Highest Average Short Term Temperature
0x05  0x038  1			  21  N--  Lowest Average Short Term Temperature
0x05  0x040  1			  43  N--  Highest Average Long Term Temperature
0x05  0x048  1			  25  N--  Lowest Average Long Term Temperature
0x05  0x050  4			   0  ---  Time in Over-Temperature
0x05  0x058  1			  65  ---  Specified Maximum Operating Temperature
0x05  0x060  4			   0  ---  Time in Under-Temperature
0x05  0x068  1			   0  ---  Specified Minimum Operating Temperature
0x06  =====  =			   =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4			   0  ---  Number of Hardware Resets
0x06  0x010  4			  19  ---  Number of ASR Events
0x06  0x018  4			   0  ---  Number of Interface CRC Errors
								|||_ C monitored condition met
								||__ D supports DSN
								|___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID	  Size	 Value  Description
0x0001  2			0  Command failed due to ICRC error
0x0002  2			0  R_ERR response for data FIS
0x0003  2			0  R_ERR response for device-to-host data FIS
0x0004  2			0  R_ERR response for host-to-device data FIS
0x0005  2			0  R_ERR response for non-data FIS
0x0006  2			0  R_ERR response for device-to-host non-data FIS
0x0007  2			0  R_ERR response for host-to-device non-data FIS
0x0008  2			0  Device-to-host non-data FIS retries
0x0009  2			0  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2			1  Device-to-host register FISes sent due to a COMRESET
0x000b  2			0  CRC errors within host-to-device FIS
0x000d  2			0  Non-CRC errors within host-to-device FIS




And a third:
Code:
# smartctl -x /dev/da3
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD80EFZX-68UW8N0
Serial Number:	R6GMY8UY
LU WWN Device Id: 5 000cca 263c910fb
Firmware Version: 83.H0A83
User Capacity:	8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Tue Jul 24 21:13:58 2018 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:	  ( 241)	Self-test routine in progress...
					10% of test remaining.
Total time to complete Offline 
data collection:		 (  101) seconds.
Offline data collection
capabilities:			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time:	 (   2) minutes.
Extended self-test routine
recommended polling time:	 (1317) minutes.
SCT capabilities:			(0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAGS	VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate	 PO-R--   100   100   016	-	0
  2 Throughput_Performance  P-S---   131   131   054	-	116
  3 Spin_Up_Time			POS---   152   152   024	-	427 (Average 436)
  4 Start_Stop_Count		-O--C-   100   100   000	-	329
  5 Reallocated_Sector_Ct   PO--CK   100   100   005	-	0
  7 Seek_Error_Rate		 PO-R--   100   100   067	-	0
  8 Seek_Time_Performance   P-S---   140   140   020	-	15
  9 Power_On_Hours		  -O--C-   100   100   000	-	4268
 10 Spin_Retry_Count		PO--C-   100   100   060	-	0
 12 Power_Cycle_Count	   -O--CK   100   100   000	-	29
 22 Helium_Level			PO---K   100   100   025	-	100
192 Power-Off_Retract_Count -O--CK   100   100   000	-	709
193 Load_Cycle_Count		-O--C-   100   100   000	-	709
194 Temperature_Celsius	 -O----   144   144   000	-	45 (Min/Max 19/47)
196 Reallocated_Event_Count -O--CK   100   100   000	-	0
197 Current_Pending_Sector  -O---K   100   100   000	-	0
198 Offline_Uncorrectable   ---R--   100   100   000	-	0
199 UDMA_CRC_Error_Count	-O-R--   200   200   000	-	0
							||||||_ K auto-keep
							|||||__ C event count
							||||___ R error rate
							|||____ S speed/performance
							||_____ O updated online
							|______ P prefailure warning

General Purpose Log Directory Version 1
SMART		   Log Directory Version 1 [multi-sector log support]
Address	Access  R/W   Size  Description
0x00	   GPL,SL  R/O	  1  Log Directory
0x01		   SL  R/O	  1  Summary SMART error log
0x02		   SL  R/O	  1  Comprehensive SMART error log
0x03	   GPL	 R/O	  1  Ext. Comprehensive SMART error log
0x04	   GPL,SL  R/O	  8  Device Statistics log
0x06		   SL  R/O	  1  SMART self-test log
0x07	   GPL	 R/O	  1  Extended self-test log
0x08	   GPL	 R/O	  2  Power Conditions log
0x09		   SL  R/W	  1  Selective self-test log
0x10	   GPL	 R/O	  1  NCQ Command Error log
0x11	   GPL	 R/O	  1  SATA Phy Event Counters log
0x12	   GPL	 R/O	  1  SATA NCQ Non-Data log
0x15	   GPL	 R/W	  1  Rebuild Assist log
0x21	   GPL	 R/O	  1  Write stream error log
0x22	   GPL	 R/O	  1  Read stream error log
0x24	   GPL	 R/O	256  Current Device Internal Status Data log
0x25	   GPL	 R/O	256  Saved Device Internal Status Data log
0x30	   GPL,SL  R/O	  9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W	 16  Host vendor specific log
0xe0	   GPL,SL  R/W	  1  SCT Command/Status
0xe1	   GPL,SL  R/W	  1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	   413		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:				  3
SCT Version (vendor specific):	   256 (0x0100)
SCT Support Level:				   1
Device State:						DST executing in background (3)
Current Temperature:					45 Celsius
Power Cycle Min/Max Temperature:	 27/47 Celsius
Lifetime	Min/Max Temperature:	 19/47 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:	 2
Temperature Sampling Period:		 1 minute
Temperature Logging Interval:		1 minute
Min/Max recommended Temperature:	  0/65 Celsius
Min/Max Temperature Limit:		   -40/70 Celsius
Temperature History Size (Index):	128 (125)

Index	Estimated Time   Temperature Celsius
 126	2018-07-24 19:06	45  **************************
 ...	..(126 skipped).	..  **************************
 125	2018-07-24 21:13	45  **************************

SCT Error Recovery Control:
		   Read:	 70 (7.0 seconds)
		  Write:	 70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size		Value Flags Description
0x01  =====  =			   =  ===  == General Statistics (rev 2) ==
0x01  0x008  4			  29  ---  Lifetime Power-On Resets
0x01  0x018  6	166439738855  ---  Logical Sectors Written
0x01  0x020  6	  1374718695  ---  Number of Write Commands
0x01  0x028  6	192073413993  ---  Logical Sectors Read
0x01  0x030  6	  1232777306  ---  Number of Read Commands
0x01  0x038  6	 15367023050  ---  Date and Time TimeStamp
0x03  =====  =			   =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4			2532  ---  Spindle Motor Power-on Hours
0x03  0x010  4			2532  ---  Head Flying Hours
0x03  0x018  4			 709  ---  Head Load Events
0x03  0x020  4			   0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4			  60  ---  Read Recovery Attempts
0x03  0x030  4			   0  ---  Number of Mechanical Start Failures
0x04  =====  =			   =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4			   0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4			   4  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =			   =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1			  45  ---  Current Temperature
0x05  0x010  1			  45  N--  Average Short Term Temperature
0x05  0x018  1			  44  N--  Average Long Term Temperature
0x05  0x020  1			  47  ---  Highest Temperature
0x05  0x028  1			  19  ---  Lowest Temperature
0x05  0x030  1			  46  N--  Highest Average Short Term Temperature
0x05  0x038  1			  23  N--  Lowest Average Short Term Temperature
0x05  0x040  1			  44  N--  Highest Average Long Term Temperature
0x05  0x048  1			  25  N--  Lowest Average Long Term Temperature
0x05  0x050  4			   0  ---  Time in Over-Temperature
0x05  0x058  1			  65  ---  Specified Maximum Operating Temperature
0x05  0x060  4			   0  ---  Time in Under-Temperature
0x05  0x068  1			   0  ---  Specified Minimum Operating Temperature
0x06  =====  =			   =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4			   0  ---  Number of Hardware Resets
0x06  0x010  4			  11  ---  Number of ASR Events
0x06  0x018  4			   0  ---  Number of Interface CRC Errors
								|||_ C monitored condition met
								||__ D supports DSN
								|___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID	  Size	 Value  Description
0x0001  2			0  Command failed due to ICRC error
0x0002  2			0  R_ERR response for data FIS
0x0003  2			0  R_ERR response for device-to-host data FIS
0x0004  2			0  R_ERR response for host-to-device data FIS
0x0005  2			0  R_ERR response for non-data FIS
0x0006  2			0  R_ERR response for device-to-host non-data FIS
0x0007  2			0  R_ERR response for host-to-device non-data FIS
0x0008  2			0  Device-to-host non-data FIS retries
0x0009  2			0  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2			1  Device-to-host register FISes sent due to a COMRESET
0x000b  2			0  CRC errors within host-to-device FIS
0x000d  2			0  Non-CRC errors within host-to-device FIS


 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Also, I would like to add that apart from the lights on the array being lit up, it seems to be working fine.
 
Last edited:
Joined
Dec 29, 2014
Messages
1,135
Re. controller - would other volumes on the same controller be also experiencing problems? Everything else seems to be working just fine.

If it were the controller, it would affect all volumes. Given that other volumes are fine, I would suspect drives, SATA/SAS cables, and then power in that order.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
If it were the controller, it would affect all volumes. Given that other volumes are fine, I would suspect drives, SATA/SAS cables, and then power in that order.

Why do you think the issue is with hardware at all? I have identified stray jail that was using a lot of IO on the array, which may explain why the self-test is stuck and not completing.

After I killed the jail, the lights on the front of the enclosure stopped being constantly lit up and apart from the scrub not working, I am seeing full performance of the array, everything is working as expected, speeds are as expected, etc.
 
Joined
Dec 29, 2014
Messages
1,135
Why do you think the issue is with hardware at all? I have identified stray jail that was using a lot of IO on the array, which may explain why the self-test is stuck and not completing.

I didn't see a mention of the possessed jail process until now. If all the SMART tests and scrubs run now, then that is great. I do find it a bit odd that those things would time out and look like failed hardware when they were just way too busy.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
I didn't see a mention of the possessed jail process until now. If all the SMART tests and scrubs run now, then that is great. I do find it a bit odd that those things would time out and look like failed hardware when they were just way too busy.

Sorry, should have been clearer - SMART tests are still the same (but I am expecting they may complete).

Still no luck with a scrub - whenever I start it it's stuck at 0%..
 
Joined
Dec 29, 2014
Messages
1,135
Sorry, should have been clearer

No sweat.

Still no luck with a scrub - whenever I start it it's stuck at 0%..

Bummer. That makes my gut still lean towards hardware. I would try (if you can) running the SMART tests and scrubs while the system has no load and all jails/iocages are stopped. If not, do you have a spare drive so you can try to replace one and see if the problem follows the drive or the connection point?
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Had time to review SMART logs now, and as expected after killing the stray jail, the tests running on the 3 drives all completed without issues:

Code:
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	  4313		 -
# 2  Short offline	   Completed without error	   00%	   413		 -



(same output pretty much for all 3 drives. I will now run on the rest of the pool but my gut tells me that there is no issue with the drives, as they array's performance is absolutely normal)
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
I thought I would update if anyone has any clue - I have updated to FreeNAS-11.1-U6 and the issue still persists.

The scrub just goes through the first 1.56TB very fast, then it grinds to a stop:

Code:
  pool: mainsafe
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
		still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
		the pool may no longer be accessible by software that does not support
		the features. See zpool-features(7) for details.
  scan: scrub in progress since Tue Oct 16 12:35:44 2018
		1.56T scanned at 486M/s, 1008M issued at 2.81M/s, 31.1T total
		0 repaired, 0.00% done, no estimated completion time
config:

		NAME											STATE	 READ WRITE CKSUM
		mainsafe										ONLINE	   0	 0	 0
		  raidz3-0									  ONLINE	   0	 0	 0
			gptid/a4b8bb14-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
			gptid/a5859ce3-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
			gptid/a647bcbb-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
			gptid/a70459ab-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
			gptid/a7c75791-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
			gptid/a88e1cd5-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
			gptid/a94e05d1-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0


As before, everything else works absolutely fine, no errors in the log, all other volumes that are in the same enclosure are just fine.

I am getting a bit worried that my volume has not gone through a scrub for a while - any ideas what I can test?

Any help would be greatly appreciated!

Is there any way of knowing what potential area/drives is the scrub hitting before it freezes?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
It's not stopped, according to the output, just very slow. It's scanning decently quickly and slowly issuing. For a meaningful analysis, you have to look at it over time.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
It's not stopped, according to the output, just very slow. It's scanning decently quickly and slowly issuing. For a meaningful analysis, you have to look at it over time.

Sorry, should have waited for longer - it at some point just stops and the issued just keeps getting smaller until it reaches zero (was hanging on that value for last ~2 weeks before I restarted)

I am running another test now - I've disabled all jails even if they did not really touch the pool and turned off all network shares.

Fingers crossed - we seem to be progressing.

I suspect there was something odd to do with a jail that is blocking the scrub, I have a suspicion about one of the jails, which took inordinate time to stop.

Generally this seems more normal:

Code:
# zpool status mainsafe
  pool: mainsafe
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
		still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
		the pool may no longer be accessible by software that does not support
		the features. See zpool-features(7) for details.
  scan: scrub in progress since Tue Oct 16 12:52:13 2018
		7.75T scanned at 1017M/s, 7.33T issued at 963M/s, 31.1T total
		0 repaired, 23.60% done, 0 days 07:10:55 to go
config:

		NAME											STATE	 READ WRITE CKSUM
		mainsafe										ONLINE	   0	 0	 0
		  raidz3-0									  ONLINE	   0	 0	 0
			gptid/a4b8bb14-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
			gptid/a5859ce3-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
			gptid/a647bcbb-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
			gptid/a70459ab-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
			gptid/a7c75791-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
			gptid/a88e1cd5-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
			gptid/a94e05d1-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0

errors: No known data errors


Will re-run the scrub once I re-enable network shares and the offending jail and see.

In general - is there any reason why network shares or a jail would be blocking a scrub?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
They shouldn't, so something is definitely wrong.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
They shouldn't, so something is definitely wrong.

I know, very weird - I got a somehow positive update - after disabling all services apart from SSH and stopping all of the jails, the scrub just sailed through without any issue.

Next test now - re-enabling all of the services (no jails yet) and trying the scrub again.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Yeah, definitely narrow this one down so that it can be investigated.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Yeah, definitely narrow this one down so that it can be investigated.

Update - all good after re-enabling services (a bit slower but that's to be expected when the pool is in use).

Will start re-enabling jails now.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Yeah, definitely narrow this one down so that it can be investigated.

Ok, an update - after enabling first batch of jails, all seemed OK - second batch - we are back to stuck scrub:

Code:
zpool status mainsafe
  pool: mainsafe
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
	still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(7) for details.
  scan: scrub in progress since Mon Oct 22 11:03:10 2018
	20.6T scanned at 215M/s, 20.4T issued at 213M/s, 30.2T total
	0 repaired, 67.56% done, 0 days 13:24:22 to go
config:

	NAME											STATE	 READ WRITE CKSUM
	mainsafe										ONLINE	   0	 0	 0
	  raidz3-0									  ONLINE	   0	 0	 0
		gptid/a4b8bb14-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
		gptid/a5859ce3-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
		gptid/a647bcbb-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
		gptid/a70459ab-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
		gptid/a7c75791-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
		gptid/a88e1cd5-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
		gptid/a94e05d1-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0

errors: No known data errors


I will double-check in a few hours but the speed has been going down (&ETA up) and percentage done is stuck for last 2-3hours.

Any idea what to do next? I can try narrowing it down to a single jail but would be good to know what possibly can be blocking the scrub.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Start by figuring out which jail is causing it and we can reason about it from there.
 
Status
Not open for further replies.
Top