SOLVED Unable to reset Current_Pending_Sector (SMART)

Status
Not open for further replies.

theirman

Dabbler
Joined
Jan 30, 2014
Messages
31
Update: I ended up RMAing the drive as it was *just* within the 3 year WD Red warranty (last month). Included the SMART error log. No questions asked.
Also got a spare one to prevent a similar lengthy downtime in the future. And finally I made sure all drives were selected in the (short/long) SMART tasks.

If anyone has the same problem, can't RMA, and is also unable to fix it using badblocks or dd by forcing a write to the bad sector(s): try this freenas post (set offline, wipe, resilver).
I can't vouch that that will work, but it'd have been my last resort. I'll be sure to try it once one of my other (old) drives starts giving bad sectors.

===========

Hello,

Last week I received an email that a sector can't be read anymore from my ada4 device.

I've been trying to get the sector remapped in order to "fix" the issue. Without any success: dd keeps failing with "Input/output error",

Following sources were read and used:
https://dekoder.wordpress.com/2014/10/08/fixing-freenas-currently-unreadable-pending-sectors-error/
https://forums.freenas.org/index.php?threads/currently-unreadable-pending-sectors.46395/
https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/
http://linux.101hacks.com/unix/badblocks/
https://forums.freenas.org/index.ph...-1-currently-unreadable-pending-sectors.9824/
http://daemon-notes.com/articles/system/smartmontools/current-pending
http://www.freebsddiary.org/smart-fixing-bad-sector.php

First I tried a long SMART test to find the failed sector:
Code:
[root@fnas] ~# smartctl -t long /dev/ada4
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 542 minutes for test to complete.
Test will complete after Thu Jun  1 21:35:48 2017

Use smartctl -X to abort test.


Which was successful in locating the problem:
Code:
[root@fnas] ~# smartctl -a /dev/ada4
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
...
Sector Sizes:	 512 bytes logical, 4096 bytes physical
...
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   1
...
SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	 24578		 1678762305		<< seek = 1678762305
# 2  Short offline	   Completed without error	   00%	  5518		 -
...


Next up, I wanted to write to the sector to make it relocate:
Code:
[root@fnas] ~# diskinfo -v /dev/ada4
/dev/ada4
		512			 # sectorsize
		...
		4096			# stripesize		<< blocksize (bs) = 4096
...
[root@fnas] ~# sysctl kern.geom.debugflags=16	#same as: sysctl kern.geom.debugflags=0x10
kern.geom.debugflags: 0 -> 16
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 seek=1678762305 conv=noerror,sync
dd: /dev/ada4: Input/output error
1+0 records in
0+0 records out
0 bytes transferred in 0.000061 secs (0 bytes/sec)


Initially I hadn't noticed the I/O error, and immediately started another long SMART test for the drive:
Code:
[root@fnas] ~# smartctl -t long /dev/ada4
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
...
Please wait 542 minutes for test to complete.
Test will complete after Sat Jun  3 06:53:16 2017


The following evening I checked the log again:
Code:
[root@fnas] ~# smartctl -a /dev/ada4
...
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   1
...
SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	 24611		 1678762304		<< seek = 1678762304
# 2  Extended offline	Completed: read failure	   90%	 24578		 1678762305
# 3  Short offline	   Completed without error	   00%	  5518		 -
...


Which is a new LBA, seems like it failed earlier this time. I tried the "fix" again:
Code:
[root@fnas] ~# sysctl kern.geom.debugflags=16
kern.geom.debugflags: 16 -> 16
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 seek=1678762304 conv=noerror,sync
dd: /dev/ada4: Input/output error
1+0 records in
0+0 records out
0 bytes transferred in 0.000066 secs (0 bytes/sec)


Tried with badblocks next:
Code:
[root@fnas] ~# badblocks -b 4096 -wsv -c 64 -p 10 /dev/ada4 1678762305 1678762304
Checking for bad blocks in read-write mode
From block 1678762304 to 1678762305
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
1678762304
1678762305
done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 2 bad blocks found. (0/2/0 errors)
...
Checking for bad blocks in read-write mode
From block 1678762304 to 1678762305
Testing with pattern 0xaa: done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 0 bad blocks found. (0/2/0 errors)

But every single time I retry this command, it'll say "2 bad blocks found" once, followed nine times by "0 bad blocks found".
So it doesn't seem to do anything.

Since badblocks also has a non-destructive mode, I decided to run the following for the entire disk in screen overnight (and scheduled another long test):
Code:
[root@fnas] ~# badblocks -v -b 4096 -s /dev/ada4
Checking blocks 0 to 976754645
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)

So nothing found in particular here.

Then I read something about taking the drive offline from the zfs pool while doing these operations, so I attempted that next:
Code:
[root@fnas] ~# glabel status
									  Name  Status  Components
gptid/221fef8d-6648-11e6-93ab-bc5ff4fb5e9c	 N/A  ada0p1
gptid/c3605fab-09e1-11e4-9783-bc5ff4fb5e9c	 N/A  ada1p2
gptid/30d71158-0de2-11e4-b754-bc5ff4fb5e9c	 N/A  ada2p2
gptid/31bca9bc-0de2-11e4-b754-bc5ff4fb5e9c	 N/A  ada3p2
gptid/32290850-0de2-11e4-b754-bc5ff4fb5e9c	 N/A  ada4p2	<< !
gptid/32922a60-0de2-11e4-b754-bc5ff4fb5e9c	 N/A  ada5p2
gptid/33025476-0de2-11e4-b754-bc5ff4fb5e9c	 N/A  ada6p2
gptid/314f5e04-0de2-11e4-b754-bc5ff4fb5e9c	 N/A  ada7p2
[root@fnas] ~# zpool offline HDD gptid/32290850-0de2-11e4-b754-bc5ff4fb5e9c
[root@fnas] ~# ls /dev/ada4*
/dev/ada4		/dev/ada4p1	  /dev/ada4p1.eli  /dev/ada4p2
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 seek=1678762304 conv=noerror,sync
dd: /dev/ada4: Input/output error
1+0 records in
0+0 records out
0 bytes transferred in 0.000059 secs (0 bytes/sec)
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4p1 bs=4096 count=1 seek=1678762304 conv=noerror,sync
dd: /dev/ada4p1: Operation not permitted
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4p1.eli bs=4096 count=1 seek=1678762304 conv=noerror,sync
dd: /dev/ada4p1.eli: Input/output error
1+0 records in
0+0 records out
0 bytes transferred in 0.000120 secs (0 bytes/sec)
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4p2 bs=4096 count=1 seek=1678762304 conv=noerror,sync
dd: /dev/ada4p2: Input/output error
1+0 records in
0+0 records out
0 bytes transferred in 0.000064 secs (0 bytes/sec)
[root@fnas] ~# zpool status
...
		NAME											STATE	 READ WRITE CKSUM
		HDD											 DEGRADED	 0	 0	 0
		  raidz2-0									  DEGRADED	 0	 0	 0
			...
			15598725717587329647						OFFLINE	  0	 0	 0  was /dev/gptid/32290850-0de2-11e4-b754-bc5ff4fb5e9c
[root@fnas] ~# zpool online HDD 15598725717587329647
[root@fnas] ~# zpool status
  pool: HDD
state: ONLINE
  scan: resilvered 596K in 0h0m with 0 errors on Sun Jun  4 16:27:35 2017
...

So except for me learning some new commands, it didn't do anything.

Finally I tried some smaller blocksizes:
Code:
[root@fnas] ~# dd if=/dev/ada4 of=/dev/ada4 bs=512 count=1 iseek=1678762304 oseek=1678762304 conv=noerror,sync
1+0 records in
1+0 records out
512 bytes transferred in 0.020933 secs (24459 bytes/sec)
[root@fnas] ~# dd if=/dev/ada4 of=/dev/ada4 bs=1024 count=1 iseek=1678762304 oseek=1678762304 conv=noerror,sync
1+0 records in
1+0 records out
1024 bytes transferred in 0.014450 secs (70866 bytes/sec)
[root@fnas] ~# dd if=/dev/ada4 of=/dev/ada4 bs=2048 count=1 iseek=1678762304 oseek=1678762304 conv=noerror,sync
1+0 records in
1+0 records out
2048 bytes transferred in 0.016503 secs (124100 bytes/sec)
[root@fnas] ~# dd if=/dev/ada4 of=/dev/ada4 bs=4096 count=1 iseek=1678762304 oseek=1678762304 conv=noerror,sync
dd: /dev/ada4: Input/output error
0+0 records in
0+0 records out
0 bytes transferred in 0.000042 secs (0 bytes/sec)
dd: /dev/ada4: Input/output error
dd: /dev/ada4: Input/output error
1+0 records in
0+0 records out
0 bytes transferred in 0.000068 secs (0 bytes/sec)


Same with /dev/zero as input:
Code:
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=512  count=1 oseek=1678762304 conv=noerror,sync
dd if=/dev/zero of=/dev/ada4 bs=1024 count=1 oseek=1678762304 conv=noerror,sync
dd if=/dev/zero of=/dev/ada4 bs=2048 count=1 oseek=1678762304 conv=noerror,sync
dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 oseek=1678762304 conv=noerror,sync1+0 records in
1+0 records out
512 bytes transferred in 0.000160 secs (3200423 bytes/sec)
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=1024 count=1 oseek=1678762304 conv=noerror,sync
1+0 records in
1+0 records out
1024 bytes transferred in 0.000201 secs (5094860 bytes/sec)
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=2048 count=1 oseek=1678762304 conv=noerror,sync
1+0 records in
1+0 records out
2048 bytes transferred in 0.000163 secs (12558384 bytes/sec)
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 oseek=1678762304 conv=noerror,sync
dd: /dev/ada4: Input/output error
1+0 records in
0+0 records out
0 bytes transferred in 0.000056 secs (0 bytes/sec)
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=512 count=8 oseek=1678762304 conv=noerror,sync
8+0 records in
8+0 records out
4096 bytes transferred in 0.000802 secs (5106977 bytes/sec)

bs=4096 always fails.

The Current_Pending_Sector is never reset to 0:
Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD40EFRX-68WT0N0
Firmware Version: 80.00A80
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sun Jun  4 18:18:26 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(54240) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 542) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x703d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   8
  3 Spin_Up_Time			0x0027   194   178   021	Pre-fail  Always	   -	   7300
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   67
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   067   067   000	Old_age   Always	   -	   24655
10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   67
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   43
193 Load_Cycle_Count		0x0032   192   192   000	Old_age   Always	   -	   26449
194 Temperature_Celsius	 0x0022   112   099   000	Old_age   Always	   -	   40
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   1
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	 24637		 -
# 2  Extended offline	Completed: read failure	   90%	 24634		 1678762304
# 3  Extended offline	Completed: read failure	   90%	 24611		 1678762304
# 4  Extended offline	Completed: read failure	   90%	 24578		 1678762305
# 5  Short offline	   Completed without error	   00%	  5518		 -
# 6  Short offline	   Completed without error	   00%	  5470		 -
# 7  Short offline	   Completed without error	   00%	  5422		 -
# 8  Short offline	   Completed without error	   00%	  5374		 -
# 9  Short offline	   Completed without error	   00%	  5326		 -
#10  Short offline	   Completed without error	   00%	  5278		 -
#11  Extended offline	Completed without error	   00%	  5265		 -
#12  Short offline	   Completed without error	   00%	  5230		 -
#13  Short offline	   Completed without error	   00%	  5182		 -
#14  Short offline	   Completed without error	   00%	  5134		 -
#15  Short offline	   Completed without error	   00%	  5086		 -
#16  Short offline	   Completed without error	   00%	  5038		 -
#17  Short offline	   Completed without error	   00%	  4990		 -
#18  Short offline	   Completed without error	   00%	  4942		 -
#19  Extended offline	Interrupted (host reset)	  10%	  4928		 -
#20  Short offline	   Completed without error	   00%	  4894		 -
#21  Short offline	   Completed without error	   00%	  4846		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Does anyone know what I might be doing wrong? How can I "fix" this error?
Please don't just say "RMA the drive". It isn't worth hassle so far.
Thank you

EDIT #1: Decided to read some more, and apparently I had to divide the LBA by 8 when writing 4096, so I did that.
Code:
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 seek=209845288 conv=noerror,sync # 1 678 762 304 / 8 = 209 845 288
1+0 records in
1+0 records out
4096 bytes transferred in 0.000204 secs (20093414 bytes/sec)
[root@fnas] ~# smartctl -a /dev/ada4 | grep -e Current_Pending_Sector -e failure
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   1
# 2  Extended offline	Completed: read failure	   90%	 24634		 1678762304
# 3  Extended offline	Completed: read failure	   90%	 24611		 1678762304
# 4  Extended offline	Completed: read failure	   90%	 24578		 1678762305
 
Last edited:

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Does anyone know what I might be doing wrong? How can I "fix" this error?
AFAIK, the referred 'relocate fixes' are not bulletproof for success. You might not be able to fix the error. Perhaps all you accomplish is to move the error around a bit.
On the other hand, the forum appreciates your sharing of experience and process.
Although I've nothing to add, I'm curious to see how the story unfolds. Keep posting!

Please don't just say "RMA the drive".
:rolleyes:
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Hello,

Last week I received an email that a sector can't be read anymore from my ada4 device.

I've been trying to get the sector remapped in order to "fix" the issue. Without any success: dd keeps failing with "Input/output error",

Following sources were read and used:
https://dekoder.wordpress.com/2014/10/08/fixing-freenas-currently-unreadable-pending-sectors-error/
https://forums.freenas.org/index.php?threads/currently-unreadable-pending-sectors.46395/
https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/
http://linux.101hacks.com/unix/badblocks/
https://forums.freenas.org/index.ph...-1-currently-unreadable-pending-sectors.9824/
http://daemon-notes.com/articles/system/smartmontools/current-pending
http://www.freebsddiary.org/smart-fixing-bad-sector.php

First I tried a long SMART test to find the failed sector:
Code:
[root@fnas] ~# smartctl -t long /dev/ada4
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 542 minutes for test to complete.
Test will complete after Thu Jun  1 21:35:48 2017

Use smartctl -X to abort test.


Which was successful in locating the problem:
Code:
[root@fnas] ~# smartctl -a /dev/ada4
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
...
Sector Sizes:	 512 bytes logical, 4096 bytes physical
...
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   1
...
SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	 24578		 1678762305		<< seek = 1678762305
# 2  Short offline	   Completed without error	   00%	  5518		 -
...


Next up, I wanted to write to the sector to make it relocate:
Code:
[root@fnas] ~# diskinfo -v /dev/ada4
/dev/ada4
		512			 # sectorsize
		...
		4096			# stripesize		<< blocksize (bs) = 4096
...
[root@fnas] ~# sysctl kern.geom.debugflags=16	#same as: sysctl kern.geom.debugflags=0x10
kern.geom.debugflags: 0 -> 16
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 seek=1678762305 conv=noerror,sync
dd: /dev/ada4: Input/output error
1+0 records in
0+0 records out
0 bytes transferred in 0.000061 secs (0 bytes/sec)


Initially I hadn't noticed the I/O error, and immediately started another long SMART test for the drive:
Code:
[root@fnas] ~# smartctl -t long /dev/ada4
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
...
Please wait 542 minutes for test to complete.
Test will complete after Sat Jun  3 06:53:16 2017


The following evening I checked the log again:
Code:
[root@fnas] ~# smartctl -a /dev/ada4
...
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   1
...
SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	 24611		 1678762304		<< seek = 1678762304
# 2  Extended offline	Completed: read failure	   90%	 24578		 1678762305
# 3  Short offline	   Completed without error	   00%	  5518		 -
...


Which is a new LBA, seems like it failed earlier this time. I tried the "fix" again:
Code:
[root@fnas] ~# sysctl kern.geom.debugflags=16
kern.geom.debugflags: 16 -> 16
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 seek=1678762304 conv=noerror,sync
dd: /dev/ada4: Input/output error
1+0 records in
0+0 records out
0 bytes transferred in 0.000066 secs (0 bytes/sec)


Tried with badblocks next:
Code:
[root@fnas] ~# badblocks -b 4096 -wsv -c 64 -p 10 /dev/ada4 1678762305 1678762304
Checking for bad blocks in read-write mode
From block 1678762304 to 1678762305
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
1678762304
1678762305
done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 2 bad blocks found. (0/2/0 errors)
...
Checking for bad blocks in read-write mode
From block 1678762304 to 1678762305
Testing with pattern 0xaa: done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 0 bad blocks found. (0/2/0 errors)

But every single time I retry this command, it'll say "2 bad blocks found" once, followed nine times by "0 bad blocks found".
So it doesn't seem to do anything.

Since badblocks also has a non-destructive mode, I decided to run the following for the entire disk in screen overnight (and scheduled another long test):
Code:
[root@fnas] ~# badblocks -v -b 4096 -s /dev/ada4
Checking blocks 0 to 976754645
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)

So nothing found in particular here.

Then I read something about taking the drive offline from the zfs pool while doing these operations, so I attempted that next:
Code:
[root@fnas] ~# glabel status
									  Name  Status  Components
gptid/221fef8d-6648-11e6-93ab-bc5ff4fb5e9c	 N/A  ada0p1
gptid/c3605fab-09e1-11e4-9783-bc5ff4fb5e9c	 N/A  ada1p2
gptid/30d71158-0de2-11e4-b754-bc5ff4fb5e9c	 N/A  ada2p2
gptid/31bca9bc-0de2-11e4-b754-bc5ff4fb5e9c	 N/A  ada3p2
gptid/32290850-0de2-11e4-b754-bc5ff4fb5e9c	 N/A  ada4p2	<< !
gptid/32922a60-0de2-11e4-b754-bc5ff4fb5e9c	 N/A  ada5p2
gptid/33025476-0de2-11e4-b754-bc5ff4fb5e9c	 N/A  ada6p2
gptid/314f5e04-0de2-11e4-b754-bc5ff4fb5e9c	 N/A  ada7p2
[root@fnas] ~# zpool offline HDD gptid/32290850-0de2-11e4-b754-bc5ff4fb5e9c
[root@fnas] ~# ls /dev/ada4*
/dev/ada4		/dev/ada4p1	  /dev/ada4p1.eli  /dev/ada4p2
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 seek=1678762304 conv=noerror,sync
dd: /dev/ada4: Input/output error
1+0 records in
0+0 records out
0 bytes transferred in 0.000059 secs (0 bytes/sec)
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4p1 bs=4096 count=1 seek=1678762304 conv=noerror,sync
dd: /dev/ada4p1: Operation not permitted
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4p1.eli bs=4096 count=1 seek=1678762304 conv=noerror,sync
dd: /dev/ada4p1.eli: Input/output error
1+0 records in
0+0 records out
0 bytes transferred in 0.000120 secs (0 bytes/sec)
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4p2 bs=4096 count=1 seek=1678762304 conv=noerror,sync
dd: /dev/ada4p2: Input/output error
1+0 records in
0+0 records out
0 bytes transferred in 0.000064 secs (0 bytes/sec)
[root@fnas] ~# zpool status
...
		NAME											STATE	 READ WRITE CKSUM
		HDD											 DEGRADED	 0	 0	 0
		  raidz2-0									  DEGRADED	 0	 0	 0
			...
			15598725717587329647						OFFLINE	  0	 0	 0  was /dev/gptid/32290850-0de2-11e4-b754-bc5ff4fb5e9c
[root@fnas] ~# zpool online HDD 15598725717587329647
[root@fnas] ~# zpool status
  pool: HDD
state: ONLINE
  scan: resilvered 596K in 0h0m with 0 errors on Sun Jun  4 16:27:35 2017
...

So except for me learning some new commands, it didn't do anything.

Finally I tried some smaller blocksizes:
Code:
[root@fnas] ~# dd if=/dev/ada4 of=/dev/ada4 bs=512 count=1 iseek=1678762304 oseek=1678762304 conv=noerror,sync
1+0 records in
1+0 records out
512 bytes transferred in 0.020933 secs (24459 bytes/sec)
[root@fnas] ~# dd if=/dev/ada4 of=/dev/ada4 bs=1024 count=1 iseek=1678762304 oseek=1678762304 conv=noerror,sync
1+0 records in
1+0 records out
1024 bytes transferred in 0.014450 secs (70866 bytes/sec)
[root@fnas] ~# dd if=/dev/ada4 of=/dev/ada4 bs=2048 count=1 iseek=1678762304 oseek=1678762304 conv=noerror,sync
1+0 records in
1+0 records out
2048 bytes transferred in 0.016503 secs (124100 bytes/sec)
[root@fnas] ~# dd if=/dev/ada4 of=/dev/ada4 bs=4096 count=1 iseek=1678762304 oseek=1678762304 conv=noerror,sync
dd: /dev/ada4: Input/output error
0+0 records in
0+0 records out
0 bytes transferred in 0.000042 secs (0 bytes/sec)
dd: /dev/ada4: Input/output error
dd: /dev/ada4: Input/output error
1+0 records in
0+0 records out
0 bytes transferred in 0.000068 secs (0 bytes/sec)


Same with /dev/zero as input:
Code:
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=512  count=1 oseek=1678762304 conv=noerror,sync
dd if=/dev/zero of=/dev/ada4 bs=1024 count=1 oseek=1678762304 conv=noerror,sync
dd if=/dev/zero of=/dev/ada4 bs=2048 count=1 oseek=1678762304 conv=noerror,sync
dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 oseek=1678762304 conv=noerror,sync1+0 records in
1+0 records out
512 bytes transferred in 0.000160 secs (3200423 bytes/sec)
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=1024 count=1 oseek=1678762304 conv=noerror,sync
1+0 records in
1+0 records out
1024 bytes transferred in 0.000201 secs (5094860 bytes/sec)
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=2048 count=1 oseek=1678762304 conv=noerror,sync
1+0 records in
1+0 records out
2048 bytes transferred in 0.000163 secs (12558384 bytes/sec)
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 oseek=1678762304 conv=noerror,sync
dd: /dev/ada4: Input/output error
1+0 records in
0+0 records out
0 bytes transferred in 0.000056 secs (0 bytes/sec)
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=512 count=8 oseek=1678762304 conv=noerror,sync
8+0 records in
8+0 records out
4096 bytes transferred in 0.000802 secs (5106977 bytes/sec)

bs=4096 always fails.

The Current_Pending_Sector is never reset to 0:
Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD40EFRX-68WT0N0
Firmware Version: 80.00A80
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sun Jun  4 18:18:26 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(54240) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 542) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x703d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   8
  3 Spin_Up_Time			0x0027   194   178   021	Pre-fail  Always	   -	   7300
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   67
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   067   067   000	Old_age   Always	   -	   24655
10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   67
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   43
193 Load_Cycle_Count		0x0032   192   192   000	Old_age   Always	   -	   26449
194 Temperature_Celsius	 0x0022   112   099   000	Old_age   Always	   -	   40
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   1
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	 24637		 -
# 2  Extended offline	Completed: read failure	   90%	 24634		 1678762304
# 3  Extended offline	Completed: read failure	   90%	 24611		 1678762304
# 4  Extended offline	Completed: read failure	   90%	 24578		 1678762305
# 5  Short offline	   Completed without error	   00%	  5518		 -
# 6  Short offline	   Completed without error	   00%	  5470		 -
# 7  Short offline	   Completed without error	   00%	  5422		 -
# 8  Short offline	   Completed without error	   00%	  5374		 -
# 9  Short offline	   Completed without error	   00%	  5326		 -
#10  Short offline	   Completed without error	   00%	  5278		 -
#11  Extended offline	Completed without error	   00%	  5265		 -
#12  Short offline	   Completed without error	   00%	  5230		 -
#13  Short offline	   Completed without error	   00%	  5182		 -
#14  Short offline	   Completed without error	   00%	  5134		 -
#15  Short offline	   Completed without error	   00%	  5086		 -
#16  Short offline	   Completed without error	   00%	  5038		 -
#17  Short offline	   Completed without error	   00%	  4990		 -
#18  Short offline	   Completed without error	   00%	  4942		 -
#19  Extended offline	Interrupted (host reset)	  10%	  4928		 -
#20  Short offline	   Completed without error	   00%	  4894		 -
#21  Short offline	   Completed without error	   00%	  4846		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Does anyone know what I might be doing wrong? How can I "fix" this error?
Please don't just say "RMA the drive". It isn't worth hassle so far.
Thank you

EDIT #1: Decided to read some more, and apparently I had to divide the LBA by 8 when writing 4096, so I did that.
Code:
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 seek=209845288 conv=noerror,sync # 1 678 762 304 / 8 = 209 845 288
1+0 records in
1+0 records out
4096 bytes transferred in 0.000204 secs (20093414 bytes/sec)
[root@fnas] ~# smartctl -a /dev/ada4 | grep -e Current_Pending_Sector -e failure
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   1
# 2  Extended offline	Completed: read failure	   90%	 24634		 1678762304
# 3  Extended offline	Completed: read failure	   90%	 24611		 1678762304
# 4  Extended offline	Completed: read failure	   90%	 24578		 1678762305
You might try doing a secure erase operation on the drive. I inadvertently 'cleared' a bad sector error on one of my 4TB WDC Re drives when I performed a secure erase on it in preparation for sending it in for RMA replacement. No more error, so no reason to exchange it... but I keep a close watch on that particular drive these days.
 

theirman

Dabbler
Joined
Jan 30, 2014
Messages
31
You might try doing a secure erase operation on the drive. I inadvertently 'cleared' a bad sector error on one of my 4TB WDC Re drives when I performed a secure erase on it in preparation for sending it in for RMA replacement. No more error, so no reason to exchange it.
Interesting. How would I go about that? Did you take it offline out of the zfs pool, and then erase it?
I'm thinking of resorting to this fix: https://forums.freenas.org/index.php?threads/how-to-fix-1-unreadable-sector.21941/page-2#post-129670
selected the HD click set offline
selected the HD click wipe
selected HD click replace with itself
checked resilvering status with zpool status
But I might give it a day or two because that option seems rather excessive.
With replacing: https://doc.freenas.org/9.10/storage.html#replacing-a-failed-drive

Why assume the problem is fixable? It seems to me this is a case of SMART doing its job.
SMART is doing its job, and it did find some sectors that are a problem. Although... is there any way to actually confirm this? When I ran badblocks on the entire disk not a single "bad block" was found.
Even if these blocks are really problematic, I should be able to make it reallocate. Thats should reset ID197 Current_Pending_Sector back to 0. This realloc should increase ID5 Reallocated_Sector_Ct. But I simply can't get the pending sector to 0, and the amount of reallocated sectors to >0.
So it won't fix the bad sector, but at least it'll get rid of the FreeNAS critical alert... and all smartctl long tests would succeed again.

In the meantime I've tried some more things.

1. Ran another long test since last time:
Code:
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	 24660		 1678762307
# 2  Short offline	   Completed without error	   00%	 24637		 -
# 3  Extended offline	Completed: read failure	   90%	 24634		 1678762304
# 4  Extended offline	Completed: read failure	   90%	 24611		 1678762304
# 5  Extended offline	Completed: read failure	   90%	 24578		 1678762305

1678762307 is new, used to be 1678762304/5 before.

2. Rewrote all 3 blocks again
Code:
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=512 count=1 seek=1678762304 conv=noerror,sync ; dd if=/dev/zero of=/dev/ada4 bs=512 count=1 seek=1678762305 conv=noerror,sync ; dd if=/dev/zero of=/dev/ada4 bs=512 count=1 seek=1678762307 conv=noerror,sync
1+0 records in
1+0 records out
512 bytes transferred in 0.000208 secs (2462711 bytes/sec)
1+0 records in
1+0 records out
512 bytes transferred in 0.000182 secs (2814526 bytes/sec)
1+0 records in
1+0 records out
512 bytes transferred in 0.000174 secs (2941758 bytes/sec)

Which is nearly the same as:
Code:
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=512 count=4 seek=1678762304 conv=noerror,sync
4+0 records in
4+0 records out
2048 bytes transferred in 0.000516 secs (3969471 bytes/sec)

Which should write 0...00 to 1678762304, 1678762305, 1678762306, and 1678762307 (4*512B blocks).

3. Unfortunately this didn't help at all. The pending sector count didn't reset to 0, and it started flagging the *4 and *5 blocks again.
Code:
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	 24660		 1678762305
# 2  Extended offline	Completed: read failure	   90%	 24660		 1678762304
# 3  Extended offline	Completed: read failure	   90%	 24660		 1678762304
# 4  Extended offline	Completed: read failure	   90%	 24660		 1678762304
# 5  Extended offline	Completed: read failure	   90%	 24660		 1678762305
# 6  Extended offline	Completed: read failure	   90%	 24660		 1678762304
# 7  Extended offline	Completed: read failure	   90%	 24660		 1678762307


4. I try the badblocks command again with 512B blocks:
Code:
[root@fnas] ~# badblocks -b 512 -wsv -c 9 -p 10 /dev/ada4 1678762312 1678762304
Checking for bad blocks in read-write mode
From block 1678762304 to 1678762312
Testing with pattern 0xaa: done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 0 bad blocks found. (0/0/0 errors)
...
Checking for bad blocks in read-write mode
From block 1678762304 to 1678762312
Testing with pattern 0xaa: done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 0 bad blocks found. (0/0/0 errors)

0 bad blocks.

5. When trying with one 4096B block (again dividing 1678762304 by 8 = 209845288):
Code:
[root@fnas] ~# badblocks -b 4096 -wsv -c 1 -p 10 /dev/ada4 209845288 209845288
Checking for bad blocks in read-write mode
From block 209845288 to 209845288
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 0 bad blocks found. (0/0/0 errors)
...
Checking for bad blocks in read-write mode
From block 209845288 to 209845288
Testing with pattern 0xaa: done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 0 bad blocks found. (0/0/0 errors)

Apart from the "Inappropriate ioctl", still 0 bad blocks found.
Which seems to confirm my overnight badblocks test that the disk seems fine.
Code:
[root@fnas] ~# badblocks -v -b 4096 -s /dev/ada4
Checking blocks 0 to 976754645
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)


So still no dice. I simply can't get the sector to reallocate. And badblocks keeps telling me everythings fine there.
My smartctl right now:
Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD40EFRX-68WT0N0
Firmware Version: 80.00A80
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Mon Jun  5 01:25:43 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  ( 121) The previous self-test completed having
										the read element of the test failed.
Total time to complete Offline
data collection:				(54240) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 542) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x703d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   8
  3 Spin_Up_Time			0x0027   194   178   021	Pre-fail  Always	   -	   7300
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   67
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   067   067   000	Old_age   Always	   -	   24662
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   67
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   43
193 Load_Cycle_Count		0x0032   192   192   000	Old_age   Always	   -	   26458
194 Temperature_Celsius	 0x0022   112   099   000	Old_age   Always	   -	   40
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   1
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	 24661		 1678762304
# 2  Extended offline	Completed: read failure	   90%	 24661		 1678762304
# 3  Extended offline	Completed: read failure	   90%	 24661		 1678762304
# 4  Extended offline	Completed: read failure	   90%	 24660		 1678762305
# 5  Extended offline	Completed: read failure	   90%	 24660		 1678762304
# 6  Extended offline	Completed: read failure	   90%	 24660		 1678762304
# 7  Extended offline	Completed: read failure	   90%	 24660		 1678762304
# 8  Extended offline	Completed: read failure	   90%	 24660		 1678762305
# 9  Extended offline	Completed: read failure	   90%	 24660		 1678762304
#10  Extended offline	Completed: read failure	   90%	 24660		 1678762307
#11  Short offline	   Completed without error	   00%	 24637		 -
#12  Extended offline	Completed: read failure	   90%	 24634		 1678762304
#13  Extended offline	Completed: read failure	   90%	 24611		 1678762304
#14  Extended offline	Completed: read failure	   90%	 24578		 1678762305
#15  Short offline	   Completed without error	   00%	  5518		 -
#16  Short offline	   Completed without error	   00%	  5470		 -
#17  Short offline	   Completed without error	   00%	  5422		 -
#18  Short offline	   Completed without error	   00%	  5374		 -
#19  Short offline	   Completed without error	   00%	  5326		 -
#20  Short offline	   Completed without error	   00%	  5278		 -
#21  Extended offline	Completed without error	   00%	  5265		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

As you can see I've reran the long test quite a few times since, and it just seems to fluctuate between 1678762304 and 1678762305 (with 1678762307 on one occasion).

I did just notice the alarming difference in "LifeTime (hours)" between these faulty tests, and the last time one succeeded (24578 <> 5518 hours - or the test hadn't ran in 19060 hours/794 days/2.17 years (!!). Very alarming since I've set up automated SMART tasks when I initially set up the system (check smart_schedule.png for my settings), and never considered they wouldn't be running. (Which also raises the question why the test ran for ada4? I just figured it got discovered as part of the schedule.)

Just checked the other disks, and noticed the same thing...
Code:
[root@fnas] ~# printf 'foreach dev ( 0 1 2 3 4 5 6 7 ) \n echo ">> ada$dev" \n smartctl -a /dev/ada$dev | grep Power_On_Hours \n smartctl -a /dev/ada$dev | grep pass \n end' | csh -f
>> ada0
  9 Power_On_Hours_and_Msec 0x0032   092   092   000	Old_age   Always	   -	   7394h+17m+11.740s
>> ada1
  9 Power_On_Hours		  0x0032   095   095   000	Old_age   Always	   -	   24791
>> ada2
  9 Power_On_Hours		  0x0032   064   064   000	Old_age   Always	   -	   26783
>> ada3
  9 Power_On_Hours		  0x0032   064   064   000	Old_age   Always	   -	   26784
>> ada4
  9 Power_On_Hours		  0x0032   067   067   000	Old_age   Always	   -	   24663
>> ada5
  9 Power_On_Hours		  0x0032   064   064   000	Old_age   Always	   -	   26782
>> ada6
  9 Power_On_Hours		  0x0032   067   067   000	Old_age   Always	   -	   24663
>> ada7
  9 Power_On_Hours		  0x0032   064   064   000	Old_age   Always	   -	   26783


ada0: never had any tests ran at all (mSATA SSD bootdrive, <1 y/o, new addition, and forgot to set up test for it)
ada1: SSD, last time test ran was @5593 hours
ada2: HDD, last time test ran was @7638 hours
ada3: HDD, last time test ran was @7638 hours
ada4: HDD, used to be @5518 hours (until it somehow got ran and flagged down the disk)
ada5: HDD, last time test ran was @7638 hours
ada6: HDD, last time test ran was @5518 hours
ada7: HDD, last time test ran was @7638 hours

So I'll be running long SMART tests on every device overnight and check for errors in the morning...
In the mean time: can anyone tell me what else I can try to force a relocate?
(Or why a SMART test ran for ada4 as I didn't order one.)
 

Attachments

  • smart_schedule.png
    smart_schedule.png
    16.2 KB · Views: 566

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Interesting. How would I go about that? Did you take it offline out of the zfs pool, and then erase it?
I'm thinking of resorting to this fix: https://forums.freenas.org/index.php?threads/how-to-fix-1-unreadable-sector.21941/page-2#post-129670

But I might give it a day or two because that option seems rather excessive.
With replacing: https://doc.freenas.org/9.10/storage.html#replacing-a-failed-drive


SMART is doing its job, and it did find some sectors that are a problem. Although... is there any way to actually confirm this? When I ran badblocks on the entire disk not a single "bad block" was found.
Even if these blocks are really problematic, I should be able to make it reallocate. Thats should reset ID197 Current_Pending_Sector back to 0. This realloc should increase ID5 Reallocated_Sector_Ct. But I simply can't get the pending sector to 0, and the amount of reallocated sectors to >0.
So it won't fix the bad sector, but at least it'll get rid of the FreeNAS critical alert... and all smartctl long tests would succeed again.

In the meantime I've tried some more things.

1. Ran another long test since last time:
Code:
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	 24660		 1678762307
# 2  Short offline	   Completed without error	   00%	 24637		 -
# 3  Extended offline	Completed: read failure	   90%	 24634		 1678762304
# 4  Extended offline	Completed: read failure	   90%	 24611		 1678762304
# 5  Extended offline	Completed: read failure	   90%	 24578		 1678762305

1678762307 is new, used to be 1678762304/5 before.

2. Rewrote all 3 blocks again
Code:
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=512 count=1 seek=1678762304 conv=noerror,sync ; dd if=/dev/zero of=/dev/ada4 bs=512 count=1 seek=1678762305 conv=noerror,sync ; dd if=/dev/zero of=/dev/ada4 bs=512 count=1 seek=1678762307 conv=noerror,sync
1+0 records in
1+0 records out
512 bytes transferred in 0.000208 secs (2462711 bytes/sec)
1+0 records in
1+0 records out
512 bytes transferred in 0.000182 secs (2814526 bytes/sec)
1+0 records in
1+0 records out
512 bytes transferred in 0.000174 secs (2941758 bytes/sec)

Which is nearly the same as:
Code:
[root@fnas] ~# dd if=/dev/zero of=/dev/ada4 bs=512 count=4 seek=1678762304 conv=noerror,sync
4+0 records in
4+0 records out
2048 bytes transferred in 0.000516 secs (3969471 bytes/sec)

Which should write 0...00 to 1678762304, 1678762305, 1678762306, and 1678762307 (4*512B blocks).

3. Unfortunately this didn't help at all. The pending sector count didn't reset to 0, and it started flagging the *4 and *5 blocks again.
Code:
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	 24660		 1678762305
# 2  Extended offline	Completed: read failure	   90%	 24660		 1678762304
# 3  Extended offline	Completed: read failure	   90%	 24660		 1678762304
# 4  Extended offline	Completed: read failure	   90%	 24660		 1678762304
# 5  Extended offline	Completed: read failure	   90%	 24660		 1678762305
# 6  Extended offline	Completed: read failure	   90%	 24660		 1678762304
# 7  Extended offline	Completed: read failure	   90%	 24660		 1678762307


4. I try the badblocks command again with 512B blocks:
Code:
[root@fnas] ~# badblocks -b 512 -wsv -c 9 -p 10 /dev/ada4 1678762312 1678762304
Checking for bad blocks in read-write mode
From block 1678762304 to 1678762312
Testing with pattern 0xaa: done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 0 bad blocks found. (0/0/0 errors)
...
Checking for bad blocks in read-write mode
From block 1678762304 to 1678762312
Testing with pattern 0xaa: done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 0 bad blocks found. (0/0/0 errors)

0 bad blocks.

5. When trying with one 4096B block (again dividing 1678762304 by 8 = 209845288):
Code:
[root@fnas] ~# badblocks -b 4096 -wsv -c 1 -p 10 /dev/ada4 209845288 209845288
Checking for bad blocks in read-write mode
From block 209845288 to 209845288
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 0 bad blocks found. (0/0/0 errors)
...
Checking for bad blocks in read-write mode
From block 209845288 to 209845288
Testing with pattern 0xaa: done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 0 bad blocks found. (0/0/0 errors)

Apart from the "Inappropriate ioctl", still 0 bad blocks found.
Which seems to confirm my overnight badblocks test that the disk seems fine.
Code:
[root@fnas] ~# badblocks -v -b 4096 -s /dev/ada4
Checking blocks 0 to 976754645
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)


So still no dice. I simply can't get the sector to reallocate. And badblocks keeps telling me everythings fine there.
My smartctl right now:
Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD40EFRX-68WT0N0
Firmware Version: 80.00A80
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Mon Jun  5 01:25:43 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  ( 121) The previous self-test completed having
										the read element of the test failed.
Total time to complete Offline
data collection:				(54240) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 542) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x703d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   8
  3 Spin_Up_Time			0x0027   194   178   021	Pre-fail  Always	   -	   7300
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   67
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   067   067   000	Old_age   Always	   -	   24662
10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   67
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   43
193 Load_Cycle_Count		0x0032   192   192   000	Old_age   Always	   -	   26458
194 Temperature_Celsius	 0x0022   112   099   000	Old_age   Always	   -	   40
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   1
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	 24661		 1678762304
# 2  Extended offline	Completed: read failure	   90%	 24661		 1678762304
# 3  Extended offline	Completed: read failure	   90%	 24661		 1678762304
# 4  Extended offline	Completed: read failure	   90%	 24660		 1678762305
# 5  Extended offline	Completed: read failure	   90%	 24660		 1678762304
# 6  Extended offline	Completed: read failure	   90%	 24660		 1678762304
# 7  Extended offline	Completed: read failure	   90%	 24660		 1678762304
# 8  Extended offline	Completed: read failure	   90%	 24660		 1678762305
# 9  Extended offline	Completed: read failure	   90%	 24660		 1678762304
#10  Extended offline	Completed: read failure	   90%	 24660		 1678762307
#11  Short offline	   Completed without error	   00%	 24637		 -
#12  Extended offline	Completed: read failure	   90%	 24634		 1678762304
#13  Extended offline	Completed: read failure	   90%	 24611		 1678762304
#14  Extended offline	Completed: read failure	   90%	 24578		 1678762305
#15  Short offline	   Completed without error	   00%	  5518		 -
#16  Short offline	   Completed without error	   00%	  5470		 -
#17  Short offline	   Completed without error	   00%	  5422		 -
#18  Short offline	   Completed without error	   00%	  5374		 -
#19  Short offline	   Completed without error	   00%	  5326		 -
#20  Short offline	   Completed without error	   00%	  5278		 -
#21  Extended offline	Completed without error	   00%	  5265		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

As you can see I've reran the long test quite a few times since, and it just seems to fluctuate between 1678762304 and 1678762305 (with 1678762307 on one occasion).

I did just notice the alarming difference in "LifeTime (hours)" between these faulty tests, and the last time one succeeded (24578 <> 5518 hours - or the test hadn't ran in 19060 hours/794 days/2.17 years (!!). Very alarming since I've set up automated SMART tasks when I initially set up the system (check smart_schedule.png for my settings), and never considered they wouldn't be running. (Which also raises the question why the test ran for ada4? I just figured it got discovered as part of the schedule.)

Just checked the other disks, and noticed the same thing...
Code:
[root@fnas] ~# printf 'foreach dev ( 0 1 2 3 4 5 6 7 ) \n echo ">> ada$dev" \n smartctl -a /dev/ada$dev | grep Power_On_Hours \n smartctl -a /dev/ada$dev | grep pass \n end' | csh -f
>> ada0
  9 Power_On_Hours_and_Msec 0x0032   092   092   000	Old_age   Always	   -	   7394h+17m+11.740s
>> ada1
  9 Power_On_Hours		  0x0032   095   095   000	Old_age   Always	   -	   24791
>> ada2
  9 Power_On_Hours		  0x0032   064   064   000	Old_age   Always	   -	   26783
>> ada3
  9 Power_On_Hours		  0x0032   064   064   000	Old_age   Always	   -	   26784
>> ada4
  9 Power_On_Hours		  0x0032   067   067   000	Old_age   Always	   -	   24663
>> ada5
  9 Power_On_Hours		  0x0032   064   064   000	Old_age   Always	   -	   26782
>> ada6
  9 Power_On_Hours		  0x0032   067   067   000	Old_age   Always	   -	   24663
>> ada7
  9 Power_On_Hours		  0x0032   064   064   000	Old_age   Always	   -	   26783


ada0: never had any tests ran at all (mSATA SSD bootdrive, <1 y/o, new addition, and forgot to set up test for it)
ada1: SSD, last time test ran was @5593 hours
ada2: HDD, last time test ran was @7638 hours
ada3: HDD, last time test ran was @7638 hours
ada4: HDD, used to be @5518 hours (until it somehow got ran and flagged down the disk)
ada5: HDD, last time test ran was @7638 hours
ada6: HDD, last time test ran was @5518 hours
ada7: HDD, last time test ran was @7638 hours

So I'll be running long SMART tests on every device overnight and check for errors in the morning...
In the mean time: can anyone tell me what else I can try to force a relocate?
(Or why a SMART test ran for ada4 as I didn't order one.)
Oh, yes - I removed the disk from my FreeNAS system. I did the secure erase on a Linux system I use for workbench tasks, using these instructions from Thomas-Krenn (don't let the article title fool you, you can perform a secure erase on HDDs as well as SSDs):

https://www.thomas-krenn.com/en/wiki/SSD_Secure_Erase

But you need to consider the very real possibility that this drive is simply failing and needs to be replaced. :oops:
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
I did just notice the alarming difference in "LifeTime (hours)" between these faulty tests, and the last time one succeeded (24578 <> 5518 hours - or the test hadn't ran in 19060 hours/794 days/2.17 years (!!). Very alarming since I've set up automated SMART tasks when I initially set up the system (check smart_schedule.png for my settings), and never considered they wouldn't be running. (Which also raises the question why the test ran for ada4? I just figured it got discovered as part of the schedule.)

Ughh. That is not beautiful. I've encountered something similar, where a drive or two has become "unchecked" whenever I returned and checked on settings. Memory fades me on what reasons I concluded were due to the unwanted check. It could've been cable fixing, (ie, unpluggin and pluging back a drive, or moving to a different SATA port) or some pool arrangements. Anyhow, what I found was something that would explain your situation if you've just set it and forget it.
upload_2017-6-5_8-57-33.png
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,367
Its a bit crappy actually. The SMART stuff seems to be done by device number, and when you change the device order/naming, say by moving a drive from a motherboard SATA to an HBA SAS SATA, then the SMART no longer picks up the new drive.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
@Stux exactly what I've seen too.
I think this should be part of the maintenance schedule of the box. *spawns ideas on a resource on the topic*
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,367
@Stux exactly what I've seen too.
I think this should be part of the maintenance schedule of the box. *spawns ideas on a resource on the topic*

I think FreeNAS should be cleverer and sort it out for you. Perhaps a tickbox for "just check all the damn drives"
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
I think FreeNAS should be cleverer and sort it out for you. Perhaps a tickbox for "just check all the damn drives"
That would be an excellent feature request to put in that SMART scheduling box.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,464
The SMART stuff seems to be done by device number,
Not only by device number, though. If I replace da8 (for example) with a new disk, so the new disk is now da8, new da8 is no longer selected in the SMART tests. Whether it ties to the disk serial number or some other identifier I don't know, but it isn't simply the device number.
 

theirman

Dabbler
Joined
Jan 30, 2014
Messages
31
Oh, yes - I removed the disk from my FreeNAS system. I did the secure erase on a Linux system I use for workbench tasks...
But you need to consider the very real possibility that this drive is simply failing and needs to be replaced. :oops:
Good to know. But I don't have a spare desktop handy in house to do the wiping, so I think I'll try the wipe & replace first. If I can't get the sectors relocated any other way.

But you need to consider the very real possibility that this drive is simply failing and needs to be replaced. :oops:
Yes, well. I've searched my invoices, and apparently this one was from a batch I bought on 2014-07-05, and offers 36 months warranty. I'm currently at *exactly* 35 months. So I'll contact them and see if I have grounds for warranty. Any helpful arguments for accomplishing that are welcome. ;)

I've encountered something similar, where a drive or two has become "unchecked" whenever I returned and checked on settings. ... It could've been cable fixing, (ie, unpluggin and pluging back a drive, or moving to a different SATA port) or some pool arrangements.
That could be right. My first motherboard C2750D4I (since moved on to Supermicro X10SDV-6C+-TLN4F because the first one died because of flash erosion) had a few defective SATA ports which caused random disconnects for some disks. After a few months I moved all disks to use other SATA ports. But that wasn't 5500 hours after installing them. I did have RAM issues early in 2015 which caused my to completely desassemble everything to reach the modules, and that fits! 5500 hours sinds mid July 2014 = 229 days or 7.5 months. Which makes it feb 2015. That must have been when the automated SMART tests stopped working. Scary that I hadn't had a SMART scan ran since all that time. :(
Haven't had one ran in 19060 hours (24578 - 5518) / 794 days / 26.5 months / 2.2 years. :eek:
So much for setting up that check and peace of mind.

I think FreeNAS should be cleverer and sort it out for you. Perhaps a tickbox for "just check all the damn drives"
That would be an excellent feature request to put in that SMART scheduling box.
Anyone wants to log this? Pretty sure there will be a ton of people where the drives got unchecked. I'll definitely keep it in mind now...

On a more positive note, all long tests finished successfully! :)
Code:
>> ada0
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours_and_Msec 0x0032   092   092   000	Old_age   Always	   -	   7405h+09m+08.790s
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	  7394		 -
>> ada1
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours		  0x0032   095   095   000	Old_age   Always	   -	   24802
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 24792		 -
>> ada2
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours		  0x0032   064   064   000	Old_age   Always	   -	   26793
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 26792		 -
>> ada3
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours		  0x0032   064   064   000	Old_age   Always	   -	   26795
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 26793		 -
>> ada4
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours		  0x0032   067   067   000	Old_age   Always	   -	   24673
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	 24663		 1678762306
# 2  Extended offline	Completed: read failure	   90%	 24661		 1678762304
#11  Extended offline	Completed: read failure	   90%	 24660		 1678762307
#15  Extended offline	Completed: read failure	   90%	 24578		 1678762305
>> ada5
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours		  0x0032   064   064   000	Old_age   Always	   -	   26793
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 26792		 -
>> ada6
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours		  0x0032   067   067   000	Old_age   Always	   -	   24674
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 24673		 -
>> ada7
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours		  0x0032   064   064   000	Old_age   Always	   -	   26794
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 26793		 -

Remarkable how the test for ada4 now flagged 4 consecutive blocks in total (1678762304..1678762307).
And how I can't trigger the relocate.

For now I've selected all disks in both of my SMART tasks to be checked. I find the UI to be quite confusing as I figured it was running on all of them when I checked yesterday. I'd forgotten that you actually need to select/highlight the disks in that list in order for it to work. It'd be useful if the SMART tasks overview had an extra column in which it showed the applicable drives.
I also noticed I can't select my ada0 drive (mSATA boot drive) in the UI. But the boot drive isn't that important.

I'll also contact the webshop I've bought the disks from to check if my ada4 disk can be traded in under warranty.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,464
see if I have grounds for warranty. Any helpful arguments for accomplishing that are welcome.
Easy--the drive is failing SMART self-tests. That's definitely a warranty failure.
 

theirman

Dabbler
Joined
Jan 30, 2014
Messages
31
Easy--the drive is failing SMART self-tests. That's definitely a warranty failure.
Ok good. I've checked all six serial numbers on the official WD Warranty status page, and two (one of which fails SMART) of these are still "In Limited Warranty" until 2017-07-08.
I've attached screenshots of this fact and a dump of the smartctl command, so here's hoping I can get it replaced for free.

Still scary how I would have nearly missed this warranty window because SMART hadn't ran in so long.

Which begs the question: does anyone have a theory on why FreeNAS would have sent me the following message on 2017-06-01?
Device: /dev/ada4, 1 Currently unreadable (pending) sectors
Device: /dev/ada4, Self-Test Log error count increased from 0 to 1
The only thing I can think triggered it was the scrub that's executed on the first (and 15th) of every month:
2017-06-05_14-06-28.png

But if I recall correctly it didn't say anything bad zfs wise, and no resilvering happened (I checked the zfs status after getting that email).
Back then I figured it was coincidence, and that the SMART test ran together with the scrub.
But it can't be, because smartctl didn't show anything yet as the test hadn't been ran in 2.2 years.

Can the scrub encounter a disk error which makes FreeNAS sent the quoted message?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,464
here's hoping I can get it replaced for free.
Should not be a problem at all. Last time I checked, for about $10, WD would do an advance exchange, and that included return shipping of your old disk. Seems like a pretty good deal to me. Make sure to burn in the replacement disk before putting it in your pool, and make sure your update your SMART test schedules once you've installed the replacement.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,464

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
No dice for solid ideas :/
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176
I suspect many of these decisions may now be subjected to a second look, so please file some tickets as the questions pop up again (new tickets, the old ones might not attract attention).
 
Status
Not open for further replies.
Top