Disk has errors but pool is not degraded.

Status
Not open for further replies.
Joined
Sep 3, 2017
Messages
6
I have six disk system, with two each of 2TB, 3TB and 4TB drives. Freenas is complaining of bad sectors. Should I replace the drive? I have another 4TB drive to put in it but then the mirror "balance' is out. Also, all the drive replacement docs I found seem to be for a degraded pool, which mine says it isn't.
Ideas?

By the way, running FreeNAS 11 Stable - fully updated.
  • CRITICAL: Sept. 6, 2017, 11:20 p.m. - Device: /dev/ada5, 54 Currently unreadable (pending) sectors
  • CRITICAL: Sept. 8, 2017, 10:20 a.m. - Device: /dev/ada5, Self-Test Log error count increased from 0 to 1
  • CRITICAL: Sept. 8, 2017, 10:50 a.m. - Device: /dev/ada5, 15 Offline uncorrectable sectors
  • CRITICAL: Sept. 10, 2017, 8:21 a.m. - Device: /dev/ada5, 50 Offline uncorrectable sectors
Smartctl Output
Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===

Model Family:	 Western Digital Caviar Green (AF)

Device Model:	 WDC WD20EARS-00MVWB0

Serial Number:	WD-WCAZA1201929

LU WWN Device Id: 5 0014ee 2afb3c808

Firmware Version: 51.0AB51

User Capacity:	2,000,398,934,016 bytes [2.00 TB]

Sector Size:	  512 bytes logical/physical

Device is:		In smartctl database [for details use: -P show]

ATA Version is:   ATA8-ACS (minor revision not indicated)

SATA Version is:  SATA 2.6, 3.0 Gb/s

Local Time is:	Tue Sep 12 20:08:22 2017 MDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x84)	Offline data collection activity

					was suspended by an interrupting command from host.

					Auto Offline Data Collection: Enabled.

Self-test execution status:	  ( 113)	The previous self-test completed having

					the read element of the test failed.

Total time to complete Offline 

data collection:		 (38160) seconds.

Offline data collection

capabilities:			 (0x7b) SMART execute Offline immediate.

					Auto Offline data collection on/off support.

					Suspend Offline collection upon new

					command.

					Offline surface scan supported.

					Self-test supported.

					Conveyance Self-test supported.

					Selective Self-test supported.

SMART capabilities:			(0x0003)	Saves SMART data before entering

					power-saving mode.

					Supports SMART auto save timer.

Error logging capability:		(0x01)	Error logging supported.

					General Purpose Logging supported.

Short self-test routine 

recommended polling time:	 (   2) minutes.

Extended self-test routine

recommended polling time:	 ( 368) minutes.

Conveyance self-test routine

recommended polling time:	 (   5) minutes.

SCT capabilities:		   (0x3035)	SCT Status supported.

					SCT Feature Control supported.

					SCT Data Table supported.


SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   3

  3 Spin_Up_Time			0x0027   184   170   021	Pre-fail  Always	   -	   5766

  4 Start_Stop_Count		0x0032   099   099   000	Old_age   Always	   -	   1453

  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0

  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0

  9 Power_On_Hours		  0x0032   025   025   000	Old_age   Always	   -	   54906

 10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0

 11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0

 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   104

192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   65

193 Load_Cycle_Count		0x0032   001   001   000	Old_age   Always	   -	   2088884

194 Temperature_Celsius	 0x0022   113   106   000	Old_age   Always	   -	   37

196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0

197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   121

198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0

199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   3

200 Multi_Zone_Error_Rate   0x0008   200   199   000	Old_age   Offline	  -	   3


SMART Error Log Version: 1

No Errors Logged


SMART Self-test log structure revision number 1

Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline	Completed: read failure	   10%	 54801		 3839505240

# 2  Short offline	   Completed without error	   00%	 54765		 -

# 3  Short offline	   Completed without error	   00%	 54746		 -

# 4  Short offline	   Completed without error	   00%	 54483		 -

# 5  Extended offline	Completed without error	   00%	 54394		 -

# 6  Short offline	   Completed without error	   00%	 54243		 -

# 7  Extended offline	Completed without error	   00%	 54058		 -

# 8  Short offline	   Completed without error	   00%	 54003		 -

# 9  Short offline	   Completed without error	   00%	 53740		 -

#10  Extended offline	Completed without error	   00%	 53651		 -

#11  Short offline	   Completed without error	   00%	 53538		 -

#12  Short offline	   Completed without error	   00%	 53537		 -

#13  Short offline	   Completed without error	   00%	 53536		 -

#14  Short offline	   Completed without error	   00%	 53535		 -

#15  Short offline	   Completed without error	   00%	 53534		 -

#16  Short offline	   Completed without error	   00%	 53533		 -

#17  Short offline	   Completed without error	   00%	 53532		 -

#18  Short offline	   Completed without error	   00%	 53531		 -

#19  Short offline	   Completed without error	   00%	 53530		 -

#20  Short offline	   Completed without error	   00%	 53529		 -

#21  Short offline	   Completed without error	   00%	 53528		 -


SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

	1		0		0  Not_testing

	2		0		0  Not_testing

	3		0		0  Not_testing

	4		0		0  Not_testing

	5		0		0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.



ZPool status

Code:
pool: VolumeZ

 state: ONLINE

  scan: scrub repaired 0 in 11h17m with 0 errors on Tue Sep  5 14:17:49 2017

config:


	NAME											STATE	 READ WRITE CKSUM

	VolumeZ										 ONLINE	   0	 0	 0

	  mirror-0									  ONLINE	   0	 0	 0

		gptid/b3014a26-d506-11e4-a3f7-10c37b8fc624  ONLINE	   0	 0	 0

		gptid/0dfa0c01-3623-11e4-98b6-10c37b8fc624  ONLINE	   0	 0	 0

	  mirror-1									  ONLINE	   0	 0	 0

		gptid/d9ed7075-4157-11e4-a44a-10c37b8fc624  ONLINE	   0	 0	 0

		gptid/7217ec61-4128-11e4-a44a-10c37b8fc624  ONLINE	   0	 0	 0

	  mirror-2									  ONLINE	   0	 0	 0

		gptid/c63f52d5-53e9-11e7-9c4e-10c37b8fc624  ONLINE	   0	 0	 0

		gptid/c75453b4-53e9-11e7-9c4e-10c37b8fc624  ONLINE	   0	 0	 0


errors: No known data errors


  pool: freenas-boot

 state: ONLINE

  scan: scrub repaired 0 in 0h17m with 0 errors on Thu Sep  7 04:02:10 2017

config:


	NAME		STATE	 READ WRITE CKSUM

	freenas-boot  ONLINE	   0	 0	 0

	  da0p2	 ONLINE	   0	 0	 0


errors: No known data errors

 

Inxsible

Guru
Joined
Aug 14, 2017
Messages
1,123
You do have 121 Current_Pending_Sectors. So it would be a good idea to look into replacing the drive.

But since the read test failed, you might want to run a smart test 1 more time and check the values. Your pool is not degraded probably because the RAW_VALUE of 121 is below the threshold value of 200. So as far as SMART is concerned, the tests have passed for this run
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
It may not be degraded yet, but if it went from no pending sectors to 121 in a short period of time, then you should plan on replacing it soon. If the other mirror drive is similar to this one, you could be in trouble.

Your load cycles are over 2 million. You need to run WDIDLE3 on these drives to raise the park interval to 120 seconds.
 
Joined
Sep 3, 2017
Messages
6
You do have 121 Current_Pending_Sectors. So it would be a good idea to look into replacing the drive.

But since the read test failed, you might want to run a smart test 1 more time and check the values. Your pool is not degraded probably because the RAW_VALUE of 121 is below the threshold value of 200. So as far as SMART is concerned, the tests have passed for this run

I will run the test right now. I think I would like to replace the 2TB drive ASAP. I can pop in the 4TB drive then when I can, buy another 4TB to replace the other 2TB drive. Without the drive being in a degraded state, what is the best way to replace the bad 2TB drive? I have no spare SATA ports to attach it without disconnecting the bad one first. Just want the best practice for this situation.
 

Inxsible

Guru
Joined
Aug 14, 2017
Messages
1,123
I will run the test right now. I think I would like to replace the 2TB drive ASAP. I can pop in the 4TB drive then when I can, buy another 4TB to replace the other 2TB drive. Without the drive being in a degraded state, what is the best way to replace the bad 2TB drive? I have no spare SATA ports to attach it without disconnecting the bad one first. Just want the best practice for this situation.
You can manually offline the bad disk, which will put the vdev in DEGRADED state. Then replace it with a new drive and then go back in the UI and online the new drive that you added. Resilvering will start and once that is done, your pool should come out of the DEGRADED state.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
It's better to have a drive with pending sectors, than no redundancy.

The drive is not degraded yet because the faulty sectors have not had any affect yet.

Smart is at a different level to ZFS.

If the drive is in warranty you should RMA it.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Your pool is not degraded probably because the RAW_VALUE of 121 is below the threshold value of 200.
No, this has nothing to do with why the pool is not degraded. The pool is not degraded because the disk has not faulted as far as ZFS is concerned--that is, it hasn't served enough bad data within a short enough time for ZFS to kick it out of the pool. The disk can have tons of bad sectors, but if there's no data on them, ZFS won't know or care.

The disk also has a load cycle count over 2 million--check into WDIDLE3.EXE for the remaining (and replacement) disks.
 
Joined
Sep 3, 2017
Messages
6
After the most recent long SMART test.
Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===

Model Family:	 Western Digital Caviar Green (AF)

Device Model:	 WDC WD20EARS-00MVWB0

Serial Number:	WD-WCAZA1201929

LU WWN Device Id: 5 0014ee 2afb3c808

Firmware Version: 51.0AB51

User Capacity:	2,000,398,934,016 bytes [2.00 TB]

Sector Size:	  512 bytes logical/physical

Device is:		In smartctl database [for details use: -P show]

ATA Version is:   ATA8-ACS (minor revision not indicated)

SATA Version is:  SATA 2.6, 3.0 Gb/s

Local Time is:	Wed Sep 13 06:29:05 2017 MDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x84)	Offline data collection activity

					was suspended by an interrupting command from host.

					Auto Offline Data Collection: Enabled.

Self-test execution status:	  ( 117)	The previous self-test completed having

					the read element of the test failed.

Total time to complete Offline 

data collection:		 (38160) seconds.

Offline data collection

capabilities:			 (0x7b) SMART execute Offline immediate.

					Auto Offline data collection on/off support.

					Suspend Offline collection upon new

					command.

					Offline surface scan supported.

					Self-test supported.

					Conveyance Self-test supported.

					Selective Self-test supported.

SMART capabilities:			(0x0003)	Saves SMART data before entering

					power-saving mode.

					Supports SMART auto save timer.

Error logging capability:		(0x01)	Error logging supported.

					General Purpose Logging supported.

Short self-test routine 

recommended polling time:	 (   2) minutes.

Extended self-test routine

recommended polling time:	 ( 368) minutes.

Conveyance self-test routine

recommended polling time:	 (   5) minutes.

SCT capabilities:		   (0x3035)	SCT Status supported.

					SCT Feature Control supported.

					SCT Data Table supported.


SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   3

  3 Spin_Up_Time			0x0027   184   170   021	Pre-fail  Always	   -	   5766

  4 Start_Stop_Count		0x0032   099   099   000	Old_age   Always	   -	   1453

  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0

  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0

  9 Power_On_Hours		  0x0032   025   025   000	Old_age   Always	   -	   54917

 10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0

 11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0

 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   104

192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   65

193 Load_Cycle_Count		0x0032   001   001   000	Old_age   Always	   -	   2089742

194 Temperature_Celsius	 0x0022   114   106   000	Old_age   Always	   -	   36

196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0

197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   121

198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0

199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   3

200 Multi_Zone_Error_Rate   0x0008   200   199   000	Old_age   Offline	  -	   43


SMART Error Log Version: 1

No Errors Logged


SMART Self-test log structure revision number 1

Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline	Completed: read failure	   50%	 54911		 1986736672

# 2  Extended offline	Completed: read failure	   10%	 54801		 3839505240

# 3  Short offline	   Completed without error	   00%	 54765		 -

# 4  Short offline	   Completed without error	   00%	 54746		 -

# 5  Short offline	   Completed without error	   00%	 54483		 -

# 6  Extended offline	Completed without error	   00%	 54394		 -

# 7  Short offline	   Completed without error	   00%	 54243		 -

# 8  Extended offline	Completed without error	   00%	 54058		 -

# 9  Short offline	   Completed without error	   00%	 54003		 -

#10  Short offline	   Completed without error	   00%	 53740		 -

#11  Extended offline	Completed without error	   00%	 53651		 -

#12  Short offline	   Completed without error	   00%	 53538		 -

#13  Short offline	   Completed without error	   00%	 53537		 -

#14  Short offline	   Completed without error	   00%	 53536		 -

#15  Short offline	   Completed without error	   00%	 53535		 -

#16  Short offline	   Completed without error	   00%	 53534		 -

#17  Short offline	   Completed without error	   00%	 53533		 -

#18  Short offline	   Completed without error	   00%	 53532		 -

#19  Short offline	   Completed without error	   00%	 53531		 -

#20  Short offline	   Completed without error	   00%	 53530		 -

#21  Short offline	   Completed without error	   00%	 53529		 -


SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

	1		0		0  Not_testing

	2		0		0  Not_testing

	3		0		0  Not_testing

	4		0		0  Not_testing

	5		0		0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
You can use the 4TB disk, and then replace it later with a different 2TB disk. Until both disks in a mirror set are 4TB, it will not expand. That means you are not required to use 4TB forever.

The SMART data show your multi-zone error rate went from 3 to 43 after the long SMART test. This is another sign of a drive that needs replacement.
 
Joined
Sep 3, 2017
Messages
6
I think I made a mistake. I removed my old drive. Added the new drive and when I went to Volume Manager, the option to Mirror wasn't there. I went to Manual setup and added it to the Volume there. My Volume is still in a DEGRADED state. I looked at my ZPOOL Status and it looked like this:
Code:
 pool: VolumeZ

 state: DEGRADED

status: One or more devices could not be opened.  Sufficient replicas exist for

	the pool to continue functioning in a degraded state.

action: Attach the missing device and online it using 'zpool online'.

  see: http://illumos.org/msg/ZFS-8000-2Q

  scan: scrub repaired 0 in 11h17m with 0 errors on Tue Sep  5 14:17:49 2017

config:


	NAME											STATE	 READ WRITE CKSUM

	VolumeZ										 DEGRADED	 0	 0	 0

	  mirror-0									  ONLINE	   0	 0	 0

		gptid/b3014a26-d506-11e4-a3f7-10c37b8fc624  ONLINE	   0	 0	 0

		gptid/0dfa0c01-3623-11e4-98b6-10c37b8fc624  ONLINE	   0	 0	 0

	  mirror-1									  DEGRADED	 0	 0	 0

		8766186258271018871						 UNAVAIL	  0	 0	 0  was /dev/gptid/d9ed7075-4157-11e4-a44a-10c37b8fc624

		gptid/7217ec61-4128-11e4-a44a-10c37b8fc624  ONLINE	   0	 0	 0

	  mirror-2									  ONLINE	   0	 0	 0

		gptid/c63f52d5-53e9-11e7-9c4e-10c37b8fc624  ONLINE	   0	 0	 0

		gptid/c75453b4-53e9-11e7-9c4e-10c37b8fc624  ONLINE	   0	 0	 0

	  gptid/3e7d3fa4-98f3-11e7-9c33-10c37b8fc624	ONLINE	   0	 0	 0


errors: No known data errors



Will my non redundant data become redundant without the mirror? If not, how can I get a copy on the new drive? Should I have added the drive differently? Can I redo adding my drive if I messed it up?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Yes. You made a big mistake :(

You can not remove a vdev from a pool and you've added a new single drive vdev.

You can 'fix' this by adding two new drives. One to replace the failed drive and one to mirror the new stripe.

OR

you can backup your entire pool, re-create it, and restore.
 
Joined
Sep 3, 2017
Messages
6
I don't have any more ports to add another drive, let alone two. What happens if I disconnect the new drive? Can I detach it from the Volume? I thought I needed help before, now I need it even more.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
I don't have any more ports to add another drive, let alone two.

You may want to consider adding extra ports, ie by adding an HBA.

What happens if I disconnect the new drive?

You will lose your entire pool (until you reconnect the drive)

Can I detach it from the Volume?

No.

I thought I needed help before, now I need it even more.

Yes.

For future reference, http://doc.freenas.org/11/storage.html#replacing-a-failed-drive

See Fig 8.2.20 Replacing a Failed Disk
replace.png


You should've selected the offlined disk, and clicked replace.

Instead you used the Volume Manager to override the safetys and added a new disk to your pool as a single disk vdev, ie a stripe.

That has resulted in a new single point of failure. If that new drive fails you lose your pool (this includes removing it), and also the existing mirror with only one member also is a single point of failure.

You can remove the offlined drive from the degraded mirror ( zpool detach), but that will just result in two single disk vdevs (ie stripes) being in your pool.

But you can not remove a vdev from a pool.

The only way to restore redundancy to your pool is to replace the offlined disk in your degraded mirror, to restore it, and then to add another disk to your stripe vdev ( zpool attach) to convert the single disk vdev to a mirror.

Since you have no additional ports... and you probably don't want to acquire another 2 drives... your only other option... assuming you want to restore redudancy, and keep your data, is to recreate the pool and restore from backup. And if you don't have a backup, then you will need to make one.

Unfortunately, the mistake you made is one of the worst :(

There is work being done to make it so you can remove a vdev, specifically to help in this type of situation, but that is for a future version of ZFS, not todays.
 
Last edited:

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
How much data do you have to backup? How big are your disks?
 
Joined
Sep 3, 2017
Messages
6
How much data do you have to backup? How big are your disks?
I have 4.7 Tib to backup. Does ZFS have a backup utility or is it a matter of simple copying everything to other (non-ZFS) volumes?

I have two 4 TB drives Mirrored, two 3TB drives mirrored, a 2TB drive which was mirrored, and a 4TB drive which I just added.

I think I will buy another 4TB drive this weekend, then copy most everything to it, and the rest on another drive I have lying about. Once that is done, destroy my pool and recreate it properly - two 4 TB drives mirrored, two 3 TB drives mirrored, then copy everything back, then add the two 4 TB drives, getting rid of those old 2 TB drives completely.
 
Last edited:

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
I have 4.7 Tib to backup. Does ZFS have a backup utility or is it a matter of simple copying everything to other (non-ZFS) volumes?
There are lots of ways to backup to zfs or other filesystem. Since you can't add more disks to your system you are stuck with network backup or possibly USB drive plugged into freenas. Make sure it's USB 3 and use zfs send/recv
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
To quote myself, what part of this screen:
upload_2017-9-14_9-39-15.png

made you think, "sure, I should switch to Manual Setup so I can bypass this"?

I'm serious here. OK, I'm partially being snarky, since any effort to do your homework would have told you you were doing it wrong, but I'm honestly mystified how people continue to blow through this warning screen, which is intended to prevent exactly what you did, to end up with the result you now have.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
And how is "Add Extra Device" ever interpreted as the "ooh, this will replace my existing drive" button?
 
Status
Not open for further replies.
Top