DEGRADED: One or more devices could not be opened

Status
Not open for further replies.

Wolfeman0101

Patron
Joined
Jun 14, 2012
Messages
428
I'm getting this error all of the sudden.
Code:
Aug. 8, 2017, 5:51 p.m. - The volume vol1 state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.


I run a report every other day and the disk affected hasn't ever shown an error.
Code:
########## SMART status report for ada1 drive (Western Digital Red: WD-WCC4E1EFAJLL) ##########
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   189   188   021	Pre-fail  Always	   -	   7550
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   25
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   078   078   000	Old_age   Always	   -	   16114
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   25
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   17
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   442
194 Temperature_Celsius	 0x0022   114   092   000	Old_age   Always	   -	   38
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

No Errors Logged

Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline	   Completed without error	   00%	 16108		 -


Is the drive bad or what is going on?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Could also be a cabling issue.
 

zeezoo

Dabbler
Joined
Sep 4, 2017
Messages
12
Hello,

Did you manage to sort it out?

Last night (maybe because of fireworks) I got e-mail saying:

Code:
		
The volume raid-z2 state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
			  


I shut down the server immediately.
Now, after reboot I get this:

Code:

		NAME												STATE	 READ WRITE CKSUM
		raid-z2											 DEGRADED	 0	 0	 0
		  raidz2-0										  DEGRADED	 0	 0	 0
			13191957016337703241							UNAVAIL	  0	 0	 0  was /dev/gptid/dfe7bb5b-4c25-11e7-b342-3cd92b507db2.eli
			gptid/dda36b46-1383-11e7-b739-3cd92b507db2.eli  ONLINE	   0	 0	 0
			gptid/21cc7236-5378-11e6-be46-3cd92b507db2.eli  ONLINE	   0	 0	 0
			gptid/da1620b7-6aee-11e7-be91-3cd92b507db2.eli  ONLINE	   0	 0	 0
			gptid/2445918c-5378-11e6-be46-3cd92b507db2.eli  ONLINE	   0	 0	 0


Drive seems fine:

Code:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   181   177   021	Pre-fail  Always	   -	   7908
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   41
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   097   097   000	Old_age   Always	   -	   2788
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   41
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   25
193 Load_Cycle_Count		0x0032   118   118   000	Old_age   Always	   -	   247368
194 Temperature_Celsius	 0x0022   122   109   000	Old_age   Always	   -	   30
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0


I can't make the drive online.

Code:
zpool online raid-z2 13191957016337703241
warning: device '13191957016337703241' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present


I did not do anything to the server. Is it possible that drive changed ID by itself?
I'm a bit worried if I use "replace drive" and select the same disk. If it is faulty it may crash during resilvering :/

What can I do?

Regards,
zeezoo
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977

zeezoo

Dabbler
Joined
Sep 4, 2017
Messages
12
No.


You can start by not truncating the smart output. Post the entire output. Also are you sure you identified the correct drive?

Thanks for answer. Yes. When I try to replace this is the only drive that shows up available for replacement. ada0p2 is also missing in Volume Status.

smartctl -a /dev/ada0

Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Blue
Device Model:	 WDC WD40EZRZ-00WN9B0
Serial Number:	WD-WCC4E4JNCUU3
LU WWN Device Id: 5 0014ee 2630c0485
Firmware Version: 80.00A80
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Mon Jan  1 15:35:39 2018 +04
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)   Offline data collection activity
				   was suspended by an interrupting command from host.
				   Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0)   The previous self-test routine completed
				   without error or no self-test has ever
				   been run.
Total time to complete Offline
data collection:		(52380) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
				   Auto Offline data collection on/off support.
				   Suspend Offline collection upon new
				   command.
				   Offline surface scan supported.
				   Self-test supported.
				   Conveyance Self-test supported.
				   Selective Self-test supported.
SMART capabilities:			(0x0003)   Saves SMART data before entering
				   power-saving mode.
				   Supports SMART auto save timer.
Error logging capability:		(0x01)   Error logging supported.
				   General Purpose Logging supported.
Short self-test routine
recommended polling time:	 (   2) minutes.
Extended self-test routine
recommended polling time:	 ( 524) minutes.
Conveyance self-test routine
recommended polling time:	 (   5) minutes.
SCT capabilities:		   (0x7035)   SCT Status supported.
				   SCT Feature Control supported.
				   SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   181   177   021	Pre-fail  Always	   -	   7908
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   41
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   097   097   000	Old_age   Always	   -	   2789
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   41
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   25
193 Load_Cycle_Count		0x0032   118   118   000	Old_age   Always	   -	   247374
194 Temperature_Celsius	 0x0022   114   109   000	Old_age   Always	   -	   38
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	  2787		 -
# 2  Short offline	   Completed without error	   00%	  2763		 -
# 3  Short offline	   Completed without error	   00%	  2739		 -
# 4  Short offline	   Completed without error	   00%	  2715		 -
# 5  Short offline	   Completed without error	   00%	  2692		 -
# 6  Short offline	   Completed without error	   00%	  2668		 -
# 7  Short offline	   Completed without error	   00%	  2644		 -
# 8  Short offline	   Completed without error	   00%	  2620		 -
# 9  Short offline	   Completed without error	   00%	  2596		 -
#10  Short offline	   Completed without error	   00%	  2572		 -
#11  Short offline	   Completed without error	   00%	  2548		 -
#12  Short offline	   Completed without error	   00%	  2524		 -
#13  Short offline	   Completed without error	   00%	  2500		 -
#14  Short offline	   Completed without error	   00%	  2476		 -
#15  Short offline	   Completed without error	   00%	  2452		 -
#16  Short offline	   Completed without error	   00%	  2428		 -
#17  Short offline	   Completed without error	   00%	  2404		 -
#18  Short offline	   Completed without error	   00%	  2380		 -
#19  Short offline	   Completed without error	   00%	  2356		 -
#20  Short offline	   Completed without error	   00%	  2332		 -
#21  Short offline	   Completed without error	   00%	  2308		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



Also:

Code:
root@freenas:/dev/gptid # ls
21cc7236-5378-11e6-be46-3cd92b507db2			da1620b7-6aee-11e7-be91-3cd92b507db2
21cc7236-5378-11e6-be46-3cd92b507db2.eli		da1620b7-6aee-11e7-be91-3cd92b507db2.eli
2445918c-5378-11e6-be46-3cd92b507db2			dda36b46-1383-11e7-b739-3cd92b507db2
2445918c-5378-11e6-be46-3cd92b507db2.eli		dda36b46-1383-11e7-b739-3cd92b507db2.eli
9e1a3075-4a21-11e6-9d45-3cd92b507db2			dfd17659-4c25-11e7-b342-3cd92b507db2
9e2880c2-4a21-11e6-9d45-3cd92b507db2			dfe7bb5b-4c25-11e7-b342-3cd92b507db2


I'm using latest version 11.1
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
You haven't run any long tests on that drive. Run a long test on it and report back with the results. Also post your complete hardware specs.
 

zeezoo

Dabbler
Joined
Sep 4, 2017
Messages
12
Long test will take over 500 mins.

I'm using HP Compaq 8200 Elite Microtower with i7 2600s since July 2017. Standard PSU.

16GB RAM (4x4GB)

Initially 5xWD 3TB Blue in Raid-z2 but few drives started developing Current_Pending_Sector and I got them replaced by RMA. Now:

ada0: WD4TB Blue
ada1: WD4TB Red
ada2: WD3TB Blue
ada3: WD4TB Blue
ada4: WD3TB Blue

I built this system for home usage. Mostly to keep the data safe and secure with raid-z2, occasional backups from MacOS. Few plugins active - transmission, sickrage, emby, owncloud. 2 VM's but they're off most of the time. Actually that's another issue - Win10 on VM keeps asking for activation and teamviewer has to be set up after each restart. It wasn't like that with VirtualBox on FreeNAS 9, hence it is OFF most of time.

Regards,
zeezoo
 
Last edited by a moderator:

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
Load cycle count is rather high on the drive you posted. You need to run wdidle3 on all those blue drives and change the head park timer to something more appropriate. Search function here should point you in the right direction.
 

zeezoo

Dabbler
Joined
Sep 4, 2017
Messages
12
Load cycle count is rather high on the drive you posted. You need to run wdidle3 on all those blue drives and change the head park timer to something more appropriate. Search function here should point you in the right direction.

Well, it is actually high on all my drives :/
Will my pool be destroyed after using wdidle3?

Regards,
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
No it's a utility you run on the drives to change the head park timer. It has nothing to do with your pool.
 

zeezoo

Dabbler
Joined
Sep 4, 2017
Messages
12
No it's a utility you run on the drives to change the head park timer. It has nothing to do with your pool.

Alright. So I should run it only for BLUE drives? I've read there's some trouble with RED EZRX... Mine is EFRX though.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
If the red drive has a high load cycle count then check that one too.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
First of all I thought that the OP was including more info for his problem but it's a new person with their own issues. Please don't do this without clearly stating you are posting on an old thread.

My advice is to set your head loading timer to 300 seconds, this will reduce the count significantly.

To address your main problem I would ensure you have all your data backed up first of all.
Next, your drive S/N: WCC4E4JNCUU3 appears to be fine however as previously asked, conduct a SMART long/extended test on it and then provide the output. In my signature is a link to help someone troubleshoot hard drive failures, take a look at it. BTW, you should be conducting routine SMART Long tests on your hard drives. I test once a week and daily short tests.

Next lets assume the hard drive comes back looking okay, I'd run "badblocks" on the suspect drive for at least one pass. This will do two things, 1) verify the drive can record and read data, 2) wipe out the drive.
After that is done, abort badblocks when it's on the second test pattern 0x55. Now you can resilver the drive into the pool.
 

zeezoo

Dabbler
Joined
Sep 4, 2017
Messages
12
Hello again,

Thanks for all replies. Really helpful. I did wdidle3 for all drives and disabled it completely.
I also set APM to 254 as advised in another thread. I didn't know that was the issue killing my drives :/
The one that died last night was practically 3-4 months old...

Smartctl found some issues with mentioned drive, so I replaced.
Resilvering in progress...

My most critical data is being backed up weekly to my old WD NAS :)

Now one final question - shall I wait until resilvering is done or I can generate new key and re-encrypt the pool right away? It is not mentioned anywhere... I was usually waiting until it was done but what happens in the event of power outage?

[edit]
The new drive as it gets resilvered already has Load_Cycle_Count value at 457... Isn't that too quick?
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
I also set APM to 254 as advised in another thread.
A value of 128 or higher should be fine but 254 is fine too.

I did wdidle3 for all drives and disabled it completely.
That will likely be just fine. I ran my WD Reds for 5 years with the timer disabled.
Now one final question - shall I wait until resilvering is done or I can generate new key and re-encrypt the pool right away? It is not mentioned anywhere... I was usually waiting until it was done but what happens in the event of power outage?
Too bad you are using an encrypted pool but I'd do it now. As for a power failure, I hope you have an UPS on your system, if not then I highly recommend you get one.
 

zeezoo

Dabbler
Joined
Sep 4, 2017
Messages
12
Alright. All done.

The problem still exist though...

Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Blue
Device Model:	 WDC WD40EZRZ-00WN9B0
Serial Number:	WD-WCC4E6UV271U
LU WWN Device Id: 5 0014ee 20db12a80
Firmware Version: 80.00A80
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Tue Jan  2 15:28:02 2018 +04
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)   Offline data collection activity
				   was suspended by an interrupting command from host.
				   Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0)   The previous self-test routine completed
				   without error or no self-test has ever
				   been run.
Total time to complete Offline
data collection:		(52320) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
				   Auto Offline data collection on/off support.
				   Suspend Offline collection upon new
				   command.
				   Offline surface scan supported.
				   Self-test supported.
				   Conveyance Self-test supported.
				   Selective Self-test supported.
SMART capabilities:			(0x0003)   Saves SMART data before entering
				   power-saving mode.
				   Supports SMART auto save timer.
Error logging capability:		(0x01)   Error logging supported.
				   General Purpose Logging supported.
Short self-test routine
recommended polling time:	 (   2) minutes.
Extended self-test routine
recommended polling time:	 ( 523) minutes.
Conveyance self-test routine
recommended polling time:	 (   5) minutes.
SCT capabilities:		   (0x7035)   SCT Status supported.
				   SCT Feature Control supported.
				   SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   100   253   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   100   253   021	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   1
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   20
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   1
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   0
193 Load_Cycle_Count		0x0032   197   197   000	Old_age   Always	   -	   10066
194 Temperature_Celsius	 0x0022   110   109   000	Old_age   Always	   -	   42
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%		19		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


10066 after just one day... and it grew over a 1000 in last 2 hours...
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
After reprogramming the timer did you power off the hard drives and then power them back on? I don't recall if this is required but maybe on your drives it is. Also if this does not work, try to set the timer for 300 seconds.
 

pschatz100

Guru
Joined
Mar 30, 2014
Messages
1,184
I was under the impression that deactivating the head loading timer was not a good idea, and that setting the time to 300 seconds was the proper thing to do. Had a Green drive that lasted for 6.5 years.

Maybe deactivating the timer does not work on all drives. Set it for 300 seconds (no longer) and see if that results in more reasonable cycle counts.
 
Joined
May 10, 2017
Messages
838
I usually disable it and never had a drive fail to do so, wdidle outputs the command result, though changing to 300 seconds is usually good enough to lower the LLC.
 
Status
Not open for further replies.
Top