RAIDZ Volume Degraded

Argi · Oct 14, 2017

Hi,

I've done some searching on degraded volumes but it's not clear to me if my drive actually needs to be replaced or if there's something wrong with the RAIDz itself.

How do I recover from this? The drive, ada4, seems to be reporting healthy from a SMART perspective but from the RAIDZ side it's degraded. Details follow.

Thanks.

Code:

[root@freenas ~]# zpool status																									
  pool: freenas-boot																											  
state: ONLINE																													
  scan: scrub repaired 0 in 0h7m with 0 errors on Tue Oct 10 03:52:50 2017														
config:																															
																																  
		NAME		STATE	 READ WRITE CKSUM																					
		freenas-boot  ONLINE	   0	 0	 0																				  
		  da0p2	 ONLINE	   0	 0	 0																					
																																  
errors: No known data errors																									  
																																  
  pool: storage																													
state: DEGRADED																												  
status: One or more devices has experienced an unrecoverable error.  An															
		attempt was made to correct the error.  Applications are unaffected.													  
action: Determine if the device needs to be replaced, and clear the errors														
		using 'zpool clear' or replace the device with 'zpool replace'.															
   see: http://illumos.org/msg/ZFS-8000-9P																						
  scan: scrub repaired 0 in 9h19m with 0 errors on Sun Sep 10 09:19:23 2017														
config:																															
																																  
		NAME											STATE	 READ WRITE CKSUM												
		storage										 DEGRADED	 0	 0	 0												
		  raidz1-0									  DEGRADED	 0	 0	 0												
			gptid/dda4d6d5-d527-11e6-a23b-000c29fb52cd  ONLINE	   0	 0	 0												
			gptid/df8e9648-d527-11e6-a23b-000c29fb52cd  ONLINE	   0	 0	 0												
			gptid/e17c1e7d-d527-11e6-a23b-000c29fb52cd  ONLINE	   0	 0	 0												
			gptid/e3997407-d527-11e6-a23b-000c29fb52cd  ONLINE	   0	 0	 0												
			gptid/e50e0577-d527-11e6-a23b-000c29fb52cd  DEGRADED	 0	 0   169  too many errors								
		logs																													  
		  gptid/e5fcf548-d527-11e6-a23b-000c29fb52cd	ONLINE	   0	 0	 0												
																																  
errors: No known data errors  

[root@freenas ~]# smartctl -a /dev/ada4

SMART Attributes Data Structure revision number: 16																				
Vendor Specific SMART Attributes with Thresholds:																				  
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE								  
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   1										  
  3 Spin_Up_Time			0x0027   208   181   021	Pre-fail  Always	   -	   6575										
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   74										  
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0										  
  7 Seek_Error_Rate		 0x002e   194   192   000	Old_age   Always	   -	   649										
  9 Power_On_Hours		  0x0032   091   091   000	Old_age   Always	   -	   6868										
10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0										  
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0										  
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   58										  
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   45										  
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   200										
194 Temperature_Celsius	 0x0022   122   112   000	Old_age   Always	   -	   30										  
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0										  
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0										  
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0										  
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0										  
200 Multi_Zone_Error_Rate   0x0008   100   253   000	Old_age   Offline	  -	   0

Chris Moore · Oct 14, 2017

From the abreviated SMART results, it is hard to tell what might be wrong with the drive but it has definitely been ejected from the pool.
If it were me, I would replace it for now and run diagnostics on the drive later. But if it is under warranty, which it may well be, you might want to send it in for a replacement drive.
Since you have a RAID-z1 pool, you don't have any room for error though, so go ahead and get a replacement drive in there as soon as you can.
It would be really great if you could share more information about your hardware though, like it says here:
https://forums.freenas.org/index.php?threads/updated-forum-rules-4-11-17.45124/

joeschmuck · Oct 14, 2017

When was the last time you ran an Extended test on ada4? Had you posted the full output of your smart data I would have been able to see it. Also, post your system specs per the forum rules so we can provide you the best help we can in the shortest amount of messages.

You can run zpool clear to clear those errors and then run another scrub to see what happens.

I see you have a log drive, out of curiosity what are you using it for?

Chris Moore · Oct 14, 2017

Argi said:
something wrong with the RAIDz

No, you either have a bad drive or maybe some other hardware in the chain, but it is difficult to tell since you didn't give hardware details.
The thing you should share with us it the full output of smartctl -x /dev/ada4 so we can see all the data the drive returns. It might give some answers.
Looks like this:

Code:

root@Emily-NAS:~ # smartctl -x /dev/da9
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Barracuda 7200.14 (AF)
Device Model:	 ST2000DM001-1ER164
Firmware Version: CC25
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	7200 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sat Oct 14 18:14:32 2017 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM level is:	 128 (minimum power consumption without standby)
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(   89) seconds.
Offline data collection
capabilities:					(0x73) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										No Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   1) minutes.
Extended self-test routine
recommended polling time:		( 226) minutes.
Conveyance self-test routine
recommended polling time:		(   2) minutes.
SCT capabilities:			  (0x1085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAGS	VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate	 POSR--   114   099   006	-	78156640
  3 Spin_Up_Time			PO----   096   096   000	-	0
  4 Start_Stop_Count		-O--CK   099   099   020	-	1299
  5 Reallocated_Sector_Ct   PO--CK   100   100   010	-	0
  7 Seek_Error_Rate		 POSR--   078   060   030	-	74102168
  9 Power_On_Hours		  -O--CK   092   092   000	-	7323
 10 Spin_Retry_Count		PO--C-   100   100   097	-	0
 12 Power_Cycle_Count	   -O--CK   100   100   020	-	99
183 Runtime_Bad_Block	   -O--CK   100   100   000	-	0
184 End-to-End_Error		-O--CK   100   100   099	-	0
187 Reported_Uncorrect	  -O--CK   100   100   000	-	0
188 Command_Timeout		 -O--CK   100   100   000	-	0 0 0
189 High_Fly_Writes		 -O-RCK   098   098   000	-	2
190 Airflow_Temperature_Cel -O---K   070   058   045	-	30 (Min/Max 26/33)
191 G-Sense_Error_Rate	  -O--CK   100   100   000	-	0
192 Power-Off_Retract_Count -O--CK   100   100   000	-	64
193 Load_Cycle_Count		-O--CK   099   099   000	-	3815
194 Temperature_Celsius	 -O---K   030   042   000	-	30 (0 13 0 0 0)
197 Current_Pending_Sector  -O--C-   100   100   000	-	0
198 Offline_Uncorrectable   ----C-   100   100   000	-	0
199 UDMA_CRC_Error_Count	-OSRCK   200   200   000	-	0
240 Head_Flying_Hours	   ------   100   253   000	-	7076h+31m+42.358s
241 Total_LBAs_Written	  ------   100   253   000	-	29995977657
242 Total_LBAs_Read		 ------   100   253   000	-	625253974282
							||||||_ K auto-keep
							|||||__ C event count
							||||___ R error rate
							|||____ S speed/performance
							||_____ O updated online
							|______ P prefailure warning

General Purpose Log Directory Version 1
SMART		   Log Directory Version 1 [multi-sector log support]
Address	Access  R/W   Size  Description
0x00	   GPL,SL  R/O	  1  Log Directory
0x01		   SL  R/O	  1  Summary SMART error log
0x02		   SL  R/O	  5  Comprehensive SMART error log
0x03	   GPL	 R/O	  5  Ext. Comprehensive SMART error log
0x06		   SL  R/O	  1  SMART self-test log
0x07	   GPL	 R/O	  1  Extended self-test log
0x09		   SL  R/W	  1  Selective self-test log
0x10	   GPL	 R/O	  1  SATA NCQ Queued Error log
0x11	   GPL	 R/O	  1  SATA Phy Event Counters log
0x21	   GPL	 R/O	  1  Write stream error log
0x22	   GPL	 R/O	  1  Read stream error log
0x30	   GPL,SL  R/O	  9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W	 16  Host vendor specific log
0xa1	   GPL,SL  VS	  20  Device vendor specific log
0xa2	   GPL	 VS	4496  Device vendor specific log
0xa8	   GPL,SL  VS	 129  Device vendor specific log
0xa9	   GPL,SL  VS	   1  Device vendor specific log
0xab	   GPL	 VS	   1  Device vendor specific log
0xb0	   GPL	 VS	5176  Device vendor specific log
0xbe-0xbf  GPL	 VS   65535  Device vendor specific log
0xc0	   GPL,SL  VS	   1  Device vendor specific log
0xc1	   GPL,SL  VS	  10  Device vendor specific log
0xe0	   GPL,SL  R/W	  1  SCT Command/Status
0xe1	   GPL,SL  R/W	  1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	  7319		 -
# 2  Short offline	   Completed without error	   00%	  7313		 -
# 3  Extended offline	Completed without error	   00%	  7309		 -
# 4  Short offline	   Completed without error	   00%	  7301		 -
# 5  Extended offline	Completed without error	   00%	  7295		 -
# 6  Short offline	   Completed without error	   00%	  7289		 -
# 7  Extended offline	Completed without error	   00%	  7284		 -
# 8  Short offline	   Completed without error	   00%	  7277		 -
# 9  Extended offline	Completed without error	   00%	  7271		 -
#10  Short offline	   Completed without error	   00%	  7265		 -
#11  Extended offline	Completed without error	   00%	  7260		 -
#12  Short offline	   Completed without error	   00%	  7253		 -
#13  Extended offline	Completed without error	   00%	  7247		 -
#14  Short offline	   Completed without error	   00%	  7241		 -
#15  Extended offline	Completed without error	   00%	  7236		 -
#16  Short offline	   Completed without error	   00%	  7229		 -
#17  Extended offline	Completed without error	   00%	  7223		 -
#18  Short offline	   Completed without error	   00%	  7217		 -
#19  Extended offline	Completed without error	   00%	  7212		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:				  3
SCT Version (vendor specific):	   522 (0x020a)
SCT Support Level:				   1
Device State:						Active (0)
Current Temperature:					29 Celsius
Power Cycle Min/Max Temperature:	 26/33 Celsius
Lifetime	Min/Max Temperature:	 13/42 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Data Table command not supported

SCT Error Recovery Control command not supported

Device Statistics (GP/SMART Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11)
ID	  Size	 Value  Description
0x000a  2			1  Device-to-host register FISes sent due to a COMRESET
0x0001  2			0  Command failed due to ICRC error
0x0003  2			0  R_ERR response for device-to-host data FIS
0x0004  2			0  R_ERR response for host-to-device data FIS
0x0006  2			0  R_ERR response for device-to-host non-data FIS
0x0007  2			0  R_ERR response for host-to-device non-data FIS

root@Emily-NAS:~ #

joeschmuck · Oct 14, 2017

Chris Moore said:
but it has definitely been ejected from the pool

Ejected? I don't see FAULTED, OFFLINE, or other key words to indicate that the drive is not attached to the pool. If I have this wrong, please correct me. Right now I only see degraded.

Chris Moore · Oct 14, 2017

joeschmuck said:
Ejected?

Ejected is not a technical description of the status, it is merely my way of indicating that FreeNAS doesn't like that drive.

joeschmuck · Oct 14, 2017

Chris Moore said:
Ejected is not a technical description of the status, it is merely my way of indicating that FreeNAS doesn't like that drive.

As an engineer I do take things literally often. I really want a shot of tequila.

MrToddsFriends · Oct 14, 2017

Not sure if OpenZFS has some more status keywords than those enumerated here:

ONLINE, DEGRADED, FAULTED, OFFLINE, UNAVAIL, REMOVED

Otherwise it's probably a good idea not to use "private" extensions.

Chris Moore · Oct 14, 2017

MrToddsFriends said:
Not sure if OpenZFS has some more status keywords than those enumerated here:

ONLINE, DEGRADED, FAULTED, OFFLINE, UNAVAIL, REMOVED

Otherwise it's probably a good idea not to use "private" extensions.

It's not a private extension. I didn't put code tags on it or something.
I wasn't trying to quote chapter and verse out of a book. I was trying to get a point across.
Don't be so literal.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

MrToddsFriends · Oct 14, 2017

Chris Moore said:
Don't be so literal.

Most likely I was too literal (sorry for that, non native speaker). Should have been just a little push to avoid confusing terms, combined with a link to the docs where to find the valid values of the pool state and an explanation thereof, hopefully also useful for the OP and others reading this thread in the future.

Now back to the original problem where I can't give a definite answer right now. Let's see how the OP answers the questions already asked (full smartctl output, why/what for the log device, ...). Hint: Non-zero values in 01 Read Error Rate and 07 Seek Error Rate are bad signs, at least for drives from some producers (as for example WD).

joeschmuck · Oct 15, 2017

MrToddsFriends said:
Not sure if OpenZFS has some more status keywords than those enumerated here:

ONLINE, DEGRADED, FAULTED, OFFLINE, UNAVAIL, REMOVED

Otherwise it's probably a good idea not to use "private" extensions.

Thanks for the link, it's good to refresh my poor memory.

@Chris Moore The only reason this got any attention is because you never know who is on the other end of the problem and what they are thinking so we do our best to be exact. I think we have definately cleared up any potiential for confusion. The one thing I got out of this thread was that good link by @MrToddsFriends explaining the definitions of the messages, a very useful tool.

Now we wait for the OP to respond, and it's time to feed the animal.

Jailer · Oct 15, 2017

Chris Moore said:
Don't be so literal.

When you are trying to help troubleshoot a users problem words matter. Using the term "ejected" infers that the drive is no longer part of the pool. In this case it does matter what you say as it changes how drive replacement is done if the drive is detached from the pool.

Johnnie Black · Oct 15, 2017

It doesn't make much sense to me the disk being DEGRADED, the pool is DEGRADED and that makes sense, shouldn't the disk be OFFLINE or UNAVAILABLE?

Jailer · Oct 15, 2017

No the pool can be degraded and the disk still on line due to errors such as the OP's.

Johnnie Black · Oct 15, 2017

Jailer said:
No the pool can be degraded and the disk still on line due to errors such as the OP's.

Sorry, still doesn't make sense to me, and I can't find anything on the ZFS documentation talking about a degrade device, only degraded pools.

joeschmuck · Oct 15, 2017

Johnnie Black said:
Sorry, still doesn't make sense to me,

Hey, it's ZFS. You have tried to figure out the capacity of as drive using the different methods that seem to conflict, right? Some things are just easier to accept and then you move on.

rs225 · Oct 15, 2017

Since this is WD, I would also agree the read error rate and seek error rate indicate this is a failing drive. It is pretty rare to see genuine checksum errors coming from the disk itself, but that is what it looks like, and that is why ZFS has flagged it DEGRADED.

Just to speculate, this could be a case of a drive returning the wrong sectors or that has been writing to the wrong sectors, and therefore the drive says "ECC okay" but ZFS knows better. The drive could also be vibrating too much.

joeschmuck · Oct 16, 2017

rs225 said:
Since this is WD

What makes you say that? The OP only posted once here and I didn't see any reference to WD.

Important Announcement for the TrueNAS Community.

RAIDZ Volume Degraded

Argi

Cadet

Chris Moore

Hall of Famer

joeschmuck

Old Man

Chris Moore

Hall of Famer

joeschmuck

Old Man

Chris Moore

Hall of Famer

joeschmuck

Old Man

MrToddsFriends

Documentation Browser

Chris Moore

Hall of Famer

MrToddsFriends

Documentation Browser

joeschmuck

Old Man

Jailer

Not strong, but bad

Johnnie Black

Guru

Jailer

Not strong, but bad

Johnnie Black

Guru

joeschmuck

Old Man

rs225

Guru

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

RAIDZ Volume Degraded

Cadet

Hall of Famer

Old Man

Hall of Famer

Old Man

Hall of Famer

Old Man

Documentation Browser

Hall of Famer

Documentation Browser

Old Man

Not strong, but bad

Guru

Not strong, but bad

Guru

Old Man

Guru

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "RAIDZ Volume Degraded"

Similar threads