SOLVED Update from 11.0-U4 to 11.1 now reports DEGRADED status on pool

André Fettouhi · Dec 15, 2017

I have been running freens on my fileserver almost a year now. I originally installed it with 9.10 and updated to 11.0 after going to Corral and reverting back. No issues so far. Today I notice that 11.1 is out and I update my 11.0-U4 install. After boot I log in and I see a critical notice that my freenaspool is flagged as HEALTHY but some critical errors have orcurred but no applications are affected by this. About 10 minutes later I get another critical report stating that the pool is now flagged as DEGRADED. Since I haven't done much with Freenas since installing it I am unsure how proceed here. Everthing has just been running with any issues for me besides the Corral debacle. After getting the DEGRADED notice I decided to reboot Freenas again and now it reports the pool as healthy??? Should I be worried about one of my 4 Western Digital Red drives might be dying? How do I check a drives health with Freenas? Is this posibvle directly from the GUI or do I need to run commands in the terminal?

dlavigne · Dec 15, 2017

Please post the output of zpool status within code tags.

André Fettouhi · Dec 15, 2017

dlavigne said:
Please post the output of zpool status within code tags.

Code:

action: Enable all features using 'zpool upgrade'. Once this is done,		  
	   the pool may no longer be accessible by software that does not support
	   the features. See zpool-features(7) for details.						
  scan: scrub repaired 0 in 0 days 02:50:41 with 0 errors on Sun Dec  3 02:50:41
 2017																		  
config:																		
																			  
	   NAME											STATE	 READ WRITE CKS
UM																			
	   freenaspool									 ONLINE	   0	 0	
 0																			
		 raidz2-0									  ONLINE	   0	 0	
 0																			
		   gptid/21b36631-c4a3-11e6-afcd-0cc47ae0446b  ONLINE	   0	 0	
 0																			
		   gptid/227a936a-c4a3-11e6-afcd-0cc47ae0446b  ONLINE	   0	 0	
 0																			
		   gptid/233a9cf2-c4a3-11e6-afcd-0cc47ae0446b  ONLINE	   0	 0	
 0																			
		   gptid/2414cf96-c4a3-11e6-afcd-0cc47ae0446b  ONLINE	   0	 0	
 0																			
																			  
errors: No known data errors

André Fettouhi · Dec 15, 2017

After reboot the server has been running less than a hour and it reports now:

Code:

CRITICAL: 15. December 2017 16:25 - The volume freenaspool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

did the above zpool status and it reports the same.

I can also see this

Code:

Dec 15 16:33:52 freenas daemon[2982]:	 2017/12/15 16:33:52 [WARN] agent: Check 'service:nas-health' is now warning

every 2 minutes.

danb35 · Dec 15, 2017

André Fettouhi said:
did the above zpool status and it reports the same.

What is the complete output of zpool status?

André Fettouhi · Dec 15, 2017

danb35 said:
What is the complete output of zpool status?

That is the complete output I get when I type in zpool status into the shell in the FreeNAS web client.

danb35 · Dec 15, 2017

Then enable SSH and use a terminal, because that is not the complete output.

Edit: Or, if the Shell that comes up in the web GUI still lets you do it, use a larger window size so it isn't cutting off part of the output.

André Fettouhi · Dec 15, 2017

André Fettouhi said:
That is the complete output I get when I type in zpool status into the shell in the Freenas web client.

Code:

FreeBSD 11.1-STABLE (FreeNAS.amd64) #0 r321665+d4625dcee3e(freenas/11.1-stable): Wed Dec 13 16:33:42 UTC 2017

		FreeNAS (c) 2009-2017, The FreeNAS Development Team
		All rights reserved.
		FreeNAS is released under the modified BSD license.

		For more information, documentation, help or support, go here:
		http://freenas.org
Welcome to FreeNAS
[freenasuser@freenas ~]$ zpool status
  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:24:41 with 0 errors on Tue Dec 12 04:09:41 2017
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  da0p2	 ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenaspool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
		attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
		using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 0 days 02:50:41 with 0 errors on Sun Dec  3 02:50:41 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		freenaspool									 ONLINE	   0	 0	 0
		  raidz2-0									  ONLINE	   0	 0	 0
			gptid/21b36631-c4a3-11e6-afcd-0cc47ae0446b  ONLINE	   0	 0	 0
			gptid/227a936a-c4a3-11e6-afcd-0cc47ae0446b  ONLINE	   0	 0	 1
			gptid/233a9cf2-c4a3-11e6-afcd-0cc47ae0446b  ONLINE	   0	 0	 0		
			gptid/2414cf96-c4a3-11e6-afcd-0cc47ae0446b  ONLINE	   0	 0	 0		
																						  
errors: No known data errors

danb35 · Dec 15, 2017

Much better (though putting it in code tags rather than cmd tags would be better yet--code deals better with multi-line text). And it isn't the same as you previously posted; the second disk is showing a checksum error where it wasn't before (the action: message is also different, but not especially relevant). That means the system read data from that disk that didn't match its checksum. Before messing around with the disk, check your cable connections, making sure they're securely connected both on the motherboard and on the disk. If you're still seeing errors on that disk, check out https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/ for some troubleshooting guidance.

Ericloewe · Dec 15, 2017

danb35 said:
(though putting it in code tags rather than cmd tags would be better yet--code deals better with multi-line text

I'm just glad a forum update made it start preserving white space. Holy crap, it makes our lives so much easier, I can even just add the tags to a post that's missing them and it comes out neatly formatted! It's amazing, ditching stupid ancient forum behavior is like getting clean water after years of sticking with contaminated water on purpose because "that's the way we've been doing it for a while".
Probably a stupid vbulletin legacy thing, like most of XenForo's stupidities.

Hell, I just swapped those CMD tags for CODE tags and everything looks right. This shouldn't be making me nearly as happy as it is, but look where we are now...

André Fettouhi · Dec 15, 2017

danb35 said:
Much better (though putting it in code tags rather than cmd tags would be better yet--code deals better with multi-line text). And it isn't the same as you previously posted; the second disk is showing a checksum error where it wasn't before (the action: message is also different, but not especially relevant). That means the system read data from that disk that didn't match its checksum. Before messing around with the disk, check your cable connections, making sure they're securely connected both on the motherboard and on the disk. If you're still seeing errors on that disk, check out https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/ for some troubleshooting guidance.

OK I can see that this is the drive with the checksum error

gptid/227a936a-c4a3-11e6-afcd-0cc47ae0446b ONLINE 0 0 1

How do I relate it to the /dev/adaX so that I can run the smartctl command?

Ericloewe · Dec 15, 2017

You quoted a link that will tell you how to proceed. To immediately narrow it down to the offending drive, try this:
https://forums.freenas.org/index.php?resources/identify-your-drives-by-serial-number.64/

Or this, if you feel like adding a small script to your collection of useful scripts:
https://forums.freenas.org/index.ph...cation-with-converting-gptid-to-geom-name.60/

danb35 · Dec 15, 2017

...or just go to the Volume Status page in the web GUI.

André Fettouhi · Dec 15, 2017

danb35 said:
...or just go to the Volume Status page in the web GUI.

Thanks it seems to be ada2 that is reporting the checksum error. Running a smart test now on that drive. BTW I have set up ssh access to my FreeNAS server but ny user can't execute smartctl. How do I change that?

BigDave · Dec 15, 2017

Ericloewe said:
This shouldn't be making me nearly as happy as it is, but look where we are now...

I feel you have worked hard (coming up on four years) to make this a more worth while place, and because I have not seen anyone take the time...
THANK YOU ERIC FOR YOUR TIME AND EFFORTS, I APPRECIATE ALL YOU DO FOR THE MEMBERS HERE.

André Fettouhi · Dec 15, 2017

Here is the output of smartctl -a /dev/ada2:

Code:

root@freenas:~ # smartctl -a /dev/ada2
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD20EFRX-68EUZN0
Serial Number:	WD-WCC4M1VKAEV6
LU WWN Device Id: 5 0014ee 2b81408ef
Firmware Version: 82.00A82
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Fri Dec 15 18:13:09 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (  41) The self-test routine was interrupted
										by the host with a hard or soft reset.
Total time to complete Offline
data collection:				(26940) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 272) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x703d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   188   175   021	Pre-fail  Always	   -	   3558
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   22
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   087   087   000	Old_age   Always	   -	   9544
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   22
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   13
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   125
194 Temperature_Celsius	 0x0022   113   106   000	Old_age   Always	   -	   34
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   100   253   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended captive	Interrupted (host reset)	  90%	  9544		 -
# 2  Extended offline	Aborted by host			   90%	  9544		 -
# 3  Extended offline	Aborted by host			   90%	  9544		 -
# 4  Extended offline	Aborted by host			   90%	  9543		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

danb35 · Dec 15, 2017

There's nothing obviously wrong with the SMART data, other than that the disk has never completed a SMART test in its life.

André Fettouhi · Dec 15, 2017

danb35 said:
There's nothing obviously wrong with the SMART data, other than that the disk has never completed a SMART test in its life.

Hence running

smartctl -t long /dev/ada2

now. Seems to take 272 minutes to complete.

Jailer · Dec 15, 2017

André Fettouhi said:
but ny user can't execute smartctl. How do I change that?

Switch to root. su followed by the root password.

Your smart test was interrupted. What hardware are you running this on?

André Fettouhi · Dec 15, 2017

Jailer said:
Switch to root. su followed by the root password.

Your smart test was interrupted. What hardware are you running this on?

Intel(R) Core(TM) i3-6100 CPU @ 3.70GHz

with 32 GB RAM and 4 WD 2 TB red discs.

Important Announcement for the TrueNAS Community.

SOLVED Update from 11.0-U4 to 11.1 now reports DEGRADED status on pool

Contributor

dlavigne

Guest

Contributor

Contributor

Hall of Famer

Contributor

Hall of Famer

Contributor

Hall of Famer

Server Wrangler

Contributor

Server Wrangler

Hall of Famer

Contributor

FreeNAS Enthusiast

Contributor

Hall of Famer

Contributor

Not strong, but bad

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Update from 11.0-U4 to 11.1 now reports DEGRADED status on pool"

Similar threads