Degraded ...all disks!

Status
Not open for further replies.

skyline65

Explorer
Joined
Jul 18, 2014
Messages
95
I just received an email warning from Freenas during a routine scrub.

The volume Server (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

So I ran zpool status... and all 4 disks seem to have problems!

scan: scrub in progress since Fri Dec 2 04:00:27 2016

4.85T scanned out of 8.78T at 96.8M/s, 11h48m to go

192M repaired, 55.29% done

config:

NAME STATE READ WRITE CKSUM
Server DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 9

gptid/eeb55e39-ad79-11e5-98b5-001b78574ec8 DEGRADED 0 0 737 too many errors (repairing)

gptid/efaf45f6-ad79-11e5-98b5-001b78574ec8 DEGRADED 0 0 749 too many errors (repairing)

gptid/f0972a6e-ad79-11e5-98b5-001b78574ec8 DEGRADED 0 0 735 too many errors (repairing)

gptid/f17e3830-ad79-11e5-98b5-001b78574ec8 DEGRADED 0 0 851 too many errors (repairing)

errors: No known data errors

Im running a 6 disk RaidZ2 and a 4 disk RaidZ2... WD green disks... which have had WDIDLE used on them.
Supermicro Board X9, 1220, 16GB RAM, Dell Perc H300 flashed.

I’m pretty sure it isnt the card unless 1 of the channels has gone bad, as 2 of the disks from the 6 Disk Raid Z2 are running off it. Im wondering whether the cable has gone bad... or all 4 disks are dying simultaneously.

As the scrub is half way through what should I do? Shutdown and replace cable and start another scrub?

Thanks


I just ran
smartctl -qnoserial -x /dev/
and all the drives passed.
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Can be the controller, can be the cable(s), can be the PSU, ...

Or it can be the drives (highly unlikely though...).

Can you provide the output (inbetween code tags) of smartctl -a /dev/... for your drives please?

And can you post the list of your hardware too?
 

skyline65

Explorer
Joined
Jul 18, 2014
Messages
95
I just had a thought... last week I put in a new SSD as a main Final Cut Pro and named it Final_Cut_HD. I renamed the old Final_Cut_HD to Final_Cut_Archive_HD.
All these drives get synced automatically to the Final Cut Backup dataset on the server. However on the Freenas server I decided to rename the old Final_Cut_HD to Final_Cut_Archive_HD and create a new Final_Cut_HD folder with in the Final Cut Backup dataset . Do you think this messing about has confused Freenas and its checksums?

Imj wondering whether it is worth rolling back a snap shot to pre me messing about and the sync with the fresh data?
 
Last edited:

skyline65

Explorer
Joined
Jul 18, 2014
Messages
95
Supermicro X9sci-LN4f, Xeon E3 1220, 16gb ECC Crucial RAM, Seasonic G550 PSU, Dell Perc H300 flashed to IT 20.
 
Last edited:

skyline65

Explorer
Joined
Jul 18, 2014
Messages
95
Code:
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 10.0px Monaco} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 10.0px Monaco; min-height: 14.0px} span.s1 {font-variant-ligatures: no-common-ligatures} span.Apple-tab-span {white-space:pre}

=== START OF INFORMATION SECTION ===

Model Family:	 Western Digital Green

Device Model:	 WDC WD30EZRX-00D8PB0

Serial Number:	WD-WCC4N5UPT88D

LU WWN Device Id: 5 0014ee 20c01de0b

Firmware Version: 80.00A80

User Capacity:	3,000,592,982,016 bytes [3.00 TB]

Sector Sizes:	 512 bytes logical, 4096 bytes physical

Rotation Rate:	5400 rpm

Device is:		In smartctl database [for details use: -P show]

ATA Version is:   ACS-2 (minor revision not indicated)

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is:	Fri Dec  2 18:58:22 2016 GMT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status:	  ( 241) Self-test routine in progress...

10% of test remaining.

Total time to complete Offline

data collection: (41160) seconds.

Offline data collection

capabilities:  (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:			(0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:		(0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time:  (   2) minutes.

Extended self-test routine

recommended polling time:  ( 413) minutes.

Conveyance self-test routine

recommended polling time:  (   5) minutes.

SCT capabilities:		 (0x7035) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.


SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0

  3 Spin_Up_Time			0x0027   185   181   021	Pre-fail  Always	   -	   5733

  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   998

  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0

  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0

  9 Power_On_Hours		  0x0032   091   091   000	Old_age   Always	   -	   6963

10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0

11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0

12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   55

192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   12

193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   1007

194 Temperature_Celsius	 0x0022   121   111   000	Old_age   Always	   -	   29

196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0

197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0

198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0

199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0

200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0


SMART Error Log Version: 1

No Errors Logged


SMART Self-test log structure revision number 1

Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline	   Completed without error	   00%	  6806		 -

# 2  Short offline	   Completed without error	   00%	  6651		 -

# 3  Extended offline	Completed without error	   00%	  6620		 -

# 4  Short offline	   Completed without error	   00%	  6476		 -

# 5  Short offline	   Completed without error	   00%	  6312		 -

# 6  Extended offline	Completed without error	   00%	  6252		 -

# 7  Short offline	   Completed without error	   00%	  6072		 -

# 8  Short offline	   Completed without error	   00%	  5904		 -

# 9  Extended offline	Completed without error	   00%	  5867		 -

#10  Short offline	   Completed without error	   00%	  5742		 -

#11  Short offline	   Completed without error	   00%	  5582		 -

#12  Extended offline	Completed without error	   00%	  5499		 -

#13  Short offline	   Completed without error	   00%	  5208		 -

#14  Short offline	   Completed without error	   00%	  5161		 -

#15  Short offline	   Completed without error	   00%	  5077		 -

#16  Extended offline	Completed without error	   00%	  4973		 -

#17  Short offline	   Completed without error	   00%	  4893		 -

#18  Short offline	   Completed without error	   00%	  4725		 -

#19  Extended offline	Completed without error	   00%	  4638		 -

#20  Extended offline	Completed without error	   00%	  4387		 -

#21  Short offline	   Completed without error	   00%	  4190		 -


SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

	1		0		0  Not_testing

	2		0		0  Not_testing

	3		0		0  Not_testing

	4		0		0  Not_testing

	5		0		0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.


All the drives have passed.

I used your excellent scripts to prvide me with 6 hourly reports...
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
This smells like your controller or cables have gone bad.

Sent from my Nexus 5X using Tapatalk
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
No, and even if it did it'll not create checksum errors (who are detected and corrected by ZFS so what you do with the shares doesn't do anything to that).

Your hardware sound perfectly fine, no problem here. Just to confirm: you have 6 HDDs in this server?

Even if a drive pass a SMART test it is not necessarily good because the tests are very permissive usually. However I don't see anything wrong with this drive.

Which tells me that "last week I put in a new SSD" is the cause. You've likely moved the SATA/SAS cables when you did that and since then you have checksum errors. Try to reseat both ends of the cables and see if it fixes the problem ;)
 

skyline65

Explorer
Joined
Jul 18, 2014
Messages
95
No the SSD was put in my desktop machine not the server.

6x3tb Raid Z2 for media... all fine running 2 drives off the Dell Card,
4x3tb Raid Z2 for Backup... 4 drives running from Dell card.

Could also be that it has been getting cold in England... so about 17c-19c in my study... when i built the server it was summer so maybe they just need reseating.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Ah...

10 drives on 550 W is pushing it a bit but should pass given you have a good quality PSU.

The drive is at 29 °C so it's perfect, not a temp issue here. Humidity can play however.

I hope it's not the controller...

Just in case: do you have a backup of the data?
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Errors due to port shrinkage, now that's funny, I don't care who you are...
 

skyline65

Explorer
Joined
Jul 18, 2014
Messages
95
Well it is in bits on the table. I have a virtually identical server sitting under it so Im taking the cable out of that one a will try that... If not I can always put the drives in one of my Microservers and reimport them if the controller has gone wrong. Luckily I have 2 N54 and a G8 sitting unused.

I used the cable that came with the card and it doesnt have locking Sata connections... my othe ones do maybe that could be the issue.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Yeah I don't know why they don't put the locking tab on every cable... well, I know: money. So, my rule is: if you put a SATA cable in a server then make sure it has locking tabs ;)

One of the big advantages of software RAID is that the controller can die but you can still access your data with another controller or server without any problem.
 

skyline65

Explorer
Joined
Jul 18, 2014
Messages
95
Well I removed the drives and put them in my G8 server and ran another scrub. Yes some files were knackered. So I deleted the offending files and ran another scrub. Now everything has no errors. I do suspect the cable or port on the card had gone bad. i did realise that some of the data on the server isn't backed up... doh! So that is getting burned to M Disc soon.

Is it worth erasing the setup and starting from scratch as I have noticed that some files and folders are locked and cannot be deleted from the finder of my Mac. If I login with ftp software I can delete the offending files/folders. I wondering if a fresh start with correct permissions would solve the problem. Any thoughts?
 

skyline65

Explorer
Joined
Jul 18, 2014
Messages
95
Not really. If freenas says 0 errors I'm quite happy. Luckily I had copies of the corrupted files. This was setup a while ago and I think I may have set it up initially with unix permissions and then changed to windows hence a few weird locked files and folders whilst accessing from my Mac. It really isn't an issue to start a fresh and is only about 1.5 Tb of data to copy over.


Sent from my iPad using Tapatalk
 
Status
Not open for further replies.
Top