Drive removed from pool

Status
Not open for further replies.

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,478
I'm at a bit of a roadblock after having done all the testing I can think to do on this problem.

About a day ago I got an email that one of my scheduled smart tests failed to initiate because a disk was "offline". I got the warning email during the night so nothing had been changed and I wasn't fiddling with the hardware in any way.

Here is what I have done:
1) I identified the drive by serial number that is was missing from the pool
2) I shutdown the box and then made sure the cable were all plugged in firmly
3) Powered back on, confirmed the drive by feel was spinning which means it is getting power, still not showing up
4) Shut down again, went to store and bought a brand new sata cable, installed it.
5) Powered back up, still nothing.
6) Powered down and plugged the offline drive into a different motherboard sata port.
7) Powered on and still nothing.

So at this point, have I missed any troubleshooting steps? Does this leave it down to the motherboard? I can't think of what would be causing this. Could the drive just be that dead. I have a regular scrub, long, and short tests (mirroring very closely cyberjock's published schedule) scheduled so I would have thought there would have been warning signs.

Puzzled. Would appreciate any help.

EDIT: here is the output of zpool status:
Code:
[root@tank] ~# zpool status
  pool: tank
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
	the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
  see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 2.79M in 0h0m with 0 errors on Thu Nov  3 15:56:28 2016
config:
	NAME											STATE	 READ WRITE CKSUM
	tank											DEGRADED	 0	 0	 0
	  raidz2-0									  DEGRADED	 0	 0	 0
		gptid/549fa8c7-f836-11e5-83b3-d05099c06c0a  ONLINE	   0	 0	 0
		17982504159631691677						UNAVAIL	  0	 0	 0  was /dev/gptid/55637770-f836-11e5-83b3-d05099c06c0a
		gptid/56276481-f836-11e5-83b3-d05099c06c0a  ONLINE	   0	 0	 0
		gptid/56e58c6a-f836-11e5-83b3-d05099c06c0a  ONLINE	   0	 0	 0
		gptid/57a1f4ab-f836-11e5-83b3-d05099c06c0a  ONLINE	   0	 0	 0
		gptid/584fac5b-f836-11e5-83b3-d05099c06c0a  ONLINE	   0	 0	 0
errors: No known data errors
  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h5m with 0 errors on Thu Oct  6 03:50:56 2016
config:
	NAME											STATE	 READ WRITE CKSUM
	freenas-boot									ONLINE	   0	 0	 0
	  mirror-0									  ONLINE	   0	 0	 0
		da0p2									   ONLINE	   0	 0	 0
		gptid/bc97d97b-374a-11e6-bd6e-d050996ca84a  ONLINE	   0	 0	 0
errors: No known data errors


Code:
Also here, is the output of my running "zpool online"
[root@tank] ~# zpool online tank gptid/55637770-f836-11e5-83b3-d05099c06c0a
warning: device 'gptid/55637770-f836-11e5-83b3-d05099c06c0a' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
When you say "not showing up", does that mean you looked at all the drives and counted them and you are missing say ada1?

How about the output of dmesg | grep ada
 

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,478
When you say "not showing up", does that mean you looked at all the drives and counted them and you are missing say ada1?
Thanks for you help. What I mean specifically is that under the "view disks" in the GUI, the drive is not there because only 5 disks are listed (it is a 6 drive radiz2) and I can confirm this against my list of serial numbers that I keep of the drives that are in the box and one is not listed.

Here is the output of "dmesg | grep ada"
Code:
[root@tank] ~# dmesg | grep ada
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <ST3000VN000-1HJ166 SC60> ACS-2 ATA SATA 3.x device
ada0: Serial Number W7307M5T
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 2861588MB (5860533168 512 byte sectors)
ada0: Previously was known as ad4
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <ST3000VN000-1HJ166 SC60> ACS-2 ATA SATA 3.x device
ada1: Serial Number W7307P53
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 2861588MB (5860533168 512 byte sectors)
ada1: Previously was known as ad6
ada2 at ahcich3 bus 0 scbus3 target 0 lun 0
ada2: <ST3000VN000-1HJ166 SC60> ACS-2 ATA SATA 3.x device
ada2: Serial Number W7307SAT
ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 2861588MB (5860533168 512 byte sectors)
ada2: Previously was known as ad10
ada3 at ahcich4 bus 0 scbus4 target 0 lun 0
ada3: <ST3000VN000-1HJ166 SC60> ACS-2 ATA SATA 3.x device
ada3: Serial Number W7307MV8
ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 2861588MB (5860533168 512 byte sectors)
ada3: Previously was known as ad12
ada4 at ahcich5 bus 0 scbus5 target 0 lun 0
ada4: <ST3000VN000-1HJ166 SC60> ACS-2 ATA SATA 3.x device
ada4: Serial Number W7307NT2
ada4: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada4: Command Queueing enabled
ada4: 2861588MB (5860533168 512 byte sectors)
ada4: Previously was known as ad14
GEOM_ELI: Device ada0p1.eli created.
GEOM_ELI: Device ada4p1.eli created.
GEOM_ELI: Device ada1p1.eli created.
GEOM_ELI: Device ada2p1.eli created.
GEOM_ELI: Device ada3p1.eli created.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Sounds to me like the drive is dead. Happens.

You can try plugging it into your windows box and see if it it sees it or the device.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Sounds to me like the drive is dead. Happens.
I agree, you only have the five drives showing up and no error messages for the failed drive so it's not communicating. I would expect the drive interface board being dead. As said above, try it in another computer just to see if it magically comes to life but I'd say you should RMA the drive.
 

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,478
thanks guys. I will plug it into a computer tomorrow and see if it pops up.

I was just stretching my head though because I've never had a drive die without warning like that. just poof. damn.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
This of course is why you are using ZFS and RaidZ2 :)
 

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,478
just thought I would provide an update. I removed the ghost drive and put it into a USB dock, powered on, heard it spin up, then heard two violent what sounded like the head seeking, then it powered down itself and stopped spinning.

I am going to RMA it tomorrow. Still strange how I got no warning signs, thank the FreeNAS overlords for raidz2!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Yup, definitely not good anymore.

It sounds like the electronics died. You will typically get some early warnings of mechanical failures like sector error counts but when it comes to the circuit board failing, well this is what happens.

I'm curious about something... Could you post the output for one of your other hard drives that you installed about the same time as the one which failed? smartctl -a /dev/ada1

I'm just curious on how things are looking for the drive, just in case you have something possibly happening that might affect the other drives. But hopefully it was just the single drive but it's not a problem to just take a look.
 

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,478
sure, thanks for the inquiry @joeschmuck All of the drives in this vdev were purchased at the same time (March 2016) from the same vendor.

Code:
[root@tank] ~# smartctl -a /dev/ada1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate NAS HDD
Device Model:	 ST3000VN000-1HJ166
Serial Number:	W7307P53
LU WWN Device Id: 5 000c50 09aa2bfb4
Firmware Version: SC60
User Capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5900 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sat Nov  5 21:24:29 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection:		 (   97) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time:	 (   1) minutes.
Extended self-test routine
recommended polling time:	 ( 373) minutes.
Conveyance self-test routine
recommended polling time:	 (   2) minutes.
SCT capabilities:		   (0x10bd)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   115   099   006	Pre-fail  Always	   -	   88722208
  3 Spin_Up_Time			0x0003   093   093   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   42
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   073   060   030	Pre-fail  Always	   -	   23712926
  9 Power_On_Hours		  0x0032   095   095   000	Old_age   Always	   -	   5225
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   47
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   100   100   000	Old_age   Always	   -	   0
189 High_Fly_Writes		 0x003a   092   092   000	Old_age   Always	   -	   8
190 Airflow_Temperature_Cel 0x0022   073   060   045	Old_age   Always	   -	   27 (Min/Max 23/27)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   16
193 Load_Cycle_Count		0x0032   100   100   000	Old_age   Always	   -	   58
194 Temperature_Celsius	 0x0022   027   040   000	Old_age   Always	   -	   27 (0 16 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   198   000	Old_age   Always	   -	   5

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	  5206		 -
# 2  Short offline	   Completed without error	   00%	  5135		 -
# 3  Short offline	   Completed without error	   00%	  5039		 -
# 4  Short offline	   Completed without error	   00%	  4967		 -
# 5  Short offline	   Completed without error	   00%	  4895		 -
# 6  Extended offline	Completed without error	   00%	  4877		 -
# 7  Short offline	   Completed without error	   00%	  4823		 -
# 8  Short offline	   Completed without error	   00%	  4751		 -
# 9  Short offline	   Completed without error	   00%	  4679		 -
#10  Short offline	   Completed without error	   00%	  4607		 -
#11  Extended offline	Completed without error	   00%	  4565		-
#12  Short offline	   Completed without error	   00%	  4535		 -
#13  Short offline	   Completed without error	   00%	  4463		 -
#14  Short offline	   Completed without error	   00%	  4391		 -
#15  Short offline	   Completed without error	   00%	  4319		 -
#16  Short offline	   Completed without error	   00%	  4247		 -
#17  Short offline	   Completed without error	   00%	  4175		 -
#18  Extended offline	Completed without error	   00%	  4157		 -
#19  Short offline	   Completed without error	   00%	  4105		 -
#20  Short offline	   Completed without error	   00%	  4033		 -
#21  Short offline	   Completed without error	   00%	  3961		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Thankfully I don't see anything to worry about, meaning no indicators of pending doom, but then again you didn't have any indicators for the failed drive either. Lets hope it was just a single drive failure.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
I was just stretching my head though because I've never had a drive die without warning like that. just poof. damn.

Actually that was also how one of my drives died one year ago. Ironically I received my usual emails with the SMART and pool infos in the morning the same day the drive died and all was perfectly good. In the afternoon the drive died, no warning, nothing. And it even keeped the server from POSTing (and my desktop too when I tried to plug the drive in it).

It's not often but sometimes it happens.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Yep ;)

SMART data just before it died (mid-october 2015):

Code:
########## SMART status report for da3 drive (Seagate NAS HDD: W300HNNQ) ##########

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   120   099   006    Pre-fail  Always       -       241386352
  3 Spin_Up_Time            0x0003   092   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       74
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   075   060   030    Pre-fail  Always       -       32173363
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       10450
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       74
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   090   090   000    Old_age   Always       -       10
190 Airflow_Temperature_Cel 0x0022   072   054   045    Old_age   Always       -       28 (Min/Max 26/31)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       38
193 Load_Cycle_Count        0x0032   092   092   000    Old_age   Always       -       16683
194 Temperature_Celsius     0x0022   028   046   000    Old_age   Always       -       28 (0 14 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%     10426         -
 
Status
Not open for further replies.
Top