SOLVED Disk UNAVAIL

Status
Not open for further replies.

Bhoot

Patron
Joined
Mar 28, 2015
Messages
241
I got a mail from my freenas
Code:
Checking status of zfs pools:
NAME		   SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
bhoot		   29T  21.6T  7.35T		 -	39%	74%  1.00x  DEGRADED  /mnt
freenas-boot  14.2G  1.05G  13.2G		 -	  -	 7%  1.00x  ONLINE  -

  pool: bhoot
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
		the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 132h59m with 0 errors on Fri Jan  6 16:59:14 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		bhoot										   DEGRADED	 0	 0	 0
		  raidz2-0									  DEGRADED	 0	 0	 0
			gptid/5663b940-bdde-11e5-9e00-
f07959376c84  ONLINE	   0	 0	 0
			10479856730608632472						UNAVAIL	  0	 0	 0  was /dev/gptid/cd427285-e4d8-11e4-b39d-f07959376c84
			gptid/ec0f7827-2d2c-11e6-b1de-f07959376c84  ONLINE	   0	 0	 0
			gptid/ce06b19f-e4d8-11e4-b39d-f07959376c84  ONLINE	   0	 0	 0
			gptid/ce69a75d-e4d8-11e4-b39d-f07959376c84  ONLINE	   0	 0	 0
			gptid/b1f3389f-5382-11e6-885d-f07959376c84  ONLINE	   0	 0	 0
			gptid/cf2dd08e-e4d8-11e4-b39d-f07959376c84  ONLINE	   0	 0	 0
			gptid/cf91d6e8-e4d8-11e4-b39d-f07959376c84  ONLINE	   0	 0	 0

errors: No known data errors

-- End of daily output --


Woke up next morning and checked the basics of the freenas box. One of the disks showed a really long (about 20 digit number) with the status as UNAVAIL. I also tried resetting the connections to no avail. Not sure why but the disk just dropped out of the array. I am ready with a cold spare so that's not an issue. I am just wondering if I was supposed to get a few alerts before this happened.
I did get a few outputs from ssh
Code:
[root@freenas] ~# zpool status -v
  pool: bhoot
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
  the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
  see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 132h59m with 0 errors on Fri Jan  6 16:59:14 2017
config:

  NAME  STATE  READ WRITE CKSUM
  bhoot  DEGRADED  0  0  0
  raidz2-0  DEGRADED  0  0  0
  gptid/5663b940-bdde-11e5-9e00-f07959376c84  ONLINE  0  0  0
  10479856730608632472  UNAVAIL  0  0  0  was /dev/gptid/cd427285-e4d8-11e4-b39d-f07959376c84
  gptid/ec0f7827-2d2c-11e6-b1de-f07959376c84  ONLINE  0  0  0
  gptid/ce06b19f-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  0
  gptid/ce69a75d-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  0
  gptid/b1f3389f-5382-11e6-885d-f07959376c84  ONLINE  0  0  0
  gptid/cf2dd08e-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  0
  gptid/cf91d6e8-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  0

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h1m with 0 errors on Wed Jan  4 03:46:18 2017
config:

  NAME  STATE  READ WRITE CKSUM
  freenas-boot  ONLINE  0  0  0
  mirror-0  ONLINE  0  0  0
  gptid/40460acb-cf27-11e5-b12b-f07959376c84  ONLINE  0  0  0
  da1p2  ONLINE  0  0  0

errors: No known data errors

Code:
[root@freenas] ~# camcontrol devlist
<WDC WD40EFRX-68WT0N0 82.00A82>  at scbus1 target 0 lun 0 (ada0,pass0)
<WDC WD40EFRX-68WT0N0 82.00A82>  at scbus4 target 0 lun 0 (ada1,pass1)
<WDC WD40EFRX-68WT0N0 82.00A82>  at scbus5 target 0 lun 0 (ada2,pass2)
<WDC WD40EFRX-68WT0N0 82.00A82>  at scbus6 target 0 lun 0 (ada3,pass3)
<WDC WD40EFRX-68WT0N0 82.00A82>  at scbus7 target 0 lun 0 (ada4,pass4)
<WDC WD40EFRX-68WT0N0 82.00A82>  at scbus8 target 0 lun 0 (ada5,pass5)
<WDC WD40EFRX-68WT0N0 82.00A82>  at scbus9 target 0 lun 0 (ada6,pass6)
<SanDisk Ultra Fit 1.00>  at scbus11 target 0 lun 0 (pass7,da0)
<SanDisk Ultra Fit 1.00>  at scbus12 target 0 lun 0 (pass8,da1)

Code:
[root@freenas] ~# gpart show
=>  34  30031183  da0  GPT  (14G)
  34  1024  1  bios-boot  (512k)
  1058  6  - free -  (3.0k)
  1064  30030152  2  freebsd-zfs  (14G)
  30031216  1  - free -  (512B)

=>  34  30031183  da1  GPT  (14G)
  34  1024  1  bios-boot  (512k)
  1058  6  - free -  (3.0k)
  1064  30030152  2  freebsd-zfs  (14G)
  30031216  1  - free -  (512B)

=>  34  7814037101  ada0  GPT  (3.7T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  7809842696  2  freebsd-zfs  (3.7T)
  7814037128  7  - free -  (3.5k)

=>  34  7814037101  ada1  GPT  (3.7T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  7809842696  2  freebsd-zfs  (3.7T)
  7814037128  7  - free -  (3.5k)

=>  34  7814037101  ada2  GPT  (3.7T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  7809842696  2  freebsd-zfs  (3.7T)
  7814037128  7  - free -  (3.5k)

=>  34  7814037101  ada3  GPT  (3.7T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  7809842696  2  freebsd-zfs  (3.7T)
  7814037128  7  - free -  (3.5k)

=>  34  7814037101  ada4  GPT  (3.7T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  7809842696  2  freebsd-zfs  (3.7T)
  7814037128  7  - free -  (3.5k)

=>  34  7814037101  ada5  GPT  (3.7T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  7809842696  2  freebsd-zfs  (3.7T)
  7814037128  7  - free -  (3.5k)

=>  34  7814037101  ada6  GPT  (3.7T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  7809842696  2  freebsd-zfs  (3.7T)
  7814037128  7  - free -  (3.5k)

Code:
[root@freenas] ~# glabel status
  Name  Status  Components
gptid/403b7529-cf27-11e5-b12b-f07959376c84  N/A  da0p1
gptid/40460acb-cf27-11e5-b12b-f07959376c84  N/A  da0p2
gptid/9623e4df-cf29-11e5-a539-f07959376c84  N/A  da1p1
gptid/cf91d6e8-e4d8-11e4-b39d-f07959376c84  N/A  ada0p2
gptid/b1f3389f-5382-11e6-885d-f07959376c84  N/A  ada1p2
gptid/5663b940-bdde-11e5-9e00-f07959376c84  N/A  ada2p2
gptid/ec0f7827-2d2c-11e6-b1de-f07959376c84  N/A  ada3p2
gptid/ce06b19f-e4d8-11e4-b39d-f07959376c84  N/A  ada4p2
gptid/ce69a75d-e4d8-11e4-b39d-f07959376c84  N/A  ada5p2
gptid/cf2dd08e-e4d8-11e4-b39d-f07959376c84  N/A  ada6p2


//edit: the gui only shows replace option when the said disk is selected.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Sometimes a drive just die, no warnings, nothing.
 

Bhoot

Patron
Joined
Mar 28, 2015
Messages
241
Sometimes a drive just die, no warnings, nothing.
Surprising thing is that the mail alert came in the same minute as the security report. I just went and checked SMART reports that come to me as mail. The disk had ~14000 hours. So yes I do understand. But it's nowhere close to the MTBF claimed by WD for reds.
Code:
########## SMART status report summary for all drives ##########

+------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+-------+----+
|Device|Serial		 |Temp|Power|Start|Spin |ReAlloc|Current|Offline |UDMA  |Seek  |High  |Command|Last|
|	  |			   |	|On   |Stop |Retry|Sectors|Pending|Uncorrec|CRC   |Errors|Fly   |Timeout|Test|
|	  |			   |	|Hours|Count|Count|	   |Sectors|Sectors |Errors|	  |Writes|Count  |Age |
+------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+-------+----+
|ada0  |WD-WCC4EXXXXXX| 36 |14033|  170|	0|	  0|	  0|	   0|	 0|   N/A|   N/A|	N/A|   0|
|ada1  |WD-WCC4EXXXXXX| 36 | 4203|	9|	0|	  0|	  0|	   0|	 0|   N/A|   N/A|	N/A|   0|
|ada2  |WD-WCC4EXXXXXX| 37 | 7882|   33|	0|	  0|	  0|	   0|	 0|   N/A|   N/A|	N/A|   0|
|ada3  |WD-WCC4EXXXXXX| 38 | 5371|   15|	0|	  0|	  0|	   0|	 0|   N/A|   N/A|	N/A|   0|
|ada4  |WD-WCC4EXXXXXX| 38 |14033|  176|	0|	  0|	  0|	   0|	 0|   N/A|   N/A|	N/A|   0|
|ada5  |WD-WCC4EXXXXXX| 36 |14033|  177|	0|	  0|	  0|	   0|	 0|   N/A|   N/A|	N/A|   0|
|ada6  |WD-WCC4EXXXXXX| 37 |14033|  177|	0|	  0|	  0|	   0|	 0|   N/A|   N/A|	N/A|   0|
|ada7  |			   |  |	 |	 |	 |	   |	   |		|	  |   N/A|   N/A|	N/A|   0|
+------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+-------+----+
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
It happens, I had one drive failing on me last year but none before and none since then.

I guess yours is still under warranty so you can RMA it so it's shouldn't be a big deal.
 

Bhoot

Patron
Joined
Mar 28, 2015
Messages
241
I have RMA'd about 5-6 hard disks already. Not sure why but I am getting a lot of sectors pending (unreadable). I am just scared of losing 2 of them together. :/
 

snaptec

Guru
Joined
Nov 30, 2015
Messages
502
Could you please post the complete hw / sw Setup and how old it is? How old are your drives? How many do you have?


Gesendet von iPhone mit Tapatalk
 

Bhoot

Patron
Joined
Mar 28, 2015
Messages
241
I just realized I get emails every night from my server and I really thought that 1 of the disks is kaput. For most of the mail scripts I have to thank @Bidule0hm and @cyberjock . However I just realized a mail I skipped a few days ago.
Mail at 0301 Local on 24/01/17
Code:
Checking status of zfs pools:
NAME		   SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
bhoot		   29T  21.3T  7.65T		 -	38%	73%  1.00x  DEGRADED  /mnt
freenas-boot  14.2G  1.05G  13.2G		 -	  -	 7%  1.00x  ONLINE  -

  pool: bhoot
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
		the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 132h59m with 0 errors on Fri Jan  6 16:59:14 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		bhoot										   DEGRADED	 0	 0	 0
		  raidz2-0									  DEGRADED	 0	 0	 0
			gptid/5663b940-bdde-11e5-9e00-
f07959376c84  ONLINE	   0	 0	 0
			10479856730608632472						UNAVAIL	  0	 0	 0  was /dev/gptid/cd427285-e4d8-11e4-b39d-f07959376c84
			gptid/ec0f7827-2d2c-11e6-b1de-f07959376c84  ONLINE	   0	 0	 0
			gptid/ce06b19f-e4d8-11e4-b39d-f07959376c84  ONLINE	   0	 0	 0
			gptid/ce69a75d-e4d8-11e4-b39d-f07959376c84  ONLINE	   0	 0	 0
			gptid/b1f3389f-5382-11e6-885d-f07959376c84  ONLINE	   0	 0	 0
			gptid/cf2dd08e-e4d8-11e4-b39d-f07959376c84  ONLINE	   0	 0	 0
			gptid/cf91d6e8-e4d8-11e4-b39d-f07959376c84  ONLINE	   0	 0	 0

errors: No known data errors

-- End of daily output --

Then another mail at 1310 on 24/01/17
Code:
The volume bhoot (ZFS) state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

Next mail at 1352 on 24/01/17
Code:
The volume bhoot (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.


No disks have been replaced. I did however remove and refit the wires but that was on the day this thread was created. i am not sure what suddenly happened to the server/disk. Before that there wasn't any option to online the disk. As you can see it was shown as UNAVAIL (not OFFLINE).

Now the output of zpool status -v
Code:
[root@freenas] ~# zpool status -v
  pool: bhoot
 state: ONLINE
status: 
action: Determine if the device needs to be replaced, and clear the errors
  using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 43.5G in 0h43m with 0 errors on Tue Jan 24 13:51:54 2017
config:

  NAME  STATE  READ WRITE CKSUM
  bhoot  ONLINE  0  0  0
  raidz2-0  ONLINE  0  0  0
  gptid/5663b940-bdde-11e5-9e00-f07959376c84  ONLINE  0  0  0
  gptid/cd427285-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  3
  gptid/ec0f7827-2d2c-11e6-b1de-f07959376c84  ONLINE  0  0  0
  gptid/ce06b19f-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  0
  gptid/ce69a75d-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  0
  gptid/b1f3389f-5382-11e6-885d-f07959376c84  ONLINE  0  0  0
  gptid/cf2dd08e-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  0
  gptid/cf91d6e8-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h1m with 0 errors on Fri Jan 20 03:46:22 2017
config:

  NAME  STATE  READ WRITE CKSUM
  freenas-boot  ONLINE  0  0  0
  mirror-0  ONLINE  0  0  0
  gptid/40460acb-cf27-11e5-b12b-f07959376c84  ONLINE  0  0  0
  da1p2  ONLINE  0  0  0

errors: No known data errors



There is a figure of 3 chksum error(s). Does this qualify for RMA? I am still waiting for my previous disk to return from RMA. Should be another week or so.
I read the error related topic. It says the error occurs coz of:
  • A network attached device lost connectivity but has now recovered
  • A device suffered from a bit flip, an expected event over long periods of time
  • An administrator accidentally wrote over a portion of the disk using another program
Any way I could prove it just lost power for RMA or suffered from a bit flip (silent corruption).

Any and every help will be appreciated. :)
 

Bhoot

Patron
Joined
Mar 28, 2015
Messages
241
Could you please post the complete hw / sw Setup and how old it is? How old are your drives? How many do you have?


Gesendet von iPhone mit Tapatalk

I fist of all would like to put a disclaimer that I do not promote/support the hardware I use. I use it as per my capabilities. A few components maybe (and are shown) as non compatible with another on both manufacturer website.
I would have preferred the super micro board but sadly their stocks/model/support is limited to proprietary boards with cases neither fitting my budget and the hole in the wall I wanted to place them in.

My system config

CASE: Fractal Define R5 (with 2 140mm fans)
MOTHERBOARD: ASUS Z10PA-U8
PROCESSOR: iNTEL Xeon E52609v3 + Intel Fan
RAM: CRUCIAL 32GB (16x2)DDR4-2133 1.2v RDIMM 288p (CT16G4RFD4213)
HDD: WD RED 4tbx8 WD-WCC4E
OS USB: 2x Cruzer Fit 16gb (mirror)
PSU: Coolermaster 750 gold
FreeNAS 9.3 Stable
2xCorsair sp120 front fans

I do live in a warm to hot place next to the sea. I can not afford ac to be running 24x7 and temperatures do touch 50C in summers. The humidity is also alarmingly high (My house is less than a km from the beach). The winter months go down to barely 20C. Yes it is Celcius and not Farenhiet.
I put the system together on 30/03/15. The max number of hours the system could have has is 16000 to date and I am currently at 14100 hours on a few disks. The temperature on the disks are impossible to maintain below 40C, esp during scrub, I won't lie about that. I do however keep them below 45C with whatever required. I don't have them spinning down at idle and I don't think my system is ever at IDLE. :P
Also notice my system has had <180 restarts in the close to 2 years of setup. Many a times the restarts were done one after another while trouble shooting and I think I have made the server run for 60 days at a stretch.
The system is scrubbed once a month and SMART test(ed) as per @cyberjock 's guide. The scrub frequency has been reduced coz of the time taken and heat generated by the system. It's a personal choice and I don't suggest it.
 

diedrichg

Wizard
Joined
Dec 4, 2012
Messages
1,319
You say you have (8) drives but all your status logs show 7.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
There is a figure of 3 chksum error(s). Does this qualify for RMA?
Checksum errors reported by ZFS are not grounds for RMA. A failed SMART test or a SMART attribute reaching threshold would be grounds for RMA.
 

Bhoot

Patron
Joined
Mar 28, 2015
Messages
241
You say you have (8) drives but all your status logs show 7.
If you see the second post after the resilver finished it shows 8 again.
Checksum errors reported by ZFS are not grounds for RMA. A failed SMART test or a SMART attribute reaching threshold would be grounds for RMA.
So should I just zpool clear and move on?
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
So should I just zpool clear and move on?
I suppose you could do that for now and see what happens, depends how paranoid you are. If the problem persists, you'll have to figure out the underlying cause.
 
Status
Not open for further replies.
Top