SOLVED Disk UNAVAIL

Bhoot · Jan 18, 2017

I got a mail from my freenas

Code:

Checking status of zfs pools:
NAME		   SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
bhoot		   29T  21.6T  7.35T		 -	39%	74%  1.00x  DEGRADED  /mnt
freenas-boot  14.2G  1.05G  13.2G		 -	  -	 7%  1.00x  ONLINE  -

  pool: bhoot
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
		the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 132h59m with 0 errors on Fri Jan  6 16:59:14 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		bhoot										   DEGRADED	 0	 0	 0
		  raidz2-0									  DEGRADED	 0	 0	 0
			gptid/5663b940-bdde-11e5-9e00-
f07959376c84  ONLINE	   0	 0	 0
			10479856730608632472						UNAVAIL	  0	 0	 0  was /dev/gptid/cd427285-e4d8-11e4-b39d-f07959376c84
			gptid/ec0f7827-2d2c-11e6-b1de-f07959376c84  ONLINE	   0	 0	 0
			gptid/ce06b19f-e4d8-11e4-b39d-f07959376c84  ONLINE	   0	 0	 0
			gptid/ce69a75d-e4d8-11e4-b39d-f07959376c84  ONLINE	   0	 0	 0
			gptid/b1f3389f-5382-11e6-885d-f07959376c84  ONLINE	   0	 0	 0
			gptid/cf2dd08e-e4d8-11e4-b39d-f07959376c84  ONLINE	   0	 0	 0
			gptid/cf91d6e8-e4d8-11e4-b39d-f07959376c84  ONLINE	   0	 0	 0

errors: No known data errors

-- End of daily output --

Woke up next morning and checked the basics of the freenas box. One of the disks showed a really long (about 20 digit number) with the status as UNAVAIL. I also tried resetting the connections to no avail. Not sure why but the disk just dropped out of the array. I am ready with a cold spare so that's not an issue. I am just wondering if I was supposed to get a few alerts before this happened.
I did get a few outputs from ssh

Code:

[root@freenas] ~# zpool status -v
  pool: bhoot
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
  the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
  see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 132h59m with 0 errors on Fri Jan  6 16:59:14 2017
config:

  NAME  STATE  READ WRITE CKSUM
  bhoot  DEGRADED  0  0  0
  raidz2-0  DEGRADED  0  0  0
  gptid/5663b940-bdde-11e5-9e00-f07959376c84  ONLINE  0  0  0
  10479856730608632472  UNAVAIL  0  0  0  was /dev/gptid/cd427285-e4d8-11e4-b39d-f07959376c84
  gptid/ec0f7827-2d2c-11e6-b1de-f07959376c84  ONLINE  0  0  0
  gptid/ce06b19f-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  0
  gptid/ce69a75d-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  0
  gptid/b1f3389f-5382-11e6-885d-f07959376c84  ONLINE  0  0  0
  gptid/cf2dd08e-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  0
  gptid/cf91d6e8-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  0

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h1m with 0 errors on Wed Jan  4 03:46:18 2017
config:

  NAME  STATE  READ WRITE CKSUM
  freenas-boot  ONLINE  0  0  0
  mirror-0  ONLINE  0  0  0
  gptid/40460acb-cf27-11e5-b12b-f07959376c84  ONLINE  0  0  0
  da1p2  ONLINE  0  0  0

errors: No known data errors

Code:

[root@freenas] ~# camcontrol devlist
<WDC WD40EFRX-68WT0N0 82.00A82>  at scbus1 target 0 lun 0 (ada0,pass0)
<WDC WD40EFRX-68WT0N0 82.00A82>  at scbus4 target 0 lun 0 (ada1,pass1)
<WDC WD40EFRX-68WT0N0 82.00A82>  at scbus5 target 0 lun 0 (ada2,pass2)
<WDC WD40EFRX-68WT0N0 82.00A82>  at scbus6 target 0 lun 0 (ada3,pass3)
<WDC WD40EFRX-68WT0N0 82.00A82>  at scbus7 target 0 lun 0 (ada4,pass4)
<WDC WD40EFRX-68WT0N0 82.00A82>  at scbus8 target 0 lun 0 (ada5,pass5)
<WDC WD40EFRX-68WT0N0 82.00A82>  at scbus9 target 0 lun 0 (ada6,pass6)
<SanDisk Ultra Fit 1.00>  at scbus11 target 0 lun 0 (pass7,da0)
<SanDisk Ultra Fit 1.00>  at scbus12 target 0 lun 0 (pass8,da1)

Code:

[root@freenas] ~# gpart show
=>  34  30031183  da0  GPT  (14G)
  34  1024  1  bios-boot  (512k)
  1058  6  - free -  (3.0k)
  1064  30030152  2  freebsd-zfs  (14G)
  30031216  1  - free -  (512B)

=>  34  30031183  da1  GPT  (14G)
  34  1024  1  bios-boot  (512k)
  1058  6  - free -  (3.0k)
  1064  30030152  2  freebsd-zfs  (14G)
  30031216  1  - free -  (512B)

=>  34  7814037101  ada0  GPT  (3.7T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  7809842696  2  freebsd-zfs  (3.7T)
  7814037128  7  - free -  (3.5k)

=>  34  7814037101  ada1  GPT  (3.7T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  7809842696  2  freebsd-zfs  (3.7T)
  7814037128  7  - free -  (3.5k)

=>  34  7814037101  ada2  GPT  (3.7T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  7809842696  2  freebsd-zfs  (3.7T)
  7814037128  7  - free -  (3.5k)

=>  34  7814037101  ada3  GPT  (3.7T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  7809842696  2  freebsd-zfs  (3.7T)
  7814037128  7  - free -  (3.5k)

=>  34  7814037101  ada4  GPT  (3.7T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  7809842696  2  freebsd-zfs  (3.7T)
  7814037128  7  - free -  (3.5k)

=>  34  7814037101  ada5  GPT  (3.7T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  7809842696  2  freebsd-zfs  (3.7T)
  7814037128  7  - free -  (3.5k)

=>  34  7814037101  ada6  GPT  (3.7T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  7809842696  2  freebsd-zfs  (3.7T)
  7814037128  7  - free -  (3.5k)

Code:

[root@freenas] ~# glabel status
  Name  Status  Components
gptid/403b7529-cf27-11e5-b12b-f07959376c84  N/A  da0p1
gptid/40460acb-cf27-11e5-b12b-f07959376c84  N/A  da0p2
gptid/9623e4df-cf29-11e5-a539-f07959376c84  N/A  da1p1
gptid/cf91d6e8-e4d8-11e4-b39d-f07959376c84  N/A  ada0p2
gptid/b1f3389f-5382-11e6-885d-f07959376c84  N/A  ada1p2
gptid/5663b940-bdde-11e5-9e00-f07959376c84  N/A  ada2p2
gptid/ec0f7827-2d2c-11e6-b1de-f07959376c84  N/A  ada3p2
gptid/ce06b19f-e4d8-11e4-b39d-f07959376c84  N/A  ada4p2
gptid/ce69a75d-e4d8-11e4-b39d-f07959376c84  N/A  ada5p2
gptid/cf2dd08e-e4d8-11e4-b39d-f07959376c84  N/A  ada6p2

//edit: the gui only shows replace option when the said disk is selected.

Bidule0hm · Jan 19, 2017

Sometimes a drive just die, no warnings, nothing.

Bhoot · Jan 19, 2017

Bidule0hm said:
Sometimes a drive just die, no warnings, nothing.

Surprising thing is that the mail alert came in the same minute as the security report. I just went and checked SMART reports that come to me as mail. The disk had ~14000 hours. So yes I do understand. But it's nowhere close to the MTBF claimed by WD for reds.

Code:

########## SMART status report summary for all drives ##########

+------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+-------+----+
|Device|Serial		 |Temp|Power|Start|Spin |ReAlloc|Current|Offline |UDMA  |Seek  |High  |Command|Last|
|	  |			   |	|On   |Stop |Retry|Sectors|Pending|Uncorrec|CRC   |Errors|Fly   |Timeout|Test|
|	  |			   |	|Hours|Count|Count|	   |Sectors|Sectors |Errors|	  |Writes|Count  |Age |
+------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+-------+----+
|ada0  |WD-WCC4EXXXXXX| 36 |14033|  170|	0|	  0|	  0|	   0|	 0|   N/A|   N/A|	N/A|   0|
|ada1  |WD-WCC4EXXXXXX| 36 | 4203|	9|	0|	  0|	  0|	   0|	 0|   N/A|   N/A|	N/A|   0|
|ada2  |WD-WCC4EXXXXXX| 37 | 7882|   33|	0|	  0|	  0|	   0|	 0|   N/A|   N/A|	N/A|   0|
|ada3  |WD-WCC4EXXXXXX| 38 | 5371|   15|	0|	  0|	  0|	   0|	 0|   N/A|   N/A|	N/A|   0|
|ada4  |WD-WCC4EXXXXXX| 38 |14033|  176|	0|	  0|	  0|	   0|	 0|   N/A|   N/A|	N/A|   0|
|ada5  |WD-WCC4EXXXXXX| 36 |14033|  177|	0|	  0|	  0|	   0|	 0|   N/A|   N/A|	N/A|   0|
|ada6  |WD-WCC4EXXXXXX| 37 |14033|  177|	0|	  0|	  0|	   0|	 0|   N/A|   N/A|	N/A|   0|
|ada7  |			   |  |	 |	 |	 |	   |	   |		|	  |   N/A|   N/A|	N/A|   0|
+------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+-------+----+

Bidule0hm · Jan 21, 2017

It happens, I had one drive failing on me last year but none before and none since then.

I guess yours is still under warranty so you can RMA it so it's shouldn't be a big deal.

Bhoot · Jan 24, 2017

I have RMA'd about 5-6 hard disks already. Not sure why but I am getting a lot of sectors pending (unreadable). I am just scared of losing 2 of them together. :/

snaptec · Jan 24, 2017

Could you please post the complete hw / sw Setup and how old it is? How old are your drives? How many do you have?

Gesendet von iPhone mit Tapatalk

Bhoot · Jan 27, 2017

I just realized I get emails every night from my server and I really thought that 1 of the disks is kaput. For most of the mail scripts I have to thank @Bidule0hm and @cyberjock . However I just realized a mail I skipped a few days ago.
Mail at 0301 Local on 24/01/17

Code:

Checking status of zfs pools:
NAME		   SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
bhoot		   29T  21.3T  7.65T		 -	38%	73%  1.00x  DEGRADED  /mnt
freenas-boot  14.2G  1.05G  13.2G		 -	  -	 7%  1.00x  ONLINE  -

  pool: bhoot
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
		the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 132h59m with 0 errors on Fri Jan  6 16:59:14 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		bhoot										   DEGRADED	 0	 0	 0
		  raidz2-0									  DEGRADED	 0	 0	 0
			gptid/5663b940-bdde-11e5-9e00-
f07959376c84  ONLINE	   0	 0	 0
			10479856730608632472						UNAVAIL	  0	 0	 0  was /dev/gptid/cd427285-e4d8-11e4-b39d-f07959376c84
			gptid/ec0f7827-2d2c-11e6-b1de-f07959376c84  ONLINE	   0	 0	 0
			gptid/ce06b19f-e4d8-11e4-b39d-f07959376c84  ONLINE	   0	 0	 0
			gptid/ce69a75d-e4d8-11e4-b39d-f07959376c84  ONLINE	   0	 0	 0
			gptid/b1f3389f-5382-11e6-885d-f07959376c84  ONLINE	   0	 0	 0
			gptid/cf2dd08e-e4d8-11e4-b39d-f07959376c84  ONLINE	   0	 0	 0
			gptid/cf91d6e8-e4d8-11e4-b39d-f07959376c84  ONLINE	   0	 0	 0

errors: No known data errors

-- End of daily output --

Then another mail at 1310 on 24/01/17

Code:

The volume bhoot (ZFS) state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

Next mail at 1352 on 24/01/17

Code:

The volume bhoot (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

No disks have been replaced. I did however remove and refit the wires but that was on the day this thread was created. i am not sure what suddenly happened to the server/disk. Before that there wasn't any option to online the disk. As you can see it was shown as UNAVAIL (not OFFLINE).

Now the output of zpool status -v

Code:

[root@freenas] ~# zpool status -v
  pool: bhoot
 state: ONLINE
status: 
action: Determine if the device needs to be replaced, and clear the errors
  using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 43.5G in 0h43m with 0 errors on Tue Jan 24 13:51:54 2017
config:

  NAME  STATE  READ WRITE CKSUM
  bhoot  ONLINE  0  0  0
  raidz2-0  ONLINE  0  0  0
  gptid/5663b940-bdde-11e5-9e00-f07959376c84  ONLINE  0  0  0
  gptid/cd427285-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  3
  gptid/ec0f7827-2d2c-11e6-b1de-f07959376c84  ONLINE  0  0  0
  gptid/ce06b19f-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  0
  gptid/ce69a75d-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  0
  gptid/b1f3389f-5382-11e6-885d-f07959376c84  ONLINE  0  0  0
  gptid/cf2dd08e-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  0
  gptid/cf91d6e8-e4d8-11e4-b39d-f07959376c84  ONLINE  0  0  0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h1m with 0 errors on Fri Jan 20 03:46:22 2017
config:

  NAME  STATE  READ WRITE CKSUM
  freenas-boot  ONLINE  0  0  0
  mirror-0  ONLINE  0  0  0
  gptid/40460acb-cf27-11e5-b12b-f07959376c84  ONLINE  0  0  0
  da1p2  ONLINE  0  0  0

errors: No known data errors

There is a figure of 3 chksum error(s). Does this qualify for RMA? I am still waiting for my previous disk to return from RMA. Should be another week or so.
I read the error related topic. It says the error occurs coz of:

A network attached device lost connectivity but has now recovered
A device suffered from a bit flip, an expected event over long periods of time
An administrator accidentally wrote over a portion of the disk using another program

Any way I could prove it just lost power for RMA or suffered from a bit flip (silent corruption).

Any and every help will be appreciated. :)

Bhoot · Jan 27, 2017

snaptec said:
Could you please post the complete hw / sw Setup and how old it is? How old are your drives? How many do you have?

Gesendet von iPhone mit Tapatalk

I fist of all would like to put a disclaimer that I do not promote/support the hardware I use. I use it as per my capabilities. A few components maybe (and are shown) as non compatible with another on both manufacturer website.
I would have preferred the super micro board but sadly their stocks/model/support is limited to proprietary boards with cases neither fitting my budget and the hole in the wall I wanted to place them in.

My system config

 

CASE: Fractal Define R5 (with 2 140mm fans)

MOTHERBOARD: ASUS Z10PA-U8

PROCESSOR: iNTEL Xeon E52609v3 + Intel Fan

RAM: CRUCIAL 32GB (16x2)DDR4-2133 1.2v RDIMM 288p (CT16G4RFD4213)

HDD: WD RED 4tbx8 WD-WCC4E

OS USB: 2x Cruzer Fit 16gb (mirror)

PSU: Coolermaster 750 gold

FreeNAS 9.3 Stable

2xCorsair sp120 front fans

I do live in a warm to hot place next to the sea. I can not afford ac to be running 24x7 and temperatures do touch 50C in summers. The humidity is also alarmingly high (My house is less than a km from the beach). The winter months go down to barely 20C. Yes it is Celcius and not Farenhiet.
I put the system together on 30/03/15. The max number of hours the system could have has is 16000 to date and I am currently at 14100 hours on a few disks. The temperature on the disks are impossible to maintain below 40C, esp during scrub, I won't lie about that. I do however keep them below 45C with whatever required. I don't have them spinning down at idle and I don't think my system is ever at IDLE. :P
Also notice my system has had <180 restarts in the close to 2 years of setup. Many a times the restarts were done one after another while trouble shooting and I think I have made the server run for 60 days at a stretch.
The system is scrubbed once a month and SMART test(ed) as per @cyberjock 's guide. The scrub frequency has been reduced coz of the time taken and heat generated by the system. It's a personal choice and I don't suggest it.

diedrichg · Jan 27, 2017

You say you have (8) drives but all your status logs show 7.

Robert Trevellyan · Jan 27, 2017

Bhoot said:
There is a figure of 3 chksum error(s). Does this qualify for RMA?

Checksum errors reported by ZFS are not grounds for RMA. A failed SMART test or a SMART attribute reaching threshold would be grounds for RMA.

Bhoot · Jan 27, 2017

diedrichg said:
You say you have (8) drives but all your status logs show 7.

If you see the second post after the resilver finished it shows 8 again.

Robert Trevellyan said:
Checksum errors reported by ZFS are not grounds for RMA. A failed SMART test or a SMART attribute reaching threshold would be grounds for RMA.

So should I just zpool clear and move on?

Robert Trevellyan · Jan 28, 2017

Bhoot said:
So should I just zpool clear and move on?

I suppose you could do that for now and see what happens, depends how paranoid you are. If the problem persists, you'll have to figure out the underlying cause.

Important Announcement for the TrueNAS Community.

SOLVED Disk UNAVAIL

Bhoot

Patron

Bidule0hm

Server Electronics Sorcerer

Bhoot

Patron

Bidule0hm

Server Electronics Sorcerer

Bhoot

Patron

snaptec

Guru

Bhoot

Patron

Bhoot

Patron

diedrichg

Wizard

Robert Trevellyan

Pony Wrangler

Bhoot

Patron

Robert Trevellyan

Pony Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED Disk UNAVAIL

Patron

Server Electronics Sorcerer

Patron

Server Electronics Sorcerer

Patron

Guru

Patron

Patron

Wizard

Pony Wrangler

Patron

Pony Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Disk UNAVAIL"

Similar threads