Zpool unavailable might be, might not be dead

Status
Not open for further replies.

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
Running FreeNAS-9.1.1-RELEASE-x64 (a752d35) with 16GB RAM.

Had 1 pool called MEDIA with a stripe of 3 2-disk mirrors plus one spare 1TB.

the 2-disk mirrors were all 3TB drives.

So like this (from gpart show)

[CODE]=> 34 5860533101 ada0 GPT (2.7T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338696 2 freebsd-zfs (2.7T)
5860533128 7 - free - (3.5k)

=> 34 5860533101 ada1 GPT (2.7T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338696 2 freebsd-zfs (2.7T)
5860533128 7 - free - (3.5k)

=> 34 5860533101 ada2 GPT (2.7T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338696 2 freebsd-zfs (2.7T)
5860533128 7 - free - (3.5k)

=> 34 5860533101 ada3 GPT (2.7T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338696 2 freebsd-zfs (2.7T)
5860533128 7 - free - (3.5k)

=> 34 5860533101 ada4 GPT (2.7T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338696 2 freebsd-zfs (2.7T)
5860533128 7 - free - (3.5k)

=> 34 5860533101 ada5 GPT (2.7T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338696 2 freebsd-zfs (2.7T)
5860533128 7 - free - (3.5k)

=> 63 15638465 da3 MBR (7.5G)
63 1930257 1 freebsd [active] (942M)
1930320 63 - free - (31k)
1930383 1930257 2 freebsd (942M)
3860640 3024 3 freebsd (1.5M)
3863664 41328 4 freebsd (20M)
3904992 11733536 - free - (5.6G)

=> 0 1930257 da3s1 BSD (942M)
0 16 - free - (8.0k)
16 1930241 1 !0 (942M)[/CODE]
Then zdb shows the pool which is encouraging:

[CODE]
MEDIA:
version: 5000
name: 'MEDIA'
state: 0
txg: 220477
pool_guid: 9173378823382608771
hostid: 1822453256
hostname: 'freenas.basement.local'
vdev_children: 4
vdev_tree:
type: 'root'
id: 0
guid: 9173378823382608771
create_txg: 4
children[0]:
type: 'mirror'
id: 0
guid: 4578489925772993350
metaslab_array: 34
metaslab_shift: 34
ashift: 12
asize: 2998440558592
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 7034236002078255833
path: '/dev/gptid/e5cda0d2-5ba7-11e3-a9f0-6805ca1adbd8'
phys_path: '/dev/gptid/e5cda0d2-5ba7-11e3-a9f0-6805ca1adbd8'
whole_disk: 1
DTL: 193
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 8131083936012038549
path: '/dev/gptid/e6533050-5ba7-11e3-a9f0-6805ca1adbd8'
phys_path: '/dev/gptid/e6533050-5ba7-11e3-a9f0-6805ca1adbd8'
whole_disk: 1
DTL: 192
create_txg: 4
children[1]:
type: 'mirror'
id: 1
guid: 7287074714158299330
metaslab_array: 38
metaslab_shift: 34
ashift: 12
asize: 2998440558592
is_log: 0
create_txg: 220
children[0]:
type: 'disk'
id: 0
guid: 18438491820313231629
path: '/dev/gptid/0a02dfab-65aa-11e3-b5ba-6805ca1adbd8'
phys_path: '/dev/gptid/0a02dfab-65aa-11e3-b5ba-6805ca1adbd8'
whole_disk: 1
DTL: 212
create_txg: 220
children[1]:
type: 'disk'
id: 1
guid: 5740885520923749251
path: '/dev/gptid/6175d11f-5baa-11e3-a9f0-6805ca1adbd8'
phys_path: '/dev/gptid/6175d11f-5baa-11e3-a9f0-6805ca1adbd8'
whole_disk: 1
DTL: 195
create_txg: 220
children[2]:
type: 'mirror'
id: 2
guid: 7549782153430211792
metaslab_array: 149
metaslab_shift: 34
ashift: 12
asize: 2998440558592
is_log: 0
create_txg: 45718
children[0]:
type: 'disk'
id: 0
guid: 15446892777379765105
path: '/dev/gptid/4776dabb-65ab-11e3-b5ba-6805ca1adbd8'
phys_path: '/dev/gptid/4776dabb-65ab-11e3-b5ba-6805ca1adbd8'
whole_disk: 1
DTL: 214
create_txg: 45718
children[1]:
type: 'disk'
id: 1
guid: 8708169765859944444
path: '/dev/gptid/d9fcbbaf-65a5-11e3-b5ba-6805ca1adbd8'
phys_path: '/dev/gptid/d9fcbbaf-65a5-11e3-b5ba-6805ca1adbd8'
whole_disk: 1
DTL: 210
create_txg: 45718
children[3]:
type: 'disk'
id: 3
guid: 9809961858194575065
path: '/dev/gptid/ca8978cb-5db9-11e3-b5ba-6805ca1adbd8'
phys_path: '/dev/gptid/ca8978cb-5db9-11e3-b5ba-6805ca1adbd8'
whole_disk: 1
metaslab_array: 151
metaslab_shift: 34
ashift: 12
asize: 1998246641664
is_log: 0
DTL: 194
create_txg: 45726
features_for_read:
[/CODE]

When I discovered that 1 of the 3TB drives was on a SATA-1 port, and all others on SATA-3 ports, I got suspicious of the performance and decided to:
1. Remove the 1 TB drive as it was on a good SATA-3 port.
2. Unplug the 3 TB drive from the SATA-1 port and put onto the SATA-3 port.
From there I knew it would resilver, and saw it start.
Inadvertantly, the box was restarted; children...ugh.
OK, I rebooted and now has detached volume MEDIA so nothing anyone can reach but I cannot auto-import it:
freenas manage.py: [middleware.exceptions :38] [MiddlewareError : The volume "MEDIA" failed to import, for further details check pool status]
Ok, here's little tell from zpool status:

Code:
no pools available


I suspect "available" is the key, and I hope that it means it will be available at some point.
Problem is I cannot try and see how long it will take to be available or if it will be at all.
Any help given the above outputs would be greatly appreciated.

Oh, adding this which I believe is doubtful on my situation because resilvering was happening but that's when the PC cycled:

zpool import
[NOPARSE}
Code:
  pool: MEDIA
    id: 9173378823382608771
  state: UNAVAIL
status: One or more devices were being resilvered.
action: The pool cannot be imported due to damaged devices or data.
config:
 
MEDIA                                          UNAVAIL  missing device
mirror-0                                      ONLINE
  gptid/e5cda0d2-5ba7-11e3-a9f0-6805ca1adbd8  ONLINE
  gptid/e6533050-5ba7-11e3-a9f0-6805ca1adbd8  ONLINE
mirror-1                                      ONLINE
  gptid/0a02dfab-65aa-11e3-b5ba-6805ca1adbd8  ONLINE
  gptid/6175d11f-5baa-11e3-a9f0-6805ca1adbd8  ONLINE
mirror-2                                      ONLINE
  gptid/4776dabb-65ab-11e3-b5ba-6805ca1adbd8  ONLINE
  gptid/d9fcbbaf-65a5-11e3-b5ba-6805ca1adbd8  ONLINE

[/NOPARSE]

But for anyone wanting to help (please, please), I also showing:

Camcontrol Devlist

[CODE]
<WDC WD30EFRX-68EUZN0 80.00A80> at scbus5 target 0 lun 0 (ada0,pass0)
<WDC WD30EFRX-68EUZN0 80.00A80> at scbus6 target 0 lun 0 (ada1,pass1)
<WDC WD30EFRX-68EUZN0 80.00A80> at scbus7 target 0 lun 0 (ada2,pass2)
<WDC WD30EFRX-68EUZN0 80.00A80> at scbus8 target 0 lun 0 (pass3,ada3)
<WDC WD30EFRX-68EUZN0 80.00A80> at scbus9 target 0 lun 0 (ada4,pass4)
<WDC WD30EFRX-68EUZN0 80.00A80> at scbus10 target 0 lun 0 (ada5,pass5)
<Generic USB SD Reader 1.00> at scbus14 target 0 lun 0 (pass6,da0)
<Generic USB CF Reader 1.01> at scbus14 target 0 lun 1 (pass7,da1)
<Generic USB xD/SM Reader 1.02> at scbus14 target 0 lun 2 (pass8,da2)
<Generic USB MS Reader 1.03> at scbus14 target 0 lun 3 (pass10,da4)
<Kingston DataTraveler G3 PMAP> at scbus15 target 0 lun 0 (pass9,da3)
[/CODE]
As well as
glabel status
[CODE]
Name Status Components
gptid/d9fcbbaf-65a5-11e3-b5ba-6805ca1adbd8 N/A ada0p2
gptid/e6533050-5ba7-11e3-a9f0-6805ca1adbd8 N/A ada1p2
gptid/6175d11f-5baa-11e3-a9f0-6805ca1adbd8 N/A ada2p2
gptid/e5cda0d2-5ba7-11e3-a9f0-6805ca1adbd8 N/A ada4p2
gptid/0a02dfab-65aa-11e3-b5ba-6805ca1adbd8 N/A ada5p2
ufs/FreeNASs3 N/A da3s3
ufs/FreeNASs4 N/A da3s4
ufs/FreeNASs1a N/A da3s1a
gptid/4765aca5-65ab-11e3-b5ba-6805ca1adbd8 N/A ada3p1
gptid/4776dabb-65ab-11e3-b5ba-6805ca1adbd8 N/A ada3p2
[/CODE]
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
When I discovered that 1 of the 3TB drives was on a SATA-1 port, and all others on SATA-3 ports, I got suspicious of the performance and decided to:
1. Remove the 1 TB drive as it was on a good SATA-3 port.
2. Unplug the 3 TB drive from the SATA-1 port and put onto the SATA-3 port.
From there I knew it would resilver, and saw it start.
I assume you did this while the server was running. Had you shutdown the server and swapped the drives no resilver would be needed. Also, SATA1 transfer speed is 150MB/s, the WD REDs can sustain 147MB/s. SATA2 & 3 only really make sense with high RPM drives or SSDs.

What is the output of "zpool import -Fn MEDIA"?
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
Yes, this was while running, but removing the 1 TB drive from the SATA port and power was the need for resilvering.

Fixed by adding back the 1TB drive.

Odd: When I powered the drive (before attaching SATA cable), glabel got reshuffled even though camcontrol devlist did not. That is, the power cable for SATA drives had the 1 TB first, then 2 other drives after. Plugging it in to power put back the glabels to I guess what they were, then when I did attach SATA (to a 2-port PCI-e card), the pool was importable, and now it's back and resilvered.

Notice the glabel differences in a run from now:

[CODE]
Name Status Components
gptid/d9fcbbaf-65a5-11e3-b5ba-6805ca1adbd8 N/A ada0p2
gptid/6175d11f-5baa-11e3-a9f0-6805ca1adbd8 N/A ada2p2
ufs/FreeNASs3 N/A da3s3
ufs/FreeNASs4 N/A da3s4
ufs/FreeNASs1a N/A da3s1a
gptid/4765aca5-65ab-11e3-b5ba-6805ca1adbd8 N/A ada3p1
gptid/4776dabb-65ab-11e3-b5ba-6805ca1adbd8 N/A ada3p2
gptid/e5b7a05b-5ba7-11e3-a9f0-6805ca1adbd8 N/A ada1p1
gptid/e5cda0d2-5ba7-11e3-a9f0-6805ca1adbd8 N/A ada1p2
gptid/e63feb3b-5ba7-11e3-a9f0-6805ca1adbd8 N/A ada4p1
gptid/e6533050-5ba7-11e3-a9f0-6805ca1adbd8 N/A ada4p2
gptid/09f2b9f5-65aa-11e3-b5ba-6805ca1adbd8 N/A ada5p1
gptid/0a02dfab-65aa-11e3-b5ba-6805ca1adbd8 N/A ada5p2
gptid/ca7e7820-5db9-11e3-b5ba-6805ca1adbd8 N/A ada6p1
gptid/ca8978cb-5db9-11e3-b5ba-6805ca1adbd8 N/A ada6p2
[/CODE]
Comparing to earlier (from post #1) you can see ada4p2 was ada1p2 for example. So it does matter how the drives are chained to power.
Happy to show that now:
zpool status shows
[CODE]
pool: MEDIA
state: ONLINE
scan: resilvered 362G in 21h6m with 0 errors on Mon Dec 16 09:16:53 2013
config:
NAME STATE READ WRITE CKSUM
MEDIA ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/e5cda0d2-5ba7-11e3-a9f0-6805ca1adbd8 ONLINE 0 0 0
gptid/e6533050-5ba7-11e3-a9f0-6805ca1adbd8 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/0a02dfab-65aa-11e3-b5ba-6805ca1adbd8 ONLINE 0 0 0
gptid/6175d11f-5baa-11e3-a9f0-6805ca1adbd8 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/4776dabb-65ab-11e3-b5ba-6805ca1adbd8 ONLINE 0 0 0
gptid/d9fcbbaf-65a5-11e3-b5ba-6805ca1adbd8 ONLINE 0 0 0
gptid/ca8978cb-5db9-11e3-b5ba-6805ca1adbd8 ONLINE 0 0 0
errors: No known data errors
[/CODE]
Now I think the thing to do is
1. Backup everything externally
2. Blowaway all disks
3. Remove the 1TB drive again
4. Restart and build the 6 drives as zRAID2
5. Copy back the data into datasets.
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
How are the power cables connected does not definitely matter. ZFS also does not care about device names. If you export/detach the pool, do your changes and Auto Import ZFS will figure out everything properly.
However, looking at your zpool status output I just noticed a much bigger problem with your pool. It shows that your pool contains three mirrored vdevs and a single drive vdev without any redundancy. The zdb output in the original post shows the same. What you thought is a spare is actually an active vdev (also, a 1TB spare in a pool full of 3TB devices makes absolutely no sense, if a 3TB drive fails a 1TB spare won't help a bit). If that drive fails your entire pool is gone. You should immediately add another drive to the last vdev to create a mirror (you need to do that via command line).
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
Dusan,

While I believe the power cables shouldn't matter, I can concretely say that:

1. the 1TB drive was left in the cage, with no power, no SATA cable.
2. Plugging in the power cable making it first in the chain immediately changed the drives as noted in glabel differences. This was just the power cable mind you. I believe I saw 3 drive changes noted in the monitor with the usual message of "was formerly" denoting the old label.
3. Plugging in the SATA cable then changed zpool import from MEDIA unavailable and erroring in autoimport to MEDIA available and autoimport working.

I agree the spare 1 TB drive in stripe is poor and I will be removing it from the pool but I believe doing so, even by GUI offline it, shutdown, remove, and reboot will result in glabel changes and again have a MEDIA pool unavailable.

But is there a way to command line remove it so I can do that immediately and get home to physically do it?
 

warri

Guru
Joined
Jun 6, 2011
Messages
1,193
You cannot remove that drive from the pool easily. You need to backup your data and recreate the pool. This is covered in cyberjock's presentation about ZFS. As Dusan said, if that drive fails your complete pool is gone, so better add some redundancy there, if you don't have a backup of your data yet.
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
Please understand that the 1TB drive is not a spare! It is an active part of the pool and can't be removed! I repeat myself, if that drive fails your entire pool is gone.

The last drive in the pool (the one in bold) is a vdev without any redundancy. Notice that it does not mention a spare anywhere:
[PRE]pool: MEDIA
state: ONLINE
scan: resilvered 362G in 21h6m with 0 errors on Mon Dec 16 09:16:53 2013
config:
NAME STATE READ WRITE CKSUM
MEDIA ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/e5cda0d2-5ba7-11e3-a9f0-6805ca1adbd8 ONLINE 0 0 0
gptid/e6533050-5ba7-11e3-a9f0-6805ca1adbd8 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/0a02dfab-65aa-11e3-b5ba-6805ca1adbd8 ONLINE 0 0 0
gptid/6175d11f-5baa-11e3-a9f0-6805ca1adbd8 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/4776dabb-65ab-11e3-b5ba-6805ca1adbd8 ONLINE 0 0 0
gptid/d9fcbbaf-65a5-11e3-b5ba-6805ca1adbd8 ONLINE 0 0 0
gptid/ca8978cb-5db9-11e3-b5ba-6805ca1adbd8 ONLINE 0 0 0
errors: No known data errors[/PRE]

This is how a spare looks in zpool status:
[PRE] pool: test
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
gptid/c34f87ea-6484-11e3-89dd-0800278c9e58 ONLINE 0 0 0
spares
gptid/c383a4a0-6484-11e3-89dd-0800278c9e58 AVAIL

errors: No known data errors[/PRE]
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
Appreciate the posts. I'm opting to not use a spare as having it hot in the box seems risky when I move to RaidZ-2 and might as well have it cold, somewhat distant from the box?

Ok, so instead of trying crashplan I found I had sufficient local storage on a Ubuntu client I use moderately.

Ordered an external drive I'll use in the future but for the meantime I'm copying all data from the Pool to the Ubuntu box at varying speeds (~25MB/s - ~58MB/s). When done I will destroy the pool and remove the 1TB drive completely and reboot.

When it comes back up I will create a single RaidZ-2 with 6 3TB drives for 4 good ones and 2 redundants (as it were) and be totally safe from a one-drive crash and slightly afraid of a two-drive crash. I suspect having one drive outside the machine as a cold spare will be good insurance.

For backup I'll use an external drive (ordered) and if I see a clear path for crashplan, I'll do that. Seems a bit buggy at the moment though.
 
Status
Not open for further replies.
Top