Vdev went poof!

futex

Dabbler
Joined
Jun 1, 2020
Messages
13
Hi,

I have a problem with a VDEV disappearing after a failed disk.

Recount:
* The system in question is a striped mirror of 4 VDEVS of 2 drives each
* The system with the issue alerted me via email that a disk (/dev/ada7) FAILED SMART self-check. BACK UP DATA NOW!
* So I took a look at Storage -> Pool -> Status where /dev/ada7 was shown an DEGRADED
* I offlined it according to https://www.ixsystems.com/documentation/freenas/11.3-U3.1/storage.html#replacing-a-failed-disk
* /dev/ada7 was removed from the pool and does not show up in Storage -> Pool -> Staus anywhere
* It does show up in Storage -> Disks as UNUSED
* However /dev/ada7 belonged to a VDEV mirror of 2 drives, it's partner being /dev/ada6
* /dev/ada6 now shows up as a single disk in the VDEV
* I don't see a way to replace the failed /dev/ada7
 

futex

Dabbler
Joined
Jun 1, 2020
Messages
13
Screenshot from 2020-06-01 13-41-04.png
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
You can use the zpool attach command to re-create the mirror from a single disk vdev.
zpool attach poolname single-disk new-disk

Make sure to use the appropriate gptid devices, not the ones like ada4p2 ...
 

futex

Dabbler
Joined
Jun 1, 2020
Messages
13
Make sure to use the appropriate gptid devices, not the ones like ada4p2 ...

How do I retrieve the gptid of the failed device? It's partner still shows up in glabel status and zpool status but the replacement disk for the failed one does not. Do I have to format the disk beforehand or create the glabel manually? If so, how do I do that?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
The easiest way is to copy the entire partition table from an active disk:
gpart backup <working disk> | gpart restore -F <new disk>

These commands take device names like ada0. Don't confuse the devices!

Then you can use gpart list <new disk> to get the UUID.
 

futex

Dabbler
Joined
Jun 1, 2020
Messages
13
The easiest way is to copy the entire partition table from an active disk:
gpart backup <working disk> | gpart restore -F <new disk>

These commands take device names like ada0. Don't confuse the devices!

Then you can use gpart list <new disk> to get the UUID.
I have followed your suggestion and executed

Code:
gpart backup ada6 | gpart restore -F ada7


Now ada7 shows up in gpart list.

However if I now do
Code:
zpool attach silo gptid/<ada6p2.gptid> gptid/<ada7p2.gptid>


I get
Code:
cannot attach gptid/<ada7p2.gptid> to gptid/<ada6p2.gptid>: no such device in pool
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Then it's p1 you are using, possibly? Check the current gptid via zpool status and compare to the gpart list ada6 output. Then use that and the matching one for ada7. It's the rawuuid value from gpart list.
 

futex

Dabbler
Joined
Jun 1, 2020
Messages
13
Then it's p1 you are using, possibly? Check the current gptid via zpool status and compare to the gpart list ada6 output. Then use that and the matching one for ada7. It's the rawuuid value from gpart list.

I have verified that I used the correct Ids. I am using p2.

However, I meanwhile examined the pool with

So I examined the pool with
zdb -eC pool

and the output greatly suprised me, to the point that I don't think I can trust FreeNAS with my data anymore.

MOS Configuration:
version: 5000
state: 0
vdev_children: 4
vdev_tree:
type: 'root'
children[0]:
type: 'mirror'
children[0]:
type: 'disk'
children[1]:
type: 'disk'
children[1]:
type: 'mirror'
children[0]:
type: 'disk'
children[1]:
type: 'disk'
children[2]:
type: 'mirror'
children[0]:
type: 'disk'
children[1]:
type: 'disk'
children[3]:
type: 'disk'

Shortened for brevity. As you can see, instead of an expected type 'mirror' for children[3] with two children of type 'disk', the pool contains only a single disk.

My jaw is dropping here. Maybe I don't understand something here, but to me it looks like that FreeNAS silently dropped the failed disk, completely forgot about it and even silently changed the pool layout from mirror to (single) disk.

I hope that this is not what happened here and the original pool state can be restored, but at this point I am considering alternatives to FreeNAS.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
That's precisely the reason why I recommended attaching another device to that single disk to build a mirror. So please post the output of zpool status silo and gpart list.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
BTW ... what does zpool history show?
 

futex

Dabbler
Joined
Jun 1, 2020
Messages
13
zpool history:

2020-06-01.13:13:20 zpool detach silo /dev/gptid/97903767-5486-11ea-804c-ac1f6bb1efd4.eli
2020-06-01.14:02:20 zpool scrub silo
[snapshot operatings omitted]
2020-06-03.18:20:31 zpool import 7023851173393642101 silo

The detach operation was me removing the disk as per the docs.
The import operation was not directly triggered by me, I assume this happened automatically after rebooting. I suppose 7023851173393642101 is the new disk, but it does not seem to show up anywhere in the pool.

zpool status silo:
pool: silo
state: ONLINE
scan: scrub repaired 0 in 0 days 07:48:21 with 0 errors on Mon Jun 1 21:50:16 2020
config:

NAME STATE READ WRITE CKSUM
silo ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/5a79ebf7-4b82-11ea-bad2-ac1f6bb1efd4.eli ONLINE 0 0 0
gptid/5a866535-4b82-11ea-bad2-ac1f6bb1efd4.eli ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/3db69fe7-5486-11ea-804c-ac1f6bb1efd4.eli ONLINE 0 0 0
gptid/3ed268d7-5486-11ea-804c-ac1f6bb1efd4.eli ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/6d8f54c6-5486-11ea-804c-ac1f6bb1efd4.eli ONLINE 0 0 0
gptid/6d9bbfd2-5486-11ea-804c-ac1f6bb1efd4.eli ONLINE 0 0 0
gptid/978275f3-5486-11ea-804c-ac1f6bb1efd4.eli ONLINE 0 0 0

errors: No known data errors

The output of gpart list is here:
 

futex

Dabbler
Joined
Jun 1, 2020
Messages
13
Ok, I seem to have pieced the puzzle together.

  1. I needed to restore the disklabel, as it seems to become invalid after a detach operation.
    Code:
    gpart backup ada6 | gpart restore -F ada7
  2. With a working disklable back in place, ada7 shows up in
    Code:
    gpart list ada7
    again. I noted the rawuuid.
  3. Then I ran
    Code:
    zdb -eC silo > silo.txt
    and looked up the guid of the remaining disk of the vdev. Importantly, I have to rerun this command every time I operated on the pool as the guid does not seem to be constant.
  4. Using the guid of the remainer disk and the rawuuid of the new disk I was finally able to attach the disk/partition.
    Code:
    zpool attach silo 17083522215895285368 /dev/gptid/6c9ef33d-a5e6-11ea-8c88-ac1f6bb1efd4
Code:
# zpool status silo
  pool: silo
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jun  4 00:08:02 2020
    2.42T scanned at 2.04G/s, 2.21T issued at 1.51G/s, 3.33T total
    25.1G resilvered, 66.55% done, 0 days 00:12:31 to go
config:

    NAME                                                STATE     READ WRITE CKSUM
    silo                                                ONLINE       0     0     0
     mirror-0                                          ONLINE       0     0     0
       gptid/5a79ebf7-4b82-11ea-bad2-ac1f6bb1efd4.eli  ONLINE       0     0     0
       gptid/5a866535-4b82-11ea-bad2-ac1f6bb1efd4.eli  ONLINE       0     0     0
     mirror-1                                          ONLINE       0     0     0
       gptid/3db69fe7-5486-11ea-804c-ac1f6bb1efd4.eli  ONLINE       0     0     0
       gptid/3ed268d7-5486-11ea-804c-ac1f6bb1efd4.eli  ONLINE       0     0     0
     mirror-2                                          ONLINE       0     0     0
       gptid/6d8f54c6-5486-11ea-804c-ac1f6bb1efd4.eli  ONLINE       0     0     0
       gptid/6d9bbfd2-5486-11ea-804c-ac1f6bb1efd4.eli  ONLINE       0     0     0
     mirror-3                                          ONLINE       0     0     0
       gptid/978275f3-5486-11ea-804c-ac1f6bb1efd4.eli  ONLINE       0     0     0
       gptid/6c9ef33d-a5e6-11ea-8c88-ac1f6bb1efd4      ONLINE       0     0     0

errors: No known data errors


This seems to work. The only difference I see is that the new disk in mirror-3 is missing the .eli at the end. No idea if that will cause errors down the road. Why this has to be so complicated is beyond me.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Don't forget to rekey and generate a recovery key. That should enable the .eli extension on the added disk.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Does the documentation really say to zfs detach the broken device? Because that is why you ended up with a single disk vdev instead of the mirror. No bug, expected behaviour.
I just offline failing disks, pull them, and then do a zfs replace with the new one. The GUID will then still be present in the pool state.

I completely missed that your pool is encrypted, sorry. So I would detach that new disk once more, create the geli device manually and the re-attach that. Can someone running an encrypted pool confirm?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Ah ... and about that error message at your first try - no such device in pool:

It should have been gptid/<active disk p2>.eli of course. That's why it is so important to post the output of zpool status and not a screenshot of the UI. I would not have missed the encryption if I had seen all those .eli devices.
 

futex

Dabbler
Joined
Jun 1, 2020
Messages
13
I completely missed that your pool is encrypted, sorry.

Yeah, that's because I forgot to mention it. Sorry. :)
The documentation is a bit vague here and the GUI is not entirely clear about it either: When trying to Offline an encrypted disk it says something to the effect of "You cannot offline an encrypted disk. Continue anyway? Y/N". And if you hit yes, the disk is detached. Good to know!

So yes, this explains the single disk vdev.
 

futex

Dabbler
Joined
Jun 1, 2020
Messages
13
create the geli device manually and the re-attach that
I have one more question about that. I want to create the encryption with the following command:
geli init -s 4096 -B none -K /data/geli/7ded15b2-3a67-1c1d-4a08-f44f1eda9c70.key /dev/gptid/e9bffad5-2b95-11e3-94b2-0023543761d8
All uuids are example values.
Now as I understand it the keys stored in /data/geli/ are the pool keys. The pool key is not the recovery key and not the encryption key, correct?
In any case, I have several keys in /data/geli/ probably stemming from other/earlier pools. Now my question is this: How do I identify the correct pool key? It looks like the keys in /data/geli/ carry the name of an uuid, but where do I look for that uuid?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Sorry, I simply don't know. Let's hope someone using geli can fill in.
 

futex

Dabbler
Joined
Jun 1, 2020
Messages
13
The saga continues. Meanwhile I figured that the newest key located in /data/geli is probably the one I am looking for.
So i used the geli init command in my last post and voila, I have an eli device I can attach to the pool. Everything solved.

Or so I thought.

FreeNAS was having none of it. After a maintainance reboot for an unrelated issue the pool I entered my passphrase to unlock the pool and ... it failed
Code:
GEOM_ELI: Device gptid/5a79ebf7-4b82-11ea-bad2-ac1f6bb1efd4.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: hardware
GEOM_ELI: Device gptid/5a866535-4b82-11ea-bad2-ac1f6bb1efd4.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: hardware
GEOM_ELI: Device gptid/3db69fe7-5486-11ea-804c-ac1f6bb1efd4.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: hardware
GEOM_ELI: Device gptid/3ed268d7-5486-11ea-804c-ac1f6bb1efd4.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: hardware
GEOM_ELI: Device gptid/6d8f54c6-5486-11ea-804c-ac1f6bb1efd4.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: hardware
GEOM_ELI: Device gptid/6d9bbfd2-5486-11ea-804c-ac1f6bb1efd4.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: hardware
GEOM_ELI: Device gptid/978275f3-5486-11ea-804c-ac1f6bb1efd4.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: hardware
GEOM_ELI: Device gptid/381c2f11-a77b-11ea-b692-ac1f6bb1efd4.eli created.
GEOM_ELI: Encryption: AES-XTS 128
GEOM_ELI:     Crypto: hardware
ZFS WARNING: Unable to attach to ada7.
ZFS WARNING: Unable to attach to ada7.
GEOM_ELI: Device gptid/5a79ebf7-4b82-11ea-bad2-ac1f6bb1efd4.eli
 destroyed.
GEOM_ELI: Device gptid/5a866535-4b82-11ea-bad2-ac1f6bb1efd4.eli destroyed.
GEOM_ELI: Device gptid/3db69fe7-5486-11ea-804c-ac1f6bb1efd4.eli destroyed.
GEOM_ELI: Device gptid/3ed268d7-5486-11ea-804c-ac1f6bb1efd4.eli destroyed.
GEOM_ELI: Device gptid/6d8f54c6-5486-11ea-804c-ac1f6bb1efd4.eli destroyed.
GEOM_ELI: Device gptid/6d9bbfd2-5486-11ea-804c-ac1f6bb1efd4.eli destroyed.
GEOM_ELI: Device gptid/978275f3-5486-11ea-804c-ac1f6bb1efd4.eli destroyed.
GEOM_ELI: Device gptid/381c2f11-a77b-11ea-b692-ac1f6bb1efd4.eli destroyed.


So the disk I replaced was now failing to be attached. I took to the CLI and did

Code:
geli attach -k /data/geli/<pool.key> ada*p2


and the pool unlocked and could be imported just fine. Except that it's now mounted under /silo instead of /mnt/silo but details. Why should FreeNAS care for mountpoints when it doesn't even care for its own frontkeys?

I am in the process of backing up the entire pool now again as my trust in FreeNAS is in free fall. I'd really like some advice to make unlocking over the GUI work again, preferably even at the old mount location but I don't want to ask for too much.
 
Top