can't remove vdev from mirror after failure

Status
Not open for further replies.

xiaolonguk

Cadet
Joined
Sep 4, 2018
Messages
6
Guys

I had a drive die in my mirror, i followed the replace procedure, swapping out the disk, however my array is still degraded. If i look at zpool status its telling me the disk is unavail (the previously failed one). I cant detach this from the mirror as it says there are no valid replicas, however it looks to have introduced the new disk into the mirror. Any ideas?

Code:
root@nas01:~ # zpool status HDD
  pool: HDD
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
		corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
		entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 136G in 0 days 01:07:33 with 6 errors on Wed Sep  5 05:28:49 2018
config:

		NAME											  STATE	 READ WRITE CKSUM
		HDD											   DEGRADED	 0	 0	16
		  mirror-0										ONLINE	   0	 0	 0
			gptid/d477972a-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
			gptid/d4c852a7-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
		  mirror-1										DEGRADED	 0	 0	33
			replacing-0								   DEGRADED	33	 0	 0
			  11686339191773454234						UNAVAIL	  0	 0	 0  was /dev/gptid/d529eca1-7e42-11e8-ba06-0015175a7c6c
			  gptid/71066f3a-aece-11e8-9513-0015175a7c6c  ONLINE	   0	 0	33
			gptid/d58c84f2-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	33
		  mirror-2										ONLINE	   0	 0	 0
			gptid/d5d9d204-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
			gptid/d6314f51-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
		  mirror-3										ONLINE	   0	 0	 0
			gptid/d683af31-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
			gptid/d6e9726d-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
		  mirror-4										ONLINE	   0	 0	 0
			gptid/d753dc6b-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
			gptid/d7a34b43-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
		  mirror-5										ONLINE	   0	 0	 0
			gptid/d7fd699f-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
			gptid/d85a7f45-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
		logs
		  gptid/d89f33c6-7e42-11e8-ba06-0015175a7c6c	  ONLINE	   0	 0	 0
		cache
		  gptid/d8e1c172-7e42-11e8-ba06-0015175a7c6c	  ONLINE	   0	 0	 0

root@nas01:~ # zpool offline HDD 11686339191773454234	
cannot offline 11686339191773454234: no valid replicas
root@nas01:~ # zpool detach HDD 11686339191773454234	  
cannot detach 11686339191773454234: no valid replicas
 
Last edited by a moderator:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
First idea: Use [CODE][/CODE] tags when pasting console output so that it's readable.

Second: You're not trying to remove a vdev, you're trying to remove a disk (a leaf vdev if you must).

Third: What's the output of zpool status -v?
 
Last edited:

xiaolonguk

Cadet
Joined
Sep 4, 2018
Messages
6
hi, sorry for the confusion and bad formatting, lashes taken.

Code:
root@nas01:~ # zpool status -v HDD
  pool: HDD
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
		corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
		entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 136G in 0 days 01:07:33 with 6 errors on Wed Sep  5 05:28:49 2018
config:

		NAME											  STATE	 READ WRITE CKSUM
		HDD											   DEGRADED	 0	 0	16
		  mirror-0										ONLINE	   0	 0	 0
			gptid/d477972a-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
			gptid/d4c852a7-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
		  mirror-1										DEGRADED	 0	 0	33
			replacing-0								   DEGRADED	33	 0	 0
			  11686339191773454234						UNAVAIL	  0	 0	 0  was /dev/gptid/d529eca1-7e42-11e8-ba06-0015175a7c6c
			  gptid/71066f3a-aece-11e8-9513-0015175a7c6c  ONLINE	   0	 0	33
			gptid/d58c84f2-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	33
		  mirror-2										ONLINE	   0	 0	 0
			gptid/d5d9d204-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
			gptid/d6314f51-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
		  mirror-3										ONLINE	   0	 0	 0
			gptid/d683af31-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
			gptid/d6e9726d-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
		  mirror-4										ONLINE	   0	 0	 0
			gptid/d753dc6b-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
			gptid/d7a34b43-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
		  mirror-5										ONLINE	   0	 0	 0
			gptid/d7fd699f-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
			gptid/d85a7f45-7e42-11e8-ba06-0015175a7c6c	ONLINE	   0	 0	 0
		logs
		  gptid/d89f33c6-7e42-11e8-ba06-0015175a7c6c	  ONLINE	   0	 0	 0
		cache
		  gptid/d8e1c172-7e42-11e8-ba06-0015175a7c6c	  ONLINE	   0	 0	 0

errors: Permanent errors have been detected in the following files:

		<metadata>:<0x1>
		<metadata>:<0x56>
		/mnt/HDD/nps01/nps01-flat.vmdk

 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Yeah, looks like you have metadata corruption, which is probably messing up your pool. A failure in a single vdev should definitely not cause that, since you have multiple vdevs and metadata is stored in two or three different places (that means different vdevs as much as possible), so I'm not sure what happened.

You'll have to recreate the pool from backups...
 

xiaolonguk

Cadet
Joined
Sep 4, 2018
Messages
6
Is there a way i can migrate the data off the mirror that is causing the issues onto the remaining disks, then remove the mirror, rebuild and add it back into the array?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
You have metadata corruption caused by errors across at least two vdevs. That alone means you're lucky if you can access all your data read-only.
Your file at /mnt/HDD/nps01/nps01-flat.vmdk is also corrupted.

Unfortunately, the pool is really toast.
 

xiaolonguk

Cadet
Joined
Sep 4, 2018
Messages
6
See thats the odd thing. The array is online and the vmware are all running with no errors at all. It appears the array is operating fine it's just the disk in the mirror that seems to be generating errors.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Is there a way i can migrate the data off the mirror that is causing the issues onto the remaining disks, then remove the mirror, rebuild and add it back into the array?

No, at this point vdev removal isn't an option in OpenZFS; and it's not that the disk is causing issues, it's that the data on it already appears to have corruption.

I'm similarly puzzled as to how this happened since metadata is mirrored across vdevs for this reason. Yes, you might have lost that one .vmdk but your pool shouldn't be this damaged from just one disk failing in a vdev, even if you were unlucky enough to have corruption on the other mirror member.

Can you post full system specs including CPU/motherboard/memory/HBA used?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
No, at this point vdev removal isn't an option in OpenZFS
Actually, it is, in FreeNAS 11.2 and newer. But only for pools consisting solely of mirrors/single disks. And I think the width of the mirrors might have to be the same on all vdevs.

That said, it will not solve this problem. If there was no metadata corruption, deleting the offending file would've been enough. With metadata corruption, your options are very limited.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Actually, it is, in FreeNAS 11.2 and newer.

Guess I need to try that beta version out then. Sure won't be with data I care about though. ;)

But only for pools consisting solely of mirrors/single disks. And I think the width of the mirrors might have to be the same on all vdevs.

Roadmap on the OpenZFS page says:

Current status: feature complete for singleton vdevs only; in internal production at Delphix. Expected to be extended to removal of mirror vdevs next. Removal of top-level RAIDZ vdevs technically possible, but ONLY for pools of identical raidz vdevs - ie 4 6-disk RAIDZ2 vdevs, etc. You will not be able to remove a raidz vdev from a "mutt" pool.

However that's all applying to "removal from functional pool" not "removal when there's already damage" including to the metadata. Which I'm still puzzled as to how that occured, so I'm hoping for that hardware info dump.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Current status: feature complete for singleton vdevs only; in internal production at Delphix. Expected to be extended to removal of mirror vdevs next. Removal of top-level RAIDZ vdevs technically possible, but ONLY for pools of identical raidz vdevs - ie 4 6-disk RAIDZ2 vdevs, etc. You will not be able to remove a raidz vdev from a "mutt" pool.
That doesn't match the information I gathered, but this one is indeed poorly-documented. Mirror removal is definitely in.
 

xiaolonguk

Cadet
Joined
Sep 4, 2018
Messages
6
So the system spec is as follows

Dual Intel(R) Xeon(R) CPU L5420 @ 2.50GHz providing 8 physical cores in total
32GB RAM
Intel 5000 Series Motherboard
9650SE SATA-II RAID 24 Port RAID card, with all the disks running standalone.
the HDD Pool is 12 x 1Gb HDD, with one SSD doing log and cache functions over two partitions.

the SSD Pool is 2 x 500GB SSDs.


With respect to the VMDK its saying has errors, that VM is running A-OK, im just in the process of VM-motioning it to another disk pool in the same array.

Many thanks for all the help thus far. Are my only options to move all the VMs off, destroy the array and start again?
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
So the system spec is as follows

Dual Intel(R) Xeon(R) CPU L5420 @ 2.50GHz providing 8 physical cores in total
32GB RAM
Intel 5000 Series Motherboard
9650SE SATA-II RAID 24 Port RAID card, with all the disks running standalone.
the HDD Pool is 12 x 1Gb HDD, with one SSD doing log and cache functions over two partitions.

the SSD Pool is 2 x 500GB SSDs.


With respect to the VMDK its saying has errors, that VM is running A-OK, im just in the process of VM-motioning it to another disk pool in the same array.

Many thanks for all the help thus far. Are my only options to move all the VMs off, destroy the array and start again?
It sounds like it. But I would also run badblocks on all of the disk in that pool while still on the same controller they are now. Something cause more damage than it should have, you need to find out why before you put everything back on that pool.
 

xiaolonguk

Cadet
Joined
Sep 4, 2018
Messages
6
Are there any commands specifically that you can recommend ?

Thanks

Sent from my SM-G935F using Tapatalk
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
9650SE SATA-II RAID 24 Port RAID card
Suddenly we have a suspect. I am extremely suspicious of that controller and highly recommend you switch to an LSI SAS HBA as quickly as possible. Yes, I realize yours is also an LSI card, but it's an old, odd product and the words "SATA hardware RAID" trigger even more alarm bells than just "hardware RAID".
 
Status
Not open for further replies.
Top