Disk unavail after trying to extend Z1 VDEV "label missing or invalid"

jhl

Dabbler
Joined
Mar 5, 2023
Messages
27
Scale 23.10.1

VM in Proxmox, PCI passthrough 9207 HBA

I have reproduced this issue multiple times in a row

After replacing the smallest disk in a VDEV with a larger disk, pool works fine, continues working after reboot and import/export... until I try to expand the VDEV to fill available capacity

"Expand" gives an error that I need to reboot because the partitions were changed but the kernel was not notified of the change

After reboot I get the following error : "Pool is DEGRADED: One or more devices could not be used because the label is missing or invalid."

This "UNAVAIL" disk is still attached to the system and apparently working

The GUI actually lets me replace the "missing" disk with the exact same physical disk but refuses to recognize it's the same one, and resilvers from scratch as if it's not the same drive that was already in the pool.

A search on this issue turned up a fix - exporting then the pool with "sudo zpool import disk-pool -d /dev/disk/by-id"

This does work but the pool doesn't show up in all locations in the web UI, and it breaks again on the next reboot.

Any advice?
 

jhl

Dabbler
Joined
Mar 5, 2023
Messages
27
This is my zpool status, after the problem started, which might give a clue:

NAME STATE READ WRITE CKSUM
Backup DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
wwn-0x5000c5009431ed43-part1 ONLINE 0 0 0
b212c54c-ae40-4263-854e-942559362b75 ONLINE 0 0 0
9960309339138369302 UNAVAIL 0 0 0 was /dev/disk/by-id/wwn-0x5000cca05c996c3c-part1
raidz1-1 ONLINE 0 0 0
wwn-0x5000c5005917cb33-part2 ONLINE 0 0 0
wwn-0x5000c5005917ca8b-part2 ONLINE 0 0 0
wwn-0x5000c50059182dfb-part2 ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
wwn-0x5000c500833fc2d7-part2 ONLINE 0 0 0
wwn-0x5000c5005917baeb-part2 ONLINE 0 0 0
wwn-0x5000c50058f9910b-part2 ONLINE 0 0 0

I can see the problem disk is not identified by its unique ID but it's unclear how that happened, I built this pool and Replaced that disk into the pool via the web UI so I didn't take any unsupported steps.

Also the unique ID of the drive didn't change because the "was" name still resolves to this disk.

I did not check until now so I don't know if the different naming was already there before I attempted to Expand the pool.

I am currently trying to replace using the unique ID to see if it behaves any better.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
You posted in the wrong section: virtualized istances require specific knowledge and thus have their own section... but a moderator will hopefully move the thread.

Edit: Just noticed this is a COBIA issue; section might be appropriate. Mods will decide.

Please provide a full hardware list and read the following resource.

Did you try replacing the drive a second time in order to see if it fixes things?
 
Last edited:

jhl

Dabbler
Joined
Mar 5, 2023
Messages
27
I am PCI passing through the entire controller

Hardware is a Cisco C240 M5 and Truenas is connected via the HBA to an SA120 disk shelf.

If anything I suspect it's the disk shelf, not the virtualization, that is causing disks to get "mixed up", I have noticed disks enumerating differently each time I reboot. But I also thought Truenas could handle this, for example physically swapping disks around causes no issues, but once I use the Expand function this problem crops up.

Edit: Replacing the drive multiple times has "worked", until I go to expand the VDEV, then the exact same problem crops back up.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
This is not something that should happen. Have you perhaps changed any swap space setting?
Iirc there was a similar bug in a previous version, but I remember it having been addressed.
 

jhl

Dabbler
Joined
Mar 5, 2023
Messages
27
Huh... I actually did change a swap space setting, I noticed it was only 2gb and upped it to 8gb... but I did this while I was moving the system dataset to the boot-pool because I was going to be exporting/importing this pool so I think I did that after this problem started.

I will try this on a fresh install and see if I can still reproduce though.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
A search on this issue turned up a fix - exporting then the pool with "sudo zpool import disk-pool -d /dev/disk/by-id"

This does work but the pool doesn't show up in all locations in the web UI, and it breaks again on the next reboot.
What if you export it again, from the CLI, and then try to import it normally?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
After replacing the smallest disk in a VDEV with a larger disk, pool works fine, continues working after reboot and import/export... until I try to expand the VDEV to fill available capacity
This is not something that should normally be a manual action... it's automatic if the defaults for the pool are set (autoexpand=on ... you can check that with zfs get autoexpand)
 

jhl

Dabbler
Joined
Mar 5, 2023
Messages
27
So I tried it on a newly-imaged system and after resilvering the same disk back into the pool, some things are different. The Expand button no longer gives an error, no longer changes the disk IDs in the pool, and no longer breaks anything... but it also doesn't appear to work. Also, upon importing this same pool into the new system, the disks are all ID'd differently than they were on the problem system, and these IDs persist even after using the Expand button.
NAME STATE READ WRITE CKSUM
Backup ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
1d36cf4a-49ae-46af-a5cb-81a4c107eb23 ONLINE 0 0 0
b212c54c-ae40-4263-854e-942559362b75 ONLINE 0 0 0
67ba421b-35af-4c8b-8097-cf6c6fd00bcc ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
0dd6cbc1-9ba8-485f-9157-a4abd7cb6883 ONLINE 0 0 0
cbda5ab6-005b-43d4-a218-c220eef8b94f ONLINE 0 0 0
733e0ceb-7852-456a-a2c1-3acfcd44ccdb ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
47cfe4cd-fe0b-4dab-949c-bb61dc715c9e ONLINE 0 0 0
3b073618-5813-440a-896d-c4b46255a09b ONLINE 0 0 0
444437f1-01a4-4779-8b11-0ce30b52a3e7 ONLINE 0 0 0
Whereas the problem system was using wwn- IDs for most of the disks, this type of ID for 1 disk, and a totally different ID format for the problem disk.
What if you export it again, from the CLI, and then try to import it normally?
Once this problem occurs, the disk is never seen again with a normal (webUI) import or upon any reboot - at least until I replace it back in as if it's a new disk. I can always see the missing disk in the pool by importing with "sudo zpool import disk-pool -d /dev/disk/by-id" but this only lasts until the next normal import.
This is not something that should normally be a manual action... it's automatic if the defaults for the pool are set (autoexpand=on ... you can check that with zfs get autoexpand)
According to "zpool get autoexpand", autoexpand is enabled on this pool.
This is a weird configuration - I have 3 VDEVs, #1 has 3 matching larger drives, #2 and #3 each have 3 matching smaller drives. I guess Truenas might not support expanding to fill this free space because the total size of VDEV #1 would be larger than the other VDEVs if I expanded it? My understanding is that multiple dissimilar VDEVs is not optimal for performance but should work.
 

jhl

Dabbler
Joined
Mar 5, 2023
Messages
27
I think I found the reason I can't expand the VDEV with the larger disks. These are 4T disks (~3.6 usable) and I'm trying to expand from an original usable size of 1.8T per disk (2T disks). 1 of them has, in order:
1.8T of empty space
2G linux swap
1.8T ZFS partition

The 2G swap looks like it was either from the new system, or from the old system before I messed with the swap size, but this partition layout looks wrong. You wouldn't be able to expand the ZFS partition with the swap right in the middle of the drive.

The other 2 disks from the larger VDEV have, in order:
1.8T ZFS partition
8G linux swap
1.8T empty space

That looks like a remnant from the problem system where I set 8G swap size. Again, because of the layout of these partitions, it would not be possible to expand the ZFS partition on those disks.

The other VDEVs have 2GB swap partitions at the start of the disks, not in the middle or end. Assuming you replaced those with a larger disk and it inherited the same parititon layout - that would be correct, the ZFS partition would be next to the free space, and you'd be able to expand it once every disk was replaced with a larger one.

So only the VDEV where I replaced disks and/or tried to Expand on the bugged system was affected by this bug.

I think it might be possible to fix this by manually changing the partition table or replacing each of the weirdly partitioned drives, in order, on an up-to-date system with a default configuration, but I'm probably going to destroy the pool and start over from backup as it seems like that would be the least hassle at this point.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
The best/simplest way to fix that is to take out the drives (offline, then wipe) and replace them with themselves one by one.

The automatic repartitioning should take care of that problem you mention and you should end up with a larger available capacity at the end.
 
Top