[help] After clicking Expand, the storage pool hard disk displays unallocated disks(LABEL 2 (Bad label cksum))

Akito1

Cadet
Joined
Jul 9, 2023
Messages
4
2*18T HDD stripe pool

I clicked Expand on the page, which caused the MainPool to be damaged. Please tell me how to recover the data.
我点击了页面的Expand,这导致MainPool损坏,请告诉我应该如何恢复数据

1.After clicking Expand, the system reports an error message:
[EFAULT] Command partprobe /dev/sdc failed (code 1): Error: Partition(s) 2 on /dev/sdc have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use. As a result, the old partition(s) will remain in use. You should reboot now before making further changes.
当我点击Expand后,系统提示:
QQ截图20240319144856.png

2.My MainPool is a stripe composed of 2 hard disks. When the system reported an error, the page showed that one hard disk was not allocated. I felt something was wrong and immediately restarted the system. After the restart was completed, it showed that the 2 hard disks had not been initialized. And the status of the pool shows offline.
我的MainPool是由2个硬盘组成的条带,当系统报错后,页面上显示一个硬盘未分配,我感到事情有点不妙,立刻重启了系统,重启完成后,显示2个硬盘未被初始化,并且池的状态显示离线


3.Then I performed Export/disconnect on MainPool
随后我对MainPool进行了Export/disconnect
QQ截图20240319150144.png

4.The Import Pool is empty. Does this mean that my data is damaged?
Import Pool中为空,这是否表示我的数据已经损毁?
QQ截图20240319145856.png

I want to know what the function of Expandis. I clicked this in the backupPool composed of a hard disk, and the storage pool did not change.
我想知道Expand的作用是什么,我在由一个硬盘的成的backupPool中点击了这个,存储池没有变更



root@truenas[/dev/disk/by-partuuid]# zpool import pool: MainPool id: 8760900616562481857 state: UNAVAIL status: One or more devices contains corrupted data. action: The pool cannot be imported due to damaged devices or data. see: [URL]https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E[/URL] config: MainPool UNAVAIL insufficient replicas 5dcf2847-0f8f-42c9-a75a-33ce8e7b78e9 UNAVAIL invalid label e99f8385-9e1c-44e4-91c7-4b1b8a6b160f ONLINE root@truenas[/dev/disk/by-partuuid]#

root@truenas[~]# zdb -l /dev/sdd2 failed to unpack label 0 failed to unpack label 1 ------------------------------------ LABEL 2 (Bad label cksum) ------------------------------------ version: 5000 name: 'MainPool' state: 0 txg: 13065955 pool_guid: 8760900616562481857 errata: 0 hostid: 2139740830 hostname: 'truenas' top_guid: 6877940474972169588 guid: 6877940474972169588 vdev_children: 2 vdev_tree: type: 'disk' id: 0 guid: 6877940474972169588 path: '/dev/disk/by-partuuid/5dcf2847-0f8f-42c9-a75a-33ce8e7b78e9' metaslab_array: 143 metaslab_shift: 34 ashift: 12 asize: 17998054948864 is_log: 0 DTL: 117795 create_txg: 4 features_for_read: com.delphix:hole_birth com.delphix:embedded_data com.klarasystems:vdev_zaps_v2 labels = 2 3


root@truenas[~]# sfdisk -d /dev/sdd label: gpt label-id: 8F674207-71A3-4B4C-A9CB-64C24ED98A3C device: /dev/sdd unit: sectors first-lba: 6 last-lba: 4394582010 sector-size: 4096 /dev/sdd1 : start= 128, size= 524161, type=0657FD6D-A4AB-43C4-84E5-0933C84B4F4F, uuid=24C14E32-7AE7-4D5D-87EF-C03AFB3B55C2 /dev/sdd2 : start= 4195328, size= 4390386683, type=6A898CC3-1DD2-11B2-99A6-080020736631, uuid=5DCF2847-0F8F-42C9-A75A-33CE8E7B78E9 root@truenas[~]# sfdisk -d /dev/sde label: gpt label-id: B4605A92-EDDE-4447-B71E-553E08A9E197 device: /dev/sde unit: sectors first-lba: 6 last-lba: 4394582010 sector-size: 4096 /dev/sde1 : start= 128, size= 524161, type=0657FD6D-A4AB-43C4-84E5-0933C84B4F4F, uuid=9FE3FA6C-6C5D-4386-BC53-850CDE38A42D /dev/sde2 : start= 524416, size= 4394057595, type=6A898CC3-1DD2-11B2-99A6-080020736631, uuid=E99F8385-9E1C-44E4-91C7-4B1B8A6B160F
 

Akito1

Cadet
Joined
Jul 9, 2023
Messages
4
1.Fortunately, I have repaired the storage. The cause of this problem is that the starting position of the partition of a hard disk was modified by TrueNas. As you can see, the two hard disks in the picture form a Pool.
Administrators should pay attention to this issue, it is very serious
很高兴的是,我已经修复了存储,这个问题的原因是一块硬盘的分区起始位置被TureNas修改,正如你所见,图中2个硬盘是组成Pool的。
管理员应该重视这个问题,非常严重
QQ截图20240319211651.png


2.The following part is the code executed by the extension backend when clicking on the web page. According to the code, the correct partition command can be restored:
-d 2 partition number to be deleted, 2:524416:0 creates partition 2 with the starting position 524416 and the end position being the end of the hard disk, -t 2:BF01 sets the partition format of partition 2, -u 2:8760900616562481857 sets the partition format of partition 2 uuid (can be obtained through blkid /dev/sde2), /dev/sde physical disk
sgdisk -d 2 -n 2:524416:0 -t 2:BF01 -u 2:8760900616562481857 /dev/sde

下面这部分是在web页面点击扩展后端执行的代码,按照代码可以还原出正确的分区命令:
-d 2 要删除的分区号,2:524416:0 创建2分区起始位置是524416,结束位置是硬盘结尾,-t 2:BF01 设置2分区的分区格式, -u 2:8760900616562481857 设置2分区的uuid(可以通过blkid /dev/sde2获取),/dev/sde 物理磁盘

sgdisk -d 2 -n 2:524416:0 -t 2:BF01 -u 2:8760900616562481857 /dev/sde

3.Finally, execute partprobe /dev/sde to refresh the hard disk, use zdb -l /dev/sde2 to confirm that the data is correct, and then go to the web to import the Pool.
最后执行partprobe /dev/sde刷新硬盘,使用zdb -l /dev/sde2确认数据正确,到web导入Pool即可最后执行partprobe /dev/sde刷新硬盘
QQ截图20240319212705.png




@Private
async def expand_partition(self, part_data):
partition_number = part_data['partition_number']
start = part_data['start_sector']
await run(
'sgdisk', '-d', str(partition_number), '-n', f'{partition_number}:{start}:0', '-t',
f'{partition_number}:BF01', '-u', f'{partition_number}:{part_data["partition_uuid"]}',
os.path.join('/dev', part_data['disk'])
)
await run('partprobe', os.path.join('/dev', part_data['disk']))
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
The cause of this problem is that the starting position of the partition of a hard disk was modified by TrueNas. As you can see, the two hard disks in the picture form a Pool.
Administrators should pay attention to this issue, it is very serious

The documentation says:
Select Expand Pool to increase the pool size to match all available disk space. Users with pools using virtual disks use this option to resize these virtual disks apart from TrueNAS.
Implicitly one could argue that this is more or less expected behavior. Altough I'm not a hundred percent sure that this is expected to break anything if not using virtual disks. I see your point though. Maybe you can use the feedback button in the documentation to ask for an additional information that this should be used with caution.
 

PhilD13

Patron
Joined
Sep 18, 2020
Messages
203
I think the Op has/had a pool of 2*18TB drives, and ran the command on the pool. It was not clear to me if the 18TB drives replaced smaller drives, or if there was some other reason to attempt an expand.
 

Akito1

Cadet
Joined
Jul 9, 2023
Messages
4
I think the Op has/had a pool of 2*18TB drives, and ran the command on the pool. It was not clear to me if the 18TB drives replaced smaller drives, or if there was some other reason to attempt an expand.
I mistakenly thought that there were other selection steps behind the button, so I clicked OK and ran it directly. The system mistakenly changed the partition table of one of my hard disks, causing the Pool to go offline.
我错误的认为该按钮后面还有其他选择步骤,点击确定后直接运行了,系统错误的将我的一块硬盘分区表进行了改动,导致Pool下线
 

PhilD13

Patron
Joined
Sep 18, 2020
Messages
203
If you got into this mess by replacing smaller drives with larger ones in a data pool and the pool did not auto expand then you can make it expand. You can check if autoexpand on the data pool is on by using
sudo zpool get autoexpand
If autoexpand is on for the pool you had issues with, there is a known bug, that should be fixed in 23.10.2. If not, to fix it, see:
Manual disk replacement in TrueNAS SCALE
To make the pool expand.

Even if you don't want or need the information in the article, I think it is worth reading.
 
Top