replacing drive in data-pool

RJ_fr33

Dabbler
Joined
Jan 10, 2020
Messages
26
HEllo,

I am wanting to replace a disk in data-pool. It has many bad sectors. I did following but I am stuck now.

before:
Code:
      pool: data-pool
     state: ONLINE
    config:

            NAME                                            STATE     READ WRITE CKSUM
            data-pool                                       ONLINE       0     0     0
              raidz2-0                                      ONLINE       0     0     0
                gptid/492ae752-ff00-11eb-859f-00241d1d87a3  ONLINE       0     0     0
                gptid/49a746fb-ff00-11eb-859f-00241d1d87a3  ONLINE       0     0     0
                gptid/49b38914-ff00-11eb-859f-00241d1d87a3  ONLINE       0     0     0
                gptid/49cbabd6-ff00-11eb-859f-00241d1d87a3  ONLINE       0     0     0



replaced ada2:
ada2 = rawuuid: 49b38914-ff00-11eb-859f-00241d1d87a3

Code:
# zpool offline data-pool gptid/49b38914-ff00-11eb-859f-00241d1d87a3
# zpool detach data-pool gptid/49b38914-ff00-11eb-859f-00241d1d87a3
cannot detach gptid/49b38914-ff00-11eb-859f-00241d1d87a3: only applicable to mirror and replacing vdevs


Next, I physically removed and replaced the disk.
Now I have new disk in but I am not able to add it to pool.
Code:
# camcontrol devlist
<MB2000EBUCF MK7OHPG3>             at scbus0 target 0 lun 0 (pass0,ada0)
<MB2000EBUCF MK7OHPG3>             at scbus2 target 0 lun 0 (pass1,ada1)
<MB2000EBUCF MK7OHPG3>             at scbus3 target 0 lun 0 (pass2,ada2)   <<<< New Disk
<MD2000GSA3272DVR 1 MKAOA840>      at scbus5 target 0 lun 0 (pass3,ada3)
<WDC WD1600BJKT-75F4T0 11.01A11>   at scbus6 target 0 lun 0 (pass4,ada4)
<JAJS600M128C T1125A0>             at scbus8 target 0 lun 0 (pass5,ada5)
<JAJS600M128C T1125A0>             at scbus9 target 0 lun 0 (pass6,ada6)

# zpool replace data-pool ada2
cannot replace ada2 with ada2: no such device in pool

# gpart list ada2
gpart: Class 'PART' does not have an instance named 'ada2'.

# gpart recover ada2
gpart: arg0 'ada2': Invalid argument


I do not think I should copy partition from any other disk in pool to this new disk.

What should I do?

Thank you.
 

RJ_fr33

Dabbler
Joined
Jan 10, 2020
Messages
26
ok I did following. seems new problem now...

Code:
# gpart backup ada1 | gpart restore ada2
# zpool replace data-pool gptid/49b38914-ff00-11eb-859f-00241d1d87a3 /dev/ada2p2


now instead of showing gptid it shows full disk partition name. I am not sure how this is going to effect the pool. I tried to offline/remove and re-add, but it is not working...

Code:
# zpool replace data-pool /dev/ada2p2 gptid/f5f77e32-0152-11ec-b246-00241d1d87a3
cannot open 'gptid/f5f77e32-0152-11ec-b246-00241d1d87a3': no such device in /dev
must be a full path or shorthand device name

# zpool offline data-pool ada2p2      (a grinding sound comes...)
# echo $?
0

# zpool status data-pool
  pool: data-pool
 state: ONLINE
  scan: resilvered 48K in 00:00:01 with 0 errors on Thu Aug 19 20:10:40 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        data-pool                                       ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/492ae752-ff00-11eb-859f-00241d1d87a3  ONLINE       0     0     0
            gptid/49a746fb-ff00-11eb-859f-00241d1d87a3  ONLINE       0     0     0
            ada2p2                                      ONLINE       0     0     0
            gptid/49cbabd6-ff00-11eb-859f-00241d1d87a3  ONLINE       0     0     0

errors: No known data errors
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Why are you doing this at the CLI? Basically none of the pool management should be done there; the GUI is the way to go. I don't think you've broken anything, necessarily, but you've definitely gone about it the hard way.
 

RJ_fr33

Dabbler
Joined
Jan 10, 2020
Messages
26
I have found that the /dev/gptid does not has gptid of the new disk....

Code:
[/dev/gptid]# for d in 0 1 2 3 5 6;do ruid=`gpart list ada${d}|grep rawuuid|tail -1|awk '{print $NF}'`;echo "ada${d} ... `ls -l $ruid 2>/dev/null`";done
ada0 ... crw-r-----  1 root  operator  0x77 Aug 19 20:30 49cbabd6-ff00-11eb-859f-00241d1d87a3
ada1 ... crw-r-----  1 root  operator  0x86 Aug 19 20:30 492ae752-ff00-11eb-859f-00241d1d87a3
ada2 ...
ada3 ... crw-r-----  1 root  operator  0x8a Aug 19 20:30 49a746fb-ff00-11eb-859f-00241d1d87a3
ada5 ... crw-r-----  1 root  operator  0x8d Aug 19 20:30 ebc78f1d-fef8-11eb-859f-00241d1d87a3
ada6 ... crw-r-----  1 root  operator  0x90 Aug 19 20:30 f194e7a7-fef8-11eb-859f-00241d1d87a3


How can I remove this disk now?...physically disconnect? remove and re-add?

How to do it from GUI? I removed the disk from CLI in the first place.

I like the use of gptid internally, as it make sit easy to shuffle disk slots....
 

RJ_fr33

Dabbler
Joined
Jan 10, 2020
Messages
26
more steps taken:...


shutdown server. removed sata connection to disk.

restart: offlined the disk in GUI.

shutdown. reconnect cable.

re-start: now shows offline in pool. surprisingly now shows gptid in /dev/gptid folder...

Code:
# zpool status data-pool
  pool: data-pool
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 72K in 00:00:00 with 0 errors on Thu Aug 19 20:33:58 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        data-pool                                       DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/492ae752-ff00-11eb-859f-00241d1d87a3  ONLINE       0     0     0
            gptid/49a746fb-ff00-11eb-859f-00241d1d87a3  ONLINE       0     0     0
            5790169899629969467                         OFFLINE      0     0     0  was /dev/ada2p2
            gptid/49cbabd6-ff00-11eb-859f-00241d1d87a3  ONLINE       0     0     0

errors: No known data errors


# for d in 0 1 2 3 5 6;do ruid=`gpart list ada${d}|grep rawuuid|tail -1|awk '{print $NF}'`;echo "ada${d} ... `ls -l $ruid 2>/dev/null`";done
ada0 ... crw-r-----  1 root  operator  0x77 Aug 19 20:59 49cbabd6-ff00-11eb-859f-00241d1d87a3
ada1 ... crw-r-----  1 root  operator  0x86 Aug 19 20:59 492ae752-ff00-11eb-859f-00241d1d87a3
ada2 ... crw-r-----  1 root  operator  0x88 Aug 19 20:59 f5f77e32-0152-11ec-b246-00241d1d87a3
ada3 ... crw-r-----  1 root  operator  0x8a Aug 19 20:59 49a746fb-ff00-11eb-859f-00241d1d87a3
ada5 ... crw-r-----  1 root  operator  0x8d Aug 19 20:59 ebc78f1d-fef8-11eb-859f-00241d1d87a3
ada6 ... crw-r-----  1 root  operator  0x90 Aug 19 20:59 f194e7a7-fef8-11eb-859f-00241d1d87a3



tried few more things:

replace from GUI fails:

Code:
Error Replacing Disk
[EINVAL] options.force: Disk is not clean, partitions were found.
Error: Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 367, in run
    await self.future
  File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 403, in __run_body
    rv = await self.method(*([self] + args))
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 973, in nf
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/pool_/replace_disk.py", line 77, in replace
    raise verrors
middlewared.service_exception.ValidationErrors: [EINVAL] options.force: Disk is not clean, partitions were found.
{/CODE]

errors from CLI:

Code:
# zpool replace  data-pool 5790169899629969467 gptid/f5f77e32-0152-11ec-b246-00241d1d87a3
invalid vdev specification
use '-f' to override the following errors:
/dev/gptid/f5f77e32-0152-11ec-b246-00241d1d87a3 is part of active pool 'data-pool'

# zpool replace -f data-pool 5790169899629969467 gptid/f5f77e32-0152-11ec-b246-00241d1d87a3
invalid vdev specification
the following errors must be manually repaired:
/dev/gptid/f5f77e32-0152-11ec-b246-00241d1d87a3 is part of active pool 'data-pool'
destroyed partition from CLI.
Code:
# gpart destroy -F ada2
ada2 destroyed
Now tried from GUI, works for about 15% then error:
Code:
Error Replacing Disk
Could not replace disk.

Error: concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 277, in replace
    target.replace(newvdev)
  File "libzfs.pyx", line 391, in libzfs.ZFS.__exit__
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 277, in replace
    target.replace(newvdev)
  File "libzfs.pyx", line 2060, in libzfs.ZFSVdev.replace
libzfs.ZFSException: /dev/gptid/1c4cf14a-015c-11ec-9eb6-00241d1d87a3 is busy, or device removal is in progress

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 243, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 94, in main_worker
    res = MIDDLEWARE._run(*call_args)
  File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 45, in _run
    return self._call(name, serviceobj, methodobj, args, job=job)
  File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 39, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 39, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 977, in nf
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 279, in replace
    raise CallError(str(e), e.code)
middlewared.service_exception.CallError: [EZFS_BADDEV] /dev/gptid/1c4cf14a-015c-11ec-9eb6-00241d1d87a3 is busy, or device removal is in progress
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 367, in run
    await self.future
  File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 403, in __run_body
    rv = await self.method(*([self] + args))
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 973, in nf
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/pool_/replace_disk.py", line 122, in replace
    raise e
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/pool_/replace_disk.py", line 102, in replace
    await self.middleware.call(
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1248, in call
    return await self._call(
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1213, in _call
    return await self._call_worker(name, *prepared_call.args)
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1219, in _call_worker
    return await self.run_in_proc(main_worker, name, args, job)
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1146, in run_in_proc
    return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1120, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
middlewared.service_exception.CallError: [EZFS_BADDEV] /dev/gptid/1c4cf14a-015c-11ec-9eb6-00241d1d87a3 is busy, or device removal is in progress
 

RJ_fr33

Dabbler
Joined
Jan 10, 2020
Messages
26
SUCCESS at LAST !!!!!!


again tried to replace from GUI. This time "forced" it.

Code:
Replacing Disk 
Successfully replaced disk /dev/ada2p2.


Now pool is ready with gpt-id !!!!

Code:
# zpool status data-pool
  pool: data-pool
 state: ONLINE
  scan: resilvered 3.25M in 00:00:02 with 0 errors on Thu Aug 19 21:16:26 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        data-pool                                       ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/492ae752-ff00-11eb-859f-00241d1d87a3  ONLINE       0     0     0
            gptid/49a746fb-ff00-11eb-859f-00241d1d87a3  ONLINE       0     0     0
            gptid/9ddddd93-015c-11ec-9eb6-00241d1d87a3  ONLINE       0     0     0
            gptid/49cbabd6-ff00-11eb-859f-00241d1d87a3  ONLINE       0     0     0

errors: No known data errors


I hope this is helpful to someone. I was able to do all these changes because my pool is empty right now. :)

Final thoughts:
I think the main thing is to 'replace' from within GUI. It creates partition table on disk, saves its gptid on whichever locations/files it needs. and then adds to the pool. You need a disk with NO Partition table on it. It is ok if you do the removal steps from CLI.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
It is ok if you do the removal steps from CLI.
It's OK, but still not recommended. Really, there's no reason to use the CLI here at all, and doing so runs the risk of confusing the GUI/middleware. In the past, you had to use the CLI if you wanted to add a disk as a mirror of a single-disk vdev, but TN12 finally included that feature in the GUI, so it's now a pretty rare case where you'd need to use the CLI--and if you don't need to use the CLI, it's really best that you not.

But with all that said, glad you got it all working. I don't think the system should give a traceback on a situation like this, though; you may want to file a bug about that (the Report a Bug link at the top of the page).
 
Top