SOLVED After replacing a disk and resilvering pool is still in degraded status and drive says it's still replacing

psssttt · Apr 26, 2020

Hi,

In recent weeks two (/dev/ada3 & /dev/ada5) of my eight drives started to report more and more bad sectors and errors so I've decided to replace them.
I followed official instructions for replacing a failed disk https://www.ixsystems.com/documentation/freenas/11.3-U1/storage.html#replacing-a-failed-disk

I've done all of those steps via GUI:
* changed the faulty /dev/ada3 disk status to OFFLINE
* shutdown the system
* physically replaced the failing drive with a new one
* then I replaced the OFFLINE drive with the new one and then resilvering started
* resilvering took roughly 10 hours and finished @ Sun Apr 26 05:58:46 2020
* during the resilvering process a short SMART test ran and it found two new issues with the other faulty drive /dev/ada5 (which I want to replace next)

Code:

CRITICAL
Device: /dev/ada5, ATA error count increased from 65 to 70.
Sun, 26 Apr 2020 12:20:08 AM (Europe/London)

Code:

CRITICAL
Device: /dev/ada5, Self-Test Log error count increased from 8 to 9.
Sun, 26 Apr 2020 12:50:07 AM (Europe/London)

Once the resilvering finished (with 3 errors) the status of new drive changed to ONLINE but I the drive is now in strange "REPLACING/OFFLINE" mode.

I tried to detach that OFFLINE " /dev/gptid/36387c1c-9605-11e5-b66a-382c4abd4614" but I got this error:

Code:

[EZFS_NOTSUP] Can detach disks from mirrors and spares only

PS. The full stacktrace for that error at the end of this post.

I then checked pool status via CLI and it showed me the same 3 permanent errors as before I replaced the drive.
PS. I don't care much about loosing those files.

Code:

nasferatu# zpool status -v nasferatuvolume
  pool: nasferatuvolume
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 3.09T in 0 days 14:09:18 with 3 errors on Sun Apr 26 05:58:46 2020
config:

    NAME                                              STATE     READ WRITE CKSUM
    nasferatuvolume                                   DEGRADED     0     0     3
      raidz2-0                                        DEGRADED     0     0     6
        gptid/271e756e-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/2c32e27b-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/31362dba-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        replacing-3                                   OFFLINE      0     0     0
          854069389795236902                          OFFLINE      0     0     0  was /dev/gptid/36387c1c-9605-11e5-b66a-382c4abd4614
          gptid/f60625cd-8703-11ea-9f29-001b21275bb9  ONLINE       0     0     0
        gptid/3b7b207e-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/40a65355-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/45e1321a-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/4b337f7c-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        nasferatuvolume/photo@auto-20200420.0500-1w:/2016/2016-06-04_02/VID_20160404.mp4
        nasferatuvolume/photo@auto-20200420.0500-1w:/2017/2017-03-04_05/VID_20170304.mp4
        nasferatuvolume/photo@auto-20200420.0500-1w:/2017/2017-05-05_01/VID_20170505.mp4

I did a lot if googling but could not find a clear set of instructions of what to do next :(

Full stacktrace for "ZFSException: Can detach disks from mirrors and spares only"

Code:

Error: concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 247, in __zfs_vdev_operation
    op(target, *args)
  File "libzfs.pyx", line 369, in libzfs.ZFS.__exit__
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 247, in __zfs_vdev_operation
    op(target, *args)
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 256, in <lambda>
    self.__zfs_vdev_operation(name, label, lambda target: target.detach())
  File "libzfs.pyx", line 1764, in libzfs.ZFSVdev.detach
libzfs.ZFSException: Can detach disks from mirrors and spares only

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.7/site-packages/middlewared/worker.py", line 95, in main_worker
    res = loop.run_until_complete(coro)
  File "/usr/local/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.7/site-packages/middlewared/worker.py", line 51, in _run
    return await self._call(name, serviceobj, methodobj, params=args, job=job)
  File "/usr/local/lib/python3.7/site-packages/middlewared/worker.py", line 43, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.7/site-packages/middlewared/worker.py", line 43, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.7/site-packages/middlewared/schema.py", line 965, in nf
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 256, in detach
    self.__zfs_vdev_operation(name, label, lambda target: target.detach())
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 249, in __zfs_vdev_operation
    raise CallError(str(e), e.code)
middlewared.service_exception.CallError: [EZFS_NOTSUP] Can detach disks from mirrors and spares only
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 130, in call_method
    io_thread=False)
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1077, in _call
    return await methodobj(*args)
  File "/usr/local/lib/python3.7/site-packages/middlewared/schema.py", line 961, in nf
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/pool.py", line 1194, in detach
    await self.middleware.call('zfs.pool.detach', pool['name'], found[1]['guid'])
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1127, in call
    app=app, pipes=pipes, job_on_progress_cb=job_on_progress_cb, io_thread=True,
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1074, in _call
    return await self._call_worker(name, *args)
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1094, in _call_worker
    return await self.run_in_proc(main_worker, name, args, job)
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1029, in run_in_proc
    return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1003, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
middlewared.service_exception.CallError: [EZFS_NOTSUP] Can detach disks from mirrors and spares only

psssttt · Apr 27, 2020

Today, I've decided to clear the errors and try to replace that OFFLINE disk 854069389795236902 with the same /dev/ada3 disk.
So: zpool clear nasferatuvolume did nothing. All three errors were still visible.
Then I replaced that OFFLINE disk, which triggered resilvering again.
10 hours later, nothing has changed.

Code:

# zpool status -v
  pool: freenas-boot
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    freenas-boot  ONLINE       0     0     0
      da0p2     ONLINE       0     0     0

errors: No known data errors

  pool: nasferatuvolume
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 3.09T in 0 days 13:48:02 with 3 errors on Mon Apr 27 22:19:09 2020
config:

    NAME                                              STATE     READ WRITE CKSUM
    nasferatuvolume                                   DEGRADED     0     0     3
      raidz2-0                                        DEGRADED     0     0     6
        gptid/271e756e-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/2c32e27b-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/31362dba-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        replacing-3                                   DEGRADED     0     0     0
          854069389795236902                          OFFLINE      0     0     0  was /dev/gptid/36387c1c-9605-11e5-b66a-382c4abd4614
          gptid/f60625cd-8703-11ea-9f29-001b21275bb9  ONLINE       0     0     0
        gptid/3b7b207e-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/40a65355-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/45e1321a-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/4b337f7c-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        nasferatuvolume/photo@auto-20200420.0500-1w:/2016/2016-06-04_02/VID_20160404.mp4
        nasferatuvolume/photo@auto-20200420.0500-1w:/2017/2017-03-04_05/VID_20170304.mp4
        nasferatuvolume/photo@auto-20200420.0500-1w:/2017/2017-05-05_01/VID_20170505.mp4

psssttt · Apr 27, 2020

Should I run a scrub or somehow forcefully remove that old drive from pool?

psssttt · Apr 28, 2020

I tried again to detach that OFFLINE " /dev/gptid/36387c1c-9605-11e5-b66a-382c4abd4614" drive and again I got the same error:
[EZFS_NOTSUP] Can detach disks from mirrors and spares only

Here's a short screencast

psssttt · Apr 28, 2020

I wonder why does it say "Detach disk /dev/g?" in that detach modal.
There's no such device on my machine.

psssttt · Apr 28, 2020

Hmmmm....after restarting the machine resilvering kicked in again.

psssttt · Apr 28, 2020

Maybe I should've list my config earlier.

ATM I'm running 11.3-RELEASE-p6 on:

Mobo: ASUS A88XM-PLUS
CPU: AMD A10-7870K
RAM: 32GB non-ECC DDR3 1600MHz
NIC: PCI: Intel(R) PRO/1000 & on-board RealTek 8168/8111
HDD: 6x4TB Seagate ST4000VN003 + 2x4TB ST4000DM000

My upgrade plan was to:

first replace two slowly failing ST4000VN003 (ada3 & ada5) with brand new Seagate Ironwolf 8TB ST8000VN004
then replace two desktop drives ST4000DM000
and finally replace remaining 4xTB ST4000VN003

But as you can see from previous msgs I got stuck at step 1 :)

Below are the outputs from "camcontrol devlist" & "gpart show" ran while resilvering was still going on:

Code:

# camcontrol devlist
<ST4000VN003-1T5168 SC46>          at scbus0 target 0 lun 0 (pass0,ada0)
<ST4000VN003-1T5168 SC46>          at scbus1 target 0 lun 0 (pass1,ada1)
<ST4000VN003-1T5168 SC46>          at scbus2 target 0 lun 0 (pass2,ada2)
<ST8000VN004-2M2101 SC60>          at scbus3 target 0 lun 0 (pass3,ada3)
<ST4000VN003-1T5168 SC46>          at scbus4 target 0 lun 0 (pass4,ada4)
<ST4000VN003-1T5168 SC46>          at scbus5 target 0 lun 0 (pass5,ada5)
<ST4000DM000-1F2168 CC54>          at scbus6 target 0 lun 0 (pass6,ada6)
<ST4000DM000-1F2168 CC54>          at scbus7 target 0 lun 0 (pass7,ada7)
<SanDisk Dual Drive 1.00>          at scbus9 target 0 lun 0 (pass8,da0)

Code:

# gpart show
=>        34  7814037101  ada0  GPT  (3.6T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  7809842696     2  freebsd-zfs  (3.6T)
  7814037128           7        - free -  (3.5K)

=>        34  7814037101  ada1  GPT  (3.6T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  7809842696     2  freebsd-zfs  (3.6T)
  7814037128           7        - free -  (3.5K)

=>        34  7814037101  ada2  GPT  (3.6T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  7809842696     2  freebsd-zfs  (3.6T)
  7814037128           7        - free -  (3.5K)

=>         40  15628053088  ada3  GPT  (7.3T)
           40           88        - free -  (44K)
          128      4194304     1  freebsd-swap  (2.0G)
      4194432  15623858696     2  freebsd-zfs  (7.3T)

=>        34  7814037101  ada4  GPT  (3.6T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  7809842696     2  freebsd-zfs  (3.6T)
  7814037128           7        - free -  (3.5K)

=>        34  7814037101  ada5  GPT  (3.6T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  7809842696     2  freebsd-zfs  (3.6T)
  7814037128           7        - free -  (3.5K)

=>        34  7814037101  ada6  GPT  (3.6T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  7809842696     2  freebsd-zfs  (3.6T)
  7814037128           7        - free -  (3.5K)

=>        34  7814037101  ada7  GPT  (3.6T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  7809842696     2  freebsd-zfs  (3.6T)
  7814037128           7        - free -  (3.5K)

=>       40  120176560  da0  GPT  (57G) [CORRUPT]
         40     532480    1  efi  (260M)
     532520  119635968    2  freebsd-zfs  (57G)
  120168488       8112       - free -  (4.0M)

psssttt · Apr 29, 2020

Today I tried "zpool replace {pool} {old_drive} {new_drive}" but it didn't work:

Code:

# zpool replace nasferatuvolume /dev/gptid/36387c1c-9605-11e5-b66a-382c4abd4614 gptid/f60625cd-8703-11ea-9f29-001b21275bb9
invalid vdev specification
use '-f' to override the following errors:
/dev/gptid/f60625cd-8703-11ea-9f29-001b21275bb9 is part of active pool 'nasferatuvolume'

Code:

# zpool replace nasferatuvolume /dev/gptid/36387c1c-9605-11e5-b66a-382c4abd4614 ada3p2
invalid vdev specification
use '-f' to override the following errors:
/dev/ada3p2 is part of active pool 'nasferatuvolume'

Any ideas how to fix the issue I'm facing? :(

psssttt · Apr 29, 2020

These two articles kinda describe similar situation:
https://forums.freebsd.org/threads/a-disk-in-a-pool-disk-that-i-cant-remove-replace-or-online.40144/
https://askubuntu.com/questions/305830/replacing-a-dead-disk-in-a-zpool
not sure if I should follow the instruction.

psssttt · May 2, 2020

After replacing the new drive with the old one (I followed the same steps as in the first message https://www.ixsystems.com/documentation/freenas/11.3-U1/storage.html#replacing-a-failed-disk ) and after resilvering I guess I'm back to square one yet again. :(
Old drive is ONLINE whereas the partition for the new drive is OFFLINE.

Code:

# zpool status -v nasferatuvolume
  pool: nasferatuvolume
 state: DEGRADED
  scan: resilvered 37.7M in 0 days 02:22:52 with 0 errors on Fri May  1 20:43:36 2020
config:

    NAME                                              STATE     READ WRITE CKSUM
    nasferatuvolume                                   DEGRADED     0     0     0
      raidz2-0                                        DEGRADED     0     0     0
        gptid/271e756e-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/2c32e27b-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/31362dba-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        replacing-3                                   DEGRADED     0     0     0
          gptid/36387c1c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
          9511904525519547236                         OFFLINE      0     0     0  was /dev/gptid/f60625cd-8703-11ea-9f29-001b21275bb9
        gptid/3b7b207e-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/40a65355-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/45e1321a-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/4b337f7c-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0

errors: No known data errors

psssttt · May 2, 2020

An attempt to detach the new (yet physically disconnected) drive yields the same "Can detach disks from mirrors and spares only" error as previously:

Code:

Error: concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 247, in __zfs_vdev_operation
    op(target, *args)
  File "libzfs.pyx", line 369, in libzfs.ZFS.__exit__
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 247, in __zfs_vdev_operation
    op(target, *args)
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 256, in <lambda>
    self.__zfs_vdev_operation(name, label, lambda target: target.detach())
  File "libzfs.pyx", line 1764, in libzfs.ZFSVdev.detach
libzfs.ZFSException: Can detach disks from mirrors and spares only

psssttt · May 2, 2020

Finally I made a lil progress:
I've detached the new (OFFLINE) drive via CLI:

Code:

zpool detach  nasferatuvolume 9511904525519547236

And pool is finally in a normal state :)

Code:

# zpool status -v nasferatuvolume
  pool: nasferatuvolume
 state: ONLINE
  scan: resilvered 37.7M in 0 days 02:22:52 with 0 errors on Fri May  1 20:43:36 2020
config:

    NAME                                            STATE     READ WRITE CKSUM
    nasferatuvolume                                 ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/271e756e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/2c32e27b-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/31362dba-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/36387c1c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/3b7b207e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/40a65355-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/45e1321a-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/4b337f7c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0

errors: No known data errors

psssttt · May 4, 2020

I've decided to restart the whole task with replacing ada5 as it started to report more bad sectors than ada3 from previous messages.

I took it OFFLINE

Code:

zpool offline nasferatuvolume gptid/40a65355-9605-11e5-b66a-382c4abd4614

zpool status -v nasferatuvolume
  pool: nasferatuvolume
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 824K in 0 days 13:36:43 with 3 errors on Sun May  3 05:23:18 2020
config:

    NAME                                            STATE     READ WRITE CKSUM
    nasferatuvolume                                 DEGRADED     0     0     0
      raidz2-0                                      DEGRADED     0     0     0
        gptid/271e756e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/2c32e27b-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/31362dba-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/36387c1c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/3b7b207e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        5961868542700033192                         OFFLINE      0     0     0  was /dev/gptid/40a65355-9605-11e5-b66a-382c4abd4614
        gptid/45e1321a-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/4b337f7c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0

Then I physically replaced it with a new one, and used zpool replace

Code:

zpool replace nasferatuvolume 5961868542700033192 ada5

zpool status -v nasferatuvolume
  pool: nasferatuvolume
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun May  3 11:34:41 2020
    3.48G scanned at 510M/s, 751M issued at 107M/s, 25.3T total
    0 resilvered, 0.00% done, 2 days 20:42:58 to go
config:

    NAME                                            STATE     READ WRITE CKSUM
    nasferatuvolume                                 DEGRADED     0     0     0
      raidz2-0                                      DEGRADED     0     0     0
        gptid/271e756e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/2c32e27b-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/31362dba-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/36387c1c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/3b7b207e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        replacing-5                                 OFFLINE      0     0     0
          5961868542700033192                       OFFLINE      0     0     0  was /dev/gptid/40a65355-9605-11e5-b66a-382c4abd4614
          ada5                                      ONLINE       0     0     0
        gptid/45e1321a-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/4b337f7c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0

Once resilvering finished the pool was still in DEGRADED state:

Code:

zpool status -xv
  pool: nasferatuvolume
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 3.09T in 0 days 14:20:08 with 3 errors on Mon May  4 01:54:49 2020
config:

    NAME                                            STATE     READ WRITE CKSUM
    nasferatuvolume                                 DEGRADED     0     0     3
      raidz2-0                                      DEGRADED     0     0     6
        gptid/271e756e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/2c32e27b-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/31362dba-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/36387c1c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/3b7b207e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        replacing-5                                 OFFLINE      0     0     0
          5961868542700033192                       OFFLINE      0     0     0  was /dev/gptid/40a65355-9605-11e5-b66a-382c4abd4614
          ada5                                      ONLINE       0     0     0
        gptid/45e1321a-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/4b337f7c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0

I then detached the old drive from the pool and resilvering kicked in once again:

Code:

zpool detach nasferatuvolume 5961868542700033192

zpool status -xv
  pool: nasferatuvolume
state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 3.09T in 0 days 14:20:08 with 3 errors on Mon May  4 01:54:49 2020
config:

    NAME                                            STATE     READ WRITE CKSUM
    nasferatuvolume                                 ONLINE       0     0     3
      raidz2-0                                      ONLINE       0     0     6
        gptid/271e756e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/2c32e27b-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/31362dba-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/36387c1c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/3b7b207e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        ada5                                        ONLINE       0     0     0
        gptid/45e1321a-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/4b337f7c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0

I don't understand why does the new drive is listed in the pool with dev name: ada5 rather than a GEOM????

gpart says there's not such GEOM:

Code:

gpart show ada5
gpart: No such geom: ada5.

The new drive doesn't show up among all the geoms:

Code:

glabel status
                                      Name  Status  Components
gptid/271e756e-9605-11e5-b66a-382c4abd4614     N/A  ada0p2
gptid/2c32e27b-9605-11e5-b66a-382c4abd4614     N/A  ada1p2
gptid/31362dba-9605-11e5-b66a-382c4abd4614     N/A  ada2p2
gptid/36387c1c-9605-11e5-b66a-382c4abd4614     N/A  ada3p2
gptid/3b7b207e-9605-11e5-b66a-382c4abd4614     N/A  ada4p2
gptid/45e1321a-9605-11e5-b66a-382c4abd4614     N/A  ada6p2
gptid/4b337f7c-9605-11e5-b66a-382c4abd4614     N/A  ada7p2
gptid/2621435c-8701-11ea-9bf7-001b21275bb9     N/A  da0p1
gptid/270982c1-9605-11e5-b66a-382c4abd4614     N/A  ada0p1

Code:

gpart show
=>        34  7814037101  ada0  GPT  (3.6T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  7809842696     2  freebsd-zfs  (3.6T)
  7814037128           7        - free -  (3.5K)

=>        34  7814037101  ada1  GPT  (3.6T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  7809842696     2  freebsd-zfs  (3.6T)
  7814037128           7        - free -  (3.5K)

=>        34  7814037101  ada2  GPT  (3.6T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  7809842696     2  freebsd-zfs  (3.6T)
  7814037128           7        - free -  (3.5K)

=>        34  7814037101  ada3  GPT  (3.6T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  7809842696     2  freebsd-zfs  (3.6T)
  7814037128           7        - free -  (3.5K)

=>        34  7814037101  ada4  GPT  (3.6T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  7809842696     2  freebsd-zfs  (3.6T)
  7814037128           7        - free -  (3.5K)

=>        34  7814037101  ada6  GPT  (3.6T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  7809842696     2  freebsd-zfs  (3.6T)
  7814037128           7        - free -  (3.5K)

=>        34  7814037101  ada7  GPT  (3.6T)
          34          94        - free -  (47K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  7809842696     2  freebsd-zfs  (3.6T)
  7814037128           7        - free -  (3.5K)

psssttt · May 4, 2020

I had a look at /var/log/messages and the only new entries that showed up after I removed old drive from the pool are these:

Code:

May  4 08:47:50 nasferatu ZFS: vdev state changed, pool_guid=3287432541985472907 vdev_guid=11240258443983704252
May  4 08:47:50 nasferatu ZFS: vdev state changed, pool_guid=3287432541985472907 vdev_guid=6835768105345244814
May  4 08:47:50 nasferatu ZFS: vdev state changed, pool_guid=3287432541985472907 vdev_guid=11015975618392894410
May  4 08:47:50 nasferatu ZFS: vdev state changed, pool_guid=3287432541985472907 vdev_guid=854069389795236902
May  4 08:47:50 nasferatu ZFS: vdev state changed, pool_guid=3287432541985472907 vdev_guid=8970694967798805425
May  4 08:47:50 nasferatu ZFS: vdev state changed, pool_guid=3287432541985472907 vdev_guid=12067865134856464115
May  4 08:47:50 nasferatu ZFS: vdev state changed, pool_guid=3287432541985472907 vdev_guid=5680956802126248264
May  4 08:47:50 nasferatu ZFS: vdev state changed, pool_guid=3287432541985472907 vdev_guid=17770588696763289835

Interestingly enough /dev/ada5 appears among vdev labels for my pool.
Does it mean that I have to somehow assign a gptid to the drive?

Code:

zdb -l /dev/ada5
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'nasferatuvolume'
    state: 0
    txg: 8028346
    pool_guid: 3287432541985472907
    hostid: 1963954595
    hostname: 'nasferatu.local'
    top_guid: 10942199742888776326
    guid: 12067865134856464115
    vdev_children: 1
    vdev_tree:
        type: 'raidz'
        id: 0
        guid: 10942199742888776326
        nparity: 2
        metaslab_array: 35
        metaslab_shift: 38
        ashift: 12
        asize: 31989077901312
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 11240258443983704252
            path: '/dev/gptid/271e756e-9605-11e5-b66a-382c4abd4614'
            whole_disk: 1
            DTL: 604
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 6835768105345244814
            path: '/dev/gptid/2c32e27b-9605-11e5-b66a-382c4abd4614'
            whole_disk: 1
            DTL: 595
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 11015975618392894410
            path: '/dev/gptid/31362dba-9605-11e5-b66a-382c4abd4614'
            whole_disk: 1
            DTL: 594
            create_txg: 4
        children[3]:
            type: 'disk'
            id: 3
            guid: 854069389795236902
            path: '/dev/gptid/36387c1c-9605-11e5-b66a-382c4abd4614'
            whole_disk: 1
            DTL: 593
            create_txg: 4
        children[4]:
            type: 'disk'
            id: 4
            guid: 8970694967798805425
            path: '/dev/gptid/3b7b207e-9605-11e5-b66a-382c4abd4614'
            whole_disk: 1
            DTL: 592
            create_txg: 4
        children[5]:
            type: 'disk'
            id: 5
            guid: 12067865134856464115
            path: '/dev/ada5'
            whole_disk: 1
            DTL: 4119
            create_txg: 4
            resilver_txg: 8014710
        children[6]:
            type: 'disk'
            id: 6
            guid: 5680956802126248264
            path: '/dev/gptid/45e1321a-9605-11e5-b66a-382c4abd4614'
            whole_disk: 1
            DTL: 590
            create_txg: 4
        children[7]:
            type: 'disk'
            id: 7
            guid: 17770588696763289835
            path: '/dev/gptid/4b337f7c-9605-11e5-b66a-382c4abd4614'
            whole_disk: 1
            DTL: 589
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

psssttt · May 4, 2020

I've compared the gptids

Code:

ls -ltr /dev/gptid
total 0
crw-r-----  1 root  operator  0x9d May  3 11:28 4b337f7c-9605-11e5-b66a-382c4abd4614
crw-r-----  1 root  operator  0x9b May  3 11:28 45e1321a-9605-11e5-b66a-382c4abd4614
crw-r-----  1 root  operator  0x93 May  3 11:28 3b7b207e-9605-11e5-b66a-382c4abd4614
crw-r-----  1 root  operator  0x91 May  3 11:28 36387c1c-9605-11e5-b66a-382c4abd4614
crw-r-----  1 root  operator  0x8f May  3 11:28 31362dba-9605-11e5-b66a-382c4abd4614
crw-r-----  1 root  operator  0x8d May  3 11:28 2c32e27b-9605-11e5-b66a-382c4abd4614
crw-r-----  1 root  operator  0x80 May  3 11:28 271e756e-9605-11e5-b66a-382c4abd4614
crw-r-----  1 root  operator  0x9e May  3 11:28 2621435c-8701-11ea-9bf7-001b21275bb9
crw-r-----  1 root  operator  0xa9 May  4 08:47 270982c1-9605-11e5-b66a-382c4abd4614

with the gptids from zpool status output:

Code:

zpool status -v nasferatuvolume
  pool: nasferatuvolume
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon May  4 08:47:56 2020
    5.12T scanned at 1.45G/s, 2.87T issued at 832M/s, 25.3T total
    358G resilvered, 11.34% done, 0 days 07:51:04 to go
config:

    NAME                                            STATE     READ WRITE CKSUM
    nasferatuvolume                                 ONLINE       0     0     3
      raidz2-0                                      ONLINE       0     0     6
        gptid/271e756e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/2c32e27b-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/31362dba-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/36387c1c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/3b7b207e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        ada5                                        ONLINE       0     0     0
        gptid/45e1321a-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/4b337f7c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0

and it looks like my new drive has a gptid 270982c1-9605-11e5-b66a-382c4abd4614

So, now my question to you is how to make zpool to use that gptid instead of device name?
So that it would show up among all other glabels

Code:

glabel status
                                      Name  Status  Components
gptid/271e756e-9605-11e5-b66a-382c4abd4614     N/A  ada0p2
gptid/2c32e27b-9605-11e5-b66a-382c4abd4614     N/A  ada1p2
gptid/31362dba-9605-11e5-b66a-382c4abd4614     N/A  ada2p2
gptid/36387c1c-9605-11e5-b66a-382c4abd4614     N/A  ada3p2
gptid/3b7b207e-9605-11e5-b66a-382c4abd4614     N/A  ada4p2
gptid/45e1321a-9605-11e5-b66a-382c4abd4614     N/A  ada6p2
gptid/4b337f7c-9605-11e5-b66a-382c4abd4614     N/A  ada7p2
gptid/2621435c-8701-11ea-9bf7-001b21275bb9     N/A  da0p1
gptid/270982c1-9605-11e5-b66a-382c4abd4614     N/A  ada0p1

psssttt · May 4, 2020

So, another resilvering has just finished:

Code:

# zpool status -v nasferatuvolume
  pool: nasferatuvolume
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 3.09T in 0 days 13:47:18 with 3 errors on Mon May  4 22:35:14 2020
config:

    NAME                                            STATE     READ WRITE CKSUM
    nasferatuvolume                                 ONLINE       0     0     6
      raidz2-0                                      ONLINE       0     0    12
        gptid/271e756e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/2c32e27b-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/31362dba-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/36387c1c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/3b7b207e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        ada5                                        ONLINE       0     0     0
        gptid/45e1321a-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/4b337f7c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:
...

and as the newly replaced drive doesn't have a geom:

Code:

gpart show ada5
gpart: No such geom: ada5.

but somehow it show up here:

Code:

sysctl kern.geom.disk
kern.geom.disk.da0.led:
kern.geom.disk.ada7.led:
kern.geom.disk.ada6.led:
kern.geom.disk.ada5.led:
kern.geom.disk.ada4.led:
kern.geom.disk.ada3.led:
kern.geom.disk.ada2.led:
kern.geom.disk.ada1.led:
kern.geom.disk.ada0.led:

and here

Code:

geom disk list | grep "Geom name"
Geom name: ada0
Geom name: ada1
Geom name: ada2
Geom name: ada3
Geom name: ada4
Geom name: ada5
Geom name: ada6
Geom name: ada7
Geom name: da0

and here

Code:

geom disk list ada5
Geom name: ada5
Providers:
1. Name: ada5
   Mediasize: 8001563222016 (7.3T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   descr: ST8000VN004-2M2101
   lunid: 5000c500c3ef47be
   ident: WKD0DG0Q
   rotationrate: 7200
   fwsectors: 63
   fwheads: 16

Can someone please help me with this? :(
It's been more that a week since I decided to follow the official manual to do a simple task of replacing a failing drive.

ps. I came across this post https://www.penguinpunk.net/blog/tag/log/
Looks kinda similar.
Not sure about the commands he recommends:

Code:

gpart create -s gpt ada5
gpart add -a 4k -b 128 -t freebsd-zfs -s 10G ada5
gpart add -a 4k -t freebsd-zfs ada5

psssttt · May 8, 2020

Following @sretalla advice in this post https://www.ixsystems.com/community/threads/replacing-failed-drive.84525/#post-584663
I've removed the files listed in the resilvering & scrub error list.
After that I ran a scrub and this time no errors were reported.

Code:

zpool status -v nasferatuvolume
  pool: nasferatuvolume
 state: ONLINE
  scan: scrub repaired 0 in 0 days 13:25:32 with 0 errors on Wed May  6 23:44:55 2020
config:

    NAME                                            STATE     READ WRITE CKSUM
    nasferatuvolume                                 ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/271e756e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/2c32e27b-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/31362dba-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/36387c1c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/3b7b207e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        ada5                                        ONLINE       0     0     0
        gptid/45e1321a-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/4b337f7c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0

errors: No known data errors

Dunno why but after the last scrub GUI show up device names instead of gptids

Code:

glabel status
                                      Name  Status  Components
gptid/271e756e-9605-11e5-b66a-382c4abd4614     N/A  ada0p2
gptid/2c32e27b-9605-11e5-b66a-382c4abd4614     N/A  ada1p2
gptid/31362dba-9605-11e5-b66a-382c4abd4614     N/A  ada2p2
gptid/36387c1c-9605-11e5-b66a-382c4abd4614     N/A  ada3p2
gptid/3b7b207e-9605-11e5-b66a-382c4abd4614     N/A  ada4p2
gptid/45e1321a-9605-11e5-b66a-382c4abd4614     N/A  ada6p2
gptid/4b337f7c-9605-11e5-b66a-382c4abd4614     N/A  ada7p2
gptid/2621435c-8701-11ea-9bf7-001b21275bb9     N/A  da0p1
gptid/270982c1-9605-11e5-b66a-382c4abd4614     N/A  ada0p1

Code:

geom disk list ada5
Geom name: ada5
Providers:
1. Name: ada5
   Mediasize: 8001563222016 (7.3T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   descr: ST8000VN004-2M2101
   lunid: 5000c500c3ef47be
   ident: WKD0DG0Q
   rotationrate: 7200
   fwsectors: 63
   fwheads: 16

Yorick · May 8, 2020

That ada5 would worry me. gptid is used so the pool will be usable regardless of how the system labels physical drive locations.

I don't know why the FreeNAS UI / middleware wouldn't have partitioned the drive and resilvered the partition in using its gptid. This sounds like a bug.

psssttt said:
errors: Permanent errors have been detected in the following files:

I think ZFS resilver may have issues when it can't rebuild files. Those three files would need to be removed, and replaced after, and/or the ZFS errors cleared, before trying a resilver.

That's from reading not from experience. Maybe someone with hands-on experience will stop by.

psssttt · May 8, 2020

Thanks @Yorick

Just before I dealt with the resilvering errors I've upgraded to FreeNAS-11.3-U2.1.

I my noobish understanding of the situation, I'm thinking that maybe I could take ada5 offline, wipe it and "replace it with itself".
Hoping that this time disk replacement should finish properly, as resilvering shouldn't find any errors.

psssttt · May 8, 2020

Quick noob question: do I have to somehow initialise or create a partition layout on the new drive before it can be used to replace a failing drive?
I can't find anything about it in the official documentation https://www.ixsystems.com/documentation/freenas/11.3-U2/storage.html#replacing-a-failed-disk

Important Announcement for the TrueNAS Community.

SOLVED After replacing a disk and resilvering pool is still in degraded status and drive says it's still replacing

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Wizard

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "After replacing a disk and resilvering pool is still in degraded status and drive says it's still replacing"

Similar threads