Error Replacing Failing Drive

csshepard · Jul 26, 2021

I'm running TrueNAS-12.0-U4, and I'm having trouble replacing a failing drive on my system. The drive hadn't failed yet, but was starting to show current_pending_sector warnings. I offlined the drive from the Pool Status page, shut down the server, replaced the drive with a new higher capacity drive. I ran some burn in tests following the recommendations on https://www.truenas.com/community/resources/hard-drive-burn-in-testing.92/. After the tests finished, I tried using the replace option on the Pool Status page. I get the following error message:

Error: concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 277, in replace
target.replace(newvdev)
File "libzfs.pyx", line 391, in libzfs.ZFS.__exit__
File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 277, in replace
target.replace(newvdev)
File "libzfs.pyx", line 2060, in libzfs.ZFSVdev.replace
libzfs.ZFSException: already in replacing/spare config; wait for completion or use 'zpool detach'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 243, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 94, in main_worker
res = MIDDLEWARE._run(*call_args)
File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 45, in _run
return self._call(name, serviceobj, methodobj, args, job=job)
File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 39, in _call
return methodobj(*params)
File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 39, in _call
return methodobj(*params)
File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 977, in nf
return f(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 279, in replace
raise CallError(str(e), e.code)
middlewared.service_exception.CallError: [EZFS_BADTARGET] already in replacing/spare config; wait for completion or use 'zpool detach'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 367, in run
await self.future
File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 403, in __run_body
rv = await self.method(*([self] + args))
File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 973, in nf
return await f(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/pool_/replace_disk.py", line 122, in replace
raise e
File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/pool_/replace_disk.py", line 102, in replace
await self.middleware.call(
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1241, in call
return await self._call(
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1206, in _call
return await self._call_worker(name, *prepared_call.args)
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1212, in _call_worker
return await self.run_in_proc(main_worker, name, args, job)
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1139, in run_in_proc
return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1113, in run_in_executor
return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
middlewared.service_exception.CallError: [EZFS_BADTARGET] already in replacing/spare config; wait for completion or use 'zpool detach'

The output of `zpool status -v shepard_media` is

Code:

  pool: shepard_media
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 1M in 02:05:37 with 0 errors on Sun Jul 18 02:05:37 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        shepard_media                                   DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/2781e554-9730-11e6-bed5-4ccc6a0b9724  ONLINE       0     0     0
            gptid/284bc412-9730-11e6-bed5-4ccc6a0b9724  ONLINE       0     0     0
            gptid/29176154-9730-11e6-bed5-4ccc6a0b9724  ONLINE       0     0     0
            gptid/29ec2715-9730-11e6-bed5-4ccc6a0b9724  ONLINE       0     0     0
            gptid/2ac32abc-9730-11e6-bed5-4ccc6a0b9724  ONLINE       0     0     0
            gptid/2b963e0f-9730-11e6-bed5-4ccc6a0b9724  OFFLINE      0     0     0

errors: No known data errors

What can I do to get this new drive added into the pool? Thanks in advance.

Tony-1971 · Jul 26, 2021

Hello,

Had you tried the action suggested in zpool status output?
No sure if is better to do in command line or web interface.

Best Regards,
Antonio

Etorix · Jul 26, 2021

Using the GUI is generally safer, and avoids dealing directly with gtpids.
Can you put the old drive back? Do you have a spare port to plug both the old drive and the new drive? If so, put the old one back in and then replace it.

csshepard · Jul 26, 2021

I have not tried the action suggested in zpool status output. I do not feel comfortable with manually configuring the zfs pool without a little more guidance.
I do not have a spare port to put the old drive back in. If having a spare port is going to make this much easier, I have a pcie slot I could drop an expansion card into. If I go this route, would I need to plug the old drive back into the original sata port that it was previously in?

Etorix · Jul 26, 2021

ZFS identifies drives by gptid so they can be shuffled around without bothering about ports and controllers.
It should be possible to replace a missing or offline drive, but since it didn't work the first time and replacing with full redundancy is always safer, I suggest you put the expansion card, attach the old drive drive to it, put the drive back online and then replace it.

csshepard · Jul 26, 2021

Thanks for the information. I've got the expansion card coming on Wednesday; I'll update with results after installing it.

csshepard · Jul 28, 2021

I've installed the expansion card and connected the old drive to it. I onlined the disk, and TrueNAS reslivered the pool. After that finished, I attempted to replace the disk using the GUI, and I got the same error message and Python stacktrace.

csshepard · Aug 27, 2021

I was finally able to resolve this problem. I was having the same problem referenced in this thread. zpool get ashift reports an ashift of 0, but zdb reports an ashift of 9.
I manually partitioned my new drive and added -o ashift=9 to my zpool replace command. After that the pool began resilvering and completed successfully.

Important Announcement for the TrueNAS Community.

Error Replacing Failing Drive

csshepard

Cadet

Tony-1971

Contributor

Etorix

Wizard

csshepard

Cadet

Etorix

Wizard

csshepard

Cadet

csshepard

Cadet

csshepard

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Error Replacing Failing Drive

Cadet

Contributor

Wizard

Cadet

Wizard

Cadet

Cadet

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Error Replacing Failing Drive"

Similar threads