Error Replacing Failed Boot Mirror Drive with Partitions

EarthBoundX5 · Jul 8, 2023

I've seen alot of similar posts to mine trying to scope out a resolution, but nothing thus far has turned up to be quite my solution or lacked a proper resolution.; so please forgive adding another similar, but different, inquiry to the pile.

Issue: I'm unable to attach a replacement USB drive of the exact same size, make, and model to my mirrored boot pool, after having detached the failed old one. I get the following error:

Error: [EFAULT] concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 220, in extend i['target'].attach(newvdev) File "libzfs.pyx", line 402, in libzfs.ZFS.__exit__ File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 220, in extend i['target'].attach(newvdev) File "libzfs.pyx", line 2117, in libzfs.ZFSVdev.attach libzfs.ZFSException: can only attach to mirrors and top-level disks During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 111, in main_worker res = MIDDLEWARE._run(*call_args) File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 45, in _run return self._call(name, serviceobj, methodobj, args, job=job) File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 39, in _call return methodobj(*params) File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 39, in _call return methodobj(*params) File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 985, in nf return f(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 223, in extend raise CallError(str(e), e.code) middlewared.service_exception.CallError: [EZFS_BADTARGET] can only attach to mirrors and top-level disks """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 355, in run await self.future File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 386, in __run_body rv = await self.middleware._call_worker(self.method_name, *self.args, job={'id': self.id}) File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1250, in _call_worker return await self.run_in_proc(main_worker, name, args, job) File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1169, in run_in_proc return await self.run_in_executor(self.__procpool, method, *args, **kwargs) File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1152, in run_in_executor return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs)) middlewared.service_exception.CallError: [EZFS_BADTARGET] can only attach to mirrors and top-level disks

Background: My Primary TrueNAS CORE 13.0-U5.1 recently reported a FALTED USB drive in my Mirrored Boot Pool. I took a backup of my TrueNAS config and ordered a replacement. I noted the serial of my functional boot USB drive, and shut down my system, taking out one of the USB drives and booting back up, confirming that the drive still plugged in was indeed the same serial of the working drive. I inserted the replacement drive of the same size, Wiped it, and attempted to select "Replace" from the Boot Pool Status page. It gave an error that the drive was too small. I confirmed that it was displaying as ~0.08 GBs smaller. I instead went to the store, and bought 2 more drives of the same size, but different brand/model than the one that was technically smaller. I also bought an additional drive of larger capacity.

I attempted this same process on both drives of the same size, discovering that they too were both technically a smaller total size. I've seen this inconsistency of exact sizing amongst flash memory, so it wasn't a surprise....however, giving up and adding the larger capactiy drive still failed to Replace the FALTED drive in the mirrored boot pool, giving some error noting something needing to be detached first (I neglected to note the error, as it seemed innocuous, given that there was a Detach option next to Replace in the context menu). I instead Detached the failed USB drive (currently not plugged into the system) from the mirrored boot pool, and then attempted to Attach the new larger capacity USB drive....at which point I got the error above.

I've then scoured my home office trying to find my stash of spare USB drives of the exact same model I used when I created the FreeNAS (at the time) system and held onto in case I ever ran into a failed boot drive like this....and found one. Unfortunately, it too gave the same exact error.

I've since tried various forms of zpool attach commands, formatting, dd'ing, etc and all have not worked. The closest thing I found to a resolution that seemed most related to my own was a reddit post...but it's suggestion to use ashift=9 resulted in the command reporting Too Many Arguments.

From what I can understand, this looks like some legacy design that was changed with no backwards support in place?

My devices are /dev/da12 (original) and /dev/da13 (replacement).

My zpool status is...

pool: freenas-boot
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 00:12:13 with 0 errors on Sun Jul 2 03:57:14 2023
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da12p2 ONLINE 0 0 0

errors: No known data errors

My zpool history for creation time is....

2018-12-16.15:46:24 zpool create -f -o cachefile=/tmp/zpool.cache -o version=28-O mountpoint=none -O atime=off -O canmount=off freenas-boot mirror da12p2 da14p2

Is there anyone that can help advise me on the next steps? I feel like at this point, I'm at a crumbling wall and feel each step will result in the wall collapsing on me!

I'd prefer no reinstall solutions. Would the next logical step be to restore my TrueNAS backup from a few days ago and try doing a "Replace" again on the FALTED drive, with the exact same model replacement? Is that even stored in the config? Would there incur any data loss?

Help me TrueNAS Forums, You're my only hope.

NugentS · Jul 9, 2023

I suggest that rather than flailing around with replacing and detaching. As this is the boot pool:
1. Make a backup of your config
2. Reinstall TN Core 13.0-U5.1
3. Restore Backup. After the reboot the data pool and everything else should be present

Take 10-20 minutes at most

Simple, easy. Despite you not wanting a reinstall option.

EarthBoundX5 · Jul 9, 2023

NugentS said:
1. Make a backup of your config
2. Reinstall TN Core 13.0-U5.1
3. Restore Backup. After the reboot the data pool and everything else should be present

I've never gone through the restore process; and I wish this had happened on my backup array, as I don't like experimenting on my primary. Does the backed up config contain the boot pool config as well? Meaning, if I restore my backup from before detaching, would it be in a state where I could attempt replacing again with the exact same model drive?

I've very weary about doing a reinstall on my main array, especially due to how many versions I've upgraded from. I know it "should" be good, but at the end of the day, it only takes one unintended bug to wreck it all.

If that's the only route that's gonna be suggested; I could try and replicate the setup in a VM. I have 2x VM instances of TrueNAS I've maintained in parity with the physicals I have for any sort of testing that comes up....but I just didn't set them up with a boot mirror at the start, since it was virtual...not considering the need to test this scenario, as I've never had a problem replacing disks in a pool, I didn't expect the boot pool to be any different.

NugentS said:
flailing around

I wouldn't exactly call it that...

NugentS · Jul 9, 2023

I suggest you rebuild onto a new boot drive, first removing the original boot drive.
Then, if there is an issue you can simply remove the new boot and put the old boot back in again

EarthBoundX5 · Jul 9, 2023

NugentS said:
I suggest you rebuild onto a new boot drive, first removing the original boot drive.
Then, if there is an issue you can simply remove the new boot and put the old boot back in again

So I spun up an instance of FreeNAS 11, with mirrored boot devices, and upgraded to TrueNAS 13. I removed one of the devices, and added another. Simply selecting Replace got the similar error I'd gotten when Replacing with the larger usb device.

[EZFS_BADTARGET] already in replacing/spare config; wait for completion or use 'zpool detach'

Error: concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 283, in replace
target.replace(newvdev)
File "libzfs.pyx", line 402, in libzfs.ZFS.__exit__
File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 283, in replace
target.replace(newvdev)
File "libzfs.pyx", line 2147, in libzfs.ZFSVdev.replace
libzfs.ZFSException: already in replacing/spare config; wait for completion or use 'zpool detach'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 111, in main_worker
res = MIDDLEWARE._run(*call_args)
File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 45, in _run
return self._call(name, serviceobj, methodobj, args, job=job)
File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 39, in _call
return methodobj(*params)
File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 39, in _call
return methodobj(*params)
File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 985, in nf
return f(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 285, in replace
raise CallError(str(e), e.code)
middlewared.service_exception.CallError: [EZFS_BADTARGET] already in replacing/spare config; wait for completion or use 'zpool detach'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 139, in call_method
result = await self.middleware._call(message['method'], serviceobj, methodobj, params, app=self)
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1236, in _call
return await methodobj(*prepared_call.args)
File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 981, in nf
return await f(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/boot.py", line 143, in replace
await self.middleware.call('zfs.pool.replace', BOOT_POOL_NAME, label, zfs_dev_part['name'])
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1279, in call
return await self._call(
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1244, in _call
return await self._call_worker(name, *prepared_call.args)
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1250, in _call_worker
return await self.run_in_proc(main_worker, name, args, job)
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1169, in run_in_proc
return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1152, in run_in_executor
return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
middlewared.service_exception.CallError: [EZFS_BADTARGET] already in replacing/spare config; wait for completion or use 'zpool detach'

So at least that's why I'd previously jumped, not flailed, to Detaching vs Replacing.

What's interesting here, the pool remains with a removed disk, but I'm un-able to re-add the device after this error. And furthermore, the replacement disk oddly has the boot pool's label.

In any event, this seems like a clear cut bug; as I did nothing but intended behavior that matches the TrueNAS documentation, and resulted in an unintended outcome.

EDIT: I've opted to submit a bug report. https://ixsystems.atlassian.net/browse/NAS-122892

Important Announcement for the TrueNAS Community.

Error Replacing Failed Boot Mirror Drive with Partitions

EarthBoundX5

Cadet

NugentS

MVP

EarthBoundX5

Cadet

NugentS

MVP

EarthBoundX5

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Error Replacing Failed Boot Mirror Drive with Partitions

EarthBoundX5

Cadet

NugentS

MVP

EarthBoundX5

Cadet

NugentS

MVP

EarthBoundX5

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Error Replacing Failed Boot Mirror Drive with Partitions"

Similar threads