Degraded Volume issue

strikermed · May 25, 2020

Hey guys, a week or two ago I had a drive with some errors. I went forward with replacing it with a spare drive I had installed.

The drive went forward with resilvering, and nearly completed, but I wanted to clear the old hard drive and do some smart testing on it. When I reinserted it, it took over and the pool remained degraded.

Now I currently have the old offline and removed, but I got errors when I tried. With the spare installed, I still get a degraded state, and I'm unsure if it has replaced the old disk... Under Pool status I have a Spare section listed with my new drive (da9p2) and it says it's unavailable. There is another section labeled SPARE, and under that is /dev/gptid/6085722-46cc-11e9-b4e3-a0369f5050d4 and it's listed offline, in addition to da9p2 which is listed as online.

When I try to remove da9p2 I get a large series of errors.

under "spare" if I try to remove da9p2 (listed as unavailable) here is what I get:

Code:

Error: concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 247, in __zfs_vdev_operation
    op(target, *args)
  File "libzfs.pyx", line 369, in libzfs.ZFS.__exit__
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 247, in __zfs_vdev_operation
    op(target, *args)
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 279, in <lambda>
    self.__zfs_vdev_operation(name, label, lambda target: target.remove())
  File "libzfs.pyx", line 1788, in libzfs.ZFSVdev.remove
libzfs.ZFSException: Pool busy; removal may already be in progress

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.7/site-packages/middlewared/worker.py", line 97, in main_worker
    res = loop.run_until_complete(coro)
  File "/usr/local/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.7/site-packages/middlewared/worker.py", line 53, in _run
    return await self._call(name, serviceobj, methodobj, params=args, job=job)
  File "/usr/local/lib/python3.7/site-packages/middlewared/worker.py", line 45, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.7/site-packages/middlewared/worker.py", line 45, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.7/site-packages/middlewared/schema.py", line 965, in nf
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 279, in remove
    self.__zfs_vdev_operation(name, label, lambda target: target.remove())
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 249, in __zfs_vdev_operation
    raise CallError(str(e), e.code)
middlewared.service_exception.CallError: [EZFS_BUSY] Pool busy; removal may already be in progress
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 130, in call_method
    io_thread=False)
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1084, in _call
    return await methodobj(*args)
  File "/usr/local/lib/python3.7/site-packages/middlewared/schema.py", line 961, in nf
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/pool.py", line 1347, in remove
    await self.middleware.call('zfs.pool.remove', pool['name'], found[1]['guid'])
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1141, in call
    app=app, pipes=pipes, job_on_progress_cb=job_on_progress_cb, io_thread=True,
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1081, in _call
    return await self._call_worker(name, *args)
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1101, in _call_worker
    return await self.run_in_proc(main_worker, name, args, job)
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1036, in run_in_proc
    return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1010, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
middlewared.service_exception.CallError: [EZFS_BUSY] Pool busy; removal may already be in progress

How do I fix this issue? I'd like to get the pool back to a good status. I should mention this is RAIDz3.

Ideally, I'd like to be able to remove the spare drive, wipe it, and then resilver it to replace my bad drive.

I'd like to be able to reinsert my drive that had errors and wipe it clean of any configuration and data.

Help?

strikermed · May 25, 2020

This may help as well:

Code:

        history [-il] [<pool>] ...
        get [-Hp] [-o "all" | field[,...]] <"all" | property[,...]> <pool> ...
        set <property=value> <pool>
root@freenas[~]# zpool status
  pool: RAIDz2
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 24K in 0 days 00:00:04 with 0 errors on Mon May 25 23:47:00 2020
config:

        NAME                                              STATE     READ WRITE CKSUM
        RAIDz2                                            DEGRADED     0     0   0
          raidz3-0                                        DEGRADED     0     0   0
            gptid/52db8e07-46cc-11e9-b4e3-a0369f5050d4    ONLINE       0     0   0
            gptid/55575c7e-46cc-11e9-b4e3-a0369f5050d4    ONLINE       0     0   0
            gptid/57412135-46cc-11e9-b4e3-a0369f5050d4    ONLINE       0     0   0
            gptid/598674ca-46cc-11e9-b4e3-a0369f5050d4    ONLINE       0     0   0
            gptid/5c0b79c6-46cc-11e9-b4e3-a0369f5050d4    ONLINE       0     0   0
            gptid/5e2c1e20-46cc-11e9-b4e3-a0369f5050d4    ONLINE       0     0   0
            spare-6                                       DEGRADED     0     0   0
              14917948582218402958                        UNAVAIL      0     0   0  was /dev/gptid/60857222-46cc-11e9-b4e3-a0369f5050d4
              gptid/c7fb24b8-9156-11ea-a13c-a0369f5050d4  ONLINE       0     0   0
            gptid/62b604e0-46cc-11e9-b4e3-a0369f5050d4    ONLINE       0     0   0
            gptid/64be15cf-46cc-11e9-b4e3-a0369f5050d4    ONLINE       0     0   0
            gptid/6742e6d4-46cc-11e9-b4e3-a0369f5050d4    ONLINE       0     0   0
        spares
          16617298622180424174                            UNAVAIL   was /dev/gptid/c7fb24b8-9156-11ea-a13c-a0369f5050d4

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:25 with 0 errors on Fri May 22 05:45:25 2020
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE

PhiloEpisteme · May 26, 2020

strikermed said:
The drive went forward with resilvering, and nearly completed, but I wanted to clear the old hard drive and do some smart testing on it. When I reinserted it, it took over and the pool remained degraded.

This was likely a mistake. When you resilver a disk allow the process to finish. Was there a reason you could not wait for the resilver process to complete before doing that? Generally you're better off being patient and letting one process finish before you jump into something else. Resilvering is a very important data integrity step; trying to cut it short or mess with it is likely not a great idea.

When you post code, please surround it in code tags like the following.

[code]
Some code pasted here
[/code]

Would you mind editing your posts above and adding code tags for readability?

strikermed said:
How do I fix this issue? I'd like to get the pool back to a good status. I should mention this is RAIDz3.

Ideally, I'd like to be able to remove the spare drive, wipe it, and then resilver it to replace my bad drive.

Why do you want to wipe the spare drive? Why not insert it into your system and let the resilver complete?

A quick note about spares. Many folks use them; but lots of folks don't. A spare is only useful if a drive full on bites the dust. In that case a spare will automatically kick in and start to resilver. However, many folks opt to replace disks before they outright fail. The process is something like this. Have a spare drive on hand; run smart tests and badblocks on it so you know it is good. Keep it in safe storage. Have regular scrubs and smart tests run on each drive. When a drive shows early failure warnings replace it with that extra drive.

A couple of questions
Are you using encryption?
Do you have backups?
What drive options are available in the GUI for 14917948582218402958 UNAVAIL 0 0 0 was /dev/gptid/60857222-46cc-11e9-b4e3-a0369f5050d4?
What drive options are available in the GUI for 16617298622180424174 UNAVAIL was /dev/gptid/c7fb24b8-9156-11ea-a13c-a0369f5050d4?

Note; it is not recommended to start resilvers etc from the command line. If you're able to get the spare drive removed from the system I recommend you try to use the GUI to do the actual drive replacement.

strikermed · May 26, 2020

PhiloEpisteme said:
A couple of questions
Are you using encryption?
Do you have backups?
What drive options are available in the GUI for 14917948582218402958 UNAVAIL 0 0 0 was /dev/gptid/60857222-46cc-11e9-b4e3-a0369f5050d4?
What drive options are available in the GUI for 16617298622180424174 UNAVAIL was /dev/gptid/c7fb24b8-9156-11ea-a13c-a0369f5050d4?

Note; it is not recommended to start resilvers etc from the command line. If you're able to get the spare drive removed from the system I recommend you try to use the GUI to do the actual drive replacement.

Thanks for the response.

Answers:
Encryption - no
I have full backups in two places, except for my plex library which lives in only one other location.

I'll need to get back to you on the GUI items. I shutdown freenas to run a Clear operation in UNRAID on that newer spare drive to see if I could get it to "new status again" It's at 44% currently.

As for my other drive that was showing signs of failure, it's struggling in UNRAID, so I suspect that it was on the brink of failure.

As for resilvering... I've learned my lesson to be patient. Ugh, I don't know why I even attempted to do what I did... Thought maybe the interface would be more intuitive, but I guess not. I'll wait next time, and I wish I knew more of the shell commands to trouble shoot and issue when "Remove" doesn't function properly in the GUI. Offline and Online appear to work, atleast the GUI says it functioned. Although I got the error when trying to "Remove" the drive or atleast existence of it in the pool.

PhiloEpisteme · May 28, 2020

strikermed said:
As for my other drive that was showing signs of failure, it's struggling in UNRAID, so I suspect that it was on the brink of failure.

Generally, I suggest you not use UNRAID or other utilities to manage your drives. FreeNAS has all of the utilities you need to manage your disks. At best you perform an action in UNRAID or other OS that you could have done in FreeNAS. At worse you can irrevocably harm your data disks. I think folks are often tempted to turn to tools they are more familiar with but often in the case of FreeNAS trying to stick with the tools provided may be best.

strikermed said:
As for resilvering... I've learned my lesson to be patient. Ugh, I don't know why I even attempted to do what I did... Thought maybe the interface would be more intuitive, but I guess not. I'll wait next time, and I wish I knew more of the shell commands to trouble shoot and issue when "Remove" doesn't function properly in the GUI. Offline and Online appear to work, atleast the GUI says it functioned. Although I got the error when trying to "Remove" the drive or atleast existence of it in the pool.

Generally speaking you want to use the GUI as much as possible. Occasionally the GUI will have an issue where it is unable to complete a specific task; in those cases I suggest you first post here and then consider posting a bug to ixsystems. Often a restart of the system can fix the issue when the issue is related to a busy device etc. I try to avoid unnecessary restarts though.

In your case I suspect you will want to remove the spare drive (/dev/gptid/c7fb24b8-9156-11ea-a13c-a0369f5050d4) from your pool manually using the command line; and then in the GUI select the drive (/dev/gptid/60857222-46cc-11e9-b4e3-a0369f5050d4) which is a part of your pool that is offline/missing, choose replace and select your backup drive (/dev/gptid/c7fb24b8-9156-11ea-a13c-a0369f5050d4). If it gives you an error saying the backup drive cannot be used because it was used by another FreeNAS system you can try erasing the disk using the FreeNAS GUI or the command line. I do not recommend that you use the command line to perform the replacement step itself. It is good you have backups! At worse you don't have to worry about losing data. :)

FWIW, while I find the GUI lacking in certain departments in general it makes MUCH more sense if you have more knowledge of how zfs works. For example, what it means to offline, online, replace/resilver a disk, etc.

strikermed · May 29, 2020

PhiloEpisteme said:
Generally, I suggest you not use UNRAID or other utilities to manage your drives. FreeNAS has all of the utilities you need to manage your disks. At best you perform an action in UNRAID or other OS that you could have done in FreeNAS. At worse you can irrevocably harm your data disks. I think folks are often tempted to turn to tools they are more familiar with but often in the case of FreeNAS trying to stick with the tools provided may be best.

I wish I knew more about the utilities and how to troubleshoot them when they don't function as I expect them to. For example, i couldn't remove the drives from the pool, and thus no wipe/clear option was available... As for the console, I wish I knew more commands and how to utilize that more... I'm an amateur in that department at best.

PhiloEpisteme said:
In your case I suspect you will want to remove the spare drive (/dev/gptid/c7fb24b8-9156-11ea-a13c-a0369f5050d4) from your pool manually using the command line; and then in the GUI select the drive (/dev/gptid/60857222-46cc-11e9-b4e3-a0369f5050d4) which is a part of your pool that is offline/missing, choose replace and select your backup drive (/dev/gptid/c7fb24b8-9156-11ea-a13c-a0369f5050d4). If it gives you an error saying the backup drive cannot be used because it was used by another FreeNAS system you can try erasing the disk using the FreeNAS GUI or the command line. I do not recommend that you use the command line to perform the replacement step itself. It is good you have backups! At worse you don't have to worry about losing data. :)

I'm still learning how to use the command line. It remains to be a semi mystery to me. Every time I think I have something figured out, I'm confused by the next step... Without some googling, I don't think I would be able to figure out how to remove the drive. I guess I atleast have the keywords now to do my google search.

PhiloEpisteme said:
FWIW, while I find the GUI lacking in certain departments in general it makes MUCH more sense if you have more knowledge of how zfs works. For example, what it means to offline, online, replace/resilver a disk, etc.

Yes, I think I have some basic understanding of this.... But there are tons of things outside of this that I'm still learning.

I've always expected a day like this to come, where those dual backups will come in handy in case I screw something up... This gives me experimental flexibility I guess...

PhiloEpisteme · May 29, 2020

strikermed said:
As for the console, I wish I knew more commands and how to utilize that more... I'm an amateur in that department at best.

No worries, gotta start somewhere!

strikermed said:
Without some googling, I don't think I would be able to figure out how to remove the drive. I guess I atleast have the keywords now to do my google search.

Google "remove spare drive from pool". You'll find that you can do something like zpool remove <pool> <spare-device> where you replace <pool> with the name of your pool, in this case RAIDz2 and spare-device as /dev/gptid/c7fb24b8-9156-11ea-a13c-a0369f5050d4.

In your case, I would not be surprised if the above command gave an error. The reason is that the system already tried to use the spare drive as part of the pool; you see that in the following bit of your zpool status.

strikermed said:

Notice that this is the same UUID as the one listed under the spare drives.

strikermed said:
spares
16617298622180424174 UNAVAIL was /dev/gptid/c7fb24b8-9156-11ea-a13c-a0369f5050d4

For this reason, you may need to instead try to "offline" that drive with zpool offline RAIDz2 /dev/gptid/c7fb24b8-9156-11ea-a13c-a0369f5050d4 and then if zpool status still lists it but shows it as "offline" or "unavailable", then try the remove from above.

If any of the above does not work please copy the exact commands you tried and the exact error messages as well as zpool status.

strikermed said:
I've always expected a day like this to come, where those dual backups will come in handy in case I screw something up... This gives me experimental flexibility I guess...

This is the most important thing. FreeNAS is not a replacement for a backup. Having backups is important to keep your data safe. :)

When all of that is complete you can try to the the replacement via the GUI once more.

strikermed · May 29, 2020

Thanks for the guidance. I'll give these a try in a few days when my resilvering is complete, and my array isn't a drive down. I'll post back when I have results.

PhiloEpisteme · May 29, 2020

Ah, perhaps I misunderstood. If the system is still resilvering you're in good shape. You won't want to remove/offline the disk when the resilver completes if it completes successfully. . I thought the resilver had stalled and didn't restart. Does zpool status list the resilver as in progress?

strikermed · May 29, 2020

PhiloEpisteme said:
Ah, perhaps I misunderstood. If the system is still resilvering you're in good shape. You won't want to remove/offline the disk when the resilver completes if it completes successfully. . I thought the resilver had stalled and didn't restart. Does zpool status list the resilver as in progress?

The pool does say resilvering is in progress. Prior to that in the GUI the "spare" and the actual drive that needed replacing were both "offline" but I couldn't remove any of them through the GUI.

Now that I was able to clear the drive that I interrupted resilvering, I'm now able to use that same drive to resilver.

Once it's done, I'll give you another status update with what can and cannot be removed (referring to the old drive and the "Spare" that's not existent)

PhiloEpisteme · May 29, 2020

Great! Once its done you may be all set. Send us an update and let us know.

Important Announcement for the TrueNAS Community.

Degraded Volume issue

strikermed

Dabbler

strikermed

Dabbler

PhiloEpisteme

Guru

strikermed

Dabbler

PhiloEpisteme

Guru

strikermed

Dabbler

PhiloEpisteme

Guru

strikermed

Dabbler

PhiloEpisteme

Guru

strikermed

Dabbler

PhiloEpisteme

Guru

Similar threads