Terminate hung scrub and/or export without rebooting?

FrankWard

Explorer
Joined
Feb 13, 2023
Messages
71
I am looking for any and all solutions that can TERMINATE an existing scrub and/or export process in TrueNAS Scale WITHOUT rebooting? The TrueNAS UI is hanging on two tasks for one disk/pool. If I try to initiate an export for another pool it just hangs there and does nothing. Clearly something is hung in the OS, but TN has no clue how to fix it.

I've tried the following, but the tasks are still visible.
  • - zpool clear POOL (hangs the shell)
  • - zpool export -f POOL (hangs the shell)
  • - zpool scrub -s POOL (hangs the shell)
  • - remove the drive via UI (hangs at 20%, TN still thinks the tasks are running)

Surely, with the almighty powerful Linux, there's a way to terminate whatever processes are hanging TN.

1708976628290.png
 
Last edited:

FrankWard

Explorer
Joined
Feb 13, 2023
Messages
71

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I guess I'm a bit confused as to why you have so many operations going on at once, it's amazing something is happening at all.

I'm not sure why you do not want to reboot, maybe you meant a hard reboot. At the CLI reboot and it will perform a soft reboot (maybe in your case). If the system hangs or will not reboot, then you may be forced to perform a hard reset.

What version of TrueNAS are you running, saying SCALE does not help much since there are several versions of SCALE.

Look into 'htop' and 'top' commands, you can manually terminate running code, but do not expect your system to be working fine, you may stop a piece of important code.
 

FrankWard

Explorer
Joined
Feb 13, 2023
Messages
71
I guess I'm a bit confused as to why you have so many operations going on at once, it's amazing something is happening at all.

I'm not sure why you do not want to reboot, maybe you meant a hard reboot. At the CLI reboot and it will perform a soft reboot (maybe in your case). If the system hangs or will not reboot, then you may be forced to perform a hard reset.

What version of TrueNAS are you running, saying SCALE does not help much since there are several versions of SCALE.

Look into 'htop' and 'top' commands, you can manually terminate running code, but do not expect your system to be working fine, you may stop a piece of important code.

Hi Joe. All of the systems I am running are listed in detail in my signature. Does the sig not show up on your device?

I had a disk that produced ZFS errors. I possibly did things in the wrong order with this drive. The first thing I tried was scrub, but that sat there at 0%. After about 10 minutes, I tried to clear any issues using zpool clear which hung in the shell, but I refreshed the UI and I guess it cleared because the drive was not suspended anymore. I then tried to export the drive which also sat there useless in the UI. Since that didn't work, I tried to export a different pool and that also just sat there with the export dialog useless in the UI. This is how I arrived here.

TrueNAS has issues either terminating processes that hang or properly reporting it in the UI, which is why there are so many hung processes. Either that, or TrueNAS doesn't error out properly when it encounters an issue and leaves the UI in limbo. I'm guessing this is because it hasn't matured fully, so let's hope the UI is more friendly in regards to ZFS and other errors (such as not being able to edit the VM properties which was recently fixed) in future builds.

I have yet to find a way to identify the exact processes to clear up the UI not properly reporting progress without a reboot, and I do not want to hose the system by terminating random processes via top, which is why I am asking to see if anyone else has a better method of recovering TrueNAS UI once it hoses itself.

I have since removed the drive, rebooted, then removed the pool successfully, but the question still remains. If an export/scrub process hangs like this, are there specific actions that can fix it aside from a reboot?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hey @FrankWard

Thanks for the detail. It's likely that the ZFS processes were the ones hanging up because there was a drive that in a "Schrodinger's Failure" state - alive enough to be present on the device bus, but absent enough to hang up when issued ZFS commands. Some upstream improvements have been made that should land in OpenZFS 2.3 that should help with this:


Assuming there's enough redundancy, pulling (or possibly setting offline in ZFS?) the offending drive should cause the tasks to complete (or abort) successfully, but there's always the challenge of how to know you're pulling the right drive.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I am running are listed in detail in my signature. Does the sig not show up on your device?
It shows up, but System 1 or System 2?

I possibly did things in the wrong order with this drive.
You are not alone. Been there myself.

The first thing I tried was scrub, but that sat there at 0%. After about 10 minutes
Scrubs take a long time if you have a lot of data. Some scrubs take almost a week. And you have some high capacity drives. How long does a SMART Long test take? Now almost double it. Okay, that is not really true but the drive is working for the computer vice just an internal test and it takes much longer than a Long test.

I have since removed the drive, rebooted, then removed the pool successfully, but the question still remains. If an export/scrub process hangs like this, are there specific actions that can fix it aside from a reboot?
Not that I'm aware of. I suspect one of the iXsystem developers might be able to help but they rarely are on the forums.

ZFS is robust, a reboot should not cause any harm that couldn't be recovered from, hence redundancy.

Now you have @HoneyBadger so you are in good hands.
 

FrankWard

Explorer
Joined
Feb 13, 2023
Messages
71
Hey @FrankWard

Thanks for the detail. It's likely that the ZFS processes were the ones hanging up because there was a drive that in a "Schrodinger's Failure" state - alive enough to be present on the device bus, but absent enough to hang up when issued ZFS commands. Some upstream improvements have been made that should land in OpenZFS 2.3 that should help with this:


Assuming there's enough redundancy, pulling (or possibly setting offline in ZFS?) the offending drive should cause the tasks to complete (or abort) successfully, but there's always the challenge of how to know you're pulling the right drive.

Thanks for the info. I've run into this before rarely, so I'll be better equipped for the next battle.
 

FrankWard

Explorer
Joined
Feb 13, 2023
Messages
71

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Top