What happens when a disk fails, why is it resilvering?

emsicz

Explorer
Joined
Aug 12, 2021
Messages
78
I'm having trouble following TrueNAS Scale procedures when a disk fails. One of the disks in my NAS has failed. From the UI I can't figure out how many disks are actually in that NAS, which one has failed and what is happening now. You could say I'm a little surprised at how badly UI communicates with me (except the email notification that a disk failed, that worked perfectly fine).

Home screen actually says the pool is degraded, but no notification in the top right corner. I would expect that failed disk would light up all kinds of warnings, but no. All I see is small orange status on home screen.


2023-06-25_043330.png


When I click on pool status icon in there, I am presented with a screen that says two things are unavail. So did one drive fail or two? Is there 8 drives total or 9? The above screen says total disks 8 (data), spares: 1. So total disks is actually 9?

2023-06-25_043617.png


From this screen I still can't figure out what is actually wrong. There are two spare branches, although I'm pretty sure I only had 1 spare disk in that pool when I was setting it up. Also it says in the SPARE branch that sda is online, but in the second spare branch, it says sda is unavail. What do I make of this? And finally, TrueNAS is running "resilvering." Why?

I guess what I exected is pretty straightforward instruction that tells me SN of the failing disk, so I can just replace it, but instead I am presented with all of this above and I can't figure anything out. What am I missing here?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
If you're using spares, you need to understand what you're doing...

When a spare is in use, it shows as unavail in the Spare VDEV, and forms its own VDEV in the pool with the failed disk to take over function from it.

As you can see, your disk had a good number of read and write errors, so the spare was brought in to avoid data loss. (sda, your spare, appears twice)

You now need to decide what to do... remove the spare designation and continue with that disk in the pool (option to add/designate a new spare after that) or replace the failed disk and return the spare for the next failure.

If you haven't already taken note of which disks are which in a meaningful way, you'll need to do elimination from the list of serial numbers in the Disks view... the only one you won't see will be the broken disk (if it's still broken enough to not show up in the list).
 

emsicz

Explorer
Joined
Aug 12, 2021
Messages
78
Thanks. I've done reading on the forum and I think I have some understanding.
  1. There is no way for TrueNAS to give me a simple table with disks that are present/missing with their serial numbers. Blown away by this. Windows Storage Spaces has this.
  2. I can't find any iX knowledgebase that simply goes over the process of replacing a failed disk. I am told I need to understand what I'm doing. Where do I obtain this understanding?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
The process of replacing spares isn't specifically covered in the official docs (linked at the top of this site), but they do cover the disk replacement (if that's the option you're choosing, it works the same when spares are involved or not).

I don't know (since I don't do it a lot myself) if recent versions support it fully in the GUI, but the principles are simple... detach the failed disk to continue using the spare permanently or replace the disk as per standard process and then detach the spare to put it back to "spare".
 

emsicz

Explorer
Joined
Aug 12, 2021
Messages
78
Thanks. The resilvering has finished today and I am even more confused by the UI. I have finally received a notification bubble:

20230627 - Failed disks.png

I rejoiced, because this is exactly what I wanted to see in the first place -- give me a list of failed disks. However, when I tried checking present disks in the pool, I noticed that the 1EH1PABN drive is displayed as ONLINE:

20230627 - SDA disk.png

20230627 - SDA ONLINE.png

I think this is because this was the spare disk and TrueNAS is trying to tell me that it now has no spare disk. Correct? So I am told two drives are not healthy, but in all actuality, only 1 drive needs replacing. Correct?

For sake of completeness, here is the whole disk view now:

20230627 - Deadpool disks.png
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
That's a spare that was once in use, but is now not available in the system, so you see it where it was first put (in the "spare" VDEV) and in the data VDEV itself in a SPARE group (where it kicked in as a spare).

you now need to detach that spare from the pool (maybe possible from the 3 dots in the GUI or maybe with zpool detach DEADPOOL /dev/disk/by-partuuid/4767... you can type the rest)
 

emsicz

Explorer
Joined
Aug 12, 2021
Messages
78
Thanks for confirming, but do we agree that in the GUI, this is not very well communicated? I mean I can make a ticket, invest the time, describe the issue, propose some solutions, but I don't want to do all this if veterans disagree and it's just me being a noob. I don't use CLI to command TrueNAS, I use GUI precisely because I have no idea what happens under the hood, GUI is my guarantee that what I click on executes correctly and in the right order, and it seems to me that disk failure is something that should be idiot-proof to handle through GUI. My points during this were:
  1. Can I cancel the resilvering? The UI offers this. What happens if I cancel it? I would want to cancel it, for example, to gracefully shut down in an attempt to reboot, because the disk might come online then?
  2. Since I have RaidZ2 with 1 spare unit, and 1 disk failed, I'm still super safe, because even if yet another unit fails, I'll still be OK, right?
  3. If I replace the broken disk, mistakenly unplug some other disks and TrueNAS boots with insufficient amount of disks to construct the data, am I OK to power everything down, recheck connections and reboot to successful mount?
    1. This point is super-important to me. I realized during this process that TrueNAS will not give me any option to pause boot if there is any issue, it will just attempt to mount and automatically run whatever process if it fails. The reason I am paranoid about this is because if you do this with Windows Storage Spaces (you boot the machine with insufficient data to fully mount the volume), Windows will grab whatever disks it does see, mark the volume as failed and that's it. There is no way to put it back to working state, even if you reboot with rest of the disks connected.
  4. I found there was actually no problem with any of the disks, so I de-dusted the box, made sure no wires were visibly damaged and rebooted, and everything sprung to life. I'm confused as to how this can be - if data has changed, even slightly, since the last moment the drive was online, doesn't the entire content of it need to be recalculated?
And here is my disk view now, when all disks are back online:

20230628 - Deadpool disks after everything back online.png


Inexplicably, the home screen says everything is healthy, everything is fine, all is green. But here I see that spare `sda` is not available and within the RAIDZ2, there are two disks in the spare branch. Here are the actions available for both `sda` and `sdj`:

20230628 - Disk actions.png


I think the correct state is when there is no `SPARE` vdev inside the `RAIDZ2` and there is the `spare` vdev with just `sda` unit in the `DEADPOOL`. Am I correct? And how would I get there? Thanks again.
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I think the correct state is when there is no `SPARE` vdev inside the `RAIDZ2` and there is the `spare` vdev with just `sda` unit in the `DEADPOOL`. Am I correct? And how would I get there?
Yes.

sda is currently "in use" as a spare, so unavailable to be a spare to anything else.

You can use detach on it (in the upper section) to return it to the spares as available
 

emsicz

Explorer
Joined
Aug 12, 2021
Messages
78
Yes.

sda is currently "in use" as a spare, so unavailable to be a spare to anything else.

You can use detach on it (in the upper section) to return it to the spares as available
I tried detaching. For about an hour, the Please wait activity indicator was spinning. I then closed the window. No operation is running when I open the top right task manager. If I try to click detach again, I receive an error:

[EZFS_NOTSUP] Cannot detach root-level vdevs

Error: concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs.py", line 260, in __zfs_vdev_operation op(target, *args) File "libzfs.pyx", line 411, in libzfs.ZFS.__exit__ File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs.py", line 260, in __zfs_vdev_operation op(target, *args) File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs.py", line 273, in impl getattr(target, op)() File "libzfs.pyx", line 2143, in libzfs.ZFSVdev.detach libzfs.ZFSException: Cannot detach root-level vdevs During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.9/concurrent/futures/process.py", line 243, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 114, in main_worker res = MIDDLEWARE._run(*call_args) File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 45, in _run return self._call(name, serviceobj, methodobj, args, job=job) File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 39, in _call return methodobj(*params) File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 39, in _call return methodobj(*params) File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1276, in nf return func(*args, **kwargs) File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs.py", line 269, in detach self.detach_remove_impl('detach', name, label, options) File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs.py", line 276, in detach_remove_impl self.__zfs_vdev_operation(name, label, impl) File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs.py", line 262, in __zfs_vdev_operation raise CallError(str(e), e.code) middlewared.service_exception.CallError: [EZFS_NOTSUP] Cannot detach root-level vdevs """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/middlewared/main.py", line 177, in call_method result = await self.middleware._call(message['method'], serviceobj, methodobj, params, app=self) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1294, in _call return await methodobj(*prepared_call.args) File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1272, in nf return await func(*args, **kwargs) File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1140, in nf res = await f(*args, **kwargs) File "/usr/lib/python3/dist-packages/middlewared/plugins/pool.py", line 1077, in detach await self.middleware.call('zfs.pool.detach', pool['name'], found[1]['guid']) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1345, in call return await self._call( File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1302, in _call return await self._call_worker(name, *prepared_call.args) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1308, in _call_worker return await self.run_in_proc(main_worker, name, args, job) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1223, in run_in_proc return await self.run_in_executor(self.__procpool, method, *args, **kwargs) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1206, in run_in_executor return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs)) middlewared.service_exception.CallError: [EZFS_NOTSUP] Cannot detach root-level vdevs

I honestly have no idea what is happening with the data right now.
 
Joined
Jul 3, 2015
Messages
926
Can we see the output of zpool status from the CLI please?
 

emsicz

Explorer
Joined
Aug 12, 2021
Messages
78
root@TEXAS[~]# zpool status
pool: DEADPOOL
state: ONLINE
scan: resilvered 160M in 00:06:16 with 0 errors on Tue Jun 27 12:28:10 2023
config:

NAME STATE READ WRITE CKSUM
DEADPOOL ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
0ad9f7bf-1e28-4fa7-b1af-85f75d778cdc ONLINE 0 0 0
874ca29e-e170-4c19-957d-01e4bdb63485 ONLINE 0 0 0
358ab6fb-b6ee-4f51-81bb-ed80e4686891 ONLINE 0 0 0
fcbe1ffa-66e7-42cf-a207-e3328caa9f87 ONLINE 0 0 0
3b870d94-87bb-43f0-bad4-10ae626ee72b ONLINE 0 0 0
spare-5 ONLINE 0 0 0
47673709-75b0-406e-a183-0f6ce281e0c0 ONLINE 0 0 0
19702139-b077-4aaa-83f8-4a1e8304759f ONLINE 0 0 0
f52d2515-e8b2-4259-88f4-9c724dc8ad2a ONLINE 0 0 0
e82906ef-a691-49eb-baff-7cc2984b43ac ONLINE 0 0 0
spares
19702139-b077-4aaa-83f8-4a1e8304759f INUSE currently in use

errors: No known data errors

20230701 - Zpool status.png
 
Joined
Jul 3, 2015
Messages
926
How about zpool detach DEADPOOL 1970…………
 

emsicz

Explorer
Joined
Aug 12, 2021
Messages
78
20230701 - zpool detach deadpool.png


This seems to have worked, thanks. The UI pool status screen also lists the /sda as available now and the sda/sdj spare branch has disappeared. I reckon everything is operating norminal now?
 
Joined
Jul 3, 2015
Messages
926
Great, looks good from here.
 
Joined
Jul 3, 2015
Messages
926
I have finally received a notification bubble:
I do find this odd that you only receive an alert AFTER your hot-spare has completed its resilver and not as soon as it’s activated. I did raise a ticket about this but guess the team didn’t agree with me.

Correction just checked my ticket and they have fixed it apparently in 22.12.3 and Cobia. Would be great if they could fix it in Core as that’s what I use.
 
Last edited:

emsicz

Explorer
Joined
Aug 12, 2021
Messages
78
Thanks, yes the process wasn't very straightforward. My biggest surprise was to see that I am at no point told REPLACE DISK SN: XXXX. Honestly, it's a NAS box that sits in the corner, replacing a failed drive is, like, the one thing I am expected to do with it and it's weirdly complex task to achieve.
 
Top