SOLVED drive faulted, can't offline it to replace it

digity

Contributor
Joined
Apr 24, 2016
Messages
156
I have a drive that has faulted and has degraded the pool. In the web UI, I try to offline the disk, but nothing happens - the status still says "FAULTED". I tried several times, even logging out and in, nothing. I then noticed the drive ID in the alerts and the drive status page are different (da26, da10, respectively), with the latter being it's old ID from a pool export/import months ago. Both IDs point to the same drive (verified by the serial number). I'm assuming the ID issue is messing with TrueNAS' ability to properly offline the faulted drive...?

With these issues, how can I replace this failing drive?

P.S. - I haven't used the "Replace" function, because I don't have a free slot available until that faulted drive comes out.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I'm assuming the ID issue is messing with TrueNAS' ability to properly offline the faulted drive...?

With these issues, how can I replace this failing drive?
Faulted is already a kind of offline status (the system isn't using it in the pool already).

You can just pull the drive and replace it when it's shown as unavailable (and you insert an appropriate replacement drive to replace it with)

I then noticed the drive ID in the alerts and the drive status page are different (da26, da10, respectively),
Don't get hung up on those designations. They aren't used by ZFS to identify the right member disk of a pool/vdev. ZFS uses the gptid to identify a disk no matter where it appears in the device order.
 

digity

Contributor
Joined
Apr 24, 2016
Messages
156
Faulted is already a kind of offline status (the system isn't using it in the pool already).

You can just pull the drive and replace it when it's shown as unavailable (and you insert an appropriate replacement drive to replace it with)


Don't get hung up on those designations. They aren't used by ZFS to identify the right member disk of a pool/vdev. ZFS uses the gptid to identify a disk no matter where it appears in the device order.

I pulled the faulted drive and inserted the replacement drive. It doesn't show up as a new disk ID, but as the old disk ID, da26, so I can't use the replace function (StoragePools -> Pool -> Status) to get it into the pool. It has the old ID even though the make, model number and serial number reflect the new drive (Storage -> Disks).

Any ideas?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
so I can't use the replace function (StoragePools -> Pool -> Status
Have you actually tried that? you can indeed replace a disk from there regardless of its identifier. Is it that the list is empty when you try the replace?
 

digity

Contributor
Joined
Apr 24, 2016
Messages
156
Have you actually tried that? you can indeed replace a disk from there regardless of its identifier. Is it that the list is empty when you try the replace?

Heh, it's working now... I think.

But, yes I did try the Replace function initially, but I got an error. It actually listed the drive in question as multipath/disk15 (Storage -> Pools -> Pool Status -> Replace). I thought the multipath thing was causing the issue so I performed "gmultipath destroy disk15" and "gpart recover /dev/da26" (and "gpart recover /dev/da10"), but I got the invalid argument error with the gpart commands. Went back to try the Replace function anyway and the drive wasn't listed as available at all this time (da26/10 or multipath/disk15). I popped the drive out and then back in and it still listed it as da26/10 (Storage -> Disks). Got frustrated and I left it at that.

I saw your reply and went to Replace to check if it actually lists da26 (or da10) as available and it didn't, but now had a new entry listed as multipath/disk1. I started to run the gmultipath and gpart commands again for this multipath drive, but figured I should try to add it as is (multipath/disk1) and BAM! It added without spitting an error and started resilvering.

I don't know what was wrong or why it's working now, but I'll take it.

Thanks for your help
 

digity

Contributor
Joined
Apr 24, 2016
Messages
156
Just ran into this problem again (i.e., replacing a faulted drive via GUI not working smoothly, TrueNAS Core 12.0-U8) and once again just popping the drive out in the FAULTED status (instead of waiting for the status to go to OFFLINE like official KB says), popping in the new/replacement drive, wiping it then performing the replace function worked well. Resilvering once again started automatically.
 
Joined
Jun 2, 2019
Messages
591
My biggest problem with replacing failing drives is making sure I pull the correct drive. My NAS appliances have hot swap bays, but the device enumeration does not follow the physical order of the bays. Sure I could add tunables to force the drive enumeration, but what if it's wrong and I pull the wrong drive? I've tried the DD trick while watching the drive activity lights, but that does not work on my appliances. So it's a [redaced] shoot. The only way to be sure is to get the serial number of the failed drive from within TrueNAS webGUI, shutdown the NAS, then keep pulling drives until I find the correct serial number. Kinda defeats the benefit of having hot swap bays. The OEM QNAP QTS had the ability to "identify" a drive, but I have not figured out the TrueNAS CORE or SCALE equivalent.
 

digity

Contributor
Joined
Apr 24, 2016
Messages
156
My biggest problem with replacing failing drives is making sure I pull the correct drive. My NAS appliances have hot swap bays, but the device enumeration does not follow the physical order of the bays. Sure I could add tunables to force the drive enumeration, but what if it's wrong and I pull the wrong drive? I've tried the DD trick while watching the drive activity lights, but that does not work on my appliances. So it's a [redaced] shoot. The only way to be sure is to get the serial number of the failed drive from within TrueNAS webGUI, shutdown the NAS, then keep pulling drives until I find the correct serial number. Kinda defeats the benefit of having hot swap bays. The OEM QNAP QTS had the ability to "identify" a drive, but I have not figured out the TrueNAS CORE or SCALE equivalent.

To avoid this issue I write the last 4 characters of the serial number on painter's tape and stick on the drive's caddy or along side the server chassis. Therefore I can quickly ID the faulty drive and pull it out.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
The OEM QNAP QTS had the ability to "identify" a drive, but I have not figured out the TrueNAS CORE or SCALE equivalent.
run dd if=/dev/da1 of=/dev/null bs=1024k count=5000 for each of the disks and the activity light should help you to see which disk is which.

If the device is still alive, you can go straight to the impacted one, otherwise, work through the rest to eliminate it down.
 
Top