Replacing a drive

Joined
Jul 13, 2013
Messages
286
This is supposed to be easy and fail-safe. I've done it before multiple times.

And yet...the documentation is actively confusing, and I remember ending up in trouble every single time I've done this in the past. So I'm kinda nervous!

So, reading the official docs at https://www.ixsystems.com/documentation/freenas/11.3-U5/storage.html#replacing-a-failed-disk . First thing I see -- reference to "the failed drive". There is no failed drive. I'm replacing a drive that has been reporting through SMART that it's remapping sectors, more than once. I'm pretty confident on my decision to replace it, and multiple people here backed me up on that, so this shouldn't be a rare case, in fact probably most drives show they're getting ready to fail this way before actually failing. But the documentation doesn't seem to allow for that possibility. But that means I have to identify it from how SMART labels it, which isn't just clicking on the drive on the disk management screen and locating the one that the system knows is failed.

Then, when many people talk about doing this, they assume you connect the new drive first. No can do -- no available controller ports. I must identify and offline (remove from the pool) the drive first, then physically remove it, then install the new drive, then add it to the pool somehow as a replacement for the old drive that isn't there any more. It's not at all clear how to do this!

So...the drive throwing SMART errors is /dev/ada5. I can find that on the GUI page, and get the serial number there. I can use SMART and other things to get all the other info on the disk (I want to cross-check that I'm finding the right physical drive! Unfortunately /dev/ada5 isn't physical on FreeBSD, it's just a label). That page doesn't let me set the disk offline though.

On the GUI storage / pools / status page I can find the drive, again. Under the three vertical dots menu at the right of that line I get the choices edit, offline, and replace. I think what I have to do is offline at this point; I think replace requires the new drive to already be connected (that seems to be what the GUI thinks, there are no disks available if I click replace). Can anybody confirm that? Is that somewhere I'm missing it in the documentation? I think that fulfills step 1 as listed in the documentation.

Wait...step 2 starts "after the disk is replaced"??? Which of the meanings of "replace" that is relevant here is that supposed to be? But before that, I think I need to shut down the system, make sure I've found the right drive, disconnect it and take it out, put in the new one, hook it up, and boot. Now....what drive do I then "replace", this time meaning in the GUI interface? What should I expect to see? What will /dev/ada5 think it is?

The question of what I will see after offlining a drive, shutting down, removing the old drive, installing the new drive, and rebooting is what is worrying me. Will it be obvious what to do? The docs don't show any examples of that. (They also talk a lot about leaving the old disk in place while adding the new one...which is only valid for a mirror vdev, not for a parity vdev, and I would suggest that should be made MUCH more clear.)

The docs say AHCI drivers support hot-swap. I've got hot-swap bays for the first 4 drives in the array but not the last two (this is a $35 case, not a $2000 case). I think I'm more comfortable just shutting down, anyway.

This is not an encrypted pool, and while the new drive is SED I do not intend to use it (people said before that if I just ignore it, it'll behave fine).
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
reference to "the failed drive"
This means "the drive that you're going to replace" (dead/dying/unwanted/too small/whatever).

But that means I have to identify it from how SMART labels it
the drive throwing SMART errors is /dev/ada5. I can find that on the GUI page, and get the serial number there. I can use SMART and other things to get all the other info on the disk (I want to cross-check that I'm finding the right physical drive! Unfortunately /dev/ada5 isn't physical on FreeBSD, it's just a label). That page doesn't let me set the disk offline though
Right. The serial number is the way to link label with a physical disk... as you said, typically can be found in Storage | Disks. Can also be seen with dmesg | grep Serial. You may also find it useful to link the zpool status with the disks by checking glabel status and using the gptids to cross-reference.

On the GUI storage / pools / status page I can find the drive, again. Under the three vertical dots menu at the right of that line I get the choices edit, offline, and replace. I think what I have to do is offline at this point; I think replace requires the new drive to already be connected (that seems to be what the GUI thinks, there are no disks available if I click replace). Can anybody confirm that? Is that somewhere I'm missing it in the documentation? I think that fulfills step 1 as listed in the documentation.
You're right that the documentation assumes an already failed disk is the case, so indeed you need to first offline the disk to be replaced (before you then physically replace it in the chassis... and replace it in pool in the GUI).

Wait...step 2 starts "after the disk is replaced"??? Which of the meanings of "replace" that is relevant here is that supposed to be? But before that, I think I need to shut down the system, make sure I've found the right drive, disconnect it and take it out, put in the new one, hook it up, and boot. Now....what drive do I then "replace", this time meaning in the GUI interface? What should I expect to see? What will /dev/ada5 think it is?
Physical replacement after offlining and optional shutdown. Then replacement in ZFS/GUI.

The question of what I will see after offlining a drive, shutting down, removing the old drive, installing the new drive, and rebooting is what is worrying me. Will it be obvious what to do? The docs don't show any examples of that. (They also talk a lot about leaving the old disk in place while adding the new one...which is only valid for a mirror vdev, not for a parity vdev, and I would suggest that should be made MUCH more clear.)

The docs say AHCI drivers support hot-swap. I've got hot-swap bays for the first 4 drives in the array but not the last two (this is a $35 case, not a $2000 case). I think I'm more comfortable just shutting down, anyway.

This is not an encrypted pool, and while the new drive is SED I do not intend to use it (people said before that if I just ignore it, it'll behave fine).
All that you're suggesting will be fine.

You're right that the document could be clearer, but I don't expect much more effort to be put into the FreeNAS documentation... TrueNAS Core is already due to have improvements made to the documentation, but we've been waiting for that for a while.
 
Last edited:
Joined
Jul 13, 2013
Messages
286
Thanks! I'm also realizing that each time I've needed to change a disk -- has been under a different software version (going all the way back to Sun OpenSolaris, which is what drew me into ZFS and a home server in the first place). So not only are my memories each time years old from the previous time, but at least usually the software is different, too :smile: . And screwing it up is not to be contemplated (a lifetime of photos, scans of old film photos for the first 40 years then digital originals for the last 20, and now other people's photos also in a photo archive I'm running for a local science fiction group too; yes there are backups).

Okay, will go ahead as planned then. Heck, I can even try to remember to take screenshots and write up the process when I'm done, in case anybody else is as timid as I am about mucking with my disk array.
 
Joined
Jul 13, 2013
Messages
286
Okay. So, recorded everything I could find about the bad disk (GPTID, serial, also make/model).

  1. Set the bad drive offline in the GUI
  2. Shut down the system
  3. Found the physical drive matching the recorded information :cool:
  4. Removed it
  5. Installed the replacement drive
  6. Rebooted (it still appalls me how long it takes for FreeNAS to boot; several 9s of availability gone right there!)
  7. Looked at pool status in the GUI
  8. It shows the old disk, that I removed, as being offline (by GPTID, while showing the others by device), and the pool as degraded
  9. in the 3-dot menu for the old disk, select "replace" (NOT "online").
  10. In the dialog that comes up, I get a dropdown of possible disks, which listed only the replacement disk (says "members only", but nothing had been done to associate that disk with this pool; it seems to also list unassigned disks maybe?).
  11. Select the new disk, and confirm. Exactly what you see here will of course depend on your system, but the configuration I have here is using all 6 SATA ports for the 6 drives in one pool
  12. Spins a working display for a while, then admits it has replaced the disk
  13. And Pool Status now shows a resilver in progress (and disk IO lights are on solidly)
The resilver will take a while; 32000 seconds estimated remaining just now, a bit less than 9 hours if the estimate holds. Seems reasonable for a pool 47% full on 6TB drives.

Did not, obviously, actually manage screenshots of each step.

While there are ways for this to fail still, they seem to mostly involve hardware failures (or a meteor coming through the house and hitting the server); I think it's past the point where I can do anything seriously wrong by mistake.

Thanks so much for the friendly hand-holding!
 
Top