David Dyer-Bennet
Patron
- Joined
- Jul 13, 2013
- Messages
- 286
This is supposed to be easy and fail-safe. I've done it before multiple times.
And yet...the documentation is actively confusing, and I remember ending up in trouble every single time I've done this in the past. So I'm kinda nervous!
So, reading the official docs at https://www.ixsystems.com/documentation/freenas/11.3-U5/storage.html#replacing-a-failed-disk . First thing I see -- reference to "the failed drive". There is no failed drive. I'm replacing a drive that has been reporting through SMART that it's remapping sectors, more than once. I'm pretty confident on my decision to replace it, and multiple people here backed me up on that, so this shouldn't be a rare case, in fact probably most drives show they're getting ready to fail this way before actually failing. But the documentation doesn't seem to allow for that possibility. But that means I have to identify it from how SMART labels it, which isn't just clicking on the drive on the disk management screen and locating the one that the system knows is failed.
Then, when many people talk about doing this, they assume you connect the new drive first. No can do -- no available controller ports. I must identify and offline (remove from the pool) the drive first, then physically remove it, then install the new drive, then add it to the pool somehow as a replacement for the old drive that isn't there any more. It's not at all clear how to do this!
So...the drive throwing SMART errors is /dev/ada5. I can find that on the GUI page, and get the serial number there. I can use SMART and other things to get all the other info on the disk (I want to cross-check that I'm finding the right physical drive! Unfortunately /dev/ada5 isn't physical on FreeBSD, it's just a label). That page doesn't let me set the disk offline though.
On the GUI storage / pools / status page I can find the drive, again. Under the three vertical dots menu at the right of that line I get the choices edit, offline, and replace. I think what I have to do is offline at this point; I think replace requires the new drive to already be connected (that seems to be what the GUI thinks, there are no disks available if I click replace). Can anybody confirm that? Is that somewhere I'm missing it in the documentation? I think that fulfills step 1 as listed in the documentation.
Wait...step 2 starts "after the disk is replaced"??? Which of the meanings of "replace" that is relevant here is that supposed to be? But before that, I think I need to shut down the system, make sure I've found the right drive, disconnect it and take it out, put in the new one, hook it up, and boot. Now....what drive do I then "replace", this time meaning in the GUI interface? What should I expect to see? What will /dev/ada5 think it is?
The question of what I will see after offlining a drive, shutting down, removing the old drive, installing the new drive, and rebooting is what is worrying me. Will it be obvious what to do? The docs don't show any examples of that. (They also talk a lot about leaving the old disk in place while adding the new one...which is only valid for a mirror vdev, not for a parity vdev, and I would suggest that should be made MUCH more clear.)
The docs say AHCI drivers support hot-swap. I've got hot-swap bays for the first 4 drives in the array but not the last two (this is a $35 case, not a $2000 case). I think I'm more comfortable just shutting down, anyway.
This is not an encrypted pool, and while the new drive is SED I do not intend to use it (people said before that if I just ignore it, it'll behave fine).
And yet...the documentation is actively confusing, and I remember ending up in trouble every single time I've done this in the past. So I'm kinda nervous!
So, reading the official docs at https://www.ixsystems.com/documentation/freenas/11.3-U5/storage.html#replacing-a-failed-disk . First thing I see -- reference to "the failed drive". There is no failed drive. I'm replacing a drive that has been reporting through SMART that it's remapping sectors, more than once. I'm pretty confident on my decision to replace it, and multiple people here backed me up on that, so this shouldn't be a rare case, in fact probably most drives show they're getting ready to fail this way before actually failing. But the documentation doesn't seem to allow for that possibility. But that means I have to identify it from how SMART labels it, which isn't just clicking on the drive on the disk management screen and locating the one that the system knows is failed.
Then, when many people talk about doing this, they assume you connect the new drive first. No can do -- no available controller ports. I must identify and offline (remove from the pool) the drive first, then physically remove it, then install the new drive, then add it to the pool somehow as a replacement for the old drive that isn't there any more. It's not at all clear how to do this!
So...the drive throwing SMART errors is /dev/ada5. I can find that on the GUI page, and get the serial number there. I can use SMART and other things to get all the other info on the disk (I want to cross-check that I'm finding the right physical drive! Unfortunately /dev/ada5 isn't physical on FreeBSD, it's just a label). That page doesn't let me set the disk offline though.
On the GUI storage / pools / status page I can find the drive, again. Under the three vertical dots menu at the right of that line I get the choices edit, offline, and replace. I think what I have to do is offline at this point; I think replace requires the new drive to already be connected (that seems to be what the GUI thinks, there are no disks available if I click replace). Can anybody confirm that? Is that somewhere I'm missing it in the documentation? I think that fulfills step 1 as listed in the documentation.
Wait...step 2 starts "after the disk is replaced"??? Which of the meanings of "replace" that is relevant here is that supposed to be? But before that, I think I need to shut down the system, make sure I've found the right drive, disconnect it and take it out, put in the new one, hook it up, and boot. Now....what drive do I then "replace", this time meaning in the GUI interface? What should I expect to see? What will /dev/ada5 think it is?
The question of what I will see after offlining a drive, shutting down, removing the old drive, installing the new drive, and rebooting is what is worrying me. Will it be obvious what to do? The docs don't show any examples of that. (They also talk a lot about leaving the old disk in place while adding the new one...which is only valid for a mirror vdev, not for a parity vdev, and I would suggest that should be made MUCH more clear.)
The docs say AHCI drivers support hot-swap. I've got hot-swap bays for the first 4 drives in the array but not the last two (this is a $35 case, not a $2000 case). I think I'm more comfortable just shutting down, anyway.
This is not an encrypted pool, and while the new drive is SED I do not intend to use it (people said before that if I just ignore it, it'll behave fine).