Resource icon

Replacing a failed/failing disk

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
danb35 submitted a new resource:

Replacing a failed/failing disk - Click-by-click, how to replace a failed or failing disk

So you've had a disk fail, or you know it's about to fail, and you need to replace it. And for whatever reason, the manual's instructions weren't clear enough. So, step by step, here's how to do it.

DO NOT USE THE VOLUME MANAGER

Seriously, no part of replacing a failed disk involves clicking the Volume Manager button. Don't do it.

Log in to the FreeNAS web GUI, and click on the Storage button. On the...

Read more about this resource...
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The one point that needs to be kept in mind is the big warning:

DO NOT USE THE VOLUME MANAGER
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
The User Guide Docs need to link to this post!
*** calling @dlavigne ***
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The User Guide Docs need to link to this post!
*** calling @dlavigne ***
But they already have the instructions, they're just not read as much as we'd like. The idea behind this resource is to try to catch a few people who check out the resources section but not the manual.

I have no idea what kind of thought process would be involved in such a course of action, but I have the feeling this might help a handful of people.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
But they already have the instructions, they're just not read as much as we'd like. The idea behind this resource is to try to catch a few people who check out the resources section but not the manual.

I have no idea what kind of thought process would be involved in such a course of action, but I have the feeling this might help a handful of people.
There are many links in the documentation that point back to threads in this forum. I just thought someone should take a look at
this resource and either place a link in the docs to the resource OR change the "replace a failed drive" section of the Docs with some of the improved content
from the resource. I too am looking to help the handful of users that seem to have difficulty with the directions in the documentation for this subject.
I'm like you, in that I'm confused as to why there is confusion. The user's guide/manual is already a large book, the addition of a bunch of images...
 

garzjoe

Cadet
Joined
May 8, 2018
Messages
1
danb35 submitted a new resource:

Replacing a failed/failing disk - Click-by-click, how to replace a failed or failing disk



Read more about this resource...

Great guide! Thanks for sharing it! Anyway I found in my logs that my ada1 drive was starting to report unreadable and unrecoverable sectors even though SmartTests have passed. I actually noticed this on accident as we had a power outage and my security report I have e-mailed to me spit out the drive errors.

I confirmed the drive error when I got home and looked at the logs, looks like my ada1 drive is going south but has yet to fail. Anyway I wanted to confirm my drives serial number via the GUI before I replaced it. However on my primary server when I click on view disks in the GUI it does not provide me with any of my hard drive serial numbers for some reason. On my secondary FreeNas server running same code (9.2.1.5) I do see all the hard drive serial numbers. Anyway I'm not overly concerned with why it's not showing the serial numbers ATM. However if others are having the same or similar issue as me below are some basic Linux/Unix commands to retrieve that data via command line.

smartctl -a /dev/<drive> | grep -i serial

Just substitute the actual failed drive like I did below.

smartctl -a /dev/ada1 | grep -i serial
 

tom__w

Explorer
Joined
Mar 26, 2013
Messages
87
Must I power the server off and then back on? I've got a ton of virtual machines on here and I hate powering them down.

Thanks
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Must I power the server off and then back on?
Not if you have suitable hotswap hardware. Though even then, it's safer to do so.
 

MondoTV

Cadet
Joined
Nov 5, 2017
Messages
2
Excellent guide. One question - if you have a hot spare do you have to take the failing disk offline or can you just replace it?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
if you have a hot spare do you have to take the failing disk offline or can you just replace it?
You should be able to just replace it. The old disk will drop offline when the replacement is finished.
 

MondoTV

Cadet
Joined
Nov 5, 2017
Messages
2
Hi - replaced my first disk over the weekend (I had two drives starting throw unreadable or uncorrectable sector errors in a 12 disk RAIDZ2 pool). I replaced the one throwing 16 uncorrectable sector errors first. Came back in Monday and it was still resilvering but finished without an issue. I'm now replacing the second drive (with 1 pending, unreadable sector).
The second hard drive seems to be taking a lot longer. I'm assuming that is because I'm still using the pool during the day?
Anyway - thanks heaps for the guide - it was invaluable.
 

reub

Dabbler
Joined
Sep 22, 2018
Messages
14
"replace it with a good, burned-in, tested disk"

In a scenario where you don't have a free bay in which to burn in a replacement drive and no handy pre-burned-in drive, would the recommended procedure here be to:
  1. Shut down, remove the failed/failing disk.
  2. Insert the replacement disk into the open bay, power on the server.
  3. Run the battery of burn-in tests (as in https://www.ixsystems.com/community/resources/hard-drive-burn-in-testing.92/) on the newly inserted disk.
  4. Proceed with replacing the disk in the pool as specified.
 

reub

Dabbler
Joined
Sep 22, 2018
Messages
14
Better yet, if the failing disk is still usable, would be to burn in the replacement on different hardware.
Makes sense for sure. I suppose I should have also qualified that no outside capacity is available for the burn-in process.
 

LittleGreyCat

Dabbler
Joined
Jun 5, 2020
Messages
13
Newbie here, but this assumes that you are running a mirror.

I think that RAIDZ1 is self healing, but isn't constructed from mirrored pairs.
It can survive the loss of one drive but runs degraded as it has to reconstruct the missing data from parity on the remaining drives.
So is the process much the same for a failing RADIZ1 array?
That is, mark a disc for removal if it isn't already offline, then physically replace?

In the dim and distant past I worked with a RAID5 array where you could just pull one disc and replace it with a fresh disc and everything would run on whilst the new disc was rebuilt.
However this would be with a proper hardware RAID controller.
Can you do similar with FreeNAS?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
this assumes that you are running a mirror.
It assumes no such thing, only that you're running a pool that can survive having a disk offline. That can be a mirror or RAIDZ1/2/3.

What are you seeing that makes you think anything here depends on the pool being a mirror? I'd like to change it to clarify.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
proper hardware RAID controller
One important point: In no way is a crummy, overpriced SoC running black box code with bugs proportional to the number of tentacles it tries to grow (A webGUI for the RAID controller? What a horrible idea...) more proper than ZFS.
 
Top