how to replace a fault drive for a backup freenas box, no raid

be7taname · Jan 12, 2021

Hello, one of the drives showed check-sum errors in my backup freenas server.
Since this backup freenas server is the backup of the main freenas server, all the drives in my backup freenas server have no raid setup.
So my question is how to replace this fault drive? Because there is no raid, does it mean all the data on this fault drive are gone if I replaced it?
Thanks.

UdoB · Jan 13, 2021

be7taname said:
Because there is no raid, does it mean all the data on this fault drive are gone if I replaced it?

Be careful: the complete POOL will be gone if a single failing drive is implementing a vdev. I have no pointer, but there should be documentation on how to deal with this...

Please post the output of zpool status (in Code-Tags).

Best regards

danb35 · Jan 13, 2021

be7taname said:
So my question is how to replace this fault drive?

Pretty much like it says in the manual, but don't offline the failing disk at the beginning:

Power off the machine and install the replacement disk. If you don't have a drive bay for a replacement disk, you probably aren't using very suitable hardware for FreeNAS, but that's OK for these purposes--just set it somewhere for now. As long as you have power and SATA connections, you'll be OK. Start up the machine.
Log into the web GUI, go to Storage -> Pools, click the gear, go to Settings, as per the manual.
Click the kebab menu next to the failing disk and select replace.
In the window that comes up, replace with the new disk
When resilvering is complete, power down the system and remove the old disk.

Chris Moore · Jan 13, 2021

be7taname said:
Because there is no raid, does it mean all the data on this fault drive are gone if I replaced it?

If you do it wrong... Yes. Just do it right... No pressure... I would suggest having RAIDz1 as a minimum, even in a backup system. It isn't just about drive failure, without redundancy, there is no way for ZFS to correct data errors. It can only tell you that a data error happened, not fix it.

be7taname · Jan 13, 2021

Most of the articles found are about how to replace the fault disk with certain setup of the raid. Actually all of them.
Since I have no raid for my backup server, I don't think the same procedure (which will replace the fault disk with the new disk before resilvering) applying here because it is not possible to resilver the data without the original fault disk.
Below is the zpool status. Any suggestions are very welcome!


  pool: bckpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 5 days 17:22:56 with 0 errors on Wed Jan  6 19:23:08 2021
config:

    NAME                                          STATE     READ WRITE CKSUM
    bckpool                                       ONLINE       0     0     0
      gptid/8381a4ab-4c52-11ea-8100-d8cb8aa2e8f7  ONLINE       0     0     0
      gptid/fc7d91c9-7d7e-11e8-b937-7085c26d2ac0  ONLINE       0     0     0
      gptid/e28abd1c-52a2-11ea-925a-d8cb8aa2e8f7  ONLINE       0     0     0
      gptid/fe419d76-7d7e-11e8-b937-7085c26d2ac0  ONLINE       0     0     2

errors: No known data errors

UdoB said:
Be careful: the complete POOL will be gone if a single failing drive is implementing a vdev. I have no pointer, but there should be documentation on how to deal with this...

Please post the output of zpool status (in Code-Tags).

Best regards

be7taname · Jan 14, 2021

danb35 said:
Pretty much like it says in the manual, but don't offline the failing disk at the beginning:

Power off the machine and install the replacement disk. If you don't have a drive bay for a replacement disk, you probably aren't using very suitable hardware for FreeNAS, but that's OK for these purposes--just set it somewhere for now. As long as you have power and SATA connections, you'll be OK. Start up the machine.

Log into the web GUI, go to Storage -> Pools, click the gear, go to Settings, as per the manual.

Click the kebab menu next to the failing disk and select replace.

In the window that comes up, replace with the new disk

When resilvering is complete, power down the system and remove the old disk.

I think that this is exactly the solution I am looking for. Thank you so much!
Just one more question, I think that the resilvering is done by copying all user data from the old disk to the new disk in my no-raid case.
Does that mean it will copy the error check-sum data from old disk to the new disk as well? If that's the case, is there any benefit to replace the fault disk with new one?

be7taname · Jan 14, 2021

Chris Moore said:
If you do it wrong... Yes. Just do it right... No pressure... I would suggest having RAIDz1 as a minimum, even in a backup system. It isn't just about drive failure, without redundancy, there is no way for ZFS to correct data errors. It can only tell you that a data error happened, not fix it.

Thank you for your reply. Definitely I will setup some RAID to avoid my current awkward situation. I do have the same questions as I replied to danb35 's post.
Following his suggestion,
I think that the resilvering is done by copying all user data from the old disk to the new disk in my no-raid case.
Does that mean it will copy the error check-sum data from old disk to the new disk as well? If that's the case, is there any benefit to replace the fault disk with new one?
Thanks in advance for your opinions.

Chris Moore · Jan 14, 2021

be7taname said:
If that's the case, is there any benefit to replace the fault disk with new one?

The benefit is that the faulty disk would be removed before it causes more damage. The faulty disk could fully fail and take all your data with it, so replacing it saves the undamaged data. Running a scrub of the pool might identify the affected files so you can know how badly this impacts your data. It could be that your actual data is not damaged, just the metadata.

be7taname said:
I think that the resilvering is done by copying all user data from the old disk to the new disk in my no-raid case.

That is essentially what happens, in this case, but the details of how ZFS handles it are a little more complicated. If this were a RAIDz pool, ZFS would be able to correct data errors during the resilver by checking the parity / checksum data on the rest of the pool. In a resilver, you will see ZFS reading intensely from all the disks in the vdev with the disk being replaced as well as periodic checks across all the other pool disks as the system recomputes what data is supposed to be on the target (new) disk. Even when you do an 'in place' replacement, it is not as simple as copying the data from one disk to the other, not for ZFS, because it is constantly analyzing the data and looking for mistakes that it will correct, if it can.

be7taname said:
Does that mean it will copy the error check-sum data from old disk to the new disk as well?

With no parity data being stored in a simple strype, I don't think there is any way for ZFS to correct the data that has been damaged by the failing disk. The best thing for you to do is delete the affected files so that when you push the next backup from your main system, that data can be copied over in undamaged form.

be7taname · Jan 31, 2021

@danb35 @Chris Moore @UdoB Finally I tried and it worked perfectly. Thank you very much for your kind help!

be7taname · Mar 1, 2021

There are some new development for my backup freenas server. Another drive is just dead completely this time.

Let me summarize previous posts. Basically I have a freenas server for backup usage. One of the storage pools on it consists of 4 hard disks, simple stripe, no raid (or raid0). One of the disk showed check-sum errors (still alive though) and I then followed @danb35 's procedure to replace the disk. And the data on the old disk is able to resilved to the new disk given the old disk is still alive.

This time one of the disk is just dead. Should I just replace the dead disk as last time? But the resilvering will not happen this time since the old disk is dead. What will happen in this case and any suggestions? Thanks! @danb35 @Chris Moore @UdoB

Chris Moore · Mar 1, 2021

be7taname said:
Should I just replace the dead disk as last time? But the resilvering will not happen this time since the old disk is dead. What will happen in this case and any suggestions?

Sorry. There is no rebuild from this. You put another disk in, make a new pool and copy all your data back onto the pool again. This is the reason we strongly discourage pools with no redundancy. When a disk catastrophically fails, the pool fails with it. The only thing to do is start over.

Chris Moore · Mar 1, 2021

be7taname said:
uggestions? Thanks!

I suggest building a pool with redundancy this time. RAIDz1 at a minimum, but I would rather have RAIDz2 if it were me.

be7taname · Mar 1, 2021

Thanks for the quick reply @Chris Moore ! I thought about building a pool with redundancy but the existing backup server running out of the space for installing more drives. Luckily after several hard reboots, the disk came back to life and now works as if nothing happened.
smartctl does report the following though.
errors: Permanent errors have been detected in the following files:
<0x1551>:<0x5098b5>

Instead, I am considering to build a new file server with TrueNAS soon since the old file server is also running out of storage space. I will post for advices and suggestions before taking the action. Hope at the time I can get your advices. Again, thank you so much!

ChrisRJ · Mar 1, 2021

@be7taname , are you aware of the fact that this disk has serious problems and can die any second? If it contains any meaningful data, those should be moved to a separate disk right now.

be7taname · Mar 2, 2021

@ChrisRJ , you are absolutely correct. As a matter of fact, the disk went disconnect and dead again less than half an hour after hard rebooting.
I didn't take it seriously because this disk is actually the new one that I used to replace the fault one mentioned in this post.
It seems that I have to replace this disk again with another new one.

Important Announcement for the TrueNAS Community.

how to replace a fault drive for a backup freenas box, no raid

be7taname

Dabbler

UdoB

Dabbler

danb35

Hall of Famer

Chris Moore

Hall of Famer

be7taname

Dabbler

be7taname

Dabbler

be7taname

Dabbler

Chris Moore

Hall of Famer

be7taname

Dabbler

be7taname

Dabbler

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

be7taname

Dabbler

ChrisRJ

Wizard

be7taname

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

how to replace a fault drive for a backup freenas box, no raid

Dabbler

Dabbler

Hall of Famer

Hall of Famer

Dabbler

Dabbler

Dabbler

Hall of Famer

Dabbler

Dabbler

Hall of Famer

Hall of Famer

Dabbler

Wizard

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "how to replace a fault drive for a backup freenas box, no raid"

Similar threads