how to replace a fault drive for a backup freenas box, no raid

be7taname

Dabbler
Joined
Jan 12, 2021
Messages
15
Hello, one of the drives showed check-sum errors in my backup freenas server.
Since this backup freenas server is the backup of the main freenas server, all the drives in my backup freenas server have no raid setup.
So my question is how to replace this fault drive? Because there is no raid, does it mean all the data on this fault drive are gone if I replaced it?
Thanks.
 

UdoB

Dabbler
Joined
Dec 6, 2014
Messages
39
Because there is no raid, does it mean all the data on this fault drive are gone if I replaced it?

Be careful: the complete POOL will be gone if a single failing drive is implementing a vdev. I have no pointer, but there should be documentation on how to deal with this...

Please post the output of zpool status (in Code-Tags).

Best regards
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,458
So my question is how to replace this fault drive?
Pretty much like it says in the manual, but don't offline the failing disk at the beginning:
  • Power off the machine and install the replacement disk. If you don't have a drive bay for a replacement disk, you probably aren't using very suitable hardware for FreeNAS, but that's OK for these purposes--just set it somewhere for now. As long as you have power and SATA connections, you'll be OK. Start up the machine.
  • Log into the web GUI, go to Storage -> Pools, click the gear, go to Settings, as per the manual.
  • Click the kebab menu next to the failing disk and select replace.
  • In the window that comes up, replace with the new disk
  • When resilvering is complete, power down the system and remove the old disk.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Because there is no raid, does it mean all the data on this fault drive are gone if I replaced it?
If you do it wrong... Yes. Just do it right... No pressure... I would suggest having RAIDz1 as a minimum, even in a backup system. It isn't just about drive failure, without redundancy, there is no way for ZFS to correct data errors. It can only tell you that a data error happened, not fix it.
 

be7taname

Dabbler
Joined
Jan 12, 2021
Messages
15
Most of the articles found are about how to replace the fault disk with certain setup of the raid. Actually all of them.
Since I have no raid for my backup server, I don't think the same procedure (which will replace the fault disk with the new disk before resilvering) applying here because it is not possible to resilver the data without the original fault disk.
Below is the zpool status. Any suggestions are very welcome!

pool: bckpool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: scrub repaired 0 in 5 days 17:22:56 with 0 errors on Wed Jan 6 19:23:08 2021 config: NAME STATE READ WRITE CKSUM bckpool ONLINE 0 0 0 gptid/8381a4ab-4c52-11ea-8100-d8cb8aa2e8f7 ONLINE 0 0 0 gptid/fc7d91c9-7d7e-11e8-b937-7085c26d2ac0 ONLINE 0 0 0 gptid/e28abd1c-52a2-11ea-925a-d8cb8aa2e8f7 ONLINE 0 0 0 gptid/fe419d76-7d7e-11e8-b937-7085c26d2ac0 ONLINE 0 0 2 errors: No known data errors

Be careful: the complete POOL will be gone if a single failing drive is implementing a vdev. I have no pointer, but there should be documentation on how to deal with this...

Please post the output of zpool status (in Code-Tags).

Best regards
 

be7taname

Dabbler
Joined
Jan 12, 2021
Messages
15
Pretty much like it says in the manual, but don't offline the failing disk at the beginning:
  • Power off the machine and install the replacement disk. If you don't have a drive bay for a replacement disk, you probably aren't using very suitable hardware for FreeNAS, but that's OK for these purposes--just set it somewhere for now. As long as you have power and SATA connections, you'll be OK. Start up the machine.
  • Log into the web GUI, go to Storage -> Pools, click the gear, go to Settings, as per the manual.
  • Click the kebab menu next to the failing disk and select replace.
  • In the window that comes up, replace with the new disk
  • When resilvering is complete, power down the system and remove the old disk.
I think that this is exactly the solution I am looking for. Thank you so much!
Just one more question, I think that the resilvering is done by copying all user data from the old disk to the new disk in my no-raid case.
Does that mean it will copy the error check-sum data from old disk to the new disk as well? If that's the case, is there any benefit to replace the fault disk with new one?
 

be7taname

Dabbler
Joined
Jan 12, 2021
Messages
15
If you do it wrong... Yes. Just do it right... No pressure... I would suggest having RAIDz1 as a minimum, even in a backup system. It isn't just about drive failure, without redundancy, there is no way for ZFS to correct data errors. It can only tell you that a data error happened, not fix it.
Thank you for your reply. Definitely I will setup some RAID to avoid my current awkward situation. I do have the same questions as I replied to danb35 's post.
Following his suggestion,
I think that the resilvering is done by copying all user data from the old disk to the new disk in my no-raid case.
Does that mean it will copy the error check-sum data from old disk to the new disk as well? If that's the case, is there any benefit to replace the fault disk with new one?
Thanks in advance for your opinions.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
If that's the case, is there any benefit to replace the fault disk with new one?
The benefit is that the faulty disk would be removed before it causes more damage. The faulty disk could fully fail and take all your data with it, so replacing it saves the undamaged data. Running a scrub of the pool might identify the affected files so you can know how badly this impacts your data. It could be that your actual data is not damaged, just the metadata.
I think that the resilvering is done by copying all user data from the old disk to the new disk in my no-raid case.
That is essentially what happens, in this case, but the details of how ZFS handles it are a little more complicated. If this were a RAIDz pool, ZFS would be able to correct data errors during the resilver by checking the parity / checksum data on the rest of the pool. In a resilver, you will see ZFS reading intensely from all the disks in the vdev with the disk being replaced as well as periodic checks across all the other pool disks as the system recomputes what data is supposed to be on the target (new) disk. Even when you do an 'in place' replacement, it is not as simple as copying the data from one disk to the other, not for ZFS, because it is constantly analyzing the data and looking for mistakes that it will correct, if it can.
Does that mean it will copy the error check-sum data from old disk to the new disk as well?
With no parity data being stored in a simple strype, I don't think there is any way for ZFS to correct the data that has been damaged by the failing disk. The best thing for you to do is delete the affected files so that when you push the next backup from your main system, that data can be copied over in undamaged form.
 

be7taname

Dabbler
Joined
Jan 12, 2021
Messages
15
There are some new development for my backup freenas server. Another drive is just dead completely this time.

Let me summarize previous posts. Basically I have a freenas server for backup usage. One of the storage pools on it consists of 4 hard disks, simple stripe, no raid (or raid0). One of the disk showed check-sum errors (still alive though) and I then followed @danb35 's procedure to replace the disk. And the data on the old disk is able to resilved to the new disk given the old disk is still alive.

This time one of the disk is just dead. Should I just replace the dead disk as last time? But the resilvering will not happen this time since the old disk is dead. What will happen in this case and any suggestions? Thanks! @danb35 @Chris Moore @UdoB
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Should I just replace the dead disk as last time? But the resilvering will not happen this time since the old disk is dead. What will happen in this case and any suggestions?
Sorry. There is no rebuild from this. You put another disk in, make a new pool and copy all your data back onto the pool again. This is the reason we strongly discourage pools with no redundancy. When a disk catastrophically fails, the pool fails with it. The only thing to do is start over.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
uggestions? Thanks!
I suggest building a pool with redundancy this time. RAIDz1 at a minimum, but I would rather have RAIDz2 if it were me.
 

be7taname

Dabbler
Joined
Jan 12, 2021
Messages
15
Thanks for the quick reply @Chris Moore ! I thought about building a pool with redundancy but the existing backup server running out of the space for installing more drives. Luckily after several hard reboots, the disk came back to life and now works as if nothing happened.
smartctl does report the following though.
errors: Permanent errors have been detected in the following files:
<0x1551>:<0x5098b5>

Instead, I am considering to build a new file server with TrueNAS soon since the old file server is also running out of storage space. I will post for advices and suggestions before taking the action. Hope at the time I can get your advices. Again, thank you so much!
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,906
@be7taname , are you aware of the fact that this disk has serious problems and can die any second? If it contains any meaningful data, those should be moved to a separate disk right now.
 

be7taname

Dabbler
Joined
Jan 12, 2021
Messages
15
@ChrisRJ , you are absolutely correct. As a matter of fact, the disk went disconnect and dead again less than half an hour after hard rebooting.
I didn't take it seriously because this disk is actually the new one that I used to replace the fault one mentioned in this post.
It seems that I have to replace this disk again with another new one.
 
Top