Lost pool while swapping drives

Daryle

Dabbler
Joined
Jan 26, 2017
Messages
13
Hello -
I was upgrading a 3 disk pool (RAIDZ1) and had an unexpected disk failure on disk 2 while re-silvering disk 3 (the last one too) . I'm not sure what happened to disk #2 but I am no longer able to read the disk on any system, maybe a logic board went bad.

I powered the system down and replaced all the new disks with the original 3 thinking when the OS booted it would see the older disks and maybe mount the pool. The original disks did not have anything wrong, I was simply upgrading to larger disks. Seems it doesn't work that way.

With the 3 original disks, this is what I see:
Code:
root@freenas[~]# zpool import
   pool: personalpool
     id: 2428394965905809582
  state: UNAVAIL
 status: One or more devices are missing from the system.
 action: The pool cannot be imported. Attach the missing
        devices and try again.
   see: http://illumos.org/msg/ZFS-8000-3C
 config:

        personalpool                                    UNAVAIL  insufficient replicas
          raidz1-0                                      UNAVAIL  insufficient replicas
            10914419384285364653                        UNAVAIL  cannot open
            gptid/cc293029-739f-11e7-a1f4-0cc47ae203ea  ONLINE
            4823201479263356504                         UNAVAIL  cannot open
root@freenas[~]#


The pool name still showed up in the OS but was listed as "unknown." I tried to detach thinking it might clear out some cache that was looking for the new disks.
The new disks were all bought together and are the same make, model, firmware revision, etc... I swapped logic boards between disks 3 and 2 since 2 had completed the re-silvering process and 3 was just starting.

I am currently at this point; I have the new disks 1 and 2 (since they did complete the re-silver process) attached with the original disk 3 (which shouldn't really matter as 2 disks should still see the pool data).
Code:
root@freenas[~]# zpool import
   pool: personalpool
     id: 2428394965905809582
  state: UNAVAIL
 status: One or more devices are missing from the system.
 action: The pool cannot be imported. Attach the missing
        devices and try again.
   see: http://illumos.org/msg/ZFS-8000-3C
 config:

        personalpool                                    UNAVAIL  insufficient replicas
          raidz1-0                                      UNAVAIL  insufficient replicas
            gptid/eeb377af-6560-11ea-9186-0cc47ae203ea  ONLINE
            17284048339211250722                        OFFLINE
            4823201479263356504                         UNAVAIL  cannot open


I searched for a way to force online the "offline" disk thinking if I could just get one disk back I could get my data off and move on.

Code:
root@freenas[~]# zpool online personalpool 17284048339211250722
cannot open 'personalpool': no such pool
root@freenas[~]# zpool online personalpool 4823201479263356504
cannot open 'personalpool': no such pool
root@freenas[~]#


Are there commands I can run to piecemeal these disks together to get the data off or this just toast?
 
Joined
Jan 4, 2014
Messages
1,644
Replacing disks to grow a pool without a reliable backup is a recipe for disaster. The User Guide makes this point very clear.

If a unused disk port or bay is not available, a drive can be replaced with a larger one as shown in Replacing a Failed Disk. This process is slow and places the system in a degraded state. Since a failure at this point could be disastrous, do not attempt this method unless the system has a reliable backup. Replace one drive at a time and wait for the resilver process to complete on the replaced drive before replacing the next drive. After all the drives are replaced and the final resilver completes, the added space appears in the pool.

It's been said over and over, many times on this forum, disk redundancy is not a substitute for a reliable backup. A backup is necessary to protect your data from situations such as what you've experienced. It's a tough lesson to learn (I've been there!).

The situation you have experienced also reinforces the need to validate new disks before using them on a production system. For further details, refer to the resource Hard Drive Burn-in Testing.

Finally, this situation highlights the weakness of RAIDZ1. With an error on one disk and a resliver of another, what you have is the equivalent of a two-disk failure, which RAIDZ1 is unable to deal with. If this situation had occurred under RAIDZ2, it is possible that a disaster could have been averted.
 
Last edited:

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Grasping from straws here: when you had the three original drives in the system did you import a previous configuration file?
 

Daryle

Dabbler
Joined
Jan 26, 2017
Messages
13
Replacing disks to grow a pool without a reliable backup is a recipe for disaster. The User Guide makes this point very clear.

If a unused disk port or bay is not available, a drive can be replaced with a larger one as shown in Replacing a Failed Disk. This process is slow and places the system in a degraded state. Since a failure at this point could be disastrous, do not attempt this method unless the system has a reliable backup. Replace one drive at a time and wait for the resilver process to complete on the replaced drive before replacing the next drive. After all the drives are replaced and the final resilver completes, the added space appears in the pool.

It's been said over and over, many times on this forum, disk redundancy is not a substitute for a reliable backup. A backup is necessary to protect your data from situations such as what you've experienced. It's a tough lesson to learn (I've been there!).

The situation you have experienced also reinforces the need to validate new disks before using them on a production system. For further details, refer to the resource Hard Drive Burn-in Testing.

Finally, this situation highlights the weakness of RAIDZ1. With an error on one disk and a resliver of another, what you have is the equivalent of a two-disk failure, which RAIDZ1 is unable to deal with. If this situation had occurred under RAIDZ2, it is possible that a disaster could have been averted.

Basil,
I understand these points but, Thank you.

The disks were vetted prior to installation. They were brand new out of the box and they tested fine. Admittedly I didn't run a long test on the 4 12TB disks so I relied on the initial SMART tests.

This pool had config info for some of my test Docker containers and some VMware iscsi mount points. The data on the disks, while it would be nice to have back saving me some time re-creating it, isn't all that important. I have a 2nd, primary pool, that has active replication for my important data.

Grasping from straws here: when you had the three original drives in the system did you import a previous configuration file?
No, these were new out of the box disks and a newly created pool when I initially installed FreeNAS back in version 9.3? They've been around.

It feels like the system is looking for other disks not present. When I do a "zpool import (-f, -FX)" it see's the pool. I almost feel like it just can mount the disks because they're not part of that pool ID.. ? Anyways, I didn't know if there was a way to access the metadata or header info on the disks to correct ID's, UUID's, whatever-ID's, or data to restore the cluster.

The data on these disks ins't all that important. While I don't have a backup for this pool its just config data for docker stuff. It won't take long to re-create it.
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
When you say they "tested fine" what is that based on? A SMART short test? Or did you burn them in? Burn in takes some time but it is worth it.

Unfortunately RaidZ1 failed you. Sucks. Pretty sure your pool is garbage now.

On another note, if you have any spare slots, you can replace disks without needing to offline any disks. This reduces the risk. Also, if you have lots of spare slots, you can replace more than one disk at a time with no increased risk.

Cheers,
 

Daryle

Dabbler
Joined
Jan 26, 2017
Messages
13
When you say they "tested fine" what is that based on? A SMART short test? Or did you burn them in? Burn in takes some time but it is worth it.

Unfortunately RaidZ1 failed you. Sucks. Pretty sure your pool is garbage now.

On another note, if you have any spare slots, you can replace disks without needing to offline any disks. This reduces the risk. Also, if you have lots of spare slots, you can replace more than one disk at a time with no increased risk.

Cheers,
Correct, SMART short tests on all 4 disks.

Lessons learned all around yesterday. I've had other ZFS pools where I've done similar hardware upgrades and have done so without a hitch. Yesterday just wasn't my day.

Unfortunately I don't have empty ports in my chassis. Solid advice though, and much appreciated.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The disks were vetted prior to installation. They were brand new out of the box and they tested fine.

Correct, SMART short tests on all 4 disks.

So, not vetted, not even more than tritely tested. And I'm sorry this has happened to you, but this is why we encourage these things like burn-in testing.

Initial burn-in testing of a system, or any new components such as a HDD, need to be extensive and long. Minimum 1000 hours of testing is what is preferred here in the shop. This gives sufficient opportunity for infant mortality to rear its awful and ugly head, as it seems to have done to you.

People are all like "that stuff's for paranoids" right up to the point where it bites them. Now, let's be clear, many people might say I'm paranoid, but then again that's served me well over many years.

Your data may not be lost, but recovery of it may be a bit beyond the scope of this forum. Ideally you should make copies of the drives involved, set the originals aside, and then see what you can do about recovering using the drive copies.
 
Joined
Feb 13, 2021
Messages
3
When you say they "tested fine" what is that based on? A SMART short test? Or did you burn them in? Burn in takes some time but it is worth it.

Unfortunately RaidZ1 failed you. Sucks. Pretty sure your pool is garbage now.

On another note, if you have any spare slots, you can replace disks without needing to offline any disks. This reduces the risk. Also, if you have lots of spare slots, you can replace more than one disk at a time with no increased risk.

Cheers,

Sorry for bringing up an old thread, but I was wondering if I have a 8x8TB RaidZ2 pool, could I replace all my drives at once for 8x16TB since I have enough ports. Otherwise I think I will do 1 or if I can 2 at once.

I was going to create a new vdev with the new drives, but I think it would be better to replace them instead.
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
Sorry for bringing up an old thread, but I was wondering if I have a 8x8TB RaidZ2 pool, could I replace all my drives at once for 8x16TB since I have enough ports. Otherwise I think I will do 1 or if I can 2 at once.

I was going to create a new vdev with the new drives, but I think it would be better to replace them instead.
If you have the bays, you can replace all at the same time. I have done that before. Works fine.
 
Top