Lost pool while swapping drives

Daryle · Mar 14, 2020

Hello -
I was upgrading a 3 disk pool (RAIDZ1) and had an unexpected disk failure on disk 2 while re-silvering disk 3 (the last one too) . I'm not sure what happened to disk #2 but I am no longer able to read the disk on any system, maybe a logic board went bad.

I powered the system down and replaced all the new disks with the original 3 thinking when the OS booted it would see the older disks and maybe mount the pool. The original disks did not have anything wrong, I was simply upgrading to larger disks. Seems it doesn't work that way.

With the 3 original disks, this is what I see:

Code:

root@freenas[~]# zpool import
   pool: personalpool
     id: 2428394965905809582
  state: UNAVAIL
 status: One or more devices are missing from the system.
 action: The pool cannot be imported. Attach the missing
        devices and try again.
   see: http://illumos.org/msg/ZFS-8000-3C
 config:

        personalpool                                    UNAVAIL  insufficient replicas
          raidz1-0                                      UNAVAIL  insufficient replicas
            10914419384285364653                        UNAVAIL  cannot open
            gptid/cc293029-739f-11e7-a1f4-0cc47ae203ea  ONLINE
            4823201479263356504                         UNAVAIL  cannot open
root@freenas[~]#

The pool name still showed up in the OS but was listed as "unknown." I tried to detach thinking it might clear out some cache that was looking for the new disks.
The new disks were all bought together and are the same make, model, firmware revision, etc... I swapped logic boards between disks 3 and 2 since 2 had completed the re-silvering process and 3 was just starting.

I am currently at this point; I have the new disks 1 and 2 (since they did complete the re-silver process) attached with the original disk 3 (which shouldn't really matter as 2 disks should still see the pool data).

Code:

root@freenas[~]# zpool import
   pool: personalpool
     id: 2428394965905809582
  state: UNAVAIL
 status: One or more devices are missing from the system.
 action: The pool cannot be imported. Attach the missing
        devices and try again.
   see: http://illumos.org/msg/ZFS-8000-3C
 config:

        personalpool                                    UNAVAIL  insufficient replicas
          raidz1-0                                      UNAVAIL  insufficient replicas
            gptid/eeb377af-6560-11ea-9186-0cc47ae203ea  ONLINE
            17284048339211250722                        OFFLINE
            4823201479263356504                         UNAVAIL  cannot open

I searched for a way to force online the "offline" disk thinking if I could just get one disk back I could get my data off and move on.

Code:

root@freenas[~]# zpool online personalpool 17284048339211250722
cannot open 'personalpool': no such pool
root@freenas[~]# zpool online personalpool 4823201479263356504
cannot open 'personalpool': no such pool
root@freenas[~]#

Are there commands I can run to piecemeal these disks together to get the data off or this just toast?

Basil Hendroff · Mar 14, 2020

Replacing disks to grow a pool without a reliable backup is a recipe for disaster. The User Guide makes this point very clear.

If a unused disk port or bay is not available, a drive can be replaced with a larger one as shown in Replacing a Failed Disk. This process is slow and places the system in a degraded state. Since a failure at this point could be disastrous, do not attempt this method unless the system has a reliable backup. Replace one drive at a time and wait for the resilver process to complete on the replaced drive before replacing the next drive. After all the drives are replaced and the final resilver completes, the added space appears in the pool.

It's been said over and over, many times on this forum, disk redundancy is not a substitute for a reliable backup. A backup is necessary to protect your data from situations such as what you've experienced. It's a tough lesson to learn (I've been there!).

The situation you have experienced also reinforces the need to validate new disks before using them on a production system. For further details, refer to the resource Hard Drive Burn-in Testing.

Finally, this situation highlights the weakness of RAIDZ1. With an error on one disk and a resliver of another, what you have is the equivalent of a two-disk failure, which RAIDZ1 is unable to deal with. If this situation had occurred under RAIDZ2, it is possible that a disaster could have been averted.

Redcoat · Mar 14, 2020

Grasping from straws here: when you had the three original drives in the system did you import a previous configuration file?

Daryle · Mar 14, 2020

Basil Hendroff said:
Replacing disks to grow a pool without a reliable backup is a recipe for disaster. The User Guide makes this point very clear.

If a unused disk port or bay is not available, a drive can be replaced with a larger one as shown in Replacing a Failed Disk. This process is slow and places the system in a degraded state. Since a failure at this point could be disastrous, do not attempt this method unless the system has a reliable backup. Replace one drive at a time and wait for the resilver process to complete on the replaced drive before replacing the next drive. After all the drives are replaced and the final resilver completes, the added space appears in the pool.

It's been said over and over, many times on this forum, disk redundancy is not a substitute for a reliable backup. A backup is necessary to protect your data from situations such as what you've experienced. It's a tough lesson to learn (I've been there!).

The situation you have experienced also reinforces the need to validate new disks before using them on a production system. For further details, refer to the resource Hard Drive Burn-in Testing.

Finally, this situation highlights the weakness of RAIDZ1. With an error on one disk and a resliver of another, what you have is the equivalent of a two-disk failure, which RAIDZ1 is unable to deal with. If this situation had occurred under RAIDZ2, it is possible that a disaster could have been averted.

Basil,
I understand these points but, Thank you.

The disks were vetted prior to installation. They were brand new out of the box and they tested fine. Admittedly I didn't run a long test on the 4 12TB disks so I relied on the initial SMART tests.

This pool had config info for some of my test Docker containers and some VMware iscsi mount points. The data on the disks, while it would be nice to have back saving me some time re-creating it, isn't all that important. I have a 2nd, primary pool, that has active replication for my important data.

Redcoat said:
Grasping from straws here: when you had the three original drives in the system did you import a previous configuration file?

No, these were new out of the box disks and a newly created pool when I initially installed FreeNAS back in version 9.3? They've been around.

It feels like the system is looking for other disks not present. When I do a "zpool import (-f, -FX)" it see's the pool. I almost feel like it just can mount the disks because they're not part of that pool ID.. ? Anyways, I didn't know if there was a way to access the metadata or header info on the disks to correct ID's, UUID's, whatever-ID's, or data to restore the cluster.

The data on these disks ins't all that important. While I don't have a backup for this pool its just config data for docker stuff. It won't take long to re-create it.

Scharbag · Mar 15, 2020

When you say they "tested fine" what is that based on? A SMART short test? Or did you burn them in? Burn in takes some time but it is worth it.

Unfortunately RaidZ1 failed you. Sucks. Pretty sure your pool is garbage now.

On another note, if you have any spare slots, you can replace disks without needing to offline any disks. This reduces the risk. Also, if you have lots of spare slots, you can replace more than one disk at a time with no increased risk.

Cheers,

Daryle · Mar 15, 2020

Scharbag said:
When you say they "tested fine" what is that based on? A SMART short test? Or did you burn them in? Burn in takes some time but it is worth it.

Unfortunately RaidZ1 failed you. Sucks. Pretty sure your pool is garbage now.

On another note, if you have any spare slots, you can replace disks without needing to offline any disks. This reduces the risk. Also, if you have lots of spare slots, you can replace more than one disk at a time with no increased risk.

Cheers,

Correct, SMART short tests on all 4 disks.

Lessons learned all around yesterday. I've had other ZFS pools where I've done similar hardware upgrades and have done so without a hitch. Yesterday just wasn't my day.

Unfortunately I don't have empty ports in my chassis. Solid advice though, and much appreciated.

jgreco · Mar 15, 2020

Daryle said:
The disks were vetted prior to installation. They were brand new out of the box and they tested fine.

Daryle said:
Correct, SMART short tests on all 4 disks.

So, not vetted, not even more than tritely tested. And I'm sorry this has happened to you, but this is why we encourage these things like burn-in testing.

Initial burn-in testing of a system, or any new components such as a HDD, need to be extensive and long. Minimum 1000 hours of testing is what is preferred here in the shop. This gives sufficient opportunity for infant mortality to rear its awful and ugly head, as it seems to have done to you.

People are all like "that stuff's for paranoids" right up to the point where it bites them. Now, let's be clear, many people might say I'm paranoid, but then again that's served me well over many years.

Your data may not be lost, but recovery of it may be a bit beyond the scope of this forum. Ideally you should make copies of the drives involved, set the originals aside, and then see what you can do about recovering using the drive copies.

Vocalicacorn187 · Dec 7, 2022

Scharbag said:
When you say they "tested fine" what is that based on? A SMART short test? Or did you burn them in? Burn in takes some time but it is worth it.

Unfortunately RaidZ1 failed you. Sucks. Pretty sure your pool is garbage now.

On another note, if you have any spare slots, you can replace disks without needing to offline any disks. This reduces the risk. Also, if you have lots of spare slots, you can replace more than one disk at a time with no increased risk.

Cheers,

Sorry for bringing up an old thread, but I was wondering if I have a 8x8TB RaidZ2 pool, could I replace all my drives at once for 8x16TB since I have enough ports. Otherwise I think I will do 1 or if I can 2 at once.

I was going to create a new vdev with the new drives, but I think it would be better to replace them instead.

Scharbag · Dec 7, 2022

Vocalicacorn187 said:
Sorry for bringing up an old thread, but I was wondering if I have a 8x8TB RaidZ2 pool, could I replace all my drives at once for 8x16TB since I have enough ports. Otherwise I think I will do 1 or if I can 2 at once.

I was going to create a new vdev with the new drives, but I think it would be better to replace them instead.

If you have the bays, you can replace all at the same time. I have done that before. Works fine.

Server	Version	HPE Proliant Micro Server	CPU	RAM (DDR3 ECC @ 1600 MHz)	Pool	Boot	Battery Backup	Jails	VMs	Docker	Other
truenas-l	CORE 12.0-U6	Gen 8	Intel Xeon E3-1270L V2 @ 2.3GHz	16GB	4 x 10TB WD Red+ in RAID-Z1	2 x 16GB Verbatim Store n Go USB 3.0 Gold flash drives in mirror	PowerShield Defender 1200VA. Server is NUT master	DNSmasq, Heimdall, Nextcloud, Plex (Beta), Resilio Sync, Tautulli, Transmission, WordPress			File & media server. Replication source.
truenas-l2	CORE 12.0-U6	Gen 8	Intel Xeon E3-1220L V2 @ 3.5GHz	16GB	4 x 8TB WD Red+ in RAID-Z1	2 x 16GB Verbatim Store n Go USB 3.0 Gold flash drives in mirror	PowerShield Defender 1200VA. Server is NUT slave	Caddy Reverse Proxy	Ubuntu 20.0.1 Desktop (2 core, 4GB RAM, 150GB HDD) with Docker and Docker Compose	OnlyOffice, Collabora, TrueCommand, TC 1.2.3 & 1.3.2 Portainer, Nextcloud-Apache, Nextcloud-FPM, WordPress	Plex DVR media server.
truenas-b1	CORE 12.0-U6	Gen 7 N54L	AMD Turion II Neo N54L @ 2.2GHz	12GB	5 x 6TB WD Red+ in RAID-Z1	2 x 16GB SanDisk Cruzer Facet USB 2.0 flash drives in mirror	PowerShield Defender 1200VA. Server is NUT master				Media replication target.
truenas-b2	CORE 12.0-U6	Gen 7 N54L	AMD Turion II Neo N54L @ 2.2GHz	12GB	5 x 4TB WD Red+ in RAID-Z1	2 x 16GB SanDisk Cruzer Facet USB 2.0 flash drives in mirror	PowerShield Defender 1200VA Server is NUT slave				File replication target.
truenas-r	CORE 12.0-U6	Gen 7 N54L	AMD Turion II Neo N54L @ 2.2GHz	10GB	5 x 6TB WD Red+ in RAID-Z1	2 x 16GB SanDisk Cruzer Facet USB 2.0 flash drives in mirror		Plex (Beta)			Off-site backup
truenas-t	CORE 12.0-U6	Gen 7 N40L	AMD Turion II Neo N40L @ 1.5GHz	8GB	4 x 3TB WD Red+ in RAID-Z1	2 x 16GB SanDisk Cruzer Facet USB 2.0 flash drives in mirror					Test server
truenas-s	SCALE 22.02-RC.1	Gen 8	Intel Xeon E3-1220L V2 @ 3.5GHz	16GB	2 x 1TB WD Red in mirror	1 x 32GB Transcend M.2 SSD in a USB 3.1 enclosure				OnlyOffice, Collabora, TrueCommand	Test server

Important Announcement for the TrueNAS Community.

Lost pool while swapping drives

Daryle

Dabbler

Basil Hendroff

Wizard

Redcoat

MVP

Daryle

Dabbler

Scharbag

Guru

Daryle

Dabbler

jgreco

Resident Grinch

Vocalicacorn187

Cadet

Scharbag

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Lost pool while swapping drives

Dabbler

Wizard

MVP

Dabbler

Guru

Dabbler

Resident Grinch

Cadet

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Lost pool while swapping drives"

Similar threads