Pool gone, 4 drives missing suddenly

Joined
Jul 31, 2018
Messages
6
My RAIDZ2 volume on my system suddenly will not mount. Suddenly, overnight, the pool began failing to import due to 4 of the 8 drives being "missing". I find it highly unlikely (though not impossible) that four drives failed at the exact same time.

I am on an older version of FreeNAS (11.7-U7) with an encrypted volume. I'd like to change both of those things (upgrade and remove the encryption), but, first I have to hopefully get this back up and running.

Here is the output from `zpool import`:

Code:
root@gibibyte:~ # zpool status
  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:07 with 0 errors on Sun Sep  3 03:45:07 2023
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          ada5p2    ONLINE       0     0     0

errors: No known data errors
root@gibibyte:~ # zpool import
   pool: gibibyte-data
     id: 11272315687565251573
  state: UNAVAIL
 status: One or more devices are missing from the system.
 action: The pool cannot be imported. Attach the missing
        devices and try again.
   see: http://illumos.org/msg/ZFS-8000-3C
 config:

        gibibyte-data                                       UNAVAIL  insufficient replicas
          raidz2-0                                          UNAVAIL  insufficient replicas
            gptid/88b275e9-30b3-11ee-84f6-d05099c3831a.eli  ONLINE
            gptid/129c2efc-9846-11e8-9db3-d05099c3831a.eli  ONLINE
            12583690904832772633                            UNAVAIL  cannot open
            gptid/143c34af-9846-11e8-9db3-d05099c3831a.eli  ONLINE
            12473765029923333105                            UNAVAIL  cannot open
            10274762815153801241                            UNAVAIL  cannot open
            gptid/16ce3ffa-9846-11e8-9db3-d05099c3831a.eli  ONLINE
            9609449009153103517                             UNAVAIL  cannot open


All eight drives are detected by the system. I have tried swapping cables etc. to see if it could be a cabling problem, but it doesn't seem to be (the four "missing" drives are still missing and the four others are seen as available and valid if plugged into ports previously belonging to missing ones).

Where can I start? Thanks!
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Notably missing are details of your hardware... Its hard to intelligently discuss this without knowing what we're discussing.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
in particular how disks are attached to the motherboard, direct, miniSAS, HBA etc
Size and type of PSU and how the PSU is powering the disks - via how many and what type of cable
 
Joined
Jul 31, 2018
Messages
6
The system is an ASRock Intel Avoton C2750, installed in a case with a backplane for hot swapping and room for 8 spinners. The drives, which are 8TiB shucked Western Digital MyBook drives, are all plugged directly into the board. As I said though, moving the drives around has no effect (swapping a visible drive with an invisible one doesn't result in the visible one becoming invisible), so I don't believe it's a cabling or backplane issue. Can I provide any more detail about the hardware that might be helpful?
 
Joined
Jul 31, 2018
Messages
6
I also wanted to add that the PSU is one with plenty of wattage and has not failed, and has been in use in this system for years. Additionally, the four "missing" drives are even available to create a new volume with in the UI, so, they're present, accounted for, and working.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Additionally, the four "missing" drives are even available to create a new volume with in the UI,
Ooh, that sounds bad.
Suddenly, overnight, the pool began failing to import due to 4 of the 8 drives being "missing".
Not that I can see how that would cause this, but are you frequently rebooting your server?
 
Joined
Jul 31, 2018
Messages
6
No, I don't typically reboot it. This actually happened a month or so ago, so I don't have 100% clear memory of what happened in the lead-up to this, or if the uptime was reset when I noticed the problem.

If it helps, I don't recall exactly what I did, but I looked at the data on the drives in hex format, and the "missing" ones appear to be valid encrypted ZFS volumes, just like the others. I don't believe any data has been lost.
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829
I would not rely on the PSU being good unless you can test it in a different system, with a higher load, or have a load tester.

I had similar issues to yours when my 6-yo PSU died slowly and took the pool into a sufficiently-degraded mode where ZFS suggested I start over after destroying the pool.

You may also have a failing motherboard, which could be explained if all drives were attached to a particular HBA / SATA port collection.

Once I replaced my PSU, all was well again.
 
Joined
Jul 31, 2018
Messages
6
Pardon my ignorance but how could it be the PSU or motherboard if a) the drives are all powered on, detected by the OS, and have data and b) I've tried swapping the cables into different ports, moving things around (eg. moving bad drives to known good ports, and vice-versa) and there's no change? There are also multiple SATA port collections, I've tried both.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Well, steady states are nice to analyze because they're simple, but the real world has moving parts.

What I mean by that is that perhaps the PSU is in good enough shape to stay in spec with a constant load, but in bad enough shape to not handle substantial transients very well. Still, I hate to blame everything on PSUs - they're easy targets and the accusations they're under might as well be philosophical because nobody's going to be able to prove it either way...

So let's examine other avenues first (unless you happen to have a known-good PSU you can try it out with):
  1. Do you have backups of your keys and GELI metadata? (this is GELI and not ZFS native encryption, right?)
  2. Have you tried manually unlocking one of the "missing" disks using the aforementioned backups?
 
Joined
Jul 31, 2018
Messages
6
What I mean by that is that perhaps the PSU is in good enough shape to stay in spec with a constant load, but in bad enough shape to not handle substantial transients very well.

I'm open to the idea, especially since I find it highly suspect that exactly half of the drives can't be seen by ZFS. But, I'm honestly skeptical - I've never had a failed PSU where the items powered by the PSU were still visible and 100% OK to the OS.

  1. Do you have backups of your keys and GELI metadata? (this is GELI and not ZFS native encryption, right?)

I do.

  1. Have you tried manually unlocking one of the "missing" disks using the aforementioned backups?

I do not know how to do that, which is why I'm here.

FWIW, this is what it looks like in `/var/log/messages` when I try to decrypt using the UI:

Code:
Oct 19 19:40:13 gibibyte GEOM_ELI: Device gptid/129c2efc-9846-11e8-9db3-d05099c3831a.eli created.
Oct 19 19:40:13 gibibyte GEOM_ELI: Encryption: AES-XTS 256
Oct 19 19:40:13 gibibyte GEOM_ELI:     Crypto: hardware
Oct 19 19:40:15 gibibyte GEOM_ELI: Device gptid/143c34af-9846-11e8-9db3-d05099c3831a.eli created.
Oct 19 19:40:15 gibibyte GEOM_ELI: Encryption: AES-XTS 256
Oct 19 19:40:15 gibibyte GEOM_ELI:     Crypto: hardware
Oct 19 19:40:17 gibibyte GEOM_ELI: Device gptid/16ce3ffa-9846-11e8-9db3-d05099c3831a.eli created.
Oct 19 19:40:17 gibibyte GEOM_ELI: Encryption: AES-XTS 256
Oct 19 19:40:17 gibibyte GEOM_ELI:     Crypto: hardware
Oct 19 19:40:19 gibibyte GEOM_ELI: Device gptid/88b275e9-30b3-11ee-84f6-d05099c3831a.eli created.
Oct 19 19:40:19 gibibyte GEOM_ELI: Encryption: AES-XTS 256
Oct 19 19:40:19 gibibyte GEOM_ELI:     Crypto: hardware
Oct 19 19:40:19 gibibyte ZFS: vdev state changed, pool_guid=11272315687565251573 vdev_guid=9263826588309144475
Oct 19 19:40:19 gibibyte ZFS: vdev state changed, pool_guid=11272315687565251573 vdev_guid=12084348549073075469
Oct 19 19:40:19 gibibyte ZFS: vdev state changed, pool_guid=11272315687565251573 vdev_guid=12583690904832772633
Oct 19 19:40:19 gibibyte ZFS: vdev state changed, pool_guid=11272315687565251573 vdev_guid=4241459707677346508
Oct 19 19:40:19 gibibyte ZFS: vdev state changed, pool_guid=11272315687565251573 vdev_guid=12473765029923333105
Oct 19 19:40:19 gibibyte ZFS: vdev state changed, pool_guid=11272315687565251573 vdev_guid=10274762815153801241
Oct 19 19:40:19 gibibyte ZFS: vdev state changed, pool_guid=11272315687565251573 vdev_guid=17678928252760578401
Oct 19 19:40:20 gibibyte ZFS: vdev state changed, pool_guid=11272315687565251573 vdev_guid=9609449009153103517
 

jlpellet

Patron
Joined
Mar 21, 2012
Messages
287
I aldo don't have much hope this will help, but suggest the following.

1. replace/rearrange the drive power cables - maybe a ps branch cable died & I did not see mention of trying this

2. reseat all data, cards, power cables, ram, etc. on the botherboard - I've seen this cause wierd issues

3. physically pull the power cable from the wall - the only way to really power-cycle everything

4. reinstall the OS - no real hope but easy to do.

Hope this helps as well as hope you have a good backup.

John
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Has your "drive swapping" exercise eliminated any questions about the backplane? I believe I would want to directly connect the drives to power and SATA cables and go through the exchange process to test ports, cables and drives.

As an ex-FreeNAS Mini owner, I note that your motherboard is one of those subject to at least two potential failure modes reported upon in multiple posts here. Your board and BIOS versions might be additive to the diagnosis process here as, iirc, there was version-dependant variabilty in failure experience.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I highly recommend that you track the drives by serial number. I don't like to assume anything but when you said:
As I said though, moving the drives around has no effect (swapping a visible drive with an invisible one doesn't result in the visible one becoming invisible)
But that does not mean to me that the same drives were recognized. See my point. So this could be a backplane issue from my perspective until you state that the working drives are always the same serial numbers when you move them around.
installed in a case with a backplane for hot swapping and room for 8 spinners.
This also has me making assumptions, maybe the model of this item would help out greatly. Are the SATA ports directly passed through to the drive? I could assume so but making assumptions does not help anyone.

You have some good advice in this thread but make sure you are clear with what you are trying to convey. Don't let us make assumptions so better advice can be provided to you.
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829
All eight drives are detected by the system. I have tried swapping cables etc. to see if it could be a cabling problem, but it doesn't seem to be (the four "missing" drives are still missing and the four others are seen as available and valid if plugged into ports previously belonging to missing ones).

Where can I start? Thanks!
This suggests a power problem for the four drives that stay consistently offline despite you swapping the data cables? Clearly, the data cables seem to be functioning.

Perhaps the power to the backplane got interrupted? The PSU outlet that powered the four unavailable drives failed (in case you have a PSU with discrete SATA power outlets)?

Losing a block of drives usually points to a loose power connector, bad PSU, or the failure of a major component like the Mini-SAS connector on the motherboard.

The power connectors on the backplane can go bad, speaking from personal experience.
 
Top