RAIDZ2-60-Disk-Pool Unavailable after a RAIDZ2 vdev Failed

adeelleo · Jul 5, 2022

adeelleo said:
Hi Dice,

After a couple of reboots the machine was not booting at all since it could not even find the boot-pol.

So i entered the below two commands to import the pools directly from he console in front of the server:

zpool import -f "boot-pool"
zpool import -f "RAIDZ2-60-Disk-Pool"

boot-pool was available immediately.
RAIDZ2-60-Disk-Pool took several hours to get imported. Its visible in the GUI once more but now its showing lots of disks as degraded. These many disks cant go faulty so quickly. This is looking like a bug in TrueNAS Scale.

below is the output of the zpool status command.

Not sure why formatting gets screwed each time. Sharing in attached text file

AlexGG · Jul 5, 2022

adeelleo said:
RAIDZ2-60-Disk-Pool took several hours to get imported. Its visible in the GUI once more but now its showing lots of disks as degraded. These many disks cant go faulty so quickly. This is looking like a bug in TrueNAS Scale.

This is looking more like a failure of some hardware component common to multiple disks. You should stop trying to bring the pool online and take a good long look at your hardware.

Dice · Jul 6, 2022

I agree with Alex.

I've never come across a system acting up in a similar way.
I too think there might be a hardware problem, other than drives.
Especially since the boot drive refused to import automatically.

I understand it is maybe not an easy task, in case you really want to try to save the data here, I'd look to put the drives in another enclosure/system all together and see what happens.

That might stop the masses of errors being thrown, but will likely not fix this from the zpool status:

status: One or more devices could not be used because the label is missing
or invalid.

The amount off errors, on various drives here and there suggests something along the lines of enclosure failure/cable failure/SAScontroller failure.

adeelleo · Jul 6, 2022

Dice said:
I agree with Alex.

I've never come across a system acting up in a similar way.
I too think there might be a hardware problem, other than drives.
Especially since the boot drive refused to import automatically.

I understand it is maybe not an easy task, in case you really want to try to save the data here, I'd look to put the drives in another enclosure/system all together and see what happens.

That might stop the masses of errors being thrown, but will likely not fix this from the zpool status:

The amount off errors, on various drives here and there suggests something along the lines of enclosure failure/cable failure/SAScontroller failure.

Thanks,

Although everything seems to be working fine from a hardware standpoint. There are no errors reported by the hardware either.

But i would look into the suggestion of installing the disks in a different expansion and trying to see what happens.

Dice · Jul 6, 2022

adeelleo said:
Thanks,

Although everything seems to be working fine from a hardware standpoint. There are no errors reported by the hardware either.

But i would look into the suggestion of installing the disks in a different expansion and trying to see what happens.

Once you do, I'd also try to import the pool on a different installation - of truenas core.
Install Core on something temporary, another SSD or even a USB-stick, just to have something to work off.
No need to wipe your current SCALE installation.

I've a lot of doubts into SCALE at this point - but that's me being the typical zfs guy - data paranoia overshadows the desire for cutting edge.

Was this Gigantic pool created on SCALE too?

AlexGG · Jul 6, 2022

adeelleo said:
But i would look into the suggestion of installing the disks in a different expansion and trying to see what happens.

Since you mentioned boot-pool also failing to import, is the boot pool in the same expansion as the data pool disks? Because it is not quite obvious to me which component is suspect. I would start by drawing the component diagram to figure out what is connected to what, including CPU, memory, mainboard, specific slots on the mainboard, disk controller(s), expander(s), cables, power supply (and cables), backplane(s), and down to disks. Then, remove disks one by one and test them on a separate machine, at least some of them (I mean like at least 10 or so of 60), to eliminate the possibility that it is disks which are at fault (for whatever common reason - heat/vibration/bad batch).

Fault isolation in the machine of this size takes weeks (and sometimes achieves no result, or something vague like "there is a driver problem"), but you can't really reuse the machine without isolating the problem, or else you are setting yourself up for a repeat performance. Also, any attempt at recovery on the unstable machine risks further damaging what's left of the pool.

Important Announcement for the TrueNAS Community.

RAIDZ2-60-Disk-Pool Unavailable after a RAIDZ2 vdev Failed

adeelleo

Dabbler

Attachments

AlexGG

Contributor

Dice

Wizard

adeelleo

Dabbler

Dice

Wizard

AlexGG

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

RAIDZ2-60-Disk-Pool Unavailable after a RAIDZ2 vdev Failed

adeelleo

Dabbler

Attachments

AlexGG

Contributor

Dice

Wizard

adeelleo

Dabbler

Dice

Wizard

AlexGG

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "RAIDZ2-60-Disk-Pool Unavailable after a RAIDZ2 vdev Failed"

Similar threads