RAIDZ2-60-Disk-Pool Unavailable after a RAIDZ2 vdev Failed

adeelleo

Dabbler
Joined
Mar 28, 2021
Messages
38
Hi Dice,

After a couple of reboots the machine was not booting at all since it could not even find the boot-pol.

So i entered the below two commands to import the pools directly from he console in front of the server:

zpool import -f "boot-pool"
zpool import -f "RAIDZ2-60-Disk-Pool"

boot-pool was available immediately.
RAIDZ2-60-Disk-Pool took several hours to get imported. Its visible in the GUI once more but now its showing lots of disks as degraded. These many disks cant go faulty so quickly. This is looking like a bug in TrueNAS Scale.

below is the output of the zpool status command.
Not sure why formatting gets screwed each time. Sharing in attached text file
 

Attachments

  • zpool status.txt
    7 KB · Views: 113

AlexGG

Contributor
Joined
Dec 13, 2018
Messages
171
RAIDZ2-60-Disk-Pool took several hours to get imported. Its visible in the GUI once more but now its showing lots of disks as degraded. These many disks cant go faulty so quickly. This is looking like a bug in TrueNAS Scale.

This is looking more like a failure of some hardware component common to multiple disks. You should stop trying to bring the pool online and take a good long look at your hardware.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
I agree with Alex.

I've never come across a system acting up in a similar way.
I too think there might be a hardware problem, other than drives.
Especially since the boot drive refused to import automatically.

I understand it is maybe not an easy task, in case you really want to try to save the data here, I'd look to put the drives in another enclosure/system all together and see what happens.

That might stop the masses of errors being thrown, but will likely not fix this from the zpool status:
status: One or more devices could not be used because the label is missing
or invalid.

The amount off errors, on various drives here and there suggests something along the lines of enclosure failure/cable failure/SAScontroller failure.
 

adeelleo

Dabbler
Joined
Mar 28, 2021
Messages
38
I agree with Alex.

I've never come across a system acting up in a similar way.
I too think there might be a hardware problem, other than drives.
Especially since the boot drive refused to import automatically.

I understand it is maybe not an easy task, in case you really want to try to save the data here, I'd look to put the drives in another enclosure/system all together and see what happens.

That might stop the masses of errors being thrown, but will likely not fix this from the zpool status:


The amount off errors, on various drives here and there suggests something along the lines of enclosure failure/cable failure/SAScontroller failure.
Thanks,

Although everything seems to be working fine from a hardware standpoint. There are no errors reported by the hardware either.

But i would look into the suggestion of installing the disks in a different expansion and trying to see what happens.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Thanks,

Although everything seems to be working fine from a hardware standpoint. There are no errors reported by the hardware either.

But i would look into the suggestion of installing the disks in a different expansion and trying to see what happens.
Once you do, I'd also try to import the pool on a different installation - of truenas core.
Install Core on something temporary, another SSD or even a USB-stick, just to have something to work off.
No need to wipe your current SCALE installation.

I've a lot of doubts into SCALE at this point - but that's me being the typical zfs guy - data paranoia overshadows the desire for cutting edge.

Was this Gigantic pool created on SCALE too?
 

AlexGG

Contributor
Joined
Dec 13, 2018
Messages
171
But i would look into the suggestion of installing the disks in a different expansion and trying to see what happens.

Since you mentioned boot-pool also failing to import, is the boot pool in the same expansion as the data pool disks? Because it is not quite obvious to me which component is suspect. I would start by drawing the component diagram to figure out what is connected to what, including CPU, memory, mainboard, specific slots on the mainboard, disk controller(s), expander(s), cables, power supply (and cables), backplane(s), and down to disks. Then, remove disks one by one and test them on a separate machine, at least some of them (I mean like at least 10 or so of 60), to eliminate the possibility that it is disks which are at fault (for whatever common reason - heat/vibration/bad batch).

Fault isolation in the machine of this size takes weeks (and sometimes achieves no result, or something vague like "there is a driver problem"), but you can't really reuse the machine without isolating the problem, or else you are setting yourself up for a repeat performance. Also, any attempt at recovery on the unstable machine risks further damaging what's left of the pool.
 
Top