SOLVED Getting degraded pool online after hardware issue

EnKo

Dabbler
Joined
Jan 9, 2022
Messages
32
During last weekend, suddenly 3 disks faulted and 12 disks degraded of my 16 disk RAIDZ3 Pool. The system reported errors and I were not able to resilver the pool.

Pool state is DEGRADED: One or more devices are faulted in response to IO failures.
Pool state is DEGRADED: One or more devices has experienced an error resulting in data corruption. Applications may be affected.

After switching the disks to another hardware, no error messages appeared anymore. The faulted disks displayed online, but the degraded disks kept degraded, even after a scrub task with no errors and several restarts. Also switching offline and online do not solve the issue.

Screenshot_20230626_175630.png


A very rough view on the system let me hope there is actually no data corruption. I think I would be able to bring it back online if I force a replacement, but I am afraid to cause more damage. I could imagine there is a better solution.

Thank you very much in advance for your support.
 

EnKo

Dabbler
Joined
Jan 9, 2022
Messages
32
Finally I remembered something I read years ago.
Warning: the supported mechanisms for making configuration changes
are the TrueNAS WebUI and API exclusively. ALL OTHERS ARE
NOT SUPPORTED AND WILL RESULT IN UNDEFINED BEHAVIOR AND MAY
RESULT IN SYSTEM FAILURE.

zpool status -v
pool: ...
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: resilvered 204K in 00:00:05 with 0 errors on Mon Jun 26 17:55:22 2023
config:

NAME STATE READ WRITE CKSUM
raidz3-0 DEGRADED 0 0 0
gptid/... ONLINE 0 0 0
gptid/... DEGRADED 0 0 0 too many errors
gptid/... DEGRADED 0 0 0 too many errors
gptid/... DEGRADED 0 0 0 too many errors
gptid/... DEGRADED 0 0 0 too many errors
gptid/... DEGRADED 0 0 0 too many errors
gptid/... DEGRADED 0 0 0 too many errors
gptid/... ONLINE 0 0 0
gptid/... DEGRADED 0 0 0 too many errors
gptid/... DEGRADED 0 0 0 too many errors
gptid/... ONLINE 0 0 0
gptid/... DEGRADED 0 0 0 too many errors
gptid/... DEGRADED 0 0 0 too many errors
gptid/... DEGRADED 0 0 0 too many errors
... DEGRADED 0 0 0 too many errors
... DEGRADED 0 0 0 too many errors

errors: No known data errors
Is zpool clear a wise action?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
The good part is the "errors: No known data errors".

But, I am not sure a "zpool clear" will do what you want. Have not see that amount of problems before. Perhaps someone else can answer that question. It would probably not harm anything, though don't take my word for it.

I see 2 problems:

1. You appear to have 8 USB attached disks, (but I could be wrong?). This is not recommended:

2. Wide strips in a RAID-Zx is not recommended. Usually a maximum of 10 to 12 disks is acceptable. Using 16 is a bit too much and could lead to problems in the future, including slower writes as free space is fragmented. For example, 2 x 8 disk RAID-Z2 is a good compromise. Though that does add 1 more disk of parity, reducing overall pool size by 1 disk.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
What's the clue to USB attached disks? I'd rather suspect 8 drives on a -8i HBA.
But more details on the hardware would help. System specifications. Drive model (SMR?). How drives are attached, and powered.
Any reason why da1, da3 and da6 would be the only unaffected drives?

Look into SMART reports, and launch long SMART tests on all drives if there were no recent test.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
What I used for a hint that the OP has USB attached disks, is that there are 2 different naming conventions for disks. Half with "adaX" and the other half with "daX". If I remember my FreeBSD, (I don't have a TrueNAS Core / FreeBSD server handy), the "daX" are USB.

But, as I said in my earlier post, the OP "appears to have", so I freely admit I could be wrong.

Your request for hardware configuration and drive models, (SMR?), are probably a more helpful suggestion.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
If I remember my FreeBSD, (I don't have a TrueNAS Core / FreeBSD server handy), the "daX" are USB.

daX is for any SCSI Direct Access device, which USB also presents itself through.

@EnKo Can you provide some information on the "old" and "new" hardware, specifically around the motherboard and disk controller(s) used? Pool-wide faults usually result from a cable, midplane, or controller/HBA failure.
 

EnKo

Dabbler
Joined
Jan 9, 2022
Messages
32
Thank you very much for your answers.

Indeed there are 16 SATA drives, previously connected via SAS expander to a SAS backplane, using one power supply. Now 8 drives are connected via two other ports of the same SAS expander and 8 drives are connected to mainboard via SATA-SAS adapter all on different SAS backplanes, using another power supply. The two power supplies are connected via a Y-cable to share the switch-on signal.

The cables have been the first thing I replaced and tried to resilver the pool. I think because of this the 3 drives are not Degraded, since they Faulted instead of being recognised as Degraded.

I still have to investigate the root cause, but the only single device left in old configuration to be able to affect all disks at the same time, is the power supply. Therefore I will test this as soon as I solved the data issue.
 

EnKo

Dabbler
Joined
Jan 9, 2022
Messages
32
I tried zpool clear and it fixed the issue.

Unfortunately I were not able to figure out the hardware issue yet, since I am not able to reproduce the initial failure, anymore.
 
Top