SOLVED Getting degraded pool online after hardware issue

EnKo · Jun 26, 2023

During last weekend, suddenly 3 disks faulted and 12 disks degraded of my 16 disk RAIDZ3 Pool. The system reported errors and I were not able to resilver the pool.

Pool state is DEGRADED: One or more devices are faulted in response to IO failures.

Pool state is DEGRADED: One or more devices has experienced an error resulting in data corruption. Applications may be affected.

After switching the disks to another hardware, no error messages appeared anymore. The faulted disks displayed online, but the degraded disks kept degraded, even after a scrub task with no errors and several restarts. Also switching offline and online do not solve the issue.

A very rough view on the system let me hope there is actually no data corruption. I think I would be able to bring it back online if I force a replacement, but I am afraid to cause more damage. I could imagine there is a better solution.

Thank you very much in advance for your support.

EnKo · Jun 26, 2023

Finally I remembered something I read years ago.

Warning: the supported mechanisms for making configuration changes
are the TrueNAS WebUI and API exclusively. ALL OTHERS ARE
NOT SUPPORTED AND WILL RESULT IN UNDEFINED BEHAVIOR AND MAY
RESULT IN SYSTEM FAILURE.

zpool status -v
pool: ...
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: resilvered 204K in 00:00:05 with 0 errors on Mon Jun 26 17:55:22 2023
config:

NAME STATE READ WRITE CKSUM
raidz3-0 DEGRADED 0 0 0
gptid/... ONLINE 0 0 0
gptid/... DEGRADED 0 0 0 too many errors
gptid/... DEGRADED 0 0 0 too many errors
gptid/... DEGRADED 0 0 0 too many errors
gptid/... DEGRADED 0 0 0 too many errors
gptid/... DEGRADED 0 0 0 too many errors
gptid/... DEGRADED 0 0 0 too many errors
gptid/... ONLINE 0 0 0
gptid/... DEGRADED 0 0 0 too many errors
gptid/... DEGRADED 0 0 0 too many errors
gptid/... ONLINE 0 0 0
gptid/... DEGRADED 0 0 0 too many errors
gptid/... DEGRADED 0 0 0 too many errors
gptid/... DEGRADED 0 0 0 too many errors
... DEGRADED 0 0 0 too many errors
... DEGRADED 0 0 0 too many errors

errors: No known data errors

Is zpool clear a wise action?

Arwen · Jun 26, 2023

The good part is the "errors: No known data errors".

But, I am not sure a "zpool clear" will do what you want. Have not see that amount of problems before. Perhaps someone else can answer that question. It would probably not harm anything, though don't take my word for it.

I see 2 problems:

1. You appear to have 8 USB attached disks, (but I could be wrong?). This is not recommended:

Why you should avoid USB attached drives for data pool disks

This subject has been coming up quite often since about 2022. Perhaps because of TrueNAS SCALE has brought forth some attention to TrueNAS in general and ZFS in specific. Please note that this is about USB attached storage for ZFS data pools. On occasion, people use USB drives, (flash...

www.truenas.com

2. Wide strips in a RAID-Zx is not recommended. Usually a maximum of 10 to 12 disks is acceptable. Using 16 is a bit too much and could lead to problems in the future, including slower writes as free space is fragmented. For example, 2 x 8 disk RAID-Z2 is a good compromise. Though that does add 1 more disk of parity, reducing overall pool size by 1 disk.

Etorix · Jun 26, 2023

What's the clue to USB attached disks? I'd rather suspect 8 drives on a -8i HBA.
But more details on the hardware would help. System specifications. Drive model (SMR?). How drives are attached, and powered.
Any reason why da1, da3 and da6 would be the only unaffected drives?

Look into SMART reports, and launch long SMART tests on all drives if there were no recent test.

Arwen · Jun 27, 2023

What I used for a hint that the OP has USB attached disks, is that there are 2 different naming conventions for disks. Half with "adaX" and the other half with "daX". If I remember my FreeBSD, (I don't have a TrueNAS Core / FreeBSD server handy), the "daX" are USB.

But, as I said in my earlier post, the OP "appears to have", so I freely admit I could be wrong.

Your request for hardware configuration and drive models, (SMR?), are probably a more helpful suggestion.

HoneyBadger · Jun 27, 2023

Arwen said:
If I remember my FreeBSD, (I don't have a TrueNAS Core / FreeBSD server handy), the "daX" are USB.

daX is for any SCSI Direct Access device, which USB also presents itself through.

@EnKo Can you provide some information on the "old" and "new" hardware, specifically around the motherboard and disk controller(s) used? Pool-wide faults usually result from a cable, midplane, or controller/HBA failure.

EnKo · Jun 27, 2023

Thank you very much for your answers.

Indeed there are 16 SATA drives, previously connected via SAS expander to a SAS backplane, using one power supply. Now 8 drives are connected via two other ports of the same SAS expander and 8 drives are connected to mainboard via SATA-SAS adapter all on different SAS backplanes, using another power supply. The two power supplies are connected via a Y-cable to share the switch-on signal.

The cables have been the first thing I replaced and tried to resilver the pool. I think because of this the 3 drives are not Degraded, since they Faulted instead of being recognised as Degraded.

I still have to investigate the root cause, but the only single device left in old configuration to be able to affect all disks at the same time, is the power supply. Therefore I will test this as soon as I solved the data issue.

EnKo · Jul 2, 2023

I tried zpool clear and it fixed the issue.

Unfortunately I were not able to figure out the hardware issue yet, since I am not able to reproduce the initial failure, anymore.

Important Announcement for the TrueNAS Community.

SOLVED Getting degraded pool online after hardware issue

EnKo

Dabbler

EnKo

Dabbler

Arwen

MVP

Why you should avoid USB attached drives for data pool disks

Etorix

Wizard

Arwen

MVP

HoneyBadger

actually does care

EnKo

Dabbler

EnKo

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED Getting degraded pool online after hardware issue

Dabbler

Dabbler

MVP

Wizard

MVP

actually does care

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Getting degraded pool online after hardware issue"

Similar threads