Pool died.... now arisen again... but why/how did it die?

Constantin · Dec 16, 2021

Until last night, my system was running fine. No pool degradation, no SMART errors, etc.

I then upgraded TrueNAS to 12U7 from U6.1. The system came back up but the pool had a big red dot with a white X. The only option: disconnect / export. Trying to import the dead pool afterwards resulted in an alphabet soup of middleware errors that likely only benefit the customer support team at iXsystems.

Dropping into the shell, "zpool import XXX" yielded " I/O Error, destroy and recreate the pool from backup"

Looking through the UI Storage/Disks submenu, the apparent failure mode is that all three mirrored SVDEV SSDs disappeared. The eight primary HDDs were still listed. I presume my first task should be to pull the SSDs, swap them around and see what happens?

Reverting to U6.1 did not fix the issue, so it's likely more to do with the reboot than the U7 upgrade.

Constantin · Dec 16, 2021

... so I hotswap the three 1.46TB SSDs around and they now *do* get registered with the system - they show up in the Storage/Disks UI window, for example. Then if I drop down into the CLI and enter "zpool import XXX", the pool imports cleanly, then "zpool status XXX" shows 0 errors, yet the pool does not show up in the UI. So I reboot again...

After reboot, importing pool via the UI into TrueNAS was no problem. Everything is fine again, pool is online, no errors.

Quite a delta from 10 minutes ago when the pool was listed as dead. Now... for the $64,000 question: Why would the SSDs go offline in the first place? ... and why would swapping them around SATA ports make them go back online again?

When I hot-swapped them, the usual multi-screenfulls of status updates started scrolling down the console screen, so clearly they became "alive" once I pulled them and reinserted them back into the SATA backplanes. On the one hand I am glad the system is back... yet I am also perplexed that 3 mirrored SSDs could simply vanish from the SATA bus without the system giving better feedback.

Constantin · Dec 16, 2021

So here is my suggestion to @morganL and other folk at iXsystems: The pool config presumably contains a list of the expected drives it is supposed to connect to (by UUID, capacity, type, intended use, etc.). When a pool import fails, why cannot the UI tell the user: "Hey, I cannot import your pool because the following disks (by SN, capacity, and SATA slot) are missing from VDEV XXX in the pool". Then the user has a better starting point than "your pool is so dead that it needs to be destroyed and rebuilt from backups", which is all I got.

Rabinovitch · Jan 9, 2022

That's all I can say...

Constantin · Jan 9, 2022

Good suggestion. I reported the bug via Jira in 12/21, along with suggestions on how to improve the user experience. No response yet. They seem pre-occupied with other bugs (see upcoming 12u7.1 core release)

flashdrive · Jan 9, 2022

Hello @Constantin

Having to deal with this:

pool degraded - HD204UI - has been removed by the administrator - HDD compatibility list?

Hello, all too often the zfs data pool will have this message: CRITICAL Pool 123 state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. The following devices are not healthy: • Disk...

www.truenas.com

I have now used the following "workflow" to check for the missed drives:

- see the TN Gui error report which could tell the serial number of the drive - these I got labeled so I can check in the box without having to pull out the drives

shell:
- zpool status
- glabel status

I second your wish for an easier overview.

Important Announcement for the TrueNAS Community.

Pool died.... now arisen again... but why/how did it die?

Constantin

Vampire Pig

Constantin

Vampire Pig

Constantin

Vampire Pig

Rabinovitch

Dabbler

Constantin

Vampire Pig

flashdrive

Patron

pool degraded - HD204UI - has been removed by the administrator - HDD compatibility list?