SSD pool unhealthy, disks all online, unable to view SMART results

Eds89 · Aug 1, 2021

Hello,

I have a RAIDZ1 pool of 4x Samsung EVO 860 SSDs, and the pool they are in shows unhealthy.
The status of all four disks is green all showing online.

When I try to select one of the disks and view the SMART results, the page looks like it is trying to load the results but sits there indefinitely and produces no output.

Is there another way I can check the health of these disks to see what's going on to cause this pool alert?

Thanks
Eds

Arwen · Aug 1, 2021

Login via "root" shell and run zpool status -v. Then copy and paste the output here in CODE tags.

Plus, it's always helpful, (and in the forum rules), to have a complete listing of the hardware. For example, it's not clear what the controller chip is used for your SSDs, SATA or SAS. And if SAS, what model. Then include the version of TrueNAS you are using.

Eds89 · Aug 2, 2021

Thanks. Output below:

Code:

  pool: ESXi
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:

        NAME                                            STATE     READ WRITE CKSUM
        ESXi                                            ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/e4117ba2-ef0b-11eb-b23d-ac1f6b60b47c  ONLINE       0     0     2
            gptid/e44f7d57-ef0b-11eb-b23d-ac1f6b60b47c  ONLINE       0     0     2
            gptid/e4fa9eba-ef0b-11eb-b23d-ac1f6b60b47c  ONLINE       0     0     0
            gptid/e50426bc-ef0b-11eb-b23d-ac1f6b60b47c  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        ESXi/ESXi:<0x1>

I see some mention about checksum errors on a couple of the SSDs. How can I map those gptid's to the disk names so I know which ones I, presumably, am going to need to do further checks on/replace?

Many thanks
Eds

Eds89 · Aug 16, 2021

Can anyone help me map these gptids to disk IDs so I can figure out which ones need replacing?

Many thanks.
Eds

Patrick M. Hausen · Aug 16, 2021

gpart list will help you map the uuids to disk devices. You can then try to use smartctl on the device.

Eds89 · Aug 16, 2021

Patrick M. Hausen said:
gpart list will help you map the uuids to disk devices. You can then try to use smartctl on the device.

Excellent thanks, will give that a crack.

Eds

sretalla · Aug 17, 2021

Patrick M. Hausen said:
gpart list will help you map the uuids to disk devices

glabel status gives it to you a little nicer, but either option works.

Patrick M. Hausen · Aug 17, 2021

sretalla said:
glabel status gives it to you a little nicer, but either option works.

Didn't know that one. I thought this was only for labels set explicitly with gpart add -l foo ...
Thanks.

Eds89 · Aug 21, 2021

Thanks all for your input on getting the status and identification of disks.

I've identified the two disks with checksum errors from zpool status, as da9 and da10.
My daily SMART report though, shows neither of those SSDs reporting any SMART errors.

If it's not a SMART issue on the SSDs, is there anything else that would cause these checksum errors on the ZPOOL?
Alternatively, is there a way I can reset the checksum count to see if the issue comes back?

I'm hesitant to buy and replace the SSDs if they themselves are not at fault.

Cheers
Eds

NugentS · Aug 21, 2021

I think zpool clear "poolname" will clear checksum errors.
If they come back you may have an issue

Eds89 · Aug 22, 2021

I've now had the same warning on a pool of 4 almost brand new disks.
Is it possible checksum errors can be issues with the controller or SAS expander?

NugentS · Aug 22, 2021

Yes - or a cable

Eds89 · Aug 23, 2021

Any way to know other than replacing the hardware bit by bit?

NugentS · Aug 23, 2021

Not really. Reseat the cable first, then change the cable, then the controller card.
If you have a spare disk then you could remove your pool disks and try a new set of disks and see if the issue remains. You can always put the pool back.

Of course I am assuming that you aren't using a RAID Controller - so you might as various posts say tell us what the hardware is

Important Announcement for the TrueNAS Community.

SSD pool unhealthy, disks all online, unable to view SMART results

Eds89

Contributor

Arwen

MVP

Eds89

Contributor

Eds89

Contributor

Patrick M. Hausen

Hall of Famer

Eds89

Contributor

sretalla

Powered by Neutrality

Patrick M. Hausen

Hall of Famer

Eds89

Contributor

NugentS

MVP

Eds89

Contributor

NugentS

MVP

Eds89

Contributor

NugentS

MVP

Similar threads

Important Announcement for the TrueNAS Community.

SSD pool unhealthy, disks all online, unable to view SMART results

Contributor

MVP

Contributor

Contributor

Hall of Famer

Contributor

Powered by Neutrality

Hall of Famer

Contributor

MVP

Contributor

MVP

Contributor

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SSD pool unhealthy, disks all online, unable to view SMART results"

Similar threads