SSD pool unhealthy, disks all online, unable to view SMART results

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
Hello,

I have a RAIDZ1 pool of 4x Samsung EVO 860 SSDs, and the pool they are in shows unhealthy.
The status of all four disks is green all showing online.

When I try to select one of the disks and view the SMART results, the page looks like it is trying to load the results but sits there indefinitely and produces no output.

Is there another way I can check the health of these disks to see what's going on to cause this pool alert?

Thanks
Eds
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,600
Login via "root" shell and run zpool status -v. Then copy and paste the output here in CODE tags.

Plus, it's always helpful, (and in the forum rules), to have a complete listing of the hardware. For example, it's not clear what the controller chip is used for your SSDs, SATA or SAS. And if SAS, what model. Then include the version of TrueNAS you are using.
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
Thanks. Output below:
Code:
  pool: ESXi
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:

        NAME                                            STATE     READ WRITE CKSUM
        ESXi                                            ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/e4117ba2-ef0b-11eb-b23d-ac1f6b60b47c  ONLINE       0     0     2
            gptid/e44f7d57-ef0b-11eb-b23d-ac1f6b60b47c  ONLINE       0     0     2
            gptid/e4fa9eba-ef0b-11eb-b23d-ac1f6b60b47c  ONLINE       0     0     0
            gptid/e50426bc-ef0b-11eb-b23d-ac1f6b60b47c  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        ESXi/ESXi:<0x1>


I see some mention about checksum errors on a couple of the SSDs. How can I map those gptid's to the disk names so I know which ones I, presumably, am going to need to do further checks on/replace?

Many thanks
Eds
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
Can anyone help me map these gptids to disk IDs so I can figure out which ones need replacing?

Many thanks.
Eds
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,740
gpart list will help you map the uuids to disk devices. You can then try to use smartctl on the device.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,740
glabel status gives it to you a little nicer, but either option works.
Didn't know that one. I thought this was only for labels set explicitly with gpart add -l foo ...
Thanks.
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
Thanks all for your input on getting the status and identification of disks.

I've identified the two disks with checksum errors from zpool status, as da9 and da10.
My daily SMART report though, shows neither of those SSDs reporting any SMART errors.

If it's not a SMART issue on the SSDs, is there anything else that would cause these checksum errors on the ZPOOL?
Alternatively, is there a way I can reset the checksum count to see if the issue comes back?

I'm hesitant to buy and replace the SSDs if they themselves are not at fault.

Cheers
Eds
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,945
I think zpool clear "poolname" will clear checksum errors.
If they come back you may have an issue
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
I've now had the same warning on a pool of 4 almost brand new disks.
Is it possible checksum errors can be issues with the controller or SAS expander?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,945
Yes - or a cable
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
Any way to know other than replacing the hardware bit by bit?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,945
Not really. Reseat the cable first, then change the cable, then the controller card.
If you have a spare disk then you could remove your pool disks and try a new set of disks and see if the issue remains. You can always put the pool back.

Of course I am assuming that you aren't using a RAID Controller - so you might as various posts say tell us what the hardware is
 
Top