Dual pathed SAS HDD - One path down - alternating path

Zappo

Cadet
Joined
Nov 2, 2020
Messages
5
Hello,

I have a Supermicro 24-bay Chassis with a SAS2 dual-pathed Supermicro/LSI SAS2X36 backplane running TrueNas 12 U3. The backplane is connected to by dual LSI 9702-8i HBA's (one controls each path). The HBA's run FW 20.
In use are 24x 600GB 10k RPM Enterprise drives and 1 of them started to act strange a couple of weeks ago.
I started to get these kind of errors:
"Pool Pool01 state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected."
followed by:
"Multipath multipath/disk13 connection is not optimal. Please check disk cables."

The disk in question is seen by TrueNAS like this:
multipath/disk13DEGRADED
da13FAIL50014ee7aaabefcc
da38ACTIVE50014ee7aaabefcc

So right now, since the last reboot, "da13" failed.
If I reboot, "da38" can fail and da13 is fine. It alternates. I can reboot 50 times and get different results on which path of this HDD fails. It seems to be random. One is always active and the other failed. Depends on the alignment of the stars and the moon which failes on the next reboot. This already was the case in v12 U2 and the upgrade to U3 (which of course meant a reboot) caused da13 to fail this time (before the reboot, da38 failed and da13 was good).

The GUI still shows da13 as degraded. On the dashboard, the pool is marked as "unhealthy" (in v12 U2, before the U3 upgrade reboot, it was called "degraded").
When I do a "zpool status", all disks are online and everything is ok but it does mention the "unrecoverable error":

Code:
nas01# zpool status
  pool: Pool01
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 385G in 01:03:07 with 0 errors on Fri Apr 16 15:35:41 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        Pool01                                          ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/199def9f-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1a91b995-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1a57856d-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1a72dc44-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1a923c8f-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1ac7ebfc-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1b12dc45-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1cdd56f2-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/1909a4a5-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/19be4b1d-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1b5b4c71-1c52-11eb-ad4b-a0369f19e510  ONLINE       3     0     0
            gptid/1b7d57e8-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1b6c8d69-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1bf27e07-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1c3b860c-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1c7422ff-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
          raidz2-2                                      ONLINE       0     0     0
            gptid/1ca9f7d8-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1ee056a8-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1f1d6847-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1f2a0236-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1fb46ddb-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1fc89694-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1fe0f77d-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/20015769-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
        logs
          mirror-3                                      ONLINE       0     0     0
            gptid/1d5bded5-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0
            gptid/1d7f605a-1c52-11eb-ad4b-a0369f19e510  ONLINE       0     0     0

errors: No known data errors


Is this HDD doing the funky chicken? Or is this an issue with software/firmware?
 

Zappo

Cadet
Joined
Nov 2, 2020
Messages
5
Update: just did a "zpool clear" on that pool and the pool is now marked "Online, Disks w/Errors: 0" on the dashboard.
 

Zappo

Cadet
Joined
Nov 2, 2020
Messages
5
Update: a week later and this one disk lost 1 of it's 2 paths again. It is still online but "degraded" (1 path up, 1 path failed).
"zpool status" shows the disk as online and "repairing".
If the disk is still online and working, why is the entire pool marked "unhealthy" as it has not lost a disk?
 
Top