Available (AVL) – The drive might not be ready, and it is not suitable for use in a volume or a hot spare pool

q66

Cadet
Joined
Oct 13, 2020
Messages
9
Hello,
I'm having an issue when I try to replace a failed drive, the SAS drives don't register properly for some reason, your help would be much appreciated.

The hardware setup is a Supermicro server running 12.0-RC1, 4x SAS2308 (two used, adapters 2 and 3) and 2x SuperMicro JBODs

SAS Cabling
JBOD #1 Port 0 -> HBA #2 Port 0
JBOD #1 Port 1 -> HBA #3 Port 0
JBOD #2 Port 0 -> HBA #2 Port 1
JBOD #2 Port 1
-> HBA #3 Port 1


camcontrol rescan all doesn't help, drives remain in the same state. Neither has a reboot of the system.

Code:
[~]# sas2ircu list
LSI Corporation SAS2 IR Configuration Utility.
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved.


         Adapter      Vendor  Device                       SubSys  SubSys
 Index    Type          ID      ID    Pci Address          Ven ID  Dev ID
 -----  ------------  ------  ------  -----------------    ------  ------
   0     SAS2308_2     1000h    87h   00h:06h:00h:00h      1000h   3040h

         Adapter      Vendor  Device                       SubSys  SubSys
 Index    Type          ID      ID    Pci Address          Ven ID  Dev ID
 -----  ------------  ------  ------  -----------------    ------  ------
   1     SAS2308_2     1000h    87h   00h:08h:00h:00h      1000h   3040h

         Adapter      Vendor  Device                       SubSys  SubSys
 Index    Type          ID      ID    Pci Address          Ven ID  Dev ID
 -----  ------------  ------  ------  -----------------    ------  ------
   2     SAS2308_2     1000h    87h   00h:88h:00h:00h      1000h   3040h

         Adapter      Vendor  Device                       SubSys  SubSys
 Index    Type          ID      ID    Pci Address          Ven ID  Dev ID
 -----  ------------  ------  ------  -----------------    ------  ------
   3     SAS2308_2     1000h    87h   00h:8ah:00h:00h      1000h   3040h


The problem drives, replaced failed units in the same slot.
I would expect them to show up as RDY but instead they're coming up as AVL
From sas2ircu user guide: "Available (AVL) – The drive might not be ready, and it is not suitable for use in a volume or a hot spare pool."

Code:
#sas2ircu 2 DISPLAY
....
Device is a Hard disk
  Enclosure #                             : 5
  Slot #                                  : 2
  SAS Address                             : 5000c50-0-b70b-b28d
  State                                   : Available (AVL)
  Manufacturer                            :
  Model Number                            :
  Firmware Revision                       :
  Serial No                               :
  GUID                                    : N/A
  Protocol                                : SAS
  Drive Type                              : SAS_HDD

Device is a Hard disk
  Enclosure #                             : 5
  Slot #                                  : 3
  SAS Address                             : 5000c50-0-b70c-2e8d
  State                                   : Available (AVL)
  Manufacturer                            :
  Model Number                            :
  Firmware Revision                       :
  Serial No                               :
  GUID                                    : N/A
  Protocol                                : SAS
  Drive Type                              : SAS_HDD
.....



sesutil show shows them as "OK, Swapped"
Code:
ses8: <LSI SAS2X36 0e12>; ID: 5003048001ff623f
Desc     Dev     Model                     Ident                Size/Status
Slot 01  da114   SEAGATE ST4000NM0023      Z1Z8LRBF0000R540WKTH 4T
Slot 02  da115   SEAGATE ST4000NM0023      Z1Z8LNYB0000R540VRWJ 4T
Slot 03  -       -                         -                    OK,  Swapped
Slot 04  -       -                         -                    OK,  Swapped
Slot 05  da116   SEAGATE ST4000NM0023      Z1Z8LM3Z0000C5403R3Z 4T
Slot 06  da117   SEAGATE ST4000NM0023      Z1Z8M2G10000R54027BL 4T
Slot 07  da118   SEAGATE ST4000NM0023      Z1Z8LMLV0000C5407DZZ 4T
Slot 08  da119   SEAGATE ST4000NM0023      Z1Z8LNRD0000R540WMCH 4T
Slot 09  da120   SEAGATE ST4000NM0025      ZC12YNXZ0000R802TLC6 4T
Slot 10  -       -                         -                    Not Installed
Slot 11  da121   SEAGATE ST4000NM0125      ZC12R0DF0000R749QR1Y 4T
Slot 12  da122   SEAGATE ST4000NM0125      ZC12R0B90000R749RLN1 4T
Slot 13  da123   SEAGATE ST4000NM0125      ZC12R0SQ0000C7522DDH 4T
Slot 14  da124   SEAGATE ST4000NM0125      ZC127LW10000R746BMCV 4T
Slot 15  da125   SEAGATE ST4000NM0125      ZC127LRD0000R746SBC2 4T
Slot 16  da126   SmrtStor SDLKOEDM200G5CA1 202A1E66             20G
Slot 17  -       -                         -                    Not Installed
Slot 18  -       -                         -                    Not Installed
Slot 19  -       -                         -                    Not Installed
Slot 20  -       -                         -                    Not Installed
Slot 21  -       -                         -                    Not Installed



One odd thing I noticed was that during install TrueNAS seems to have flashed the HBA FW, it was too fast to be sure but I think that's what happened.
Strangely the HBA's with the JBODs connected are on FW v15 and the un-used on FW V20

Code:
[~]# sas2flash -listall
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2308_2(D1)

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS2308_2(D1)   20.00.04.00    14.01.00.06    07.39.00.00     00:06:00:00
1  SAS2308_2(D1)   20.00.04.00    14.01.00.06    07.39.00.00     00:08:00:00
2  SAS2308_2(D1)   15.00.00.00    0f.00.00.03    07.29.00.00     00:88:00:00
3  SAS2308_2(D1)   15.00.00.00    0f.00.00.03    07.29.00.00     00:8a:00:00

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.


Any idea on how to resolve this issue?
I'm thinking of connecting the JBODS into the HBA's with the FW version 20.00.04.00 as a test. Thoughts?

Thank you for reading
 

q66

Cadet
Joined
Oct 13, 2020
Messages
9
Quick update

Swapped to the HBA's with the 20.00 FW, no change.
Removed SAS multipathing, no change
Tried different slots for the new disks, same behavior
Tried different Seagate HDD model, no change
Tried STEC SSD, worked fine, detected, registered etc..
Tried power cycling the JBODS, worked! I can see the new Seagate drives
Ran out of time before I could do more

This system has been bugging me for years!

A bit of history...
Originally it was a illumos ZFS server but was retired early this year.
During it's life in production the drive failure rate was astronomical and SAS errors ran rampant.
Supermicro, the OS support team and the HDD manufacturer worked together on troubleshooting the issue but ultimately failed.
HBA, JBODs and cables got replaced. JBOD firmware was flashed as where the drives. Internal cabling verified and much more.
Major errors where found such as some backplanes had been flashed with the incorrect SAS addresses so that port 0 and port 1 reported as different devices, major OS bugs and more fun things..

Now that these system are retired I can finally investigate them with a heavier hand.
I'm sort of relieved that they still have SAS issue with a different OS.

Please help me find the ghost in the SAS bus so that I can finally rest!
I'd like to better understand what's going on when I insert a drive, any ideas? I'll probably swap the HBA next and then the JBOD

Code:
Oct 14 18:24:09 nex04   (pass126:mps3:0:134:0): LOG SENSE. CDB: 4d 00 6f 00 00 00 00 00 40 00 length 64 SMID 1657 Command timeout on target 134(0x0042) 60000 set, 60.144451155 elapsed
Oct 14 18:24:09 nex04 mps3: Sending abort to target 134 for SMID 1657
Oct 14 18:24:09 nex04   (pass126:mps3:0:134:0): LOG SENSE. CDB: 4d 00 6f 00 00 00 00 00 40 00 length 64 SMID 1657 Aborting command 0xfffffe017e794298
Oct 14 18:24:09 nex04 mps3: Finished abort recovery for target 134
Oct 14 18:24:09 nex04 mps3: Unfreezing devq for target ID 134
Oct 14 18:24:40 nex04   (da117:mps3:0:134:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 bc 90 00 00 00 10 00 00 length 8192 SMID 2057 Command timeout on target 134(0x0042) 60000 set, 60.127710699 elapsed
Oct 14 18:24:40 nex04 mps3: Sending abort to target 134 for SMID 2057
Oct 14 18:24:40 nex04   (da117:mps3:0:134:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 bc 90 00 00 00 10 00 00 length 8192 SMID 2057 Aborting command 0xfffffe017e7b5c18
Oct 14 18:24:40 nex04   (da117:mps3:0:134:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 ba 90 00 00 00 10 00 00 length 8192 SMID 1764 Command timeout on target 134(0x0042) 60000 set, 60.128081841 elapsed
Oct 14 18:24:40 nex04   (da117:mps3:0:134:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00 length 8192 SMID 2115 Command timeout on target 134(0x0042) 60000 set, 60.128233281 elapsed
Oct 14 18:24:40 nex04 mps3: Continuing abort recovery for target 134
Oct 14 18:24:40 nex04 (da117:mps3:0:134:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 bc 90 00 00 00 10 00 00
Oct 14 18:24:40 nex04   (da117:mps3:0:134:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 ba 90 00 00 00 10 00 00 length 8192 SMID 1764 Aborting command 0(da117:mps3:0:134:0): CAM status: Command timeout
Oct 14 18:24:40 nex04 xfffffe017e79d260
Oct 14 18:24:40 nex04 (da117:mps3:0:134:0): Retrying command, 3 more tries remain
Oct 14 18:24:40 nex04 mps3: mpssas_action_scsiio: Freezing devq for target ID 134
Oct 14 18:24:40 nex04 (da117:mps3:0:134:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 bc 90 00 00 00 10 00 00
Oct 14 18:24:40 nex04 (da117:mps3:0:134:0): CAM status: CAM subsystem is busy
Oct 14 18:24:40 nex04 (da117:mps3:0:134:0): Retrying command, 2 more tries remain
Oct 14 18:24:40 nex04 mps3: Continuing abort recovery for target 134
Oct 14 18:24:40 nex04 (da117:mps3:0:134:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 ba 90 00 00 00 10 00 00
Oct 14 18:24:40 nex04   (da117:mps3:0:134:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00 length 8192 SMID 2115 Aborting command 0xfffffe017e7baa08
Oct 14 18:24:40 nex04 (da117:mps3:0:134:0): CAM status: Command timeout
Oct 14 18:24:40 nex04 (da117:mps3:0:134:0): Retrying command, 3 more tries remain
Oct 14 18:24:40 nex04 mps3: Finished abort recovery for target 134
Oct 14 18:24:40 nex04 (da117:mps3:0:134:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00
Oct 14 18:24:40 nex04 mps3: (da117:mps3:0:134:0): CAM status: Command timeout
Oct 14 18:24:40 nex04 (da117:mps3:0:134:0): Retrying command, 3 more tries remain
Oct 14 18:24:40 nex04 Unfreezing devq for target ID 134
Oct 14 18:25:09 nex04   (pass126:mps3:0:134:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 1666 Command timeout on target 134(0x0042) 60000 set, 60.15492830 elapsed
Oct 14 18:25:09 nex04 mps3: Sending abort to target 134 for SMID 1666
Oct 14 18:25:09 nex04   (pass126:mps3:0:134:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 1666 Aborting command 0xfffffe017e794eb0
Oct 14 18:25:09 nex04 mps3: Finished abort recovery for target 134
Oct 14 18:25:09 nex04 mps3: Unfreezing devq for target ID 134
Oct 14 18:25:40 nex04   (da117:mps3:0:134:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 bc 90 00 00 00 10 00 00 length 8192 SMID 2110 Command timeout on target 134(0x0042) 60000 set, 60.6892721 elapsed
Oct 14 18:25:40 nex04 mps3: Sending abort to target 134 for SMID 2110
Oct 14 18:25:40 nex04   (da117:mps3:0:134:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 bc 90 00 00 00 10 00 00 length 8192 SMID 2110 Aborting command 0xfffffe017e7ba350
Oct 14 18:25:40 nex04   (da117:mps3:0:134:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 ba 90 00 00 00 10 00 00 length 8192 SMID 2010 Command timeout on target 134(0x0042) 60000 set, 60.7235949 elapsed
Oct 14 18:25:40 nex04   (da117:mps3:0:134:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00 length 8192 SMID 1804 Command timeout on target 134(0x0042) 60000 set, 60.7379825 elapsed
Oct 14 18:25:40 nex04 mps3: Continuing abort recovery for target 134
Oct 14 18:25:40 nex04 (da117:mps3:0:134:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 bc 90 00 00 00 10 00 00
Oct 14 18:25:40 nex04   (da117:mps3:0:134:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 ba 90 00 00 00 10 00 00 length 8192 SMID 2010 Aborting command 0(da117:mps3:0:134:0): CAM status: Command timeout
Oct 14 18:25:40 nex04 xfffffe017e7b1cf0
Oct 14 18:25:40 nex04 (da117:mps3:0:134:0): Retrying command, 1 more tries remain
Oct 14 18:25:40 nex04 mps3: mpssas_action_scsiio: Freezing devq for target ID 134
Oct 14 18:25:40 nex04 (da117:mps3:0:134:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 bc 90 00 00 00 10 00 00
Oct 14 18:25:40 nex04 (da117:mps3:0:134:0): CAM status: CAM subsystem is busy
Oct 14 18:25:40 nex04 (da117:mps3:0:134:0): Retrying command, 0 more tries remain
Oct 14 18:25:40 nex04 mps3: Continuing abort recovery for target 134
Oct 14 18:25:40 nex04 (da117:mps3:0:134:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 ba 90 00 00 00 10 00 00
Oct 14 18:25:40 nex04   (da117:mps3:0:134:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00 length 8192 SMID 1804 Aborting command 0xfffffe017e7a0820
Oct 14 18:25:40 nex04 (da117:mps3:0:134:0): CAM status: Command timeout
Oct 14 18:25:40 nex04 (da117:mps3:0:134:0): Retrying command, 2 more tries remain
Oct 14 18:25:40 nex04 mps3: (da117:mps3:0:134:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00
Oct 14 18:25:40 nex04 Finished abort recovery for target 134
 

q66

Cadet
Joined
Oct 13, 2020
Messages
9
Success!
Flashed HBAs to V20 and all the hard drives to their latest firmware. I'm now able to replace drives as expected.

The hard drives I flashed via a live linux boot using seaflashlin but the HBAs I did from TrueNAS.
I'm sure i could have flashed the drives from within TrueNAS using the sg_utils though.

TrueNAS panicked after flashing each HBA that was on V15 despite the pool being exported at the time, flashing the V20 controllers didn't cause a panic or issue even with the pool mounted.

Code:
root@nex03[~]# zpool export -f  pool0
root@nex03[~]# sas2flash -listall
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2308_2(D1)

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS2308_2(D1)   20.00.04.00    14.01.00.06    07.39.00.00     00:06:00:00
1  SAS2308_2(D1)   20.00.04.00    14.01.00.06    07.39.00.00     00:08:00:00
2  SAS2308_2(D1)   15.00.00.00    0f.00.00.03    07.29.00.00     00:84:00:00
3  SAS2308_2(D1)   15.00.00.00    0f.00.00.03    07.29.00.00     00:86:00:00

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.
root@nex03[~]# sas2flash -c 2 -f /usr/local/lib/firmware/mps_SAS9207-8e_p20.firmware.bin
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2308_2(D1)

        Executing Operation: Flash Firmware Image

                Firmware Image has a Valid Checksum.
                Firmware Version 20.00.07.00
                Firmware Image compatible with Controller.

                Valid NVDATA Image found.
                NVDATA Version 14.01.00.00
                Checking for a compatible NVData image...

                NVDATA Device ID and Chip Revision match verified.
                NVDATA Versions Compatible.
                Valid Initialization Image verified.
                Valid BootLoader Image verified.

                Beginning Firmware Download...
                Firmware Download Successful.

                Verifying Download...

                Firmware Flash Successful.

                Resetting Adapter...

Panic!


TrueNAS identifies the outdated FW during install and upgrades, it attempts to flash the HBA's but it fails because the files aren't where it expects them to be.
The firmware files are in /usr/local/lib/firmware/* but it tries /usr/local/share/firmware/*, in my case it tried /usr/local/share/firmware/mps_SAS9207-8e_p20.firmware.bin

Active/Active SAS multi-pathing is working nicely as is Fibre Channel round-robin multi-pathing to an ESXi cluster.
ESXi was even able to transition from iSCSI to FC without a single IO being dropped, iSCSI is the backup path for the LUNs now while it round-robins across the 2x FC ports for primary access.

I'm please with the performance from within an ESXi VM using this old hw as it's datastore, limit now seems to be the 2x 8Gbps FC ports.

nex03-bench-peak.JPG

nex03-bench-real.JPG


I'll try to post how I keep track of drives and the replacement procedure next time, it's not terribly simple ensuring jbod/enclosure redundancy.
I may be dreaming but is there a way to bind a zfs spare disk so that it's only used for a failed disk on the same enclosure it's in?
 
Last edited:

JaimieV

Guru
Joined
Oct 12, 2012
Messages
742
Spares are attached to pools, so only if you keep a pool within an enclosure - which is the opposite of what you want of course.
 
Top