HELP! Not sure how to proceed. Scrub shows degraded and faulted drive(s)

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
I had a similar problem, resolved by connecting HBA to a different port on my expander.

Supermicro H12SSL-CT (LSI3008) motherboard with BPN-SAS2-836EL1 expander.
Simliar read and write errors, often working for a few hours then bursts of errors.
Tried difference HBA, cables, drives, HBA and expander firmware, fans on HBA.

My expander has 3 x ports PRI_J1, PRI_J2, PRI_J3. The manual instructs to use PRI_J1 for HBA, PRI_J2 for cascade and no mention of what to do with PRI_J3. As instructed I was only using PRI_J1, had no cascade and I was getting errors. I now have the HBA connected to PRI_J2 and PRI_J3 and all working fine with no errors. I have no idea if all those SAS lanes are active but it's been working fine like that for a few weeks now.
I'm not sure which expander is in my machine but it only has one port.
 

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
This is the second time I've replaced the drives within this same hardware.
I had the exact same issue, the disks are fine. Is whatever you use as SAS controller that is not seated properly. If it would be defective, you would not see any disks. In my case, the PERC H710 contacts were not clean, a little alcohol fixed everything.
The alcohol did something, no more issues! The perc was seated properly, I'm familiar with how delicate they are.
It was pretty scary to see all my disks damaged like in your screenshot. :)
 
Last edited:

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
All his drives are connected to the backplane, and backplane is connected to 9211-8i
That backplane does work with mix of SAS and SATA buy due to protocol differences it is suboptimal and somewhat risky due to different voltages: SAS requiring higher voltage _I think_ would make all the disks operate at higher voltage, pls someone correct me if I'm wrong
It's the other way around: A mix of SAS and SATA drops voltage to SATA level, which is perfectly fine for the SAS drives provided that the cables/traces are short (per SATA specification).

Assuming I get all 12 drives SATA and then connect them all to the backplane and then the backplane into the 9211-8i will this work? Or am I still missing something by using the 8i with a 12 drive backplane?
If the backplane has a SAS expander, it will work with a single -8i.
If the backplane has no expander, you need to feed it with 12 SAS/SATA lanes from the HBA, so one -16i HBA or two -8i.
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
If the backplane has a SAS expander, it will work with a single -8i.
If the backplane has no expander, you need to feed it with 12 SAS/SATA lanes from the HBA, so one -16i HBA or two -8i.
How do I know if the backplane has a SAS expander?
 

Alex_K

Explorer
Joined
Sep 4, 2016
Messages
64
How do I know if the backplane has a SAS expander?
Multiple ways, not to mention I wrote it has:
1nd, if it has 12 bays and connectes with 2 or less sff-8087 connectors, it has expander
2nd, server documentation
3rd, sas2ircu output tells it has DL18xG6BP - it even has upgradable firmware but yours has up to date version
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
4th, inspect the backplane; an expander is a chip which needs a heatsink.
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
Multiple ways, not to mention I wrote it has:
1nd, if it has 12 bays and connectes with 2 or less sff-8087 connectors, it has expander
This is the case with my server. I'm thinking about replacing all of the drives and rebuilding the pool from scratch. I'll then do a burn-in on all the drives to confirm their integrity.
 

Alex_K

Explorer
Joined
Sep 4, 2016
Messages
64
Yep do not forget to check contacts on controller and in a PCIe slots involved, as @TECK said, if you hadn't yet.
I assumed you made sure contacts are OK but You may have thought that seller took care about it, where they didn't.
 

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
Last edited:

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
That's the problem @Demonlinx has.

Focusing on disks is a waste of time, we told you already where is the issue and how to fix it. Did you read my reply?
Hey @TECK I was able to clean the contacts and re-seat the HBA card. My pool is still degraded and I still have a failed drive. What is the process for "repairing" the pool?

The disk that was FAILED is now OFFLINE. Is this better or worse?
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
I went ahead and "online"'d the drive. I then went through the resilvering process and the pool still shows as degraded. What should I do next?
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
This is with the same drives that showed errors since the beginning?

If so, I'm open to being wrong and having someone elses observations save the day here. But, imo, it's time to give up on those drives.
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
I think it would be best to test these drives one by one with a separate system to put that question to rest. Then you can move on to testing next issues if problems persist (with tested good drives).
 

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
Hey @TECK I was able to clean the contacts and re-seat the HBA card. My pool is still degraded and I still have a failed drive.
I'm 100% sure is a server hardware issue, the disks are not faulty. For example, in Dell R720 or so, the PERC H710 are known to be very sensitive with contacts, when you seat them. Even if they have 2 blue check connectors to make sure the card is seated properly, there is a half millimeter space between the HBA and motherboard allowing a bad contact. I know about this and every time I seat cards like that, I make sure I press them well down until they cannot move anymore. Look how small is the motherboard connector, is very easy to have a bad contact:

22751831-2657320526.jpeg


The disk that was FAILED is now OFFLINE. Is this better or worse?
That's another confirmation of what I said earlier, you have a bad contact. If you read my original post, I was getting different drives being failed or offline every time I was playing with the array trying to repair it or rebooting the server. There is a bad contact somewhere in your server, so the HBA freaks and throws a bunch of disk errors. That can be because the HBA is not being seated properly, motherboard connector damaged, soldering issues, etc. All translate to a bad contact. Your goal is to determine what server part is the source of problems, stop focusing on disks, use one single disk you know is 100% functional. If for example you install the HBA card into several PCIe slots and experience the same issues, you can assume the card might be defective with one connector pin not soldered properly, for example. Let me give you a perfect example, I had to purchase 3 PCIe HBA cards from eBay until I found a good one.

If you still have doubts after repeating myself many times with clear examples, is not much I can do more.
 
Last edited:

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
I'm 100% sure is a server hardware issue, the disks are not faulty. For example, in Dell R720 or so, the PERC H710 are known to be very sensitive with contacts, when you seat them. Even if they have 2 blue check connectors to make sure the card is seated properly, there is a half millimeter space between the HBA and motherboard allowing a bad contact. I know about this and every time I seat cards like that, I make sure I press them well down until they cannot move anymore. Look how small is the motherboard connector, is very easy to have a bad contact:

View attachment 58414


That's another confirmation of what I said earlier, you have a bad contact. If you read my original post, I was getting different drives being failed or offline every time I was playing with the array trying to repair it or rebooting the server. There is a bad contact somewhere in your server, so the HBA freaks and throws a bunch of disk errors. That can be because the HBA is not being seated properly, motherboard connector damaged, soldering issues, etc. All translate to a bad contact. Your goal is to determine what server part is the source of problems, stop focusing on disks, use one single disk you know is 100% functional.
How do you go about isolating the other components? Motherboard connector damaged, soldering issues, etc. Those are all things that I don't understand or have experience with, yet.
If for example you install the HBA card into several PCIe slots and experience the same issues, you can assume the card might be defective with one connector pin not soldered properly, for example. Let me give you a perfect example, I had to purchase 3 PCIe HBA cards from eBay until I found a good one.
This I was not aware of. I'll try the other slots. Only have 2 more to try. I'll see if I can get another HBA card purchased.
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
I moved the HBA card to another slot and that didn't change anything. The same drives are still showing as degraded.
 

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
I moved the HBA card to another slot and that didn't change anything. The same drives are still showing as degraded.
That's good news, it means the card probably has some damaged contacts you cannot see. I've made sure I purchase cards from vendors who offer full refund on eBay. But first, let's check the card details and firmware.

Post below the following information:
  • Server brand and model you run Scale on
  • Exact HBA card model and LSI firmware version (output of sas2flash commands listed below)
TrueNAS is very picky with the version, you need 20.00.07.00-IT (note the IT part). Maybe you just need to update the firmware.

1661032436956.png


SSH into your server and run the sas2flash command as root, this is my output:
Code:
$ sudo -i
# sas2flash -listall
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

    Adapter Selected is a LSI SAS: SAS2308_2(D1)

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS2308_2(D1)   20.00.07.00    14.01.00.06    07.39.02.00     00:02:00:00
1  SAS2308_2(D1)   20.00.07.00    14.01.00.06    07.39.02.00     00:44:00:00

    Finished Processing Commands Successfully.
    Exiting SAS2Flash.

ID 0 is PERC H710 Mini flashed to LSI 9207i firmware and ID 1 is PERC H810 PCIe flashed to LSI 9207e firmware. To see if you are in IT mode, check the Firmware Product ID (example for my HBA PCIe card, ID 1):
Code:
# sas2flash -list -c 1
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

    Adapter Selected is a LSI SAS: SAS2308_2(D1)

    Controller Number              : 1
    Controller                     : SAS2308_2(D1)
    PCI Address                    : 00:44:00:00
    SAS Address                    : [removed]
    NVDATA Version (Default)       : 14.01.00.06
    NVDATA Version (Persistent)    : 14.01.00.06
    Firmware Product ID            : 0x2214 (IT)
    Firmware Version               : 20.00.07.00
    NVDATA Vendor                  : LSI
    NVDATA Product ID              : SAS9207-8e
    BIOS Version                   : 07.39.02.00
    UEFI BSD Version               : 07.27.01.01
    FCODE Version                  : N/A
    Board Name                     : SAS9207-8e
    Board Assembly                 : N/A
    Board Tracer Number            : N/A

    Finished Processing Commands Successfully.
    Exiting SAS2Flash.
 
Last edited:

indivision

Guru
Joined
Jan 4, 2013
Messages
806

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
That's good news, it means the card probably has some damaged contacts you cannot see. I've made sure I purchase cards from vendors who offer full refund on eBay. But first, let's check the card details and firmware.

Post below the following information:
  • Server brand and model you run Scale on
  • Exact HBA card model and LSI firmware version (output of sas2flash commands listed below)
Replaced the HBA card with a replacement unit we had:
1663262135137.png


The pool still shows as degraded and the same drives are still degraded.
 
Top