LSI-9500 HBA resets and backplanes/drives disappear during high load/scrub

katbyte · Dec 5, 2023

Hi all,

Not sure how to fully debug this but I have a LSI Broadcom 9500-16i in a Supermicro 847BE1C4-R1K23LPB case with 32 SAS disks in use with a newish (couple months) old truenas SCALE VM on promox using PCI passthrough. It seems to be dropping the connection from the backplanes (BPN-SAS3-826EL1-N4, BPN-SAS3-846EL1) under a high load, and this can be repeatably triggered by scrubbing the ZFS pool. It will be fine for days, but then happen hours after starting a scrub, and multiple timed during said scrub till it finishes.

When it happens I can still use storcli to talk to the card and query it via storcli which shows the backplanes are missing, and from the dmesg it seems like there is a controller reset? After which the backplanes (connected via minisas HDx2 cable -> front backplane -> cascading to back backplane)

I have updated the 9500 to the latest version, checked the SAS cables/connections, and that the card is well seated.

Does anyone have any ideas/where to start to debug this?

additionally I have considered not cascading the two backplanes but worry if only 1 drops (12 disks) what will happen to the pool?

storcli /c0 show all - https://pastebin.com/TWLAYRre
here is where things go sideways in dmesg (https://pastebin.com/Q9Eu90BF)

[Mon Dec 4 23:10:33 2023] mpt3sas_cm0: mpt3sas_ctl_pre_reset_handler: Releasing the trace buffer due to adapter reset.
[Mon Dec 4 23:10:43 2023] mpt3sas_cm0: sending diag reset !!
[Mon Dec 4 23:10:44 2023] mpt3sas_cm0: diag reset: SUCCESS
[Mon Dec 4 23:10:46 2023] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
[Mon Dec 4 23:10:46 2023] mpt3sas_cm0: _base_display_fwpkg_version: complete
[Mon Dec 4 23:10:46 2023] mpt3sas_cm0: FW Package Ver(28.00.00.00)
[Mon Dec 4 23:10:46 2023] mpt3sas_cm0: SAS3816: FWVersion(28.00.00.00), ChipRevision(0x00), BiosVersion(09.51.00.00)
[Mon Dec 4 23:10:46 2023] NVMe
[Mon Dec 4 23:10:46 2023] mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Diag Trace Buffer,Task Set Full,NCQ)
[Mon Dec 4 23:10:46 2023] mpt3sas_cm0: Enable interrupt coalescing only for first 8 reply queues
[Mon Dec 4 23:10:46 2023] mpt3sas_cm0: performance mode: balanced
[Mon Dec 4 23:10:46 2023] mpt3sas_cm0: sending port enable !!
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: port enable: SUCCESS
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: search for end-devices: start
[Mon Dec 4 23:10:57 2023] scsi target0:0:0: handle(0x0030), sas_addr(0x5000cca2c76775f9)
[Mon Dec 4 23:10:57 2023] scsi target0:0:0: enclosure logical id(0x5003048020bd1c3f), slot(0)
[Mon Dec 4 23:10:57 2023] scsi target0:0:1: handle(0x0031), sas_addr(0x5000cca2c7a9bce1)
[Mon Dec 4 23:10:57 2023] scsi target0:0:1: enclosure logical id(0x5003048020bd1c3f), slot(1)
....
[Mon Dec 4 23:10:57 2023] scsi target0:0:32: handle(0x0051), sas_addr(0x5000cca2c7ac1b09)
[Mon Dec 4 23:10:57 2023] scsi target0:0:32: enclosure logical id(0x50030480211a293f), slot(9)
[Mon Dec 4 23:10:57 2023] scsi target0:0:33: handle(0x0052), sas_addr(0x5000cca2c7ac0cb9)
[Mon Dec 4 23:10:57 2023] scsi target0:0:33: enclosure logical id(0x50030480211a293f), slot(10)
[Mon Dec 4 23:10:57 2023] scsi target0:0:34: handle(0x0053), sas_addr(0x50030480211a293d)
[Mon Dec 4 23:10:57 2023] scsi target0:0:34: enclosure logical id(0x50030480211a293f), slot(12)
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: search for end-devices: complete
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: search for end-devices: start
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: search for PCIe end-devices: complete
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: search for expanders: start
[Mon Dec 4 23:10:57 2023] expander present: handle(0x002f), sas_addr(0x5003048020bd1c3f), port:0
[Mon Dec 4 23:10:57 2023] expander present: handle(0x003c), sas_addr(0x50030480211a293f), port:0
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: search for expanders: complete
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: mpt3sas_base_hard_reset_handler: SUCCESS
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: _base_fault_reset_work: hard reset: success
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: removing unresponding devices: start
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: removing unresponding devices: end-devices
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: Removing unresponding devices: pcie end-devices
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: removing unresponding devices: expanders
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: removing unresponding devices: complete
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: Update devices with firmware reported queue depth
[Mon Dec 4 23:10:58 2023] sd 0:0:0:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] sd 0:0:1:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] sd 0:0:2:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] sd 0:0:3:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] sd 0:0:4:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
...
[Mon Dec 4 23:10:58 2023] sd 0:0:29:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] sd 0:0:30:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] sd 0:0:31:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] sd 0:0:32:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] sd 0:0:33:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] ses 0:0:34:0: qdepth(64), tagged(1), scsi_level(6), cmd_que(1)
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: scan devices: start
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: hba_port entry: 00000000df9f5468, port: 8 is added to hba_port list
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: scan devices: expanders start
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: break from expander scan: ioc_status(0x0022), loginfo(0x310f0400)
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: scan devices: expanders complete
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: scan devices: end devices start
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: break from end device scan: ioc_status(0x0022), loginfo(0x310f0400)
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: scan devices: end devices complete
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: scan devices: pcie end devices start
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: break from pcie end device scan: ioc_status(0x0022), loginfo(0x310f0400)
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: pcie devices: pcie end devices complete
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: scan devices: complete
[Mon Dec 4 23:12:16 2023] mpt3sas_cm0: sending diag reset !!
[Mon Dec 4 23:12:16 2023] mpt3sas_cm0: diag reset: SUCCESS
[Mon Dec 4 23:12:18 2023] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
[Mon Dec 4 23:12:18 2023] mpt3sas_cm0: _base_display_fwpkg_version: complete
[Mon Dec 4 23:12:18 2023] mpt3sas_cm0: FW Package Ver(28.00.00.00)
[Mon Dec 4 23:12:18 2023] mpt3sas_cm0: SAS3816: FWVersion(28.00.00.00), ChipRevision(0x00), BiosVersion(09.51.00.00)
[Mon Dec 4 23:12:18 2023] NVMe
[Mon Dec 4 23:12:18 2023] mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Diag Trace Buffer,Task Set Full,NCQ)
[Mon Dec 4 23:12:18 2023] mpt3sas_cm0: Enable interrupt coalescing only for first 8 reply queues
[Mon Dec 4 23:12:18 2023] mpt3sas_cm0: performance mode: balanced
[Mon Dec 4 23:12:18 2023] mpt3sas_cm0: sending port enable !!
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: port enable: SUCCESS
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: search for end-devices: start
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: search for end-devices: complete
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: search for end-devices: start
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: search for PCIe end-devices: complete
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: search for expanders: start
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: search for expanders: complete
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: mpt3sas_base_hard_reset_handler: SUCCESS
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: _base_fault_reset_work: hard reset: success
[Mon Dec 4 23:14:19 2023] sd 0:0:13:0: [sdo] tag#2946 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=133s
[Mon Dec 4 23:14:19 2023] sd 0:0:8:0: [sdj] tag#3254 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=133s
[Mon Dec 4 23:14:19 2023] sd 0:0:32:0: [sdag] tag#3032 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=133s
[Mon Dec 4 23:14:19 2023] sd 0:0:32:0: [sdag] tag#3032 CDB: Read(16) 88 00 00 00 00 01 68 91 e2 00 00 00 07 f0 00 00
[Mon Dec 4 23:14:19 2023] I/O error, dev sdag, sector 6049358336 op 0x0:(READ) flags 0x0 phys_seg 111 prio class 2

kingtool · Dec 10, 2023

I've been having the same issue for years with various Broadcom cards + mpt3sas. It just never ends, and I have to zpool clear the devices regularly and hope that I'm not clearing anything serious. It's become so incredibly tedious and janky.

The one thing I can say which may help you identify the problem: I am currently using a SAS3816-based Dell HBA355e (equivalent to a 9500-16e), very similar to you. My HBA355e has connections to two very different types of backplanes, Supermicro ones like yours, and a WD Data60. The HBA actually has two modules on board, cm0 and cm1, and I can tell you that I *only* ever get resets on cm1, the one plumbed to the Supermicro backplanes. Never to the Data60 (cm0). If I swap the cables, it follows.

Can you check your firmware revision on your backplane with CLIXTL? Here's mine, a BPN-SAS3-846EL2:

Code:

ENCLOSURE INFORMATION:
    PLATFORM NAME  - SMC846ELSAS3P     
    SERIAL NUMBER  -       
    VENDOR ID      - LSI   
    PRODUCT ID     - SAS3x40       

VERSION INFORMATION:
    FLASH REGION 0 - 66.06.01.02
    FLASH REGION 1 - 66.06.01.02
    FLASH REGION 2 - 66.06.01.02
    FLASH REGION 3 - 06.02

Edit: I just realized something. On this machine this BP is actually on an old FW revision. I had updated all my other backplanes a while back, but not this one. The latest revision as of last year was 66.16.11.00, which I'm applying now. I'll start a new scrub and see what happens, and update here with results.

Still, please share your backplane firmware revision so we can compare notes.

kingtool · Dec 11, 2023

Updating my backplanes resolved my issue. Zero drops for the other half of my scrub.

OP, I think you may want to make sure both your backplane firmware and HBA firmware are up to date. Particularly the backplane firmware made the difference for me.

kingtool · Jan 24, 2024

Reporting back a little over a month later and can confirm my next couple of scrubs ran without issue. Always make sure to update backplane and HBA firmware. The backplane firmware can be tricky to track down -- Supermicro support will provide it if you give them your model # and serial numbers. They are surprisingly responsive, albeit weird, as they will send you temporary FTP credentials to some employee FTP server they maintain.

Kailee71 · Jan 24, 2024

Thank you for reporting back, especially after such a long time. I'm sure this will help others down the line.

Ericloewe · Jan 24, 2024

kingtool said:
Reporting back a little over a month later and can confirm my next couple of scrubs ran without issue. Always make sure to update backplane and HBA firmware. The backplane firmware can be tricky to track down -- Supermicro support will provide it if you give them your model # and serial numbers. They are surprisingly responsive, albeit weird, as they will send you temporary FTP credentials to some employee FTP server they maintain.

That's far closer to the metaphorical "hosted on that one engineer's home directory" than it has any right to be and it's simultaneously hilarious and terrifying. At least the HBA firmware is just hidden away in a corner, but publicly accessible.

katbyte · Jan 24, 2024

Thanks for the tip and update @kingtool ! I still haven't resolved this yet so it is appreciated - was going to just swap the cables and not cascade through the front backplane when I had some time. When i reached out to super micro they said "there was no updates for backplane X" but maybe they missed one of them/I should ask for "the current firmware versions" and go from there.

Looks like they are quite up to date, butdifferent, so maybe it is that they are not the same causing issues?

[14:29:54] root@data:~/tools/clixtl6# ./CLIXTL -i -t 50030480211A293F

UNIT SPECIFIC INFORMATION:
SAS ADDRESS - 50030480211A293F
ENCLOSURE ID - 50030480211A293F

ENCLOSURE INFORMATION:
PLATFORM NAME - SMC826ELSAS3N4P
SERIAL NUMBER -
VENDOR ID - SMC
PRODUCT ID - SC826N4

VERSION INFORMATION:
FLASH REGION 0 - 66.16.13.00
FLASH REGION 1 - 66.16.13.00
FLASH REGION 2 - 66.16.13.00
FLASH REGION 3 - 16.13

DEVICE INFORMATION:
DEVICE NAME - /dev/sg36
BMC IP - NULL

[14:31:42] root@data:~/tools/clixtl6# ./CLIXTL -i -t 5003048020BD1C3F

UNIT SPECIFIC INFORMATION:
SAS ADDRESS - 5003048020BD1C3F
ENCLOSURE ID - 5003048020BD1C3F

ENCLOSURE INFORMATION:
PLATFORM NAME - SMC846ELSAS3P
SERIAL NUMBER -
VENDOR ID - SMC
PRODUCT ID - SC846-P

VERSION INFORMATION:
FLASH REGION 0 - 66.16.11.00
FLASH REGION 1 - 66.16.11.00
FLASH REGION 2 - 66.16.11.00
FLASH REGION 3 - 16.11

DEVICE INFORMATION:
DEVICE NAME - /dev/sg26
BMC IP - NULL

[14:31:59] root@data:~/tools/clixtl6#

EDIT: SMC support confirmed these are the latest version for my backplanes.

kingtool · Jan 27, 2024

OK, looks like it's not the firmware after all. So many variables often lead to similar behavior, so it was worth a shot. FWIW, my similar HBA (the Dell HBA355e) is still on P24 (24.15.05.00). I am less inclined to believe it's firmware with you on P28 but if you are feeling particularly adventurous you could try downgrading to the P24 series and see how it goes. But, see below first.

Is there nothing interesting above the first couple of lines of dmesg you shared? The HBA is definitely resetting, and what are you seeing past that is the fallout. Are you sure it's not overheating? I keep track of my HBA temps with "storcli64 /call show temperature nolog" (you can also pass "j" as an argument to get json output, parse it, and put it into influxdb or similar to track it over time). Could you kick off a scrub and check the temp after an hour? You can also simulate a high IO environment on the HBA by just running a bunch of parallel dd reads to all the disks. My similar HBA is holding around 40C, load or not. Overheating usually manifests as individual disks getting reset due to timeouts, or sometimes the entire controller will reset on its own (which causes long enough timeouts that the kernel resets them anyway)

A friend of mine 3D printed a little bracket for a slim blower fan for my card which works very well. You'll find a lot of people getting creative on these forums with cooling strategies if they aren't already running really loud screamer servers with tons of airflow in the addon card area. If you don't have a fan strapped on there or a LOT of airflow through the general AIO area your HBA will cook itself.

Ericloewe · Jan 28, 2024

kingtool said:
I keep track of my HBA temps with "storcli64 /call show temperature nolog"

Man, I hate storcli. Spent an hour the other day trying to figure that one out, but the manual is a disaster and people don't seem to give straight answers to this question. If nothing else good happens from this thread, at least know that there's that little snippet.

In more general terms, HBA overheating is a longstanding issue that's often overlooked. It's enough of an issue that LSI added a temperature readout to the SAS2.5 generation ICs, like the SAS2308 (and newer ones, obviously). Since the 9500s are all pretty new, it should be just a matter of airflow (no old thermal paste to worry about).

kingtool said:
You'll find a lot of people getting creative on these forums with cooling strategies if they aren't already running really loud screamer servers with tons of airflow in the addon card area.

That's a polite way of saying "a 40 mm PWM fan attached to the heat sink with two zip ties", which is a solution so simple that it overflows the "janky" category right into "reasonably elegant" territory.

kingtool · Jan 28, 2024

Oh, I hate it too. Very much. Before they added the "show temperature" command, older versions of storcli would only show the temp data with a "show all". The problem with "show all" is that is does quite a bit of querying to the disks, and if you run it frequently enough (like ingesting the data into influxdb) it can cause timeouts on the disks -- and thus resets. I had quite an odyssey tracking that one down a couple of years ago. When they added "show temperature" it returns much, much faster and doesn't bother the disks, so I run it every 10 seconds now.

I usually do the zip tie thing too. It gets trickier when you're using all your PCIe slots and don't have the clearance for the zip tie solution. Here's a pic of what my friend came up with for one of the cards in my box (not the 355e, but a 9300-8e). There's a bracket for the fan and a duct to guide it through the heatsink fins. All while keeping it inside the envelope for a single slot.

kingtool · Jan 28, 2024

I hate posting things in random forum threads, but if anyone wants to re-share in a top level thread, here are the 3D models for the 9300-8e and the Dell HBA355e. Fan models are included and both are friction fit to the cards.

Printables

www.printables.com

Printables

www.printables.com

Ericloewe · Jan 28, 2024

I would suggest that you post those to the Resources section. I don't think we have any mechanical designs there yet, but I'm sure someone will find them useful.

kingtool · Jan 28, 2024

Ericloewe said:
I would suggest that you post those to the Resources section. I don't think we have any mechanical designs there yet, but I'm sure someone will find them useful.

Done

katbyte · Feb 1, 2024

kingtool said:
Is there nothing interesting above the first couple of lines of dmesg you shared? The HBA is definitely resetting, and what are you seeing past that is the fallout. Are you sure it's not overheating? I keep track of my HBA temps with "storcli64 /call show temperature nolog" (you can also pass "j" as an argument to get json output

nothing in dmesg before that I could see, pretty much just resets outta nowhere. in the dmesg I attached above you can see there is nothing for 2 hours from the previous message and then it resets.

as for temperature it sits around 50-60 - the boardcom support engineer I spoke to said anything under 70 is fine for the card but maybe he was wrong? it wouldn't be to hard to just pull the top off and point a high flow fan at it and see what happens to rule it out.

I plan this weekend to finally swap the cables and no longer cascade the rear backplane so *fingers crossed* that does it - if not I'll give keeping the card much cooler a shot

kingtool · Feb 1, 2024

Can you actually run a scrub to get it heated up and post the temperatures at 30 minute intervals?

sretalla · Feb 1, 2024

Not sure if it's interesting, but I have a resource available to log temperatures (disks and LSI HBA) to an influxdb (then paired with Grafana to chart)...

InfluxDB Disk Temperature Logging by Serial Number (or chosen substitution)

It was brought to my attention that it may be useful for some of the forum members to log disk temperatures periodically to an influxDB (and then chart that with Grafana like the below example) to help with trends for individual disks or...

www.truenas.com

That may help you to keep an more consistent eye on the temps.

katbyte · Feb 2, 2024

That looks super useful thanks, setting up influxdb and graphana have been on my list of things to get to for a while now (as is finishing Loki for logs) and then spinning fans up and down as needed, but I've just not had time. It's a shame truenas does not grab SAS disk temps natively.

but for now I'll just use storcli & bash to spit out the temps every couple min to see what is happening - but speaking of storcli, how do I get an updated version on scale? such as the storcli64 command you are using, I've been using storcli /c0 show all | egrep temp - version 007.1504.0000.0000.

katbyte · Feb 2, 2024

figured it out, for anyone else: download from https://docs.broadcom.com/docs/1232743397 - unzip then dpkg -i Ubuntu/storcli_007.2705.0000.0000_all.deb and add either /opt/MegaRAID/storcli or /opt/MegaRAID/storcli/storcli64 to your path

WI_Hedgehog · Feb 2, 2024

I use StarTech Expansion Slot Rear Exhaust Cooling Fans, they do the job. Sometimes I'll form a custom cage, depending on the server temps. and airflow.

katbyte · Feb 2, 2024

sadly I do not have a free slot, I would need to do something like kingtools friend made - that is if temperature is the problem (and regardless of how hard it might be to cool it would be a nice to have a definite solution)

I did try to find a antec spot cool(https://aphnetworks.com/reviews/antec_spot_cool) but they don't make them anymore

Important Announcement for the TrueNAS Community.

LSI-9500 HBA resets and backplanes/drives disappear during high load/scrub

Dabbler

Dabbler

Dabbler

Dabbler

Contributor

Server Wrangler

Dabbler

Dabbler

Server Wrangler

Dabbler

Dabbler

Server Wrangler

Dabbler

Dabbler

Dabbler

Powered by Neutrality

Dabbler

Dabbler

Guru

Dabbler

Similar threads