Hi all,
Not sure how to fully debug this but I have a LSI Broadcom 9500-16i in a Supermicro 847BE1C4-R1K23LPB case with 32 SAS disks in use with a newish (couple months) old truenas SCALE VM on promox using PCI passthrough. It seems to be dropping the connection from the backplanes (BPN-SAS3-826EL1-N4, BPN-SAS3-846EL1) under a high load, and this can be repeatably triggered by scrubbing the ZFS pool. It will be fine for days, but then happen hours after starting a scrub, and multiple timed during said scrub till it finishes.
When it happens I can still use storcli to talk to the card and query it via storcli which shows the backplanes are missing, and from the dmesg it seems like there is a controller reset? After which the backplanes (connected via minisas HDx2 cable -> front backplane -> cascading to back backplane)
I have updated the 9500 to the latest version, checked the SAS cables/connections, and that the card is well seated.
Does anyone have any ideas/where to start to debug this?
additionally I have considered not cascading the two backplanes but worry if only 1 drops (12 disks) what will happen to the pool?
storcli /c0 show all - https://pastebin.com/TWLAYRre
here is where things go sideways in dmesg (https://pastebin.com/Q9Eu90BF)
Not sure how to fully debug this but I have a LSI Broadcom 9500-16i in a Supermicro 847BE1C4-R1K23LPB case with 32 SAS disks in use with a newish (couple months) old truenas SCALE VM on promox using PCI passthrough. It seems to be dropping the connection from the backplanes (BPN-SAS3-826EL1-N4, BPN-SAS3-846EL1) under a high load, and this can be repeatably triggered by scrubbing the ZFS pool. It will be fine for days, but then happen hours after starting a scrub, and multiple timed during said scrub till it finishes.
When it happens I can still use storcli to talk to the card and query it via storcli which shows the backplanes are missing, and from the dmesg it seems like there is a controller reset? After which the backplanes (connected via minisas HDx2 cable -> front backplane -> cascading to back backplane)
I have updated the 9500 to the latest version, checked the SAS cables/connections, and that the card is well seated.
Does anyone have any ideas/where to start to debug this?
additionally I have considered not cascading the two backplanes but worry if only 1 drops (12 disks) what will happen to the pool?
storcli /c0 show all - https://pastebin.com/TWLAYRre
here is where things go sideways in dmesg (https://pastebin.com/Q9Eu90BF)
[Mon Dec 4 23:10:33 2023] mpt3sas_cm0: mpt3sas_ctl_pre_reset_handler: Releasing the trace buffer due to adapter reset.
[Mon Dec 4 23:10:43 2023] mpt3sas_cm0: sending diag reset !!
[Mon Dec 4 23:10:44 2023] mpt3sas_cm0: diag reset: SUCCESS
[Mon Dec 4 23:10:46 2023] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
[Mon Dec 4 23:10:46 2023] mpt3sas_cm0: _base_display_fwpkg_version: complete
[Mon Dec 4 23:10:46 2023] mpt3sas_cm0: FW Package Ver(28.00.00.00)
[Mon Dec 4 23:10:46 2023] mpt3sas_cm0: SAS3816: FWVersion(28.00.00.00), ChipRevision(0x00), BiosVersion(09.51.00.00)
[Mon Dec 4 23:10:46 2023] NVMe
[Mon Dec 4 23:10:46 2023] mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Diag Trace Buffer,Task Set Full,NCQ)
[Mon Dec 4 23:10:46 2023] mpt3sas_cm0: Enable interrupt coalescing only for first 8 reply queues
[Mon Dec 4 23:10:46 2023] mpt3sas_cm0: performance mode: balanced
[Mon Dec 4 23:10:46 2023] mpt3sas_cm0: sending port enable !!
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: port enable: SUCCESS
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: search for end-devices: start
[Mon Dec 4 23:10:57 2023] scsi target0:0:0: handle(0x0030), sas_addr(0x5000cca2c76775f9)
[Mon Dec 4 23:10:57 2023] scsi target0:0:0: enclosure logical id(0x5003048020bd1c3f), slot(0)
[Mon Dec 4 23:10:57 2023] scsi target0:0:1: handle(0x0031), sas_addr(0x5000cca2c7a9bce1)
[Mon Dec 4 23:10:57 2023] scsi target0:0:1: enclosure logical id(0x5003048020bd1c3f), slot(1)
....
[Mon Dec 4 23:10:57 2023] scsi target0:0:32: handle(0x0051), sas_addr(0x5000cca2c7ac1b09)
[Mon Dec 4 23:10:57 2023] scsi target0:0:32: enclosure logical id(0x50030480211a293f), slot(9)
[Mon Dec 4 23:10:57 2023] scsi target0:0:33: handle(0x0052), sas_addr(0x5000cca2c7ac0cb9)
[Mon Dec 4 23:10:57 2023] scsi target0:0:33: enclosure logical id(0x50030480211a293f), slot(10)
[Mon Dec 4 23:10:57 2023] scsi target0:0:34: handle(0x0053), sas_addr(0x50030480211a293d)
[Mon Dec 4 23:10:57 2023] scsi target0:0:34: enclosure logical id(0x50030480211a293f), slot(12)
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: search for end-devices: complete
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: search for end-devices: start
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: search for PCIe end-devices: complete
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: search for expanders: start
[Mon Dec 4 23:10:57 2023] expander present: handle(0x002f), sas_addr(0x5003048020bd1c3f), port:0
[Mon Dec 4 23:10:57 2023] expander present: handle(0x003c), sas_addr(0x50030480211a293f), port:0
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: search for expanders: complete
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: mpt3sas_base_hard_reset_handler: SUCCESS
[Mon Dec 4 23:10:57 2023] mpt3sas_cm0: _base_fault_reset_work: hard reset: success
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: removing unresponding devices: start
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: removing unresponding devices: end-devices
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: Removing unresponding devices: pcie end-devices
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: removing unresponding devices: expanders
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: removing unresponding devices: complete
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: Update devices with firmware reported queue depth
[Mon Dec 4 23:10:58 2023] sd 0:0:0:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] sd 0:0:1:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] sd 0:0:2:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] sd 0:0:3:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] sd 0:0:4:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
...
[Mon Dec 4 23:10:58 2023] sd 0:0:29:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] sd 0:0:30:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] sd 0:0:31:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] sd 0:0:32:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] sd 0:0:33:0: qdepth(64), tagged(1), scsi_level(8), cmd_que(1)
[Mon Dec 4 23:10:58 2023] ses 0:0:34:0: qdepth(64), tagged(1), scsi_level(6), cmd_que(1)
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: scan devices: start
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: hba_port entry: 00000000df9f5468, port: 8 is added to hba_port list
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: scan devices: expanders start
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: break from expander scan: ioc_status(0x0022), loginfo(0x310f0400)
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: scan devices: expanders complete
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: scan devices: end devices start
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: break from end device scan: ioc_status(0x0022), loginfo(0x310f0400)
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: scan devices: end devices complete
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: scan devices: pcie end devices start
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: break from pcie end device scan: ioc_status(0x0022), loginfo(0x310f0400)
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: pcie devices: pcie end devices complete
[Mon Dec 4 23:10:58 2023] mpt3sas_cm0: scan devices: complete
[Mon Dec 4 23:12:16 2023] mpt3sas_cm0: sending diag reset !!
[Mon Dec 4 23:12:16 2023] mpt3sas_cm0: diag reset: SUCCESS
[Mon Dec 4 23:12:18 2023] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
[Mon Dec 4 23:12:18 2023] mpt3sas_cm0: _base_display_fwpkg_version: complete
[Mon Dec 4 23:12:18 2023] mpt3sas_cm0: FW Package Ver(28.00.00.00)
[Mon Dec 4 23:12:18 2023] mpt3sas_cm0: SAS3816: FWVersion(28.00.00.00), ChipRevision(0x00), BiosVersion(09.51.00.00)
[Mon Dec 4 23:12:18 2023] NVMe
[Mon Dec 4 23:12:18 2023] mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Diag Trace Buffer,Task Set Full,NCQ)
[Mon Dec 4 23:12:18 2023] mpt3sas_cm0: Enable interrupt coalescing only for first 8 reply queues
[Mon Dec 4 23:12:18 2023] mpt3sas_cm0: performance mode: balanced
[Mon Dec 4 23:12:18 2023] mpt3sas_cm0: sending port enable !!
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: port enable: SUCCESS
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: search for end-devices: start
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: search for end-devices: complete
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: search for end-devices: start
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: search for PCIe end-devices: complete
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: search for expanders: start
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: search for expanders: complete
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: mpt3sas_base_hard_reset_handler: SUCCESS
[Mon Dec 4 23:14:19 2023] mpt3sas_cm0: _base_fault_reset_work: hard reset: success
[Mon Dec 4 23:14:19 2023] sd 0:0:13:0: [sdo] tag#2946 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=133s
[Mon Dec 4 23:14:19 2023] sd 0:0:8:0: [sdj] tag#3254 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=133s
[Mon Dec 4 23:14:19 2023] sd 0:0:32:0: [sdag] tag#3032 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=133s
[Mon Dec 4 23:14:19 2023] sd 0:0:32:0: [sdag] tag#3032 CDB: Read(16) 88 00 00 00 00 01 68 91 e2 00 00 00 07 f0 00 00
[Mon Dec 4 23:14:19 2023] I/O error, dev sdag, sector 6049358336 op 0x0:(READ) flags 0x0 phys_seg 111 prio class 2