Troubleshoot MPS Freezing - Drives Don't Populate

ere109

Contributor
Joined
Aug 22, 2017
Messages
190
Note: I've identified the LSI controller as the fault - now testing solutions.

I need help troubleshooting drives not populating from a second (JBOD) chassis - whether an HP SAS expander (487738-001) is malfunctioning, if it's the LSI 2308 or my TQ backplane.
I want to add six new 14tb disks, but my 836 chassis (System 1) is out of space. I built a second server in a 743 chassis with an X11SSM-F, so I decided to set up a smarter JBOD (second server to experiment with, all disks tied back to first server through TQ backplane -> HP SAS -> LSI 4e). I bought an HP SAS expander on Ebay - it didn't come with a bracket. I built the system, installed SCALE just because, connected the 743 TQ backplane to the HP expander and ran it to the LSI 4e card in my main system. Nothing populates in "Disks" on my main system (maybe - see below). I don't even see drive activity lights in the 743, when I power on. I can see all three LSI controllers using "dmesg | grep mps" (posted below), and "sudo sas2flash -listall" hangs (UPDATE: ctl-C said "packet_write_wait: Connection to 192.168.xxx port 22: Broken pipe").
The three easiest answers would be: bad LSI controller, bad HP expander, or trouble with the backplane. How do I verify which?
I did connect the SGPIO cables on system 2. Could they send some power down signal, since no drives are connected to the second board?
Is there a way to query the HP SAS expander from either CLI?
Is there a disk size limit on the HP expander?
Future step: would ISCSI work more reliably - how much learning curve/setup is there?

sudo dmesg | grep mps mps0: <Avago Technologies (LSI) SAS2308> port 0xd000-0xd0ff mem 0xfbe40000-0xfbe4ffff,0xfbe00000-0xfbe3ffff irq 32 at device 0.0 on pci2 mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd mps0: IOCCapabilities: 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc> mps1: <Avago Technologies (LSI) SAS2308> port 0xc000-0xc0ff mem 0xfbc40000-0xfbc4ffff,0xfbc00000-0xfbc3ffff irq 40 at device 0.0 on pci3 mps1: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd mps1: IOCCapabilities: 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc> mps2: <Avago Technologies (LSI) SAS2308> port 0xb000-0xb0ff mem 0xfba40000-0xfba4ffff,0xfba00000-0xfba3ffff irq 16 at device 0.0 on pci6 mps2: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd mps2: IOCCapabilities: 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc> mps2: mpssas_prepare_remove: Sending reset for target ID 19 mps2: mpssas_prepare_remove: Sending reset for target ID 16 mps2: mpssas_prepare_remove: Sending reset for target ID 17 mps2: mpssas_prepare_remove: Sending reset for target ID 13 da16 at mps2 bus 0 scbus2 target 17 lun 0 da14 at mps2 bus 0 scbus2 target 15 lun 0 ses0 at mps2 bus 0 scbus2 target 13 lun 0 da18 at mps2 bus 0 scbus2 target 19 lun 0 da17 at mps2 bus 0 scbus2 target 18 lun 0 da15 at mps2 bus 0 scbus2 target 16 lun 0 (ses0:mps2:0:13:0): Periph destroyed mps2: No pending commands: starting remove_device mps2: Unfreezing devq for target ID 15 mps2: No pending commands: starting remove_device mps2: Unfreezing devq for target ID 18 mps2: No pending commands: starting remove_device mps2: Unfreezing devq for target ID 13 mps2: No pending commands: starting remove_device mps2: Unfreezing devq for target ID 19 mps2: No pending commands: starting remove_device mps2: Unfreezing devq for target ID 17 mps2: No pending commands: starting remove_device mps2: Unfreezing devq for target ID 16 (da18:mps2:0:19:0): Periph destroyed (da17:mps2:0:18:0): Periph destroyed (da16:mps2:0:17:0): Periph destroyed (da15:mps2:0:16:0): Periph destroyed (da14:mps2:0:15:0): Periph destroyed
 
Last edited:

ere109

Contributor
Joined
Aug 22, 2017
Messages
190
Update - after the system hang, I went down and re-seated both cards, re-seated both ends of the SFF cable, and switched from Port 0 to Port 1. On reset, all new drives appear. I'll let it rest, overnight, and check those again in the morning - not sure if it's heat (I have fans blowing on both cards), the LSI port, or maybe that card without a bracket is vibrating out over time.
 

ere109

Contributor
Joined
Aug 22, 2017
Messages
190
Update 2: Within 10 minutes of starting back up, Server 1 hung again. I've shut down system 2, for the night and will remove the HP expander tomorrow and directly connect the drives.
Is there a good iSCSI guide for beginners?
 

ere109

Contributor
Joined
Aug 22, 2017
Messages
190
With System 2 off (therefore the HP SAS off, but still plugged in) the server ran fine all night. However, I got up this morning and tried "sudo sas2flash -listall" again, and the CLI hung - the system remained up. Checking dmesg gave this:
mps2: IOC Fault 0x40002622, Resetting mps2: Reinitializing controller mps2: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd mps2: IOCCapabilities: 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc> mps2: Calling Reinit from mps_wait_command, timeout=60, elapsed=60 mps2: Reinitializing controller
This leads me to believe it's the LSI card having trouble. I've read a bit about heatsink paste going bad over time - might check that, unless someone else has a suggestion.
 

ere109

Contributor
Joined
Aug 22, 2017
Messages
190
I found an article with the same LSI fault code. My strong inclination is that this is heat-based controller error.
I just found a post from jgrego discussing mpsutil and read through the MAN page. This allowed me to check stats on the currently running cards.
mpsutil show adapters Device Name Chip Name Board Name Firmware /dev/mps0 LSISAS2308 SAS9207-8i 14000700 /dev/mps1 LSISAS2308 LSI2308-IT 14000700

Notice the third card didn't populate. Another command gave me current data on my first and second card, then hung the system while attempting to talk to the third card - mps2. I'll continue to investigate heatsink and thermal paste replacement.
mpsutil show all Adapter: mps0 Adapter: Board Name: SAS9207-8i Board Assembly: Chip Name: LSISAS2308 Chip Revision: ALL BIOS Revision: 7.39.02.00 Firmware Revision: 20.00.07.00 Integrated RAID: no SATA NCQ: ENABLED PCIe Width/Speed: x8 (8.0 GB/sec) IOC Speed: Full Temperature: 56 C
 

ere109

Contributor
Joined
Aug 22, 2017
Messages
190
This afternoon I shut down System 1 and pulled the 9207-8e card, and I removed the heatsink from it. The thermal compound was dry and I had to "crack" the heatsink off. I spent half an hour cleaning both surfaces with rubbing alcohol, then put on fresh thermal paste and re-assembled. I've been testing six concurrent drives with badblocks for 4 hours and the card is working well, currently running at 52 degrees celcius. I call this solved.
 
Top