LSI-9500 HBA resets and backplanes/drives disappear during high load/scrub

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Last edited:

kingtool

Dabbler
Joined
Sep 11, 2022
Messages
16
Not sure if it's interesting, but I have a resource available to log temperatures (disks and LSI HBA) to an influxdb (then paired with Grafana to chart)...


That may help you to keep an more consistent eye on the temps.
A note on this: you are using "storcli show all" when you can just use "storcli show temperature" to get the ROC temp. "show all" can stress the disks if you have a lot of them on the controller (due to the detailed data it pulls for every disk in a seemingly synchronous way), and if your command is running frequently enough it can hit the goldilocks moment of "drives were in flight on a read/write, then this flurry hit, and now a timeout has occurred, and now mpt3sas is resetting the controller". Your util doesn't seem to need anything besides "show temperature" so might be an easy patch and speed it up too.

As I mentioned in a previous post in this thread, I went through *hell* troubleshooting this because it seemed like the very act of measuring the temps on my controllers was causing random resets.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
If you read the post directly before yours, I'm mentioning how to get the version of storcli that supports that "show temperature" switch, but knowing that the included one in TrueNAS doesn't support it, hence my script using what it has access to... it's easy enough for somebody who wants to to change that line of the script and use the right version if they have worked that out.

The script allows (and actually defaults to) an opt-out of taking those temps.
 
Top