LSI-9201-16e resetting under heavy loads

Gerk · Jan 14, 2017

Hi there, hoping someone can possibly point me in the right direction. Sorry in advance as this post is likely to be long as I've done a lot of things to try and chase these issues down so far. Here's my hardware setup:

Running 9.10 on a Dell r710 2x6 core Xeon X5650 @ 2.67GHz, w/96GB ram, LSI-9201-16e (flashed with 20.00.07.00 IT firmware), multiple Dell MD1000 enclosures, SATA drives (mostly WD Red). One pool has 18 x WD Red 4TB (3 x vdevs, each 6 drives in RaidZ2), other pool has 14 x WD Red 2TB drives (with at least one non WD drive, 2 x vdevs, 7 drives each RaidZ2). I have the vdevs spread out across 3 MD1000 units for both pools. Each MD1000 unit is running as a single controller (all 15 drives on one controller output -- not split).

The issue I've been having, which cropped up a couple of days ago (about 5 days after I expanded the 4TB based pool with another vdev), is that when it's under heavy load I get IOC Faults and the card resets. This causes all kinds of chaos as you might expect when things are re-detected if they do not get re-detected fast enough. So I've spent the last couple of days trying to chase down the root of the problem.

Code:

Jan 13 23:22:56 sinestro mps0: IOC Fault 0x40007e23, Resetting
Jan 13 23:22:56 sinestro mps0: Reinitializing controller,
Jan 13 23:22:56 sinestro mps0: Warning: io_cmds_active is out of sync - resynching to 0
Jan 13 23:22:56 sinestro mps0: Warning: io_cmds_active is out of sync - resynching to 0
(snip, removed a ton of the sync lines)
Jan 13 23:22:56 sinestro mps0: Firmware: 20.00.07.00, Driver: 21.00.00.00-fbsd
Jan 13 23:22:56 sinestro mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
Jan 13 23:22:56 sinestro mps0: mps_reinit finished sc 0xfffffe0000efc000 post 4 free 3
Jan 13 23:22:58 sinestro mps0: SAS Address for SATA device = 496a4469f3b4df6d
Jan 13 23:22:58 sinestro mps0: SAS Address for SATA device = 497a5b5bdda2be80
(snip, lots more drives detect)

What I've done so far is:
- Changed to a different MD1000 for the new vdev I added (since I had not noticed this issue before adding the new vdev I thought that was a good place to start) and changed to a new/different cable
- Change to a completely different 9201-16e (I have two of them, one as a spare)
- Also changed the Intel X520 NIC that was in the box (one error I saw seemed to take out both PCIe cards when it stopped working so it was suspect) -- I had another spare unit for this as well
- Changed to a completely different server (I also have 2 r710s so I swapped the cards and boot devices to another box)
- I also tried doing the latest 9.10 updates (which resulted in kernel panics with my intel X520 nic so I reverted).
- re-flashed the card firmware
- checked smartctl on all the attached drives for potential issues, they are all fine

None of this has resolved the issue. I do not have any other cables to try and swap out or other enclosures to use, but I'm thinking at this point that there may be something else going on as aside from swapping out each and every drive I'm not sure what else could be going on at the hardware level.

My question is that I've read in a few places about P20 possibly having issues regarding exactly this and a lot of suggestions to use P19-IT firmware instead. But from everything I've read I should always be using the firmware that matches the FBSD driver (which is 21 currently). Is this something that I should try or am I asking for trouble reverting to P19 firmware.

I'm kind of at my wits end here. Any suggestions would be appreciated. So far I've managed to not destroy any data doing this testing but given the nature of it there's always the potential ... I of course have backups of the crucial stuff but not things like my huge media collection that I spent a long time ripping, etc (well I still do have most of the optical media that I could re-rip but that's a huge time commitment!) so am hoping to not kill my pools in the process.

Lastly I'm wondering if splitting up the load across multiple HBAs would help my situation. I don't have another spare x8 slot to put in my other 9201-16e but I do have an x4 slot and an LSI-9200-8e -- but again not sure if this is a load/bandwidth issue or something else going on ... any reports I've found seem to point at hardware issues but I think I've almost ruled them out at this point in my situation.

SweetAndLow · Jan 14, 2017

The firmware you have is what most people use. 20.00.02.00 was the buggy firmware I think. .04 and. 07 are safe.

I'm not sure what could be causing your issues, it seems like you have tried just about everything.

Sent from my Nexus 5X using Tapatalk

Ericloewe · Jan 14, 2017

Does the card get decent airflow?

Gerk · Jan 14, 2017

Shouldn't be any air flow problems from what I can tell. I can feel air movement at the back of the cards.

Gerk · Jan 16, 2017

I'm still trying to track this own (without killing my pools), does anyone know of a way I can get any extra debugging out of the drivers? There's literally just nothing in the logs at all before any of the IOC error messages and not much useful when it happens.

At this point I'm just not sure if it's a hardware issue -- which I think I've mostly ruled out but there's still potential for it I guess -- or a software issue. If it was a software issue you would think that I would not be alone having these problems. And it's very odd that it just started happening recently (as far as I could tell, I didn't really pay that close of attention before while things were scrubbing and I never got any notices of failures).

SweetAndLow · Jan 16, 2017

Some things I would try. Remove all disks and test one at a time using bad blocks. Try skipping the expander backplane. Might be a bad backplane. Remove the md1000 since it seems like the strange part of your system.

I'm not sure if any of these are possible to test but there are some ideas for you.

Sent from my Nexus 5X using Tapatalk

Gerk · Jan 16, 2017

Thanks for the ideas. Can't really skip the backplane but with the new cables I have coming I should be able to at least test things in smaller groups.

Gerk · Jan 16, 2017

And just for completeness sake here's a bug that perfectly describes what I'm seeing:

https://bugs.freenas.org/issues/11629

Gerk · Jan 17, 2017

As requested I've filed a new bug with more details on this one:

https://bugs.freenas.org/issues/20341

snaptec · Jan 17, 2017

I've read somewhere that the .07 isn't the perfect choice (don't ask me where..)
I've the same card running in one of my servers with 20.00.06.00 without problems. Maybe give it a try?

Gesendet von iPhone mit Tapatalk

Gerk · Jan 18, 2017

snaptec said:
I've read somewhere that the .07 isn't the perfect choice (don't ask me where..)
I've the same card running in one of my servers with 20.00.06.00 without problems. Maybe give it a try?

Gesendet von iPhone mit Tapatalk

That sounds like good advice. Here's the official word I just got from the bug report:

Updated by Alexander Motin about 3 hours ago

Status changed from Unscreened to 3rd party to resolve

I am sorry, but I don't see what we can do about this. This message: "mps0: IOC Fault 0x40007e23, Resetting" means that firmware of the HBA crashed and was restarted. We have no any relation to the firmware development, so the only thing I can recommend is contacting LSI support.

So at least I know what that error is now. And considering I'm having it on two cards in two different machines ... I'm inclined to believe it is indeed a firmware issue. I will try some downgrades and see how it goes.

Gerk · Jan 18, 2017

snaptec said:
I've read somewhere that the .07 isn't the perfect choice (don't ask me where..)
I've the same card running in one of my servers with 20.00.06.00 without problems. Maybe give it a try?

Gesendet von iPhone mit Tapatalk

Just downgraded to 20.00.06.00 and fired up a scrub, will see how things work out. Crossing my fingers.

Gerk · Jan 18, 2017

Had the exact same issues. I dug through the LSI/Avago/Broadcom archive and they don't even have any other versions of the 20.00.xx.00 firmware, only 06 and 07 and I've had the same issue with both now. I'm open to further suggestions if anyone has them. Is it possible that I have two bad cards that only crap out when they hit heavy loads??

bigphil · Jan 18, 2017

I'd try the following:
-flash to the latest P19 just to see if it works.
-try splitting the load across multiple HBA's.

Gerk · Jan 18, 2017

I'm planning on splitting up the load so I will start with that one. I'm very reluctant to go to P19 unless I absolutely have to. Have read too many situations about data loss with mismatched firmware/drivers.

snaptec · Jan 19, 2017

The load can't be the problem.
I've wired 16 drives to the same hba in 2x 8-wide z2 vdevs under heavy usage

Faulty mobo? Pcie Slot?

Gesendet von iPhone mit Tapatalk

Gerk · Jan 19, 2017

This is happening to me with two different cards, in two different servers (both Dell r710s). The only thing in common are the enclosures and drives. I tried it last night with the two cards installed and splitting the load across them and with brand new cables, and still ended up with a card reset.

At this point it's down to either drive(s) or enclosure(s), and it's going to be painful figuring that part out I think. I will try tonight by offlining one whole pool and then using my small pool (14 drives) to try and test with. At least that way I can limit things to a single enclosure at a time.

I wish I could go back in time and look closely at my logs before I added this latest vdev to the big pool to see if this has been happening before or not ...

Gerk · Jan 19, 2017

Or it is heat issues, I'm going to try out some more things tonight and see what happens.

Gerk · Jan 20, 2017

I rigged up a fan to go on top of the passive heatsink on the card and it's more stable. Still not perfect but only two little tiny blips on a scrub of a large pool. Running a scrub on the second pool right now. I think I'm going to try removing the heatsink on the second card and redoing the thermal paste and resetting it (it seems like it's not difficult to remove). Then will probably still re-attach the fan (because why not) and see if I can then get through a scrub with no resets ...

Gerk · Jan 22, 2017

So I'm crossing my fingers that I've gotten things sorted out here, reporting back for anyone like me in the future having an issue like this. It looks like the proper thermal paste has sorted out the issues, or at the very least I've now been able to run under very heavy load for > 24 hours with no hiccups so far. I in fact redid the thermal paste on both of my cards and they have both been running at heavy load for > 24 hours with no issues. The old thermal goo was in terrible shape (on one it was rock hard and crusty on the other it was very very gummy). Either way with the new thermal paste things have been running well so far. I still have the improvised heatsink on my main setup, but not sure it would be needed now. But I'm not messing with it now that things seem happy and stable again.

Important Announcement for the TrueNAS Community.

LSI-9201-16e resetting under heavy loads

Explorer

Sweet'NASty

Server Wrangler

Explorer

Explorer

Sweet'NASty

Explorer

Explorer

Explorer

Guru

Explorer

Explorer

Explorer

Patron

Explorer

Guru

Explorer

Explorer

Explorer

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "LSI-9201-16e resetting under heavy loads"

Similar threads