Gerk
Explorer
- Joined
- Sep 25, 2016
- Messages
- 74
Hi there, hoping someone can possibly point me in the right direction. Sorry in advance as this post is likely to be long as I've done a lot of things to try and chase these issues down so far. Here's my hardware setup:
Running 9.10 on a Dell r710 2x6 core Xeon X5650 @ 2.67GHz, w/96GB ram, LSI-9201-16e (flashed with 20.00.07.00 IT firmware), multiple Dell MD1000 enclosures, SATA drives (mostly WD Red). One pool has 18 x WD Red 4TB (3 x vdevs, each 6 drives in RaidZ2), other pool has 14 x WD Red 2TB drives (with at least one non WD drive, 2 x vdevs, 7 drives each RaidZ2). I have the vdevs spread out across 3 MD1000 units for both pools. Each MD1000 unit is running as a single controller (all 15 drives on one controller output -- not split).
The issue I've been having, which cropped up a couple of days ago (about 5 days after I expanded the 4TB based pool with another vdev), is that when it's under heavy load I get IOC Faults and the card resets. This causes all kinds of chaos as you might expect when things are re-detected if they do not get re-detected fast enough. So I've spent the last couple of days trying to chase down the root of the problem.
What I've done so far is:
- Changed to a different MD1000 for the new vdev I added (since I had not noticed this issue before adding the new vdev I thought that was a good place to start) and changed to a new/different cable
- Change to a completely different 9201-16e (I have two of them, one as a spare)
- Also changed the Intel X520 NIC that was in the box (one error I saw seemed to take out both PCIe cards when it stopped working so it was suspect) -- I had another spare unit for this as well
- Changed to a completely different server (I also have 2 r710s so I swapped the cards and boot devices to another box)
- I also tried doing the latest 9.10 updates (which resulted in kernel panics with my intel X520 nic so I reverted).
- re-flashed the card firmware
- checked smartctl on all the attached drives for potential issues, they are all fine
None of this has resolved the issue. I do not have any other cables to try and swap out or other enclosures to use, but I'm thinking at this point that there may be something else going on as aside from swapping out each and every drive I'm not sure what else could be going on at the hardware level.
My question is that I've read in a few places about P20 possibly having issues regarding exactly this and a lot of suggestions to use P19-IT firmware instead. But from everything I've read I should always be using the firmware that matches the FBSD driver (which is 21 currently). Is this something that I should try or am I asking for trouble reverting to P19 firmware.
I'm kind of at my wits end here. Any suggestions would be appreciated. So far I've managed to not destroy any data doing this testing but given the nature of it there's always the potential ... I of course have backups of the crucial stuff but not things like my huge media collection that I spent a long time ripping, etc (well I still do have most of the optical media that I could re-rip but that's a huge time commitment!) so am hoping to not kill my pools in the process.
Lastly I'm wondering if splitting up the load across multiple HBAs would help my situation. I don't have another spare x8 slot to put in my other 9201-16e but I do have an x4 slot and an LSI-9200-8e -- but again not sure if this is a load/bandwidth issue or something else going on ... any reports I've found seem to point at hardware issues but I think I've almost ruled them out at this point in my situation.
Running 9.10 on a Dell r710 2x6 core Xeon X5650 @ 2.67GHz, w/96GB ram, LSI-9201-16e (flashed with 20.00.07.00 IT firmware), multiple Dell MD1000 enclosures, SATA drives (mostly WD Red). One pool has 18 x WD Red 4TB (3 x vdevs, each 6 drives in RaidZ2), other pool has 14 x WD Red 2TB drives (with at least one non WD drive, 2 x vdevs, 7 drives each RaidZ2). I have the vdevs spread out across 3 MD1000 units for both pools. Each MD1000 unit is running as a single controller (all 15 drives on one controller output -- not split).
The issue I've been having, which cropped up a couple of days ago (about 5 days after I expanded the 4TB based pool with another vdev), is that when it's under heavy load I get IOC Faults and the card resets. This causes all kinds of chaos as you might expect when things are re-detected if they do not get re-detected fast enough. So I've spent the last couple of days trying to chase down the root of the problem.
Code:
Jan 13 23:22:56 sinestro mps0: IOC Fault 0x40007e23, Resetting Jan 13 23:22:56 sinestro mps0: Reinitializing controller, Jan 13 23:22:56 sinestro mps0: Warning: io_cmds_active is out of sync - resynching to 0 Jan 13 23:22:56 sinestro mps0: Warning: io_cmds_active is out of sync - resynching to 0 (snip, removed a ton of the sync lines) Jan 13 23:22:56 sinestro mps0: Firmware: 20.00.07.00, Driver: 21.00.00.00-fbsd Jan 13 23:22:56 sinestro mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> Jan 13 23:22:56 sinestro mps0: mps_reinit finished sc 0xfffffe0000efc000 post 4 free 3 Jan 13 23:22:58 sinestro mps0: SAS Address for SATA device = 496a4469f3b4df6d Jan 13 23:22:58 sinestro mps0: SAS Address for SATA device = 497a5b5bdda2be80 (snip, lots more drives detect)
What I've done so far is:
- Changed to a different MD1000 for the new vdev I added (since I had not noticed this issue before adding the new vdev I thought that was a good place to start) and changed to a new/different cable
- Change to a completely different 9201-16e (I have two of them, one as a spare)
- Also changed the Intel X520 NIC that was in the box (one error I saw seemed to take out both PCIe cards when it stopped working so it was suspect) -- I had another spare unit for this as well
- Changed to a completely different server (I also have 2 r710s so I swapped the cards and boot devices to another box)
- I also tried doing the latest 9.10 updates (which resulted in kernel panics with my intel X520 nic so I reverted).
- re-flashed the card firmware
- checked smartctl on all the attached drives for potential issues, they are all fine
None of this has resolved the issue. I do not have any other cables to try and swap out or other enclosures to use, but I'm thinking at this point that there may be something else going on as aside from swapping out each and every drive I'm not sure what else could be going on at the hardware level.
My question is that I've read in a few places about P20 possibly having issues regarding exactly this and a lot of suggestions to use P19-IT firmware instead. But from everything I've read I should always be using the firmware that matches the FBSD driver (which is 21 currently). Is this something that I should try or am I asking for trouble reverting to P19 firmware.
I'm kind of at my wits end here. Any suggestions would be appreciated. So far I've managed to not destroy any data doing this testing but given the nature of it there's always the potential ... I of course have backups of the crucial stuff but not things like my huge media collection that I spent a long time ripping, etc (well I still do have most of the optical media that I could re-rip but that's a huge time commitment!) so am hoping to not kill my pools in the process.
Lastly I'm wondering if splitting up the load across multiple HBAs would help my situation. I don't have another spare x8 slot to put in my other 9201-16e but I do have an x4 slot and an LSI-9200-8e -- but again not sure if this is a load/bandwidth issue or something else going on ... any reports I've found seem to point at hardware issues but I think I've almost ruled them out at this point in my situation.
Last edited by a moderator: