io_cmds_active is out of sync

Fireball81 · Oct 26, 2017

Hey guys,

recently i purchased an IBM M1015 HBA and flashed it to an LSI 9211-8i (P20 IT without the mptsas2.rom aka no BIOS) to expand our current backup server.
I installed the HBA to the PCIx8 slot of my Supermicro X11SSL-CF.
Now i have 7 disks attached to the embedded LSI/Broadcom 3008 and another 7 disks are attached to the LSDI 9211-8i aka M1015.
All disks are Seagate IronWolfs 8TB each. (ST8000VN0022)

Since i upgraded the system with the M1015 and the additional 7 disks, i get the following error occasionally.

This cycle repeats itself over and over again. (io_cmds_active is out of sync -> SAS Adress for SATA device -> io_cmds_active .........)
Can you guys point me in the right direction here? In my research i found a few mailing list discussion and a bug report that seem to relate this problem to a firmware bug.
Does that make sense for you guys, currently i am on P20 i can easily go back to P19 if thats reasonable.

Thank you for your help.
Dennis

Ericloewe · Oct 26, 2017

First time I've seen this.

Fireball81 said:
i can easily go back to P19 if thats reasonable.

No, don't do that.

You could try 11.1 BETA and see if the issue persists there, I think there were some LSI SAS changes.

cyberjock · Oct 26, 2017

I had to do an RMA for someone with this error earlier this year. The bottom line is that you are looking at what is likely instability between the LSI controller and the disks. So the LSI controller, the disks, and the associated cabling, backplanes, etc are all suspect. There's a possibility, but it is very unlikely, that the power to one or more of the disks is fluctuating causing disk instability. But I wouldn't expect that as there's typically other indicators that result from power supply problems.

Yeah, I know you probably don't want to hear that as that leaves a lot of hardware open, but you'll need to start narrowing down the problem more on your own.

I wouldn't expect much of anything to ever change with the 6Gb SAS stuff as LSI has basically discontinued development of the firmware and drivers. They actually stopped work on the 6Gb stuff about a year or so ago.

HTH.

Fireball81 · Oct 27, 2017

Thanks for all of your replies.
Sounds like a piece of work and i am currently thinking how to approach this the best way to narrow down the problem.

I found this that looks similar to the problems i have:
https://lists.freebsd.org/pipermail/freebsd-scsi/2014-October/006506.html
https://bugs.freenas.org/issues/11629

It seems to indicate a firmware problem and reverting back to P19 would be something i could do without any further trouble but you guys wouldn`t recommend that right?
I assume a general incompatibility with the M1015 and the ST8000VN0022 is probably out of the questions cause otherwise this would be a frequently occuring issue here on the forums i guess.

I want to avoid replacing hardware and spending money just based on a hunch. I will first tryx to check the smart values of every single drive that is attached to the M1015 and see if i find anything out of the ordinary there. After that i could disconnect one of the SAS ports and just boot the system with only 4 drives instead of 7.

Is it possible to disconnect the 7 harddrives from the LSI/Broadcom 3008 of my Supermicro MB and reconnect those to the M1015 without runnign into trouble?
This would help a lot cause i know that those 7 drives attached to the 3008 were working fine before and if they won`t when i connect them to the M1015 that might indicate a
HBA/cabling problem.

Of course, i want to avoid corrupting any data on my pool while trying to solve the problem.

Fireball81 · Oct 27, 2017

Ok i found some time to dig a little bit deeper and i might found something that worries me a little. (to say the least)
Take a look at the following picture i took as an example via smartctl -a /dev/dax

You can easily see the very high RAW value of the paramter Raw_Read_Error_Rate.
This looks pretty similar on every harddrive (each of the 14 overall) and considering how big the raw value already is i looked at it a couple of times and found
out that this value is incrasing minute by minute. That also applies to every single harddrive.
I might be wrong here but given the fact that 7 drives are attached to the embedded LSI 3008 and the other 7 to the M1015, i don`t think the controller or cabling is responsible for this but maybe its the backplane of my X-Case that doesnt like the 8TB Seagates?
What do you think about this?

Ericloewe · Oct 27, 2017

Raw read errors are internal to the drive and have nothing to do with the interface.

Fireball81 · Oct 27, 2017

Isnt that a little bit suspicious though? I mean they occur on every single drive in a compareable amount to what ive shown you before in the picture.
Can`t be the case that every single drive has an issue on its own right?
Maybe thats not related to my problem at all but it surely is something that makes me worry even more.
I don`t know, could it be wrong what smartctl is reporting or do i not need to worry about it at all?

edit:
It appears i was on the wrong track here. I found this meanwhile.
https://wiki.lime-technology.com/Understanding_SMART_Reports

The interesting bit is:

PLEASE completely ignore the RAW_VALUE number! Only Seagates report the raw value, which yes, does appear to be the number of raw read errors, but should be ignored, completely. All other drives have raw read errors too, but do not report them, leaving this value as zero only.

Ericloewe · Oct 27, 2017

It could be caused by vibration or something like that, but it's definitely not an interface issue.

Fireball81 said:
The interesting bit is:

That's not quite true, either. Higher is worse on WD drives and zero is good. It's highly correlated with other failures.

Fireball81 · Oct 28, 2017

Ok, so do you think the high readings of Raw_Read_Errors which are increasing minute by minute is something i need to be concerned about and do you think this could be related to the
"io_cmds_active is out of sync" issue i told you about at the beginning?

I was thinking of buying two SFF8087 to 4xSATA cables to the harddrives directly, seeing if that changes anything.
Other than that i could swap the M1015 HBA and get another one from a friend, the error is difficult to reproduce cause it doesn`t show up that often.

Ericloewe · Oct 28, 2017

Fireball81 said:
Ok, so do you think the high readings of Raw_Read_Errors which are increasing minute by minute is something i need to be concerned about

Definitely.

Fireball81 said:
do you think this could be related to the
"io_cmds_active is out of sync" issue i told you about at the beginning?

It's vaguely possible, at least.

Fireball81 said:
the error is difficult to reproduce cause it doesn`t show up that often.

Those are the worst errors.

Important Announcement for the TrueNAS Community.

io_cmds_active is out of sync

Fireball81

Explorer

Ericloewe

Server Wrangler

cyberjock

Inactive Account

Fireball81

Explorer

Fireball81

Explorer

Ericloewe

Server Wrangler

Fireball81

Explorer

Ericloewe

Server Wrangler

Fireball81

Explorer

Ericloewe

Server Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

io_cmds_active is out of sync

Explorer

Server Wrangler

Inactive Account

Explorer

Explorer

Server Wrangler

Explorer

Server Wrangler

Explorer

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "io_cmds_active is out of sync"

Similar threads