CAM status: SCSI status error - what does it means?

JohnDigital · Jan 25, 2017

I also found this, where basically its stated, that no you cant use a 9211 on that model Dell. Claim only H200. Its HERE, im sure you have seen it.

Fire up all disks without backplane and see how it goes yet? Backplane is the biggest variable for me now. Can you use the backplane to power all the disks in place and use SFF8087s right off the HBA's/MB SATA ports to see if the errors continue?

BetYourBottom · Jan 26, 2017

John Digital said:
So my next question is how much time is passing before you consider it "working" or "not working", and how are you determining this? Badblocks or dd reads? Could it be that it is always doing it, just ironically more (not working) or less (working) frequently? What are the constants, what are variables. It seems you have tried different variables, now try changing some constants (if you can), boards, powers, rams, etc. Maybe try booting into a Linux environment and seeing if the errors persist. Not sure what else to add here. Some unseen compatability issues maybe? Its a mystery when these show up sometimes.

Im pulling for you. You will figure it out eventually, and when you do please let us all know.

I am testing with dd primarily because it's simple and fast to use. The results are virtually instant in all cases. The instant I start a dd write (via SSH) command errors spew out on the FreeNAS screen (video output). Badblocks is how I initially noticed this problem as many drives were going excessively slowly.

I seriously doubt it's doing it at all when connected through the backplane to the motherboard. I ran a full badblocks scan of 4 drives recently and it seemed to go full speed the whole time. While I can't confirm there were no errors (I didn't think to check since they were working at a rate that seemed full speed), I seriously doubt it was still erroring.

I have tried running the same test in on a Debian Live CD similar issues, except debian doesn't outright say what FreeNAS does. I'll post an image of the output of one of the log files below (I don't remember the specific one m0nkey_ told me to check).

John Digital said:
I also found this, where basically its stated, that no you cant use a 9211 on that model Dell. Claim only H200. Its HERE, im sure you have seen it.

Fire up all disks without backplane and see how it goes yet? Backplane is the biggest variable for me now. Can you use the backplane to power all the disks in place and use SFF8087s right off the HBA's/MB SATA ports to see if the errors continue?

That is talking about the expander backplane so I'm unsure if it still applies. Also the creator of the Ode to C2100 thread has his working with H700s and crossflashed mezzanine H200s like I'm using.

There isn't a way to use the backplane to power disks and to connect the drives data using other methods.

There aren't any errors when I use extra power lines running off the PSU to power 2 drives and connect the drives to the H200 via reverse breakout cable.

Also my seller confirmed that the replacement backplane they sent me was tested before sending it and based on how I asked, it should have been tested with a 9211.

I'm not opposed to this being all some really strange compatibility issue but it feels so strange to get this particular error across multiple devices and cables. I'm starting to wonder if the issue lies somehow in the motherboard or something weird. Since I have had some weird incompatibility with my DDR3L ram on only a specific processor and only when the ram was running at low voltage instead of standard 1.5V.

Thank you for all the suggestions. I hope you don't feel discouraged by me shooting any of them down. I really appreciate the help and support.

Bezerker · Jan 31, 2017

I just want to comment here to let you know I am having similar issues on slightly different hardware but have similar errors. I'll be opening up my own post regarding it, but I'm at wits end just like yourself. Hopefully ,we can figure it out. :)

BetYourBottom · Feb 1, 2017

Bezerker said:
I just want to comment here to let you know I am having similar issues on slightly different hardware but have similar errors. I'll be opening up my own post regarding it, but I'm at wits end just like yourself. Hopefully ,we can figure it out. :)

This has just gotten a heck of a lot more like your problem. I just remembered a trashy old 2.5" drive I had stored in a drawer that I decided to randomly throw into my server. Guess what... of course it worked without a single SCSI error.

Sure it's got a couple hundred unreadable sectors but not a single SCSI read or write error when connected just like all of my brand new WD Reds.

So that leaves only the WD Reds as the problem children. I have a feeling that I'm going to be talking with WD support a lot for the next couple days.

Weirdest part is that the drives have tested good on SATA connections. Not a bad sector or SMART error in sight.

Bezerker · Feb 1, 2017

In regards to WD Reds, have you tested disabling the parking that they do?

See https://forums.freenas.org/index.php?threads/hacking-wd-greens-and-reds-with-wdidle3-exe.18171/

My disks are older WD blacks, but I read that older blacks also use intellipark, so I'm first testing setting APM to 254 in freenas to see if that makes any difference first, then will test wdidle3 later (my ipmi sucks on this mobo, so mounting a dos ISO is going to be difficult.)

My wonder is if the power internal power saving crap is somehow triggering when it shouldn't.

Will try wdidle3 later, but tested just setting APM to 254... didn't fix it or make any difference.

BetYourBottom · Feb 3, 2017

Bezerker said:
Will try wdidle3 later, but tested just setting APM to 254... didn't fix it or make any difference.

According to

Code:

cacmcontrol identify daX

I don't seem to have APM support on these drives anyways.

I'm going to test wdidle3 and the more updated version in a little bit and update with the results.

Bezerker · Feb 3, 2017

I have a new controller + shorter cable arriving today, so I'll be attempting those. At this point, I'm starting to become at a loss.

BetYourBottom · Feb 4, 2017

I've called WD customer support, but their level 1 techs weren't familiar with SAS at all and couldn't help me. The issue has been escalated and I should hopefully get to talk with a level 2 tech on Monday.

I'll provide updates and whatever information I gather from the call then.

BetYourBottom · Feb 9, 2017

Level 2 techs were talked to. I gave them all the information I had and they couldn't dig anything up.

I'm going to see if I can RMA them for drives from HGST (since they are part of WD), since I have tested and know that HGST drives work.

BetYourBottom · Feb 10, 2017

Welp, I double checked the HGST drive and it has the errors as well. Looks like I'm down to Seagate, or maybe 2TB drives will work. Either way, I'm losing my mind.

BetYourBottom · Feb 12, 2017

I had a crazy idea that I tried out a night ago and it seems to have resolved the issue. I put jumpers on the drives to set them to PHY mode so that they run as SATA-II devices.

See what I noticed is that when I run the drives in the bays that are connected directly to the motherboard SATA ports they were forced to run as SATA-II and they would work fine. Then I noticed that the other drives, which were older of course, worked fine on the SAS ports but were running as SATA-II. So I realized that that might be an unrealized correlation and decided to test with jumpers. Using the jumpers seemed to immediately remove any errors; if I took a device that was erroring, the jumpers would make it go away, then removing them immediately brought them back. I tested on a drive and wrote 100GB of zeroes using dd at settings that would previously immediately cause errors to show up. Once I bought enough jumpers for all the drives, I tested all of them individually for 1GB of zeroes in a way that still consistently led to errors. Not a single error appeared.

I think what may have been happening is that while the HBA is new and supports SATA-III, the backplane doesn't. Since the backplane is 1:1, it's stupid and doesn't report that it is only compatible with SATA-II speeds. So every drive that can negotiates to a higher speed than the backplane supports and suddenly errors galore when the backplane can't properly pass the messages along.

Bezerker · Feb 14, 2017

Heyas,

Strangely my issues are the opposite, my SATA III devices work better than my older SATA II ones.

That said, I suspected queueing was the issue. So far , I appear to be right.

If I disable NCQ with camcontrol (camcontrol tags device -N1) on my WD 750s, I can use all 8 in a pool without issue. Of course, this has a performance impact.

If I run 4 of them with ncq turned on at a time, they also function correctly.

Alternatively, I split my 8 disk raidz2 into two 4 disk raidz1s for my 5400 rpm 4TBs, no issues so far.

Bezerker · Feb 14, 2017

Confirmed, I found the command I needed to disable NCQ (And subsequently TCQ) on the driver level. kern.cam.sort_io_queues 0 solved my issue. My expander sucks at SATA queueing. Confirmed with supermicro. Test if it solves yours as our errors are similar though our backplanes are different.

Edit: setting sort_io_queues doesnt actualyl fix it. trying to find a kernel param does. camcontrol settings do, but those dont stick.

It's worth noting that I can confrm for sure, that ssd does not have this issue, so its somehow related.

paradoxiom · Mar 6, 2017

BetYourBottom said:
I had a crazy idea that I tried out a night ago and it seems to have resolved the issue. I put jumpers on the drives to set them to PHY mode so that they run as SATA-II devices.

See what I noticed is that when I run the drives in the bays that are connected directly to the motherboard SATA ports they were forced to run as SATA-II and they would work fine. Then I noticed that the other drives, which were older of course, worked fine on the SAS ports but were running as SATA-II. So I realized that that might be an unrealized correlation and decided to test with jumpers. Using the jumpers seemed to immediately remove any errors; if I took a device that was erroring, the jumpers would make it go away, then removing them immediately brought them back. I tested on a drive and wrote 100GB of zeroes using dd at settings that would previously immediately cause errors to show up. Once I bought enough jumpers for all the drives, I tested all of them individually for 1GB of zeroes in a way that still consistently led to errors. Not a single error appeared.

I think what may have been happening is that while the HBA is new and supports SATA-III, the backplane doesn't. Since the backplane is 1:1, it's stupid and doesn't report that it is only compatible with SATA-II speeds. So every drive that can negotiates to a higher speed than the backplane supports and suddenly errors galore when the backplane can't properly pass the messages along.

Could this be why my log fills up with :

Code:

Mar  6 14:57:07 OrbitalHub (da3:mps0:0:13:0): Retrying command (per sense data)
Mar  6 14:57:08 OrbitalHub	 (da3:mps0:0:13:0): WRITE(10). CDB: 2a 00 05 02 9d 00 00 01 00 00 length 131072 SMID 792 terminated ioc 804b scsi 0 state c xfer 0
Mar  6 14:57:08 OrbitalHub (da3:mps0:0:13:0): WRITE(10). CDB: 2a 00 05 02 9d 00 00 01 00 00 
Mar  6 14:57:08 OrbitalHub (da3:mps0:0:13:0): CAM status: CCB request completed with an error
Mar  6 14:57:08 OrbitalHub (da3:mps0:0:13:0): Retrying command
Mar  6 14:57:08 OrbitalHub (da3:mps0:0:13:0): WRITE(10). CDB: 2a 00 05 02 9d 00 00 01 00 00 
Mar  6 14:57:08 OrbitalHub (da3:mps0:0:13:0): CAM status: SCSI Status Error
Mar  6 14:57:08 OrbitalHub (da3:mps0:0:13:0): SCSI status: Check Condition
Mar  6 14:57:08 OrbitalHub (da3:mps0:0:13:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

when using an 'Cremax ICY DOCK MB454SPF-B 4 Hot Swap SATA 2 HDD in 3 X 5.25 Drive Bay'?

Important Announcement for the TrueNAS Community.

CAM status: SCSI status error - what does it means?

JohnDigital

Guru

BetYourBottom

Contributor

Bezerker

Dabbler

BetYourBottom

Contributor

Bezerker

Dabbler

BetYourBottom

Contributor

Bezerker

Dabbler

BetYourBottom

Contributor

BetYourBottom

Contributor

BetYourBottom

Contributor

BetYourBottom

Contributor

Bezerker

Dabbler

Bezerker

Dabbler

paradoxiom

Patron

Similar threads

Important Announcement for the TrueNAS Community.

CAM status: SCSI status error - what does it means?

Guru

Contributor

Dabbler

Contributor

Dabbler

Contributor

Dabbler

Contributor

Contributor

Contributor

Contributor

Dabbler

Dabbler

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "CAM status: SCSI status error - what does it means?"

Similar threads