CAM Status : SCSI Status Error

Cougar014 · Jun 4, 2017

Hello everyone,

For quite some time I'm being harassed by a daily security run output email with the following message:
(da1:mpr0:0:6:0): WRITE(10). CDB: 2a 00 02 4c ca e0 00 00 18 00
(da1:mpr0:0:6:0): CAM status: SCSI Status Error
(da1:mpr0:0:6:0): SCSI status: Check Condition
(da1:mpr0:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus d
evice reset occurred)
(da1:mpr0:0:6:0): Retrying command (per sense data)

Sometimes I receive this email 3 days in a row, sometimes I won't receive it for several weeks.
But it is annoying me big time because I don't know what it actually means.

I have a Supermicro X11-SSL-CF motherboard with an onboard LSI 3008 HBA in IT mode hooked with 2 3TB WD Reds in a RAID 1 setup, one drive is called da0 and the other drive is da1.
The amount of errors I recieve from both drives is about the same. Sometimes from da0 other times from da1.

Also I am running mpr0 Firmware 12.00.02.00 (p12) with Driver 15.01.00.00 (p15) instead of the same firmware and driver version.
But that is due to this issue:
https://forums.freenas.org/index.php?threads/lsi-3008-it-firmware-mismatch.49587/

I have asked this question several times before, I had several reaction ranging from:
1) "update Freenas to 9.10-U3" - which I did. But it didn't fix it.
2) "It is a faulty Sata cable, switch it to another connector" - I use a SASmini to 4x SATA cable. So i did, but didn't fix it.

Can someone tell me:
Is this just a random read error which occur sometimes or is this something serious?
Even though I have this issue already for months, I am not noticing any performance or missing files issues.

Thanks in advance

PS:
I am not very familiar with Freenas, so if you wan't an output from a command or something, please post the command in the reaction and I can look it up.

Cougar014 · Jun 7, 2017

Today I recieved this mail AGAIN!

Is there really no one who has any knowledge of this??

I can use the help very much.

Thanks!

Ericloewe · Jun 7, 2017

How long is the cable?

Cougar014 · Jun 7, 2017

The cable is 0.6m long i believe.

Stux · Jun 8, 2017

I've seen posts which indicate that SAS3 is more susceptible to cabling issues.

I'd suggest trying a different brand of cable.

Cougar014 · Jun 8, 2017

Ok thanks, I will give that a try.
Currently I am using a Broadcom SFF8643 0.6m 4x SATA cable.
https://www.alternate.nl/Broadcom/SFF8643-4x-SATA-0-6m-Kabel/html/product/1264312?

Do you have a suggestion for a good brand for this kind of cable?

Also i wonder if the firmware isnt the issue here? Do you have any idea about that?

Thanks for your help!

Cougar014 · Jun 30, 2017

Okay,

I thought i fixed it, But apparently I didn't.

I replaced the SAS->Sata cable from a Broadcom one to a brand new inter-tech one.
The error diss appeared since I replaced it two weeks ago.

But this morning the same error was back.
Only from an almost everyday error it reduced to a once in a two week error (for now) (small note, with the older cable it started with a once a week error en slowly over time in grew to a almost everyday error.

Here is the error again:
freenas.local kernel log messages:
(da0:mpr0:0:4:0): WRITE(10). CDB: 2a 00 11 45 0f 98 00 01 00 00 length 131072 SMID 612 terminated ioc 804b scsi 0 state c xfer 0
(da0:mpr0:0:4:0): WRITE(10). CDB: 2a 00 11 45 0f 98 00 01 00 00
(da0:mpr0:0:4:0): CAM status: CCB request completed with an error
(da0:mpr0:0:4:0): Retrying command
(da0:mpr0:0:4:0): WRITE(10). CDB: 2a 00 11 45 0f 98 00 01 00 00
(da0:mpr0:0:4:0): CAM status: SCSI Status Error
(da0:mpr0:0:4:0): SCSI status: Check Condition
(da0:mpr0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da0:mpr0:0:4:0): Retrying command (per sense data)

-- End of security output --

Someone got a clue about what the f is going on??

Thanks in advance!

hugovsky · Jun 30, 2017

You should be using firmware 14. Have you checked power connections? Please post your full hardware configuration.

Cougar014 · Jun 30, 2017

I have a
Supermicro X11SSL-cf
Xeon 1240 v5
32gb ECC memory (can't recall which, but they were suported according to supermicro)
Be Quiet straight power 10 500W
1X 250gb Samsung evo 850
2x 3TB WD red in RAID 1 connected to onboard lsi 3008 HBA by inter-tech SAS -> 4x SATA cable.

I run VMware ESXi 6.5 with 13GB and 2 cores for FreeNAS. The HBA is passthroughed to the FreeNAS VM.
Everything else matches the following instructions.
http://www.freenas.org/blog/yes-you-can-virtualize-freenas/

Most of the time these errors occurre overnight, when the systems is stationary (but still running doing "nothing")

I did not have checked the power source yet. I am not sure if I have another set of psu to electric-sate cable.
But I will take a look at it tomorrow.

tobiasbp · Jul 3, 2017

Cougar014 said:
Can someone tell me:
Is this just a random read error which occur sometimes or is this something serious?

I'm seeing something very similar in my thread over here.

Cougar014 · Jul 7, 2017

tobiasbp said:
I'm seeing something very similar in my thread over here.

Yea it is a strange thing. I dont get it either.
I think it is eiher a faulty PSU (or PSU-cables) or it is a faulty HBA....
But the strange thing is, that by replacing the SAS->SATA cable to a brand new done dropped the error count to once every 2 weeks, instead of almost every day....

joeschmuck · Jul 7, 2017

The first posting listed the drive da1 as faulty, then after replacing the cable the failure is da0. Are you tracking the drives by serial number? You absolutely need to do this as the assignments of da0, da1, etc... are not fixed to the hard drive. If you know for certain that it's the same drive serial number then you could in good faith state the drive itself has an issue.

In order to move forward on this I think you will need to do the following...
1) Write down each drive serial number when there is a failure.
2) Power down and place a piece of tape on the cable connector attached to the drive in question, then rotate the cables around.
3) When the next failure occurs, did it happen to the same drive serial number? Yes = Likely bad drive but see below for other testing.
4) If the failure follows the cable then either the cable is faulty or the HBA is faulty.
5) If the failure is a different drive and cable then write it all down. This is were I feel you need to do some other testing...

Other testing: So lets do some real troubleshooting, this will either isolate the HBA or the Drive as the failing item.
1) You need to drop your FreeNAS to bare metal and use the onboard SATA ports. Yes you will loose your ESXi but this is only for testing and then you can bring back your ESXi.
2) Post the output of the SMART data for all of your hard drives. We are looking for communications errors (ID 199).
3) Run your system until another failure occurs.
4) Post the output of the SMART results for the suspect drive, again looking for communications failures.
5) If the problem is the same drive (as identified by serial number) then consider and RMA.
6) If the problem is a different drive then we will need to evaluate your hardware, run the basic test MEMTest86 and some CPU stress test, ensure they pass. This could reveal a power supply issue.
7) If the problem never seems to come back then it's your HBA.

Good Luck

Cougar014 · Jul 9, 2017

Hi, Thanks for your info!
There are some usefull tricks i will probably use later in the game....
Right now i'm trying to find the answer in the PSU or it's cables, I really hope it is that, that would be the cheapest option....

Also i figured out that the errors doesnt only appear during scrubbing or "hard" works for the HDD's.
Today I checked the console (just randomly dont know why) arround 12:30 and i had a clean console. But when I took a look at the console around 20:00 to see if there was a SCSI error, there was one.
While the whole system was idle the whole day....

I am starting to think it is a HBA issue, but i dont have any prove for that.
It is just strange that it is perfectly operational, but just randomly dropes some of those errors and both drives, sometimes on da1 othertimes of da0.....

Scampicfx · Nov 2, 2017

Dear Cougar,

in case you still suffer from this error: check if your scrub tasks overlaps with smart tests...

Cougar014 · Nov 2, 2017

Hi scampicfx,

Thanks for your reply.
But i think i got it fixed.

When i started working behind my pc i decreased the case fan speed from 1500 to 1000 rpm and that causeud the errors
I think that the HBA got a little overheated when I did that.
I stopped doing that and I didn't got any of those errors anymore

EDIT:
This above might be a little unclear.

My server stands next to my pc. So when I sit behind my pc I got annoyed by the sound of the fans. I got a 3way fan-speed switch on my case so I lowered them from the high to medium when I was sitting next to it.
After several months I started to see a pattern and I stopped lowering the case fan speed.
Since then I never had any issues again

Important Announcement for the TrueNAS Community.

CAM Status : SCSI Status Error

Cougar014

Explorer

Cougar014

Explorer

Ericloewe

Server Wrangler

Cougar014

Explorer

Stux

MVP

Cougar014

Explorer

Cougar014

Explorer

hugovsky

Guru

Cougar014

Explorer

tobiasbp

Patron

Cougar014

Explorer

joeschmuck

Old Man

Cougar014

Explorer

Scampicfx

Contributor

Cougar014

Explorer

Similar threads