CAM Status Error during SCRUB

JohnDigital · Mar 6, 2017

Cables are a great start because they are cheap, however, I wrestled with these issues myself once. Now I have no less than 30 sata cables of different lengths and configurations and such in my "cable box", lol. The different HBAs and SAS cables I tried were sold off on eBay so no loss other than time there. While I hope cables are the issue, I suspect the odds of having two sata cables bad, along with faulty SFF cables, has got to be in the getting struck by lightning range.

The HBA will be warm, thats expected. Sideband cables shouldnt effect you. Are you using any molex to sata power adapters, splitters? While my errors dealt with cam status device error power on, reset, HBA, yours suggests something different. Mine was ultimately fixed when I replaced everything power related. PSU, molex to sata power, and the splitters. Your errors do not suggest power related problems after further review, but something else (cabling hopefully)-- (Unrecovered read error). This definitely could be evidence of cable issues. If you want to try a different board/chip/ram just to rule it out, I have my old X8SIL-F sitting idle I can lend you, it served me well and is 100% working. LMK

Good luck sir, keep us posted.

tealcomp · Mar 6, 2017

Well we do live in the land of lightning strikes LOL, so go figure. Sorry my twisted sense of humor after another long day dealing with technology. Ordering the new cables tonight and am going to order some different power adapters from monoprice. Those 2 drives in the motherboard side are not mounted in a way that is overly exciting, so tolerances with cables are very tight to non-existent. To address the other question, most of this power design is without the use of molex connectors. Where possible, the power cables are direct run modular cables from the PSU. The exceptions that come to mind are a molex to 2x SATA split on the (2) WD's and another set powering the OS mirrored SSD's. Haven't reseated the HBA but saving that for a later moment of desperation :) One thing that bears mentioning is there may be a pattern to how the errors manifest in the system log, will be taking a closer look at that shortly. Also contemplating ordering another PSU and also going to reach out to Seasonic to see if this has been mentioned before as a PSU fault.

I really appreciate your offer to borrow your motherboard, but let's hold off on that gesture of good will just a bit while the other items are worked through.

So tonight:

1) Order new power and SFF cables
2) Check all existing connections to PSU and components and try scrub again; it normally takes about an hour for this problem to show up and am pretty sure it starts with one particular drive each time and cascades from there. That's the part needing further investigation. Spent time mapping out the serial number to port to gptit labels to ensure drives aren't swapping around after a reboot.
3) Potentially order new PSU
4) Reach out to Seasonic with a support query
5) Turn it off and go relax to deal with the next day.

Anything overlooked :) ?

Thanks!
-Dan

tealcomp · Mar 13, 2017

tealcomp said:
Well we do live in the land of lightning strikes LOL, so go figure. Sorry my twisted sense of humor after another long day dealing with technology. Ordering the new cables tonight and am going to order some different power adapters from monoprice. Those 2 drives in the motherboard side are not mounted in a way that is overly exciting, so tolerances with cables are very tight to non-existent. To address the other question, most of this power design is without the use of molex connectors. Where possible, the power cables are direct run modular cables from the PSU. The exceptions that come to mind are a molex to 2x SATA split on the (2) WD's and another set powering the OS mirrored SSD's. Haven't reseated the HBA but saving that for a later moment of desperation :) One thing that bears mentioning is there may be a pattern to how the errors manifest in the system log, will be taking a closer look at that shortly. Also contemplating ordering another PSU and also going to reach out to Seasonic to see if this has been mentioned before as a PSU fault.

I really appreciate your offer to borrow your motherboard, but let's hold off on that gesture of good will just a bit while the other items are worked through.

So tonight:

1) Order new power and SFF cables
2) Check all existing connections to PSU and components and try scrub again; it normally takes about an hour for this problem to show up and am pretty sure it starts with one particular drive each time and cascades from there. That's the part needing further investigation. Spent time mapping out the serial number to port to gptit labels to ensure drives aren't swapping around after a reboot.
3) Potentially order new PSU
4) Reach out to Seasonic with a support query
5) Turn it off and go relax to deal with the next day.

Anything overlooked :) ?

Thanks!
-Dan

Well hello fellow forum members :)

I have been a busy camper since we last spoke.

All of the above items have been completed, including ordering a new PSU (eVGA P2 850 this time), new SFF cables directly from Supermicro, new more streamlined Molex to 2x SATA power for the drives in the motherboard side.
Additionally, all fans are now powered from the motherboard and the 3-way switch has been removed from the equation, as well as removal of the SATA power cable from the PSU itself.

I used this as an excuse too to buy a nice Fluke meter for doing some basic testing.

I also noticed there was a questionably loose Molex to 2x SATA providing power to the OS SSD's, so of course I fixed that.

While waiting for the various "new" components to arrive, conducted several more tests.

1) After checking everything from a wiring integrity point of view (power, data cables), I decided to re-try the scrub. It took about an hour, but sure enough, right around that timeframe, the errors started occurring. Well that was discouraging, but I already decided, I would not be defeated by this.
2) I decided to dismantle the current Z3 volume and build a slightly smaller Z3 volume, using only 8 drives (in the 2 hanging racks). Of course it required re-populating the volume with data (luckily that wasn't an issue, RSYNC to the the rescue). At first things were looking better and the first time it completed the scrub with no issues, but there was less than a TB on the volume at the time. I decided to add more data to it and run another scrub. Once again, after going beyond about 1.5TB, a scrub revealed the same issues as before.
3) The next objective was to create a mirror on the 2 drives I had taken out of the original volume pool. I decided to create a mirror, load them up with data and see if I could reproduce the error. Surprise! Everything worked as it should.
4) I decided to re-create the volume again (with the 8 drives) but this time create as a Z2 volume. I copied data over and EUREKA, the scrub completed. Not only did it complete once, it completed multiple times, including while I was actively performing a copy. It completed multiple times with different levels of data on the volume.
So, the next question is WHY? Why would it work with a RAIDZ2 volume and not a RAIDZ3 volume? There is more stress on the drives and on the CPU due to the increased parity but beyond that, I cannot offer a reason why I am seeing this behavior.
Today, my new PSU and cables arrived, so I will be taking the server down in a moment and replacing the cables. But after seeing this behavior, I am not convinced this is a power problem as originally thought. I will be back in a while to post more results. Time to replace those crappy cables right quick.

JohnDigital · Mar 13, 2017

Very interesting. Im seriously intrigued to see what this ends up being.

tealcomp · Mar 13, 2017

John Digital said:
Very interesting. Im seriously intrigued to see what this ends up being.

Well first thing I noticed LOL, the new cables from SM are shall we say beefier :) So I am encouraged, perhaps prematurely so, that this could have potential :0 Running a new scrub against both volumes right now; if that succeeds, I will recreate a new Z3 volume and reload it with data and try several scrubs :) One thing is for certain, once I get this thing seasoned, it should purr like a kitten!

JohnDigital · Mar 13, 2017

Ive seen these errors all over the forum and had to deal with it myself in the past. But the methodical way you have approached this will be valuable to members of the forum for who knows how long. I spent many a minute banging my head as well wondering "What else could it be?". Although with more cuss words.

I got them with an amalgam of parts of questionable origin both used and new, but the fact you have started with top shelf parts, will go a long way to determining the true culprit behind these errors. It didnt help that when I finally "fixed" it, I had done several things, so the burden of proof of fix, fell back to a possibility of things. At that point I was just happy to have never seen the error again.

Good night sir, and good luck!

tealcomp · Mar 14, 2017

Well, some good and some not good news LOL, or glass half full or half empty. After replacing the cables, I ran another scrub against about 7TB of my data. It ran but one drive, da1, threw that same annoying error. That is the bad news; the good news is I ran the same test again this morning and it ran to completion with ZERO errors, LOL. So, being the OCD type personality that is me, I blew away the RAIDZ2 volume, created a new RAIDZ3 volume and I am now reloading the volume. Once it is reloaded, I will test multiple scrubs. Still haven't taken out the new PSU for a spin, saving that for after the rest of this excitement I have created for myself :)
More news as it develops..

tealcomp · Mar 24, 2017

Sorry it has taken so long to provide an update. I have had more than a few things pop up (not related to this problem). Anyway, I have tried various combinations of drive configurations, tried moving drives around, tried different RAIDZ levels and I am just not having good luck with these REDS handling Z3 beyond a certain drive count (referring to scrubs). Now today, most likely unrelated but noticed, IPMI sensor for 3.3 Standby (I presume) is showing a low critical. This is new behavior and made an appearance on 3/17. Of course I never saw that because for some reason email notification from IPMI is also not functioning, so I am looking into that. I will be trying out the new eVGA PSU today and seeing if I still have errors with the drives or if everything suddenly just works :) Nah, I cannot be that lucky.

-Dan

tobiasbp · Aug 28, 2017

tealcomp said:
As promised, here is the output for smartctl -a from all drives, ada0-ada1, da0-da3

Code:
========== Device Model: WDC WD60EFRX-68L0BN1

I think we have the same problem. After having replaced almost everything in my machine, it turns out, the drive with worst behaviour is of type WD60EFRX-68L0BN1 just like yours. Other disks with device type WD60EFRX-68MYMN1 in my zpool, do not exibit the same behaviour (Throwing SCSI errors).

fireheadman · Jul 16, 2019

I know I know... dead thread.
....But you have no idea how many forums and threads I have combed through to find this one! I'm replying because I like the troubleshooting steps performed and there is a lot of value to be had here.

I'm having the same CAM related error on scrubs (only seems to be on scrubs). I am running 2x - LSI 9211-8i SAS PCIe cards in IT mode. One of the cards has 2x SAS Mini breakout cables (each cable splits to 4x SAS drives, so 8 total). The second card has 1 SAS Mini breakout cable going to 4 more drives. Bring the drive count to 12x on the system.

My NAS sits in the crawl space (a finished area).....but it has also become home to a really nasty spider nest (spider killer). I have webs all over the front of the case from the drive leds attracting other spiders (skeletal remains everywhere from challengers). So needless to say, it takes a few minutes to locate/monitor her whereabouts before entering the area and opening the case.

Steps performed:

re-seated both cards LSI cards
moved 1 of the SAS Mini connections to the other card.
also replaced 2 drives with new drives (same size/type)
rebooted after resilver completed (replacing 2 drives forced a z2 parity condition)
- 1 drive replace per z2 set, so no crash.

This seemed to move CAM error to another drive (da08 and now da04). I haven't mapped out serial numbers to see if I have a relabeling event took place on reboot issue (could be the same drive?).

TODO:

print SN labels and apply to drives
move SAS Mini connection back to original LSI card
re-run zpool/glabel cmds to see if CAM error is from same drive...
- This would make me thing the SAS Mini Cable is bad (SFF-8087) or I might have a faulty back plane/SAS port.

Ruling out power at the moment and heat has not been an issue (crawl space is like an ice box)
Spiders however... unsure, hoping webs are not conductive to arc across any/everything.

Zpool status

Code:

[root@maximus ~ 502]# zpool status
  pool: abyss
 state: ONLINE
  scan: scrub repaired 0 in 0 days 02:09:45 with 0 errors on Tue Jul 16 12:31:13 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        abyss                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/bee5edcc-a77b-11e9-a273-000423df85ec  ONLINE       0     0     0
            gptid/13925837-aca3-11e6-b8b1-000423df8756  ONLINE       0     0     0
            gptid/140839a6-aca3-11e6-b8b1-000423df8756  ONLINE       0     0     0
            gptid/148e902b-aca3-11e6-b8b1-000423df8756  ONLINE       0     0     0
            gptid/15145307-aca3-11e6-b8b1-000423df8756  ONLINE       0     0     0
            gptid/158c8f83-aca3-11e6-b8b1-000423df8756  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/160853d8-aca3-11e6-b8b1-000423df8756  ONLINE       0     0     0
            gptid/168749dc-aca3-11e6-b8b1-000423df8756  ONLINE       0     0     0
            gptid/90794795-a77b-11e9-a273-000423df85ec  ONLINE       0     0     0
            gptid/1782b12a-aca3-11e6-b8b1-000423df8756  ONLINE       0     0     0
            gptid/1806e3fa-aca3-11e6-b8b1-000423df8756  ONLINE       0     0     0
            gptid/1892db15-aca3-11e6-b8b1-000423df8756  ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:52 with 0 errors on Wed Jul 10 03:45:52 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          ada0p2    ONLINE       0     0     0

errors: No known data errors

Glabel status

Code:

[root@maximus ~ 503]# glabel status
                                      Name  Status  Components
gptid/417078d9-ac64-11e6-a443-000423df8756     N/A  ada0p1
gptid/13925837-aca3-11e6-b8b1-000423df8756     N/A  da0p2
gptid/140839a6-aca3-11e6-b8b1-000423df8756     N/A  da1p2
gptid/148e902b-aca3-11e6-b8b1-000423df8756     N/A  da2p2
gptid/90794795-a77b-11e9-a273-000423df85ec     N/A  da3p2
gptid/15145307-aca3-11e6-b8b1-000423df8756     N/A  da4p2
gptid/158c8f83-aca3-11e6-b8b1-000423df8756     N/A  da5p2
gptid/160853d8-aca3-11e6-b8b1-000423df8756     N/A  da6p2
gptid/168749dc-aca3-11e6-b8b1-000423df8756     N/A  da7p2
gptid/1782b12a-aca3-11e6-b8b1-000423df8756     N/A  da8p2
gptid/1806e3fa-aca3-11e6-b8b1-000423df8756     N/A  da9p2
gptid/1892db15-aca3-11e6-b8b1-000423df8756     N/A  da10p2
gptid/bee5edcc-a77b-11e9-a273-000423df85ec     N/A  da11p2
gptid/13fc69ec-aca3-11e6-b8b1-000423df8756     N/A  da1p1
gptid/1385c333-aca3-11e6-b8b1-000423df8756     N/A  da0p1

SAS PCIe card versions

Code:

[root@maximus ~ 506]# sas2flash -listall
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2008(B2)

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS2008(B2)     20.00.07.00    14.01.00.08    07.39.02.00     00:02:00:00
1  SAS2008(B2)     20.00.07.00    14.01.00.08    07.39.02.00     00:03:00:00

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.

JohnDigital · Jul 16, 2019

I have gotten these several times in the past. Try different drives, different power, different cables in whatever order is easiest. There seems to be no one right answer to these issues. Something different works for different people.

Out of curiosity what is the exact error displayed?

fireheadman · Jul 16, 2019

John Digital said:
I have gotten these several times in the past. Try different drives, different power, different cables in whatever order is easiest. There seems to be no one right answer to these issues. Something different works for different people.

Out of curiosity what is the exact error displayed?

Looks like this (spews this during a scrub).
I believe the scrub is causing a massive surge of I/O and Power consumption.
If I watch a movie on Plex (uses the NAS), I do not get any of the errors....

Code:

Jul 16 15:48:35 maximus (da4:mps0:0:5:0): READ(10). CDB: 28 00 8a d0 dc 4a 00 01 00 00
Jul 16 15:48:35 maximus (da4:mps0:0:5:0): CAM status: SCSI Status Error
Jul 16 15:48:35 maximus (da4:mps0:0:5:0): SCSI status: Check Condition
Jul 16 15:48:35 maximus (da4:mps0:0:5:0): SCSI sense: ABORTED COMMAND asc:4b,4 (NAK received)
Jul 16 15:48:35 maximus (da4:mps0:0:5:0): Info: 0x8ad0dc75
Jul 16 15:48:35 maximus (da4:mps0:0:5:0): Retrying command (per sense data)

...and the mount of these errors??... 8694
and that is after serveral log rotations

Code:

[root@maximus ~ 503]# grep "CAM status: SCSI Status Error" /var/log/messages |wc -l
    8694

JohnDigital · Jul 16, 2019

Shut your server down and switch the cables between that drive and a different drive. Then boot up and start a scrub. Does the error follow the drive or the cable?

JohnDigital · Jul 16, 2019

I suspect the scrub is exposing an issue with the drive reading a particular block. Thus throwing a CAM read error. To rule out the HBA you could try connecting the same suspect drive to a port on the motherboard, if it still throws the error then, it's time for drive RMA.

For the record I have never had new cables fix this. Once I think it was the drive itself and once it was power related with multiple CAM errors.

fireheadman · Jul 16, 2019

I replaced the drive and switched a cable to the other card and got a CAM error, however I need to just do 1 thing at a time....
Since the cables are cheap, I am buying a new one today and will replace the cable I moved....

My thoughts (after thinking more on it) is the cable is going bad. Card A originally had 4 drives (da0-da3)... and Card B had 8 drives(da4-da11). When I moved the cable over, card A registered da0-da7 and card B registered da8-da11. I still get a little lost on the mappings.... cables are cheap so gonna start there.

fireheadman · Jul 21, 2019

UPDATE:
looks like it ended up being a bad cable (in addition to failing disk that was replaced last week)
I purchased 1 of these https://www.amazon.com/gp/product/B018YHS8BS and re-started another scrub. Its been running for 10-15min without any errors so far. (/var/log/messages is quite as a mouse)

Inspecting the cable I took off I cannot see any physical damage to it...

JohnDigital · Jul 24, 2019

Let us know what you find out.

fireheadman · Jul 24, 2019

3 days without error
Going with a bad cable causing the CAM errors. I’ve performed 3 scrubs also to stress test

JohnDigital · Jul 24, 2019

That's great. Love those $12 fixes.

Important Announcement for the TrueNAS Community.

CAM Status Error during SCRUB

Guru

Explorer

Explorer

Guru

Explorer

Guru

Explorer

Explorer

Patron

Dabbler

Guru

Dabbler

Guru

Guru

Dabbler

Dabbler

Guru

Dabbler

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "CAM Status Error during SCRUB"

Similar threads