CAM Status, Medium Error, Unretryable?

BlueMagician · May 9, 2017

Dear all,

I wonder if anyone could shed some light on an error I've received from my FreeNAS box overnight (during a scheduled Scrub):

Code:

 (da5:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 01 0e a0 06 40 00 00 01 00 00 00 length 131072 SMID 152 terminated ioc 804b scsi 0 state 0 xfer 0
> (da5:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 01 0e a0 06 40 00 00 01 00 00 00
> (da5:mps0:0:6:0): CAM status: CCB request completed with an error
> (da5:mps0:0:6:0): Retrying command
> (da5:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 01 0e a0 05 40 00 00 01 00 00 00
> (da5:mps0:0:6:0): CAM status: SCSI Status Error
> (da5:mps0:0:6:0): SCSI status: Check Condition
> (da5:mps0:0:6:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
> (da5:mps0:0:6:0): Info: 0x10ea00540
> (da5:mps0:0:6:0): Error 5, Unretryable error

DA5 is one of the six WD Red drives that make up my RAIDz2 vDEV.

I queried ZPOOL STATUS that morning whilst the scrub was still in progress. It showed that 128KB of data had been fixed, and one of the drives in the status list had '(REPAIRED)' or such, tagged onto it.

I ran an Extended (long) SMART test on DA5 the following evening, and the test completed with no errors.

Most interestingly, the drive is not showing any pending nor any reallocated sectors - so does this mean that the read error was more likely a controller issue than a sector/disc problem?

The drive in question has about 3 months warranty left - so I'm wondering if I need to poke it harder to see if there's an underlying issue, or whether I should just forget this ever happened, and move on.

Sanity check much appreciated! Thank you in advance,

S.

m0nkey_ · May 9, 2017

Did this error only occur once, or did it appear many times? I've seen something similar with one of my drives which turned out to be a bad cable.

BlueMagician · May 9, 2017

m0nkey_ said:
Did this error only occur once, or did it appear many times? I've seen something similar with one of my drives which turned out to be a bad cable.

It has only happened once in recent times - although I couldn't "hand on heart" say that it hasn't cropped up once before.

My PERC card was new when I bought it, as were the ludicrously expensive LSI branded 8087 cables. I know that's not a guarantee to getting fault-free operation, but I do tend to over-buy on things like that, to try and avoid stuff like this happening!

I'd take the chassis apart and give all the connectors a good bit of compressed-air and a re-seat -- but I fear the powering down of all the drives may be more dangerous to the pool than this error.

Not sure. Thoughts appreciated though, thank you.

S.

Dice · May 9, 2017

BlueMagician said:
My PERC card

1. What model?
2. Is it IT-mode flashed?

BlueMagician · May 9, 2017

Dice said:
1. What model?
2. Is it IT-mode flashed?

As per my system spec in my signature, it's a Dell PERC-H200, crossflashed to IT-mode P20 drivers.
The system has been running for about 2 years, and I've changed nothing recently except for updating from 9.10_U2 to 9.10_U3 about 2 weeks ago.

Thank you,
S.

Dice · May 9, 2017

AS @Jailer pointed out, this is the typical error message for cable problems.
Does your smart output register anything >0 on ID#199 ?
If it does or does not, I'd poke around and re-seat the SATA cables. Then wait to see if the error returns.

BlueMagician · May 9, 2017

Dice said:
Does your smart output register anything >0 on ID#199 ?

No, all the usual suspect SMART counters are zero - hence in my first post making the assumption that this was more likely controller or bus related.

I'm not sure whether to be happy that my drive is probably OK, or sad that the only way to possibly find out for sure is to risk a power-down to fiddle with potentially-dodgy/potentially-fine premium cables. Hmm.

S.

Robert Trevellyan · May 9, 2017

What is it about powering down that makes you so nervous?

Stux · May 9, 2017

I'd keep on eye on it. Could've been a once-off. If it happens again, then investigate further.

BlueMagician · May 9, 2017

Robert Trevellyan said:
What is it about powering down that makes you so nervous?

Only the fact that the discs have been powered up for a year without interruption, and then another year before that. Just a bit of paranoia really.

EDIT: After a quick email search - my suspicions at having seen this error before were confirmed - albeit more recently than I first thought...

It seems that a near identical error was flagged up a few weeks ago during a previous scrub.

Line for line the error from last month is the same as this latest one, except for in the first line the SMID was 973 in one, and 152 in the other. EVERY other detail of the error is the same.

That's two emails, a few weeks apart, both produced whilst performing a scrub - showing a single unretryable READ error from the same channel/disc, with identical error detail including all the long CDB numbers - except for that SMID.

I don't know the significance of the SMID number, but if this were truly a randomly failing part/connector/controller then I would expect a little more randomness to the failure?

S.

Robert Trevellyan · May 10, 2017

It seems to me that if powering down is going to hurt the system, that's just a problem waiting to happen anyway. The only reason to delay would be to refresh your backup first.

Important Announcement for the TrueNAS Community.

CAM Status, Medium Error, Unretryable?

BlueMagician

Explorer

m0nkey_

MVP

BlueMagician

Explorer

Dice

Wizard

BlueMagician

Explorer

Dice

Wizard

BlueMagician

Explorer

Robert Trevellyan

Pony Wrangler

Stux

MVP

BlueMagician

Explorer

Robert Trevellyan

Pony Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

CAM Status, Medium Error, Unretryable?

Explorer

MVP

Explorer

Wizard

Explorer

Wizard

Explorer

Pony Wrangler

MVP

Explorer

Pony Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "CAM Status, Medium Error, Unretryable?"

Similar threads