Chronic "CAM status: Command timeout" errors

mdreed

Dabbler
Joined
Nov 18, 2018
Messages
13
Hello there,

I've got a several year old server that has been throwing "CAM status: Command timeout" errors basically every day at 3 am (if it is on). It'll throw a few of them over several days and eventually freeze up and require a reboot. This has been going on for a long time (year+).

The server is an ASRock C2550D4I w/ 16 gigs ECC memory and 2x5 TB & 2x3 TB WD Red NAS hard drives. I recently replaced on of the 5 TB drives because it was throwing some smartctl errors. Running the latest stable release of 11.2.

Some example errors are:

Code:
May 12 03:02:36 freenas ahcich3: Timeout on slot 17 port 0
May 12 03:02:36 freenas ahcich3: is 00000008 cs 00000000 ss 00000000 rs 00020000 tfd 40 serr 00000000 cmd 10009117
May 12 03:02:36 freenas (ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 c0 77 d9 40 fa 00 00 00 00 00
May 12 03:02:36 freenas (ada3:ahcich3:0:0:0): CAM status: Command timeout
May 12 03:02:36 freenas (ada3:ahcich3:0:0:0): Retrying command
May 12 03:04:18 freenas ahcich2: Timeout on slot 30 port 0
May 12 03:04:18 freenas ahcich2: is 00000008 cs 00000000 ss 00000000 rs 40000000 tfd 40 serr 00000000 cmd 10009e17
May 12 03:04:18 freenas (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 b8 30 32 40 83 00 00 00 00 00
May 12 03:04:18 freenas (ada2:ahcich2:0:0:0): CAM status: Command timeout
May 12 03:04:18 freenas (ada2:ahcich2:0:0:0): Retrying command
May 12 03:08:53 freenas ahcich2: Timeout on slot 20 port 0
May 12 03:08:53 freenas ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00100000 tfd 40 serr 00000000 cmd 10009417
May 12 03:08:53 freenas (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 48 d4 32 40 83 00 00 00 00 00
May 12 03:08:53 freenas (ada2:ahcich2:0:0:0): CAM status: Command timeout
May 12 03:08:53 freenas (ada2:ahcich2:0:0:0): Retrying command


Smartctl on both drives (ada2 and ada3) are totally clean with no errors and attributes 1, 7, and 199 = 0. Zpool status is also clean.

I have replaced the SATA cables on both of these drives. I noticed that the plastic on one of the drives is a little chewed up and so the cable doesn't click in quite right, but other than that they seemed fine.

I'd really like this server to be reliable again. I've looked into getting a new drive controller (HBA in IT mode), but a guy on eBay who sells them told me he was skeptical that it would help (and otherwise very helpful). I suppose I could also replace the drives, but they haven't shown any SMART errors and that's also a lot more expensive. I'd really appreciate any advice as to what I should do.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
How often are you running long SMART tests?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
ASRock C2550D4I
This model system board is known to have problems:
https://www.ixsystems.com/blog/library/asrock-rack-c2750d4i-bmc-watchdog-issue/
Has it been replaced under warranty?
This has been going on for a long time (year+).
You don't care about your data much, do you?
2x5 TB & 2x3 TB WD Red NAS hard drives.
Were these new drives or did you re-purpose them from a previous, something?
The server is an ASRock C2550D4I w/ 16 gigs ECC memory
Are the drives connected to the integrated SATA controller?
 

mdreed

Dabbler
Joined
Nov 18, 2018
Messages
13
Thanks for the replies. Let me respond to each question:

How often are you running long SMART tests?
I believe they were scheduled to run every day, but I checked this morning and the configuration was a little hard to understand, so I just updated them to run every day for sure. Once I get home tonight I'll force a long test.

This model system board is known to have problems:
https://www.ixsystems.com/blog/library/asrock-rack-c2750d4i-bmc-watchdog-issue/
Has it been replaced under warranty?
It has not been replaced, but my understanding of that particular error is that it causes the board to not boot at all anymore. I have updated the firmware to the latest version, however. I've had this board for ~4 years, so I imagine it's out of warrantee.

You don't care about your data much, do you?
Well, I haven't had the system on for that entire time. I've gone through several periods of trying to fix it, giving up, and turning it off. But recently I've been working on a personal project that involves the server, so fixing it has been elevated to the top of the pile again.

Were these new drives or did you re-purpose them from a previous, something?
No, they were brand new.

Are the drives connected to the integrated SATA controller?
Yes, connected to the integrated Marvell controller. Is that concerning? I was thinking about buying a replacement HBA, as I mentioned in the original post.
 

colmconn

Contributor
Joined
Jul 28, 2015
Messages
174
The board may be out of warranty but when it fails it's likely you'll get a new one to replace it from the manufacturer. I had my C2750 replaced a few months ago with very little hassle. It was roughly the same vintage as yours. When if fails your first port of call should be to email william @ asrockrack.com and describe the problem. (Include serial number as well.)

Have you tried replacing the SATA cables from the board to the drive? If you've not got the SATA ports on the baord maxed out have you tried connecting ada2 and ada3 (your log excerpt only shows two offending drives so i can but assume only two are giving you problems) to different SATA ports on the board to see if the errors go away?
 

mdreed

Dabbler
Joined
Nov 18, 2018
Messages
13
The board may be out of warranty but when it fails it's likely you'll get a new one to replace it from the manufacturer. I had my C2750 replaced a few months ago with very little hassle. It was roughly the same vintage as yours. When if fails your first port of call should be to email william @ asrockrack.com and describe the problem. (Include serial number as well.)

Have you tried replacing the SATA cables from the board to the drive? If you've not got the SATA ports on the baord maxed out have you tried connecting ada2 and ada3 (your log excerpt only shows two offending drives so i can but assume only two are giving you problems) to different SATA ports on the board to see if the errors go away?

Oh great! I'll email William right away. Thanks for the suggestion.

I have replaced the SATA cables on the two problematic drives, but I haven't permuted them around to check if the error follows some pattern. I'll plan on doing that as well.
 

mdreed

Dabbler
Joined
Nov 18, 2018
Messages
13
Just to follow up on this in case anyone else has a similar problem: it turned out to be one of the SATA controllers. Following William's advice, I switched the drives that were connected to the Marvell SE9230 controller to the Intel SATA controller, and the error went away. I am SO relieved.
 
Top