SOLVED A long journey to solve "CAM status: Command timeout" issues

fabiob · Dec 5, 2017

Hello community,
this is my first post here, so please forgive any misses (also forr my bed English! :D)

It's my third FreeNAS build, I'm a happy user of two 9.10.2U6 servers and a (less happy) user of this 11.0U4 I'm describing below:
MB: Gigabyte GA-X150M-PRO-ECC
CPU: Intel(R) Xeon(R) CPU E3-1230 v5 @ 3.40GHz
RAM: 4 * ECC Kingston ValueRAM KVR21E15D8K2/32I
PSU: Corsair VS650 CP-9020098-EU 650W

Boot:
ada0: <DREVO X1 SSD Q0111A> ACS-2 ATA SATA 3.x device

Storage: 4* Seagate 8Tb Archive HDD on a RAID10 Pool:
ada1: <ST8000AS0002-1NA17Z AR17> ACS-2 ATA SATA 3.x device
ada2: <ST8000AS0002-1NA17Z AR17> ACS-2 ATA SATA 3.x device
ada3: <ST8000AS0002-1NA17Z AR17> ACS-2 ATA SATA 3.x device
ada4: <ST8000AS0002-1NA17Z AR17> ACS-2 ATA SATA 3.x device

I'm using onboard HBA, which dmesg describes as:
ahci0: <Intel Sunrise Point AHCI SATA controller> port 0xf050-0xf057,0xf040-0xf043,0xf020-0xf03f mem 0xeff48000-0xeff49fff,0xeff4c000-0xeff4c0ff,0xeff4b000-0xeff4b7ff irq 16 at device 23.0 on pci0

Here is the zpool(s) status:

Code:

 pool: STORAGE
 state: ONLINE
 scan: resilvered 9.77G in 0h5m with 0 errors on Tue Dec  5 09:39:08 2017
 config:
 NAME											STATE	 READ WRITE CKSUM

STORAGE										 ONLINE	   0	 0	 0
  mirror-0									  ONLINE	   0	 0	 0
	gptid/93567432-67a3-11e7-9dac-1c1b0da2bc7d  ONLINE	   0	 0	 0
	gptid/bd90ba0c-67a2-11e7-9dac-1c1b0da2bc7d  ONLINE	   0	 0	 0
  mirror-1									  ONLINE	   0	 0	 0
	gptid/9f0647e1-6700-11e7-be5e-1c1b0da2bc7d  ONLINE	   0	 0	 0
	gptid/7c2543c6-67a3-11e7-9dac-1c1b0da2bc7d  ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Sun Dec  3 03:45:31 2017
config:

NAME		STATE	 READ WRITE CKSUM
freenas-boot  ONLINE	   0	 0	 0
  ada0p2	ONLINE	   0	 0	 0

errors: No known data errors

This server was first booted in July 12 2017, then the first CRITICAL email message comes:

Code:

Jul 12 16:07:13 freenas ahcich5: Timeout on slot 13 port 0
Jul 12 16:07:13 freenas ahcich5: is 00000000 cs 00002000 ss 00000000 rs 00002000 tfd c0 serr 00000000 cmd 0000cd17
Jul 12 16:07:13 freenas (ada4:ahcich5:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Jul 12 16:07:13 freenas (ada4:ahcich5:0:0:0): CAM status: Command timeout
Jul 12 16:07:13 freenas (ada4:ahcich5:0:0:0): Retrying command
Jul 12 16:07:46 freenas ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jul 12 16:07:52 freenas collectd[2843]: aggregation plugin: Unable to read the current rate of "freenas.local/cpu-2/cpu-user".
Jul 12 16:07:52 freenas collectd[2843]: utils_vl_lookup: The user object callback failed with status 2.
Jul 12 16:08:16 freenas ahcich5: Timeout on slot 14 port 0
Jul 12 16:08:16 freenas ahcich5: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd 80 serr 00000000 cmd 0000ce17
Jul 12 16:08:16 freenas (aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Jul 12 16:08:16 freenas (aprobe0:ahcich5:0:0:0): CAM status: Command timeout
Jul 12 16:08:16 freenas (aprobe0:ahcich5:0:0:0): Retrying command
Jul 12 16:08:49 freenas ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jul 12 16:09:19 freenas ahcich5: Timeout on slot 15 port 0
Jul 12 16:09:19 freenas ahcich5: is 00000000 cs 00008000 ss 00000000 rs 00008000 tfd 80 serr 00000000 cmd 0000cf17
Jul 12 16:09:19 freenas (aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Jul 12 16:09:19 freenas (aprobe0:ahcich5:0:0:0): CAM status: Command timeout
Jul 12 16:09:19 freenas (aprobe0:ahcich5:0:0:0): Error 5, Retries exhausted
Jul 12 16:09:52 freenas ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jul 12 16:10:22 freenas ahcich5: Timeout on slot 16 port 0
Jul 12 16:10:22 freenas ahcich5: is 00000000 cs 00010000 ss 00000000 rs 00010000 tfd 80 serr 00000000 cmd 0000d017
Jul 12 16:10:22 freenas (aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Jul 12 16:10:22 freenas (aprobe0:ahcich5:0:0:0): CAM status: Command timeout
Jul 12 16:10:22 freenas (aprobe0:ahcich5:0:0:0): Error 5, Retry was blocked
Jul 12 16:10:22 freenas ada4 at ahcich5 bus 0 scbus5 target 0 lun 0
Jul 12 16:10:22 freenas ada4: <ST8000AS0002-1NA17Z AR17> s/n Z841**** detached

I saw lots of CHKSUM errors, my first try was replacing ada4 SATA cable and launch a scrub. It worked for almost 2 months.
Then it started again, on other disks:

Code:

Sep 22 10:08:48 freenas ada2: <ST8000AS0002-1NA17Z AR17> s/n Z840**** detached

Here I replace all SATA cabling, but problem intensifies:

Code:

Nov 17 13:13:51 freenas ada2: <ST8000AS0002-1NA17Z AR17> s/n Z840**** detached
Nov 19 05:45:40 freenas ada1: <ST8000AS0002-1NA17Z AR17> s/n Z841**** detached
Nov 20 09:05:52 freenas ada2: <ST8000AS0002-1NA17Z AR17> s/n Z840**** detached
Nov 25 14:31:30 freenas ada2: <ST8000AS0002-1NA17Z AR17> s/n Z840**** detached

I was panicking, the machine has a Windows Server 2016 VM box with a lot of services on it.
Then we've migrated the VM to a failover machine and continued with a more relaxed troubleshooting.

I've checked PSU voltages, all fine, just +5 slightly low (4.97v) but inside ATX specs.
Here I've replaced all SATA Power Cabling, changed "Y" cables to straight and switched to a Marvell 88SE9215 PCIe HBA ... guess what?

Code:

Nov 26 07:36:14 freenas ada4: <ST8000AS0002-1NA17Z AR17> s/n Z841**** detached
Nov 29 01:46:42 freenas ada2: <ST8000AS0002-1NA17Z AR17> s/n Z840**** detached
Nov 30 11:54:42 freenas ada3: <ST8000AS0002-1NA17Z AR17> s/n Z841**** detached

Finally, I replaced PSU, even if voltages was reported correct. I thought: "maybe could be some weird noise on +5v rail". I checked BIOS Settings, upgraded BIOS to latest revision, disabled ALPM and hot swap hoping for the best.
Nope.

Code:

Dec  1 14:00:59 freenas ada2: <ST8000AS0002-1NA17Z AR17> s/n Z841**** detached
Dec  5 02:27:17 freenas ada2: <ST8000AS0002-1NA17Z AR17> s/n Z840**** detached

CAM Status error is always the same. It looks like HDD disconnects from SATA channel before the FLUSHCACHE48 command. Note that disconnected HDDs will go online again only after a power cycle. If I software reboot the machine, the disk will stay offline.

Same problem on 4 disks... could be that all 4 disks are defective?

Any suggestions, prayers or magic rituals will be much appreciated!

EDIT: attached smartctl outputs

BigDave · Dec 5, 2017

fabiob said:
Jul 12 16:07:52 freenas collectd[2843]: aggregation plugin: Unable to read the current rate of "freenas.local/cpu-2/cpu-user".
Jul 12 16:07:52 freenas collectd[2843]: utils_vl_lookup: The user object callback failed with status 2.

The above portion of the log can be ignored, this is just "noise", it's been a non-critical bug that is going to be fixed when 11.1 STABLE comes out.
Click link below to read more.
https://redmine.ixsystems.com/issues/23857

fabiob · Dec 5, 2017

Thanks BigDave, I just forgot to cut out that :)

joeschmuck · Dec 5, 2017

Looked at your SMART results, the only issue I saw there was you are not performing a Long/Extended test. I also saw that the drives had a COM Reset a few times.

I suspect this is an interface supportability issue.

So lets be very specific here, exactly which connectors are you using for the SATA ports. Reference your user manual to identify them. Are you using the SATA Express Ports at all? I hope not. And do you have an M2 card installed? Please describe everything, leave nothing out. Also refer to page 15 of your user manual to ensure you are not violating a configuration limitation.

That is all I have time for this moment, need to get back to work and pop a few reprogrammed CPU modules into a system.

fabiob · Dec 5, 2017

Thanks for your indications joeschmuck.
EDIT: Extended long tests have been set up to be executed twice a month, but, wait... looking at FreeNAS interface they were disabled.
Maybe FreeNAS disables SMART checks after disk failure. I'll verify and redo manually right now.

SATA ports used, according from manual are 0 for SSD and 2,3,4,5 for HDDs. No SATA express nor M.2 used

Note that SSD has never disconnected...
The problem was identical even when switching on a PCIe HBA of different chipset (Marvell 88SE9215),
2 different HBAs, same problems...?

I'm looking into this:

Code:

root@freenas:~ # smartctl -l scterc /dev/ada1

smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org


SCT Error Recovery Control command not supported

Could it generate this kind problems? It's the first time (and maybe the last) I'm using this Seagate Archive stuff.

fabiob · Dec 5, 2017

Code:

root@freenas:~ # smartctl -a /dev/ada1 | grep Extended

Extended self-test routine

# 1  Extended offline	Completed without error	   00%	  3101		 -

root@freenas:~ # smartctl -a /dev/ada2 | grep Extended

Extended self-test routine

# 1  Extended offline	Completed without error	   00%	  3101		 -

root@freenas:~ # smartctl -a /dev/ada3 | grep Extended

Extended self-test routine

# 1  Extended offline	Completed without error	   00%	  3101		 -

root@freenas:~ # smartctl -a /dev/ada4 | grep Extended

Extended self-test routine

# 1  Extended offline	Completed without error	   00%	  3101		 -

Completed extended test with no errors. Any ideas?

EDIT: After talking with Seagate support, I'm seriously thinking about CAM debugging. They obviously told me: no SMART failures, no RMA. I would get deeper inside this CAM timeout. Does anybody can help me with this?

joeschmuck · Dec 6, 2017

I don't see anything wrong with your hard drives so no RMA.

All of this data leaves me with only one thing right now, the SATA cables which you said you have replaced, correct? And the problem went away for a while.

So, how long are your SATA cables? Can you use shorted ones? Do they have locking tabs? Are they good quality?

And this leads me to only one last remaining thing... This could be our first real instance of SMR drives not being useful in a ZFS file system. This could be the reason for the timeout errors. I am purely speculating. No one here recommends SMR drives for normal use, or should I say I will not recommend them until they become more robust.

1) Lets try some more things, the first is what is your use/free space, post the output of zfs list STORAGE so we can see how much space is used and free.

2) How long have you had these drives and how do you use this storage? What I'm looking for is how you use your storage on a daily basis. If you are making computer backups and then making more backups and replacing them, I suspect that this will kill the performance of an SMR drive. Now you have four SMR drives and you make a change like to update several files and the drives all take a long time to rewrite all this data. Again I am just speculating, I am not the authority on SMR drives. Maybe a change of the pool type from RAIDZ2 to Mirrors would help? Did I say I was just speculating...

The real first thing to do is ensure your SATA cables are top notch. If that doesn't fix it then I think it's an SMR issue but I cna't be 100% sure about that. I'd like to hear what @Arwen has to say, she is the only person I know that uses an SMR drive with FreeNAS.

Redcoat · Dec 6, 2017

joeschmuck said:
Can you use shorted ones?

I'll bet that joe meant "shorter"!:)

Arwen · Dec 6, 2017

@joeschmuck, he is already using Striped Mirrors, (aka RAID-10).

As for the Seagate SMRs, writes do get slower over time. I've probably hit the bottom of performance at 30MBps write speed. They have a nasty habit of long duration pauses during large writes, while it does flushes of the non-SMR write buffer to SMR tracks. This may be what is causing the problem. (I read somewhere that the current write buffer is 20GB of non-SMR tracks.)

My use case is a single SMR disk, with ZFS as a backup device. It works perfectly at present, (for the use intended). Slow, but seems reliable.

I wish these had a Flash based write cache, perhaps 16GB. This would be an ideal case for a hybrid drive.
And while I am making wishes, they should introduce a 18-20TB SMR drive based on current regular 10TB drives.

fabiob · Dec 6, 2017

joeschmuck said:
I don't see anything wrong with your hard drives so no RMA.

All of this data leaves me with only one thing right now, the SATA cables which you said you have replaced, correct? And the problem went away for a while.

So, how long are your SATA cables? Can you use shorted ones? Do they have locking tabs? Are they good quality?

And this leads me to only one last remaining thing... This could be our first real instance of SMR drives not being useful in a ZFS file system. This could be the reason for the timeout errors. I am purely speculating. No one here recommends SMR drives for normal use, or should I say I will not recommend them until they become more robust.

1) Lets try some more things, the first is what is your use/free space, post the output of zfs list STORAGE so we can see how much space is used and free.

2) How long have you had these drives and how do you use this storage? What I'm looking for is how you use your storage on a daily basis. If you are making computer backups and then making more backups and replacing them, I suspect that this will kill the performance of an SMR drive. Now you have four SMR drives and you make a change like to update several files and the drives all take a long time to rewrite all this data. Again I am just speculating, I am not the authority on SMR drives. Maybe a change of the pool type from RAIDZ2 to Mirrors would help? Did I say I was just speculating...

The real first thing to do is ensure your SATA cables are top notch. If that doesn't fix it then I think it's an SMR issue but I cna't be 100% sure about that. I'd like to hear what @Arwen has to say, she is the only person I know that uses an SMR drive with FreeNAS.

Joe, Arwen, many many thanks. :)

My fault was to not properly get into SMR pros and cons before buying. Very bad (and expensive) news for me.

Let's go on with brainstorming:

Code:

NAME	  USED  AVAIL  REFER  MOUNTPOINT

STORAGE  9.62T  4.43T  10.1G  /mnt/STORAGE

1) This machine is a UrBackup server for 8 Workstations/Laptops + Windows Server 2016 VM
Actually, yes: it writes, deletes and rewrites image and file backups (+ snapshots for the VM)
Zpool is already configured in RAID10, I've choosen mirroring in order to speed up the slow 5600rpm drives. And it worked fine, VM runs very fast. Also, with 32Gb RAM, ARC helps a lot with that.

2) SATA cables: after first error I've replaced just ada4 SATA cable with another from the same pack. They were cheap Amazon 5 pack, even if with locking tabs ( ref. https://www.amazon.com/Mudder-Pack-Inch-SATA-Cable/dp/B01CZD91MW/ ). No more errors for almost a month. After a month, another disk (not ada4) disconnected. I've thought about crappy SATA cables and I thought about cable lenght, so I've replaced ALL of them 1 month ago with Wentronic 0.5m cables. It's not possible to go shorter: the case is a Corsair Obsidian 750D.

Here are my Qs:

Q1) FFS... how could a drive become so slow to exceed a 31 secs timeout?! I'm shocked.
Is there a way to move CAM timeout to 60 secs, for example? Just for trying, I'm not really going to leave it at 60 secs.

Q2) Is there a way to "reset" performances? I'm thinking about TRIM for SSDs: before TRIM support it was enough to clone the disk, launch "SECURE ERASE" and put the image back on it. Could periodically "zeroing" a drive be useful?

If this is the case, why ada4 disconnected in the first day? Maybe it really was a defective cable on the first error... I'm going nuts.

EDIT: Arwen made me think about cache flushing... what if enabling "vfs.zfs.cache_flush_disable=1" sysctl and letting drives manage "big writes" with no hurry? I'm trying, cross fingers :D

joeschmuck · Dec 6, 2017

fabiob said:
Q1) FFS... how could a drive become so slow to exceed a 31 secs timeout?! I'm shocked.
Is there a way to move CAM timeout to 60 secs, for example? Just for trying, I'm not really going to leave it at 60 secs.

I too was thinking about trying to extend the CAM timeout however I have not done any searching of the internet to find out if that is even possible. I also thought about a log drive but I don't think that would help out any either but it might be worth a try.

fabiob said:
1) This machine is a UrBackup server for 8 Workstations/Laptops + Windows Server 2016 VM
Actually, yes: it writes, deletes and rewrites image and file backups (+ snapshots for the VM)

Maybe I called it.

Good thing you already replaced the SATA cables. Honestly I would have expected a different error had it been the cables but it was just one of those things while needed to be ruled out.

Redcoat said:
I'll bet that joe meant "shorter"!:)

But I like "shorted" cables, they make Rainbow colored sparks!

Arwen said:
@joeschmuck, he is already using Striped Mirrors, (aka RAID-10).

I saw that and don't know why it didn't register in my brain.

It's possible a tunable or two could help out this problem, assuming it is an issue due to SMR causing a slow responce.

Arwen · Dec 7, 2017

I don't know how to "restore" the speed of a SMR type disk. Someone suggested simply removing the partition table and re-creating the file system. My opinion is that will do nothing on the drive side. It will cause ZFS to consider the disk empty, but do nothing for the shingled tracks, (in my opinion).

Possibly detaching a mirror drive, and writing zeros or performing a SECURE ERASE would do it. But, with only 2 way mirrors, I would not risk a regular detachment. Meaning I would have to have a 5th disk prepared and add it as a 3rd mirror. Wait for the re-sync, THEN perform the detach of one of the other disks. Doing it that way, I would never loose redundancy.

Last, to make sure everyone reading the thread, when I mentioned cache, it's the DRIVES cache. It may not be possible to disable that cache, since it's an integral part of SMR.

fabiob · Dec 8, 2017

Heartbeat message: everything is running smoothly for now... stay tuned :)

At the moment, our theory is:
- SMR has a non-SMR write buffer of 20 Gb or so.
- On long, massive operations, forcefully flushing that buffer (as ZFS requires) take so much time to exceed AHCI timeout, the box will mark that drive as failed, disconnecting it from bus.

My workaround is to disable flush commands (with ZFS tunable) and let the drives smoothly empty their loooong buffers.

A bad side effect is to risk the entire ZFS pool consistency in case of long power loss (box is UPS backed) or another hardware failure.
This is going to push me to a full disk replacement asap.

Arwen said:
Someone suggested simply removing the partition table and re-creating the file system. My opinion is that will do nothing on the drive side. It will cause ZFS to consider the disk empty, but do nothing for the shingled tracks, (in my opinion).

I agree, it's useless. GUID structures cover only very small disk space, the other parts will stay untouched. I'm going to replace these SMR with my beloved WD Red in a couple of months, I'll make then some experiments and post results.

joeschmuck said:
I too was thinking about trying to extend the CAM timeout however I have not done any searching of the internet to find out if that is even possible.

About CAM / AHCI timeout, I've found nothing on tunables. Digging into FreeBSD kernel source, I'm afraid it's hardcoded into /sys/dev/ahci.c

Code:

...
if (ahci_wait_ready(ch, dumping ? 31000 : 0, 0)) {
if (dumping)
ahci_clo(ch);
else
ch->resetting = 310;
}
...

I'm definitely not going to hack that and possibly destroy my box without proper testing first.

joeschmuck said:
I also thought about a log drive but I don't think that would help out any either but it might be worth a try.

Nice! I'll try a spare SSD for that.

joeschmuck · Dec 8, 2017

Remember that I am only speculating that your problem is SMR, I don't have any proof. I guess if you were to buy new non-SMR drives and they worked perfectly then you could say it was the drives at fault.

fabiob said:
Digging into FreeBSD kernel source, I'm afraid it's hardcoded into /sys/dev/ahci.c

Then you could submit a bug report requesting a feature change to support SMR drives and have a selection to increase this value to something higher, maybe even double that value.

Maybe a combination of disable flush and a SSD log would work? It would be nice to have someone with some serious experience down this roar jump in and give some advice.

One other thing you could do is jump on a FreeBSD forum and as this timeout question and provide the details that we have spoken of here to see if they have any advice. I'd likely label the thread title something like "SMR Drives Drop from Pool" or something catchy.

fabiob · Dec 29, 2017

After weeks of testing, I can say this journey is finally over.
This is going to be addressed to FreeBSD kernel guardians over there: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224536

Thanks to you folks!
Happy new FreeNAS year :D

joeschmuck · Dec 29, 2017

Thanks for taking that to the FreeBSD folks. Hopefully it will be be updated in the near future.

Important Announcement for the TrueNAS Community.

SOLVED A long journey to solve "CAM status: Command timeout" issues

fabiob

Dabbler

Attachments

BigDave

FreeNAS Enthusiast

fabiob

Dabbler

joeschmuck

Old Man

fabiob

Dabbler

fabiob

Dabbler

joeschmuck

Old Man

Redcoat

MVP

Arwen

MVP

fabiob

Dabbler

joeschmuck

Old Man

Arwen

MVP

fabiob

Dabbler

joeschmuck

Old Man

fabiob

Dabbler

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED A long journey to solve "CAM status: Command timeout" issues

Dabbler

Attachments

FreeNAS Enthusiast

Dabbler

Old Man

Dabbler

Dabbler

Old Man

MVP

MVP

Dabbler

Old Man

MVP

Dabbler

Old Man

Dabbler

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "A long journey to solve "CAM status: Command timeout" issues"

Similar threads