SOLVED AHCI timeouts caused by smartctl

Status
Not open for further replies.

HeinzApfel

Dabbler
Joined
Nov 24, 2014
Messages
13
Hello,

I see a lot of AHCI timeouts like:
Code:
ahcich1: Timeout on slot 10 port 0
ahcich1: is 00000000 cs 00000400 ss 0000000 rs 00000400 tfd 40 serr 00000000 cmd 10008a17

on the console.

The problem can be reproduced by copying data to my RAID while running smartctl from command line. Multiple calls like smartctl -a /dev/ada0 cause smartctl to hang suddenly after which the timeout appears.
The timeout happens on different drives, not always the same. Also it does not care which drive I call smartctl on.

Does someone have an idea what I could try to avoid these hang and timeouts?

My system is using FreeNAS 9.3-BETA 2014-11-22 on a ASRock C2550D4I, SATA cables and power supply have already been replaced. RAM is ECC, the RAID is based on 11x disk in a RAID-Z3.

Thanks for your ideas and help,
Heinz
 
D

dlavigne

Guest
SMART isn't causing the errors, only reporting them. Are you getting errors in your scheduled smart tests or ZFS scrubs?
 

solarisguy

Guru
Joined
Apr 4, 2014
Messages
1,125
Do you have an add-on SATA card?

If yes, does it have a Marvell chipset?

If yes, what is the speed of the slot it is in? 1x? 2x? 4x? 8x? 16x?

If it is not 1x., try moving it to a 1x slot.

Out of ideas...
 

HeinzApfel

Dabbler
Joined
Nov 24, 2014
Messages
13
SMART isn't causing the errors, only reporting them. Are you getting errors in your scheduled smart tests or ZFS scrubs?

It's smartctl which is causing the problem.
The timeouts appear by:
  • calling smartctl -a /dev/ada0 in the shell
  • enabling the SMART service. Then every 30 Minutes, e.g. whenever smartd calls smartctl
The timeout error does not appear when the system is idle. Only if I copy files from the network or a huge directory in the shell.
Scrubs cause the same problem. I start a scrub and call smartctl in the shell. Timeout appears.
 
Last edited:

HeinzApfel

Dabbler
Joined
Nov 24, 2014
Messages
13
Do you have an add-on SATA card?
If yes, does it have a Marvell chipset?...

Yes Marvell, the C2550D4I has two (4 + 2 SATA3) based on Marvell and one (4 SATA2) in the Intel chipset.
The timeout appear accessing disks on all controllers.

Are there any know problems with Marvell controllers?
I thought the C2550D4I is a good choice as this is the 4core version on the motherboard that is in the FreeNAS Mini.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I had the timeouts too last week and typically I would suspect the SATA cable, then the hard drive, then the SATA controller. For myself it was the hard drive even though it passed ALL of the SMART tests with flying colors. I will be doing more tests on that hard drive because the issues also started right after I upgraded the FreeNAS version.
 

solarisguy

Guru
Joined
Apr 4, 2014
Messages
1,125
I had the timeouts too last week and typically I would suspect the SATA cable, then the hard drive, then the SATA controller. For myself it was the hard drive even though it passed ALL of the SMART tests with flying colors. I will be doing more tests on that hard drive because the issues also started right after I upgraded the FreeNAS version.
Last post from HeinzApfel implied that there are errors on all the drives (or at least on one disk on each of the three controllers), so Smartmontools could be the guilty party...

@joeschmuck, does your system also have a Marvell controller? It seemed to me that previously such errors never appeared when there was no Marvell controller in the system.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
My test rig has a Intel® ICH10 southbridge which contains the SATA ports. Also, my problem did go away once I changed out the hard drive.
 

HeinzApfel

Dabbler
Joined
Nov 24, 2014
Messages
13
I disabled all SATA onboard controllers and replaced them by an LSI 9201-16i controller.
This can be done without re-formatting the drives. See here: Can a HBA SATA controller be exchanged by LSI9201-16i ?

The C2550D4I has only one 8x PCIe slot, so the alternative 16x PCIe RocketRAID 2740 can't be used.

All problems are gone. I can call smartctl in an endless loop while copying files and scrubing the zpool.
No more timeouts! No more file corruption!

Best regards,
Heinz
 
Status
Not open for further replies.
Top