AHCI timeouts

Status
Not open for further replies.

jesusjd

Cadet
Joined
Feb 28, 2012
Messages
3
As I'm experiencing this problem, too, I've made a recopilation of tests I've done, in order to see if they can help anyone. It's about my own experiences whit this error, and I'm still investigating how to fix, or at least minimize this issue with the ahci timeouts while I try to remove the rust to my written english (sorry, pals!)

What I've found until now...

Specs: Gigabyte GA990FXA-UD5 w/ AMD phenom II 955 & 16GB RipjawsX@1600MHZ and an Intel NC380T PCIe 4x dual gigabit LAN adapter

Using onboard SAS controller plus two aditional DELL SAS 6/ir (both flashed with LSI 1068IT firm). Internal controller is using 4x Samsung HD154UI drives in Raidz mode. The other two boards have 8x Samsung HD204UI and 8X Samsung HD753LJ drives, all of them in Raidz2 config. Everything into a Norco RCP-4020 chassis and powered with an OCZ 700W PSU.

The SAS boards are working like a charm, but the integrated one (AMD SB950/IXF700) is another story (more like a nightmare, i should say...). When copying single files, everything's ok with this controller, but when it is under heavy duty... Ta-daaa! the AHCI timeouts appears.

Until now, all i've found about this problem is that the involved mobos are using an AMD SB*** integrated SATA controller, and the fact that processor speed or ram amount doesn't seems to affect this problems for better or for worse...

Now, what i've tried until now...

-Set SATA in IDE mode instead of AHCI: Timeouts
-Test with another drives (HD753LJ instead of the HD154UI installed): Timeouts
-Add to loader.conf this line 'hint.ahci.0.msi=0' to see if there was a problem with irq handling: Timeouts
-Disabling all on-board hardware in BIOS but the essential ones: Timeouts
-Changing the graphic card for a PCI one: Timeouts
-Removing all cards (network, additional SAS controllers, etc): Timeouts
-Updating BIOS from F5 to latest one (F7): Timeouts
-Plugging in disks directly into the onboard controller (avoiding the Norco backplanes, i mean): Timeouts
-Disabling all services in Freenas (CIFS, SMART, etc): .....Well, guess what!


After that, I've tested the 8.0.4-p1 amd64 Freenas version on another mobo (Asus Crosshair + AMD Phenom 9950 + 8 GB 800MHz DDR2 RAM), and the drives worked perfectly, without any errors nor timeouts...

Then I've tried to plug 6 drives into the onboard controller instead of 4 in order to fill all the onboard ports, then the timeouts have noticeabilly decreased. Even with the timeouts the system continued copying/moving data without hanging totally like before...

BUT!!! (pero! - mais! - aber! - porem!), since i cannot feel satisfied until i've tortured enough this config to see where the problem could be.............

I tested the ne Raidz2 of 6 disks creating some data with the 'dd' command. Sadly, I found the same situation as previously: when dataset is under heavy load... (dd if=/dev/zero of=/mnt/TEST/TEST/testfile01 bs=100m count=200) + (dd if=/mnt/TEST/TEST/testfile02 of=/dev/null bs=100m count=200) = TIMEOUT!!

After that, I destroyed the raidz2 dataset, and created a striped dataset with the 6 drives. Next, I did the following test (all lines executing simultaneously)

dd if=/dev/zero of=/mnt/TEST/TEST/test01 bs=100m count=300 &
dd if=/dev/zero of=/mnt/TEST/TEST/test02 bs=100m count=300 &
dd if=/dev/zero of=/mnt/TEST/TEST/test03 bs=100m count=300 &
dd if=/dev/zero of=/mnt/TEST/TEST/test04 bs=100m count=300 &
dd if=/dev/zero of=/mnt/TEST/TEST/test05 bs=100m count=300 &
dd if=/mnt/TEST/TEST/test06 of=/dev/null bs=100m count=300 &
dd if=/mnt/TEST/TEST/test07 of=/dev/null bs=100m count=300 &
dd if=/mnt/TEST/TEST/test08 of=/dev/null bs=100m count=300 &
dd if=/mnt/TEST/TEST/test09 of=/dev/null bs=100m count=300 &
dd if=/mnt/TEST/TEST/test10 of=/dev/null bs=100m count=300 &

Did I mentioned that I made these test during a scrub? So please, don't ask how many time I stood looking at the screen like an idiot to see if there was any errors. It's too depressing... :D
...
...
...
...
...
...
...And after a looong wait, finally the errors appeared again, but I don't know if forcing the filesystem at this level could produce the timeouts even in a "working" system...


As a final test, I've tested the whole system with a Freenas 0.7.2.8191 Amd64 LiveCd, and after many hours repeating the 'dd' + scrub combination, the problems haven't appeared, so I think this discard that there could be a problem with the disks, or even the mobo SATA controller (this points a 'problem' between AMD chipset and the new AHCI driver, perhaps?)

Any ideas or suggestions about this, or should I test the physics law that everything can fly... with the use of adequate dose of bad mood and brute force? :D
 

jesusjd

Cadet
Joined
Feb 28, 2012
Messages
3
A small bump with my latest tests...

I've tried some things within the loader.conf, such as forcing the ahci driver to load (ahci_load="YES"), or some variations with the 'hint.ahci.x.msi=0' without any luck...

Then I've set the ahci_load="YES" to "NO", set the onboard controller to IDE mode in BIOS (previously I've already tested it without adding the ahci_load="NO" to loader.conf with the same results as always), and currently I'm testing a raidz dataset with the method of creating an continuos file(dd if=/dev/zero of=/mnt/TEST/TEST/testfile bs=1m), doing a simultaneous scrub and copying about 500GB through CIFS with Teracopy.

During the last five hours I've got no timeout errors, so i'll keep testing this setup for almost a day or so, and we'll se if it could be a temporary fix for the timeout issue...


UPDATE:

After >24h of testing, the errors have returned, but more randomly (about 5 timeout errors in 24h), and the system continues working...

This leads me to the following question: I've set the onboard controller into IDE mode, and set ahci_load="NO", but during bootup Freenas still detects the AMD SATA controller as an AHCI device, so ¿isn't there a way to disable/avoid Freenas to load the ahci drivers?

Best regards!
 

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
I started to notice those errors also, since I upgraded to 8.2.0-p1 x64 from 8.0.4 x64. Never saw them before with any previous release.
Code:
Jul 20 23:41:04 pluto kernel: ahcich0: Timeout on slot 17 port 0
Jul 20 23:41:04 pluto kernel: ahcich0: is 00000000 cs 00020000 ss 00020000 rs 00020000 tfd 40 serr 00000000
Jul 21 02:21:13 pluto kernel: ahcich0: Timeout on slot 6 port 0
Jul 21 02:21:13 pluto kernel: ahcich0: is 00000000 cs 000003c0 ss 000003c0 rs 000003c0 tfd 40 serr 00000000

IMO, this is definitely not related to disks or motherboard. I transfer all the time large files (5-10GB each) to NAS and I don't see the above timeouts. They appear randomly, when there is no activity on the NAS. For kicks, I just transferred 100GB to my NAS with no errors. I ran a "# smartctl -a /dev/adaX" on each disk, no errors found.

Running a scrub (which is intensive on disks) for few hours produces no timeout errors, I've done it several times.

My pool status:
Code:
# zpool status nas
  pool: nas
 state: ONLINE
 scrub: none requested
config:

        NAME                                            STATE     READ WRITE CKSUM
        nas                                             ONLINE       0     0     0
          raidz2                                        ONLINE       0     0     0
            gptid/3cf03778-1dc5-11e1-8396-002590382e1e  ONLINE       0     0     0
            gptid/3d98e472-1dc5-11e1-8396-002590382e1e  ONLINE       0     0     0
            gptid/3e44f17a-1dc5-11e1-8396-002590382e1e  ONLINE       0     0     0
            gptid/3ef3f815-1dc5-11e1-8396-002590382e1e  ONLINE       0     0     0
            gptid/3f9211d5-1dc5-11e1-8396-002590382e1e  ONLINE       0     0     0
            gptid/4033d56c-1dc5-11e1-8396-002590382e1e  ONLINE       0     0     0
        cache
          gptid/41575359-1dc5-11e1-8396-002590382e1e    ONLINE       0     0     0

errors: No known data errors
 

jesusjd

Cadet
Joined
Feb 28, 2012
Messages
3
I started to notice those errors also, since I upgraded to 8.2.0-p1 x64 from 8.0.4 x64. Never saw them before with any previous release.

IMO, this is definitely not related to disks or motherboard. I transfer all the time large files (5-10GB each) to NAS and I don't see the above timeouts. They appear randomly, when there is no activity on the NAS. For kicks, I just transferred 100GB to my NAS with no errors. I ran a "# smartctl -a /dev/adaX" on each disk, no errors found.

Running a scrub (which is intensive on disks) for few hours produces no timeout errors, I've done it several times.

I agree. I'm suspecting this has more to be with the ahci driver, since I've tested the same board, with the same disks, SAS cards, RAM, etc. under a Freenas 7 version and it doesn't give any errors. In fact, the errors are located to the integrated SATA controller (an AMD SB***), but It doesn't means that this problem cannot affect another SATA controller vendors...

If I can, I'll try the new Nas4Free release, which is based on FreebSD 9, and then we'll see if problem persists...
 

Daflibble

Cadet
Joined
Aug 9, 2012
Messages
4
I have exactly the same problem. AHCI timeouts on ATI IXP700 AHCI SATA controller. Created another tread earlier today..
http://forums.freenas.org/showthread.php?7996-Pls-help-with-disk-dropouts

I think the solution is to force ata(4) driver for the controller (Check this: http://forums.freebsd.org/showthread.php?t=24189). The problem is I don't know how to do it. Any ideas?

I'm getting the same problem with HP Proliant Microserver NL40 and 4x Samsung 1.5TB HDD's (HD154UI 1AG01119). I originally had 4x 500GB WD HDD's connected which worked great expect the lack of space. The Samsung disks had been in a windows 2008 r2 server for over a year and worked fine. When I tried to resilver onto the 1.5TB disks the disk on port 3 gained lots of bad blocks and Seagate replaced with (HD154UI 1AG01118), the seatools did detect the errors. Rebuilt the RaidZ and copied back data from USB2.0 connection okay. Now getting timeouts always on port 0 (not the one that was replaced) when copying via Gigabit ethernet connection. Unable to detect any errors on the disks when I scan them with seatools.

Like others on internet this problem is very annoying. Wish I could afford a new set of large disks to see if it worked with them.
 

hellok

Cadet
Joined
Jul 26, 2012
Messages
8
So here is how I got the problem solved (at least it seams like it is).

1) I have switched the controller into the IDE mode. It helps with 2 of 6 ports on my motherboard - ports 5 and 6 are loaded as atapci.
2) I have played with my loader.conf. Added the following:

ahci_load="NO"
ata_load="YES"
atacard_load="YES"
ataisa_load="YES"
atapci_load="YES"
ataahci_load="NO"
ataamd_load="YES"
ataati_load="YES"
hint.ahci.disabled=1

First four ports are still loaded as AHCI (WHY????). But I don't see any timeouts anymore. I am doing nightly scrubs, have 17 days of uptime (before it was only a couple of days before I had a dropout) and haven't seen timeouts. You probably will want to change
ataamd_load="YES"
ataati_load="YES"
to match your controller.
 

Daflibble

Cadet
Joined
Aug 9, 2012
Messages
4
So here is how I got the problem solved (at least it seams like it is).

1) I have switched the controller into the IDE mode. It helps with 2 of 6 ports on my motherboard - ports 5 and 6 are loaded as atapci.
2) I have played with my loader.conf. Added the following:

ahci_load="NO"
ata_load="YES"
atacard_load="YES"
ataisa_load="YES"
atapci_load="YES"
ataahci_load="NO"
ataamd_load="YES"
ataati_load="YES"
hint.ahci.disabled=1

First four ports are still loaded as AHCI (WHY????). But I don't see any timeouts anymore. I am doing nightly scrubs, have 17 days of uptime (before it was only a couple of days before I had a dropout) and haven't seen timeouts. You probably will want to change
ataamd_load="YES"
ataati_load="YES"
to match your controller.

Well changing to IDE in bios and changing ataamd_load= and ataati_load= to YES has just made it even less stable. DOH!
Thanks for the suggestion though ; )
 

hellok

Cadet
Joined
Jul 26, 2012
Messages
8
Well changing to IDE in bios and changing ataamd_load= and ataati_load= to YES has just made it even less stable. DOH!
Thanks for the suggestion though ; )

Well.. guess what. Last night I had a timeout, again in the middle of the scrub process. The timeout was on a drive that was picked up by the AHCI driver.

Also I should note that I've tried buying a pair of HighPoint RocketRAID 620 controllers, but it didn't help at all. The same timeouts.

There is one thing that I've noticed. I have 3 Samsung F4EG drives and one Hitachi 7k.2000 drive. Only the Samsung drives get dropped. May be another coincidence.. may be not.

I am getting tired of this. My friend is buying a pair of 3Tb drives soon. I think I will borrow the drives to try Nexenta.
 

poster72

Dabbler
Joined
Aug 1, 2012
Messages
10
Im getting these suddenly also, how to fix?

So i have my Freenas 8.2 system running fine for about 3-4 months, copied all types of data w/o issue but last night i start getting AHCI timeouts.

System basics:
Dell Optiplex 780 box
16GB RAM
four SAMSUNG EcoGreen F4 ST2000DL004 2TB drives
ZFS pool

So far i only notice AHCI timeouts when copying data, not when sitting idle. noticed them on CH0 and CH3 so far. I would appreciate any troubleshooting tips anyone has. This puts me near the end of my wits with freenas. After all my initial kernel panics and now these AHCI timeouts!!!

I receive nightly emails from freenas, which may highlight the issue. Can anyone tell me what all this might mean?

the day before daily run output
Disk status:
Filesystem Size Used Avail Capacity Mounted on
/dev/ufs/FreeNASs2a 927M 353M 499M 41% /
devfs 1.0K 1.0K 0B 100% /dev
/dev/md0 4.6M 1.8M 2.4M 43% /etc
/dev/md1 824K 2.0K 756K 0% /mnt
/dev/md2 149M 14M 124M 10% /var
/dev/ufs/FreeNASs4 20M 1.3M 17M 7% /data
Data 5.3T 2.8T 2.5T 53% /mnt/Data

Last dump(s) done (Dump '>' file systems):

Checking status of zfs pools:
all pools are healthy

Checking status of ATA raid partitions:

Checking status of gmirror(8) devices:

Checking status of graid3(8) devices:

Checking status of gstripe(8) devices:


Checking status of 3ware RAID controllers:
Alarms (most recent first):
No new alarms.




todays (after AHCI timeouts started)
Checking status of zfs pools:
pool: Data
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
Data ONLINE 0 0 0
raidz1 ONLINE 0 0 0
ada0p2 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0
ada2p2 ONLINE 0 0 0
ada3p2 ONLINE 5 22.2K 0

errors: No known data errors

Checking status of ATA raid partitions:

Checking status of gmirror(8) devices:

Checking status of graid3(8) devices:

Checking status of gstripe(8) devices:


Checking status of 3ware RAID controllers:
Alarms (most recent first):
+++ /var/log/3ware_raid_alarms.today 2012-09-12 03:01:01.000000000 -0600
@@ -0,0 +1 @@
 

hellok

Cadet
Joined
Jul 26, 2012
Messages
8
So i have my Freenas 8.2 system running fine for about 3-4 months, copied all types of data w/o issue but last night i start getting AHCI timeouts.

System basics:
Dell Optiplex 780 box
16GB RAM
four SAMSUNG EcoGreen F4 ST2000DL004 2TB drives
ZFS pool

So far i only notice AHCI timeouts when copying data, not when sitting idle. noticed them on CH0 and CH3 so far. I would appreciate any troubleshooting tips anyone has. This puts me near the end of my wits with freenas. After all my initial kernel panics and now these AHCI timeouts!!!

I receive nightly emails from freenas, which may highlight the issue. Can anyone tell me what all this might mean?

I think the problem is in Samsung disks. Or more specifically Samsung's NCQ realization. The reason for this is that the timeouts disappear if you switch the controller to IDE mode (which disables NCQ). Also you can try to disable ahci.ko (which is impossible with stock freenas - seams like it is compiled into the kernel and not loaded as a module - please correct me if I'm wrong here).

For sure I can say that I've never seen timeouts on my Hitachi drive even with AHCI and my Samsung drives hooked up to a controller in IDE mode (make sure it is in IDE mode - should be loaded by atapci or something like that).

Anyway I've got tired of this and have switched to Nexenta. Running it for 3 days already. NO timeouts yet. I've tried rsyncing 1TB folder to an empty folder on the same volume + doing scrub at the same time - rock solid for now. This isn't enough to make a conclusion that there are no timeouts on Nexenta//Solaris but promising at least =)
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
I think the problem is in Samsung disks. Or more specifically Samsung's NCQ realization. The reason for this is that the timeouts disappear if you switch the controller to IDE mode (which disables NCQ).
I ran just ran across this: problems with AHCI on FreeBSD 8.2. Assuming switching to the ata driver stops the timeouts, you can try:
Code:
camcontrol tags ada0 -N 1
If it works for your drive this will be lost on reboot I believe.

You could also test out 8.3.0-BETA2 and see whether or not you still have timeouts.
 

poster72

Dabbler
Joined
Aug 1, 2012
Messages
10
I think the problem is in Samsung disks.... This isn't enough to make a conclusion that there are no timeouts on Nexenta//Solaris but promising at least =)

Thanks for the info. Im kind of a noob in this world, not sure what nexenta is.

I am testing an individual drive out now, as freenas finally was saying it was unavailable/unreachable. Figures id pick the type of drive that may have issues! so far no issues iwth the test which makes me think it really is something like you said.

Is it ok to run IDE mode in BIOS? I see that I do have the option for ATA mode in BIOS. would it affect anything performance wise?

Honestly im hoping this drive is bad and that after replacing things go back to normal. i was pretty happy with it after working out kernel panics initially!
 

hellok

Cadet
Joined
Jul 26, 2012
Messages
8
I ran just ran across this: problems with AHCI on FreeBSD 8.2. Assuming switching to the ata driver stops the timeouts, you can try:
Code:
camcontrol tags ada0 -N 1
If it works for your drive this will be lost on reboot I believe.

You could also test out 8.3.0-BETA2 and see whether or not you still have timeouts.

This actually makes perfect sense!! In the link it says that the command basically disables NCQ for the drive. In Nexenta they have a field to adjust this value and recommend to set it to 1 for SATS disks!!! (Which I did)

Thanks for the info. Im kind of a noob in this world, not sure what nexenta is.

I am testing an individual drive out now, as freenas finally was saying it was unavailable/unreachable. Figures id pick the type of drive that may have issues! so far no issues iwth the test which makes me think it really is something like you said.

Is it ok to run IDE mode in BIOS? I see that I do have the option for ATA mode in BIOS. would it affect anything performance wise?

Honestly im hoping this drive is bad and that after replacing things go back to normal. i was pretty happy with it after working out kernel panics initially!

I don't think that the problem is in the drive itself. And as you said you had timeouts on different channels. It's just awesome Samsung's programmers who wrote the firmware (have you heard about the potential bug on your drives which can lead to data loss??
check this:
http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks)

From what I've read, disabling NCQ can lead to performance penalty if you have a lot of random reads/writes. I wouldn't worry about this too much. If random access matters you wouldn't buy 5.4k drives anyway)))

I have switched my controller to IDE//ATA in BIOS. The problem is that only 2 out of 6 ports got picked up by atapci driver instead of ahci in my case.. But definitely give it a shot!
 

poster72

Dabbler
Joined
Aug 1, 2012
Messages
10
well, after the drive tested out with seatools i put it back in the freenas box and it had a checksum error. zpool clear and then the drive got resilvered. it now says its all healthy! but i feel uneasy about it all for sure. ive been copying data over during the day and so far havent seen any timeouts. i dont know why they would have started suddenly unless there was a h/w fault.

i keep seeing this in the daily email, but not sure what it means as i dont have a 3ware controller?
Checking status of 3ware RAID controllers:
Alarms (most recent first):
+++ /var/log/3ware_raid_alarms.today 2012-09-13 03:01:00.000000000 -0600
@@ -0,0 +1 @@
+
 

eddieone

Cadet
Joined
Oct 8, 2017
Messages
1
I was getting the OP error on boot. I had to unplug the l2arc cache thing. :(

I guess that is a one time deal and can't be rebuilt.

I didn't read all the comments, just wanted to put my 0.02 in. 2012 :o
 
Status
Not open for further replies.
Top