AHCI vs. IDE mismatch (was Mixing 4Kb and 512b sector sizes in a vdev?)

deafen · Jul 23, 2014

EDIT: I was totally, completely wrong in my diagnosis of the problem. See post #2 for what really happened.

I've been slowly investigating some performance issues on my system, as triggered by a resilver operation that took over 4 days to complete.

The vdev is composed of six Seagate ST3000DM001 (3TB SATA) drives in RAIDZ2. They are all connected to motherboard SATA ports on an ASUS M5A78L-M LX+ board.

As part of my investigation, I started a scrub and pulled up gstat. Clearly, two of the drives are using 4K sectors, while the others are using 512-byte sectors, as shown by the number of ops/s:

Code:

dT: 1.001s  w: 1.000s  filter: gpt
L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      0      0      0    0.0      0      0    0.0    0.0| gptid/4c6e0af2-1083-11e4-8b26-50465d6afb74
    2   2336   2336  75549    0.4      0      0    0.0   64.8| gptid/f645c671-dfe0-11e2-b96b-50465d6afb74
    2   2308   2308  75337    0.4      0      0    0.0   58.0| gptid/f6959517-dfe0-11e2-b96b-50465d6afb74
    2   2253   2253  75361    0.4      0      0    0.0   61.7| gptid/f6e96295-dfe0-11e2-b96b-50465d6afb74
    2   2295   2295  75481    0.4      0      0    0.0   62.0| gptid/95942e35-0c5d-11e4-934c-50465d6afb74
    2    620    620  76229    3.0      0      0    0.0   92.5| gptid/f7c2859e-dfe0-11e2-b96b-50465d6afb74
    2    629    629  76157    3.5      0      0    0.0   96.9| gptid/f8333b58-dfe0-11e2-b96b-50465d6afb74

(Ignore the first line; it's an SSD that I was using for L2ARC, and have removed during the testing process.)

Now mind you, the scrub is proceeding at relatively fast speed (>200 MB/s). But I'm concerned that in a non-synthetic read scenario, this might cause performance issues as we try to do random reads and the alignment comes into play.

The fact that the two 4K-using drives are almost 50% busier is also concerning, because it means I'm bottlenecking on just these two drives.

So is this something worth dealing with? Starting from scratch isn't really an option, but I could remove, wipe, and reinsert one drive at a time. Any thoughts or comments are welcome.

deafen · Jul 23, 2014

Okay, I did some more digging, and it looks like two of my SATA ports are misconfigured in IDE mode (as opposed to AHCI). This means that two of the drives are hanging off of the same bus, as well as operating with the legacy IDE command set. Per camcontrol:

Code:

[root@delta ~]# camcontrol devlist -v
scbus0 on ahcich0 bus 0:
<C300-MTFDBAK128MAG 0006>          at scbus0 target 0 lun 0 (ada0,pass0)
<>                                 at scbus0 target -1 lun -1 ()
scbus1 on ahcich1 bus 0:
<>                                 at scbus1 target -1 lun -1 ()
scbus2 on ata2 bus 0:
<>                                 at scbus2 target -1 lun -1 ()
scbus3 on ahcich2 bus 0:
<ST3000DM001-1CH166 CC26>          at scbus3 target 0 lun 0 (ada1,pass1)
<>                                 at scbus3 target -1 lun -1 ()
scbus4 on ahcich3 bus 0:
<ST3000DM001-1CH166 CC26>          at scbus4 target 0 lun 0 (ada2,pass2)
<>                                 at scbus4 target -1 lun -1 ()
scbus5 on ahcich4 bus 0:
<ST3000DM001-1CH166 CC26>          at scbus5 target 0 lun 0 (ada3,pass3)
<>                                 at scbus5 target -1 lun -1 ()
scbus6 on ahcich5 bus 0:
<ST3000DM001-1E6166 SC48>          at scbus6 target 0 lun 0 (ada4,pass4)
<>                                 at scbus6 target -1 lun -1 ()
scbus7 on ata0 bus 0:
<>                                 at scbus7 target -1 lun -1 ()
scbus8 on ata1 bus 0:
<ST3000DM001-1E6166 CC45>          at scbus8 target 0 lun 0 (ada5,pass5)
<ST3000DM001-1E6166 CC45>          at scbus8 target 1 lun 0 (ada6,pass6)
<>                                 at scbus8 target -1 lun -1 ()
scbus9 on ctl2cam0 bus 0:
<>                                 at scbus9 target -1 lun -1 ()
scbus10 on umass-sim0 bus 0:
<PNY USB 2.0 FD 8192>              at scbus10 target 0 lun 0 (pass7,da0)
scbus-1 on xpt0 bus 0:
<>                                 at scbus-1 target -1 lun -1 (xpt0)
[root@delta ~]#

This probably explains my slow resilver operation, as well. I'm 3000 miles away from the server, but this weekend I'll be back and I'll see what happens when I twiddle that setting in the BIOS.

cyberjock · Jul 23, 2014

Uhh, before you go into this any more you should read up on the %busy. It doesn't mean anything even close to what you think it means. Look at the throughput and see how much they differ (hint: it's less than 1MB/sec). So before you get all upset about this you should realize that your drives are all performing at approximately the same speed, so you don't have anything to worry about.

deafen · Jul 24, 2014

Actually, I know exactly what it means (the percentage of time that there's at least one outstanding transaction). Those drives are taking longer to return a raw read (3.5 ms vs 0.4 ms), so the ZFS layer has to wait for them before it can reconstruct the logical block. So yes, unless the fundamental concepts of RAID have changed, it is bottlenecking on those drives.

But I do appreciate the irony of you being both condescending and wrong. Look, I made a dumb mistake and assumption when I started this, and am appropriately sheepish. And I get that you really know your stuff, 12K posts and 600+ likes in. But that doesn't mean you need to be jerky about it.

cyberjock · Jul 24, 2014

I wasn't being jerky, but good luck on your problem.

Nick Howard · Jul 24, 2014

Hal Haygood said:
Actually, I know exactly what it means (the percentage of time that there's at least one outstanding transaction). Those drives are taking longer to return a raw read (3.5 ms vs 0.4 ms), so the ZFS layer has to wait for them before it can reconstruct the logical block. So yes, unless the fundamental concepts of RAID have changed, it is bottlenecking on those drives.

But I do appreciate the irony of you being both condescending and wrong. Look, I made a dumb mistake and assumption when I started this, and am appropriately sheepish. And I get that you really know your stuff, 12K posts and 600+ likes in. But that doesn't mean you need to be jerky about it.

He's pretty good that and it seems totally oblivious about it too. How can so many people be wrong?

deafen · Jul 25, 2014

Well, I'd say that tweak made just a hair of difference ...

Code:

dT: 1.001s  w: 1.000s  filter: gpt
L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      0      0      0    0.0      0      0    0.0    0.0| gptid/4c6e0af2-1083-11e4-8b26-50465d6afb74
    2    984    984 117076    1.9      0      0    0.0   96.3| gptid/f645c671-dfe0-11e2-b96b-50465d6afb74
    2    986    986 117056    1.9      0      0    0.0   96.2| gptid/f6959517-dfe0-11e2-b96b-50465d6afb74
    2    982    982 116928    1.9      0      0    0.0   96.2| gptid/f6e96295-dfe0-11e2-b96b-50465d6afb74
    2   1004   1004 117815    1.9      0      0    0.0   96.2| gptid/95942e35-0c5d-11e4-934c-50465d6afb74
    2    953    953 117072    2.1      0      0    0.0  100.1| gptid/f7c2859e-dfe0-11e2-b96b-50465d6afb74
    2    995    995 117420    1.9      0      0    0.0   97.4| gptid/f8333b58-dfe0-11e2-b96b-50465d6afb74

... but what's a 56% increase in throughput between friends? :)

So the disk subsystem is healthier. That's some good news. Now on to tackle the other nagging issue ... why my CIFS performance sucks. Onward!

mav@ · Jul 26, 2014

What are you doing with your CIFS that it sucks?

deafen · Jul 26, 2014

Well, "sucks" is kinda overstating it. I'm getting about 50-60 MB/s transfer speeds on big files (1GbE). Might be limited by the array itself. So my first step, now that the actual disk subsystem is as healthy as it can be, is to do some local benchmarking. Sometime this weekend.

In theory, I should be able to easily saturate GbE with this setup (or come close - right now my Intel NIC is on the PCI bus, not PCIe, so I'm actually somewhat more limited, but not down to 600 Mbit/s). Possible that the RAIDZ2 overhead prevents that. Only way to find out is to be methodical, eh?

Edit: You're asking about workload. Right. Still not completely awake.

Mostly the issue is that my wife is using this as storage for her photography business. She's a wedding photographer who starts with raw files (25 MB each) and ends up with PSD files that can get enormous (sometimes up to 300 MB). Now, I'm obviously not expecting local disk speeds, but I do want to get as close to wire speed as I can. She's on a Mac, but I see similar speeds with both CIFS and AFP.

In a perfect world, I'd be able to do transparent local caching that was automatically propagated back to the NAS (think FusionIO). But we ain't got that kinda money.

mav@ · Jul 26, 2014

OK. Just couple thoughts:
- 600Mbps is about what I see from old 1Gbps Intel NIC in PCI slot, so running some iperf test may quickly answer your question;
- Recently fix was committed into FreeNAS 9.3, that significantly improves file creation rate for CIFS. It may be not a significant factor for 300MB files, but for 300KB it possibly is. You may try recent nightly build after you do other tests.

deafen · Jul 26, 2014

Good point. I've got a PCIe NIC coming next week (also Intel), so I'm not going to sweat it until then.

Important Announcement for the TrueNAS Community.

AHCI vs. IDE mismatch (was Mixing 4Kb and 512b sector sizes in a vdev?)

deafen

Explorer

deafen

Explorer

cyberjock

Inactive Account

deafen

Explorer

cyberjock

Inactive Account

Nick Howard

Contributor

deafen

Explorer

mav@

iXsystems

deafen

Explorer

mav@

iXsystems

deafen

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

AHCI vs. IDE mismatch (was Mixing 4Kb and 512b sector sizes in a vdev?)

Explorer

Explorer

Inactive Account

Explorer

Inactive Account

Contributor

Explorer

iXsystems

Explorer

iXsystems

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "AHCI vs. IDE mismatch (was Mixing 4Kb and 512b sector sizes in a vdev?)"

Similar threads