Read speeds impacted by changing vdev.max_pending?

titan_rw · Oct 27, 2013

Was helping a friend of mine setup a fresh freenas build.

Hardware overview:

Supermicro 1155 MB (Don't remember exactly which one).
i3-3xxx (3rd gen with ecc support, don't remember exactly which model).
8 gigs ECC ram (yea, I would have liked to see 16 gig).

4x WD 3tb red in RAID-Z2. (hooked up to MB sata ports)

Everything setup, and freenas 9.1.1 x64 installed.

Went to test local pool bandwidth using dd with bs=1m and count=20k

Write speeds were good, ~250 MB/sec.

Read speeds were really bad. About 30 MB/sec.

Verified cpu usage was ok. Checked gstat, and one of the hard drives was showing 100% utilization while the other 3 were mostly idle. It wasn't just one single drive pegged at 100%. It would change between drives.

Destroyed the pool to test individual drives. Simultaneous dd's to the drives was fine, and simultaneous dd's from the drives was fine. Each drive could read at about 140 MB/sec steady.

Recreated a stripe pool for testing. Reads were still very low, 30-40 MB/sec. gstat still showed one or two drives at 100% while the others were at 10-20%.

Destroyed and recreated a z2 pool, as I was fairly sure it wasn't causing the issue.

Tried disabling read ahead. Instantly went from 30 MB/sec to about 120 MB/sec. I thought, weird, read ahead is normally only disabled for highly random read io like databases where zfs can't predict the next read. But also noticed that with read ahead disabled, the current queue depth of the drives was only hitting 3-4, instead of 10.

Re-enabled read-ahead, but set vfs.zfs.vdev.min_pending=1 and vfs.zfs.vdev.max_pending=1.

That got me back to ~250 MB/sec read. All 4 drives constant 'load' in gstat. Tried setting max_pending to 2, and read performance tanked again, with one hard drive constantly pinned at 100%. Set max_pending back to 1, and everything was fine.

Tested scrub speeds, and it seemed fine. About 450 MB/sec for the 100 gig of test data on it.

This is the first time I've done anything with WD RED's, but I've never seen max_pending have that much of an effect. My main nas has 11 Seagate 3tb drives, and the default of max_pending=10 works great. Sequential read on a file is a little over 1 GB/sec.

Anybody else had to force max_pending to 1 for decent read performance?

cyberjock · Oct 27, 2013

It sounds like you might not have AHCI enabled in the BIOS. I'd expect that setting the pending values to 1 like you did will give you great dd tests, but worse actual performance overall. You should investigate what is actually wrong instead of trying to adjust the tunables to make it work "right".

Something else to check is to verify that x64 version of FreeNAS is installed. If you go with the x86 version you're instantly limited to 4GB of RAM. That disables prefetch and then performance tanks badly.

titan_rw · Oct 27, 2013

Forgot to mention. Yes, AHCI was enabled. Verified in bios.

x64 version used. No problems with memory.

In this case, disabling prefetch actually helped performance immensely. From 30 MB/sec to about 120 MB/sec. That lead me to investigate the queue depth, and re-enabling pre-fetch, but limiting the max queue depth brought it back to 250 MB/sec.

A post in this thread:

http://forums.freebsd.org/showthread.php?t=17941

actually seems to indicate a max queue depth of 1 ought to be optimal with ahci and ncq. Isn't it a software (os) queue depth, independant of the device? If the disk handles it's own queuing natively, then not using the os queue depth would make sense.

I don't know what you mean by 'worse performance overall', but if the zpool can't even read at anymore than 30 MB/sec, before you even start talking on the network, any CIFS performance is going to top out at no more than 30 MB/sec. Far less than gigabit, and more akin to usb2. I'd say the 30 MB/sec performance is the 'worse performance overall'.

cyberjock · Oct 27, 2013

titan_rw said:
http://forums.freebsd.org/showthread.php?t=17941

actually seems to indicate a max queue depth of 1 ought to be optimal with ahci and ncq. Isn't it a software (os) queue depth, independant of the device? If the disk handles it's own queuing natively, then not using the os queue depth would make sense.

I'm pretty sure that forcing a min and max of 1 means that NCQ/TCQ really doesn't do you any good as you'll only have 1 transaction at any given time. You've kind of circumvented it by forcing only 1 transaction at a time. There's times when you'd want only 1, and there's times when you'd want multiple. Factors that I can think of that affect it are location of the files on the disk, quantity of data to be read/written, how full your pool is, and your read-ahead cache variable(and a whole slew of tunables).

For "most" home users I'd expect 4 and 10 should provide pretty good performance as reading your movie file for streaming out of sequence, then reassembling it in order in RAM is preferred to a bunch of out-of-order seeks. This is where NCQ/TCQ can provide its benefit since it optimizes the head seeks. Even of you had a 100MB text file to open I can't imagine the defaults providing a potential penalty that is significant enough that you'd ask yourself why its taking so long unless it HEAVILY fragmented from a pool thats 99% full or something. And at that point you'd be assuming its your own fault for letting the pool get filled.

Just as a disclaimer, there's very little documentation on those 2 parameters, so I'm not even 100% sure on what is defined as a "transaction" as it relates to the tunable. Normally a single transaction can create/modify/delete multiple files simultaneously. The comparison between transaction and queue(and queue depth) is a big grey area for me and not well documented as far as I can tell.

titan_rw said:
I don't know what you mean by 'worse performance overall', but if the zpool can't even read at anymore than 30 MB/sec, before you even start talking on the network, any CIFS performance is going to top out at no more than 30 MB/sec. Far less than gigabit, and more akin to usb2. I'd say the 30 MB/sec performance is the 'worse performance overall'.

As I tried to discuss above, different types of loads will favor different settings. This is why my pool gives 600MB/sec reads when doing local reads from the pool, but if I had an ESXi datastore via NFS I could expect horrendous performance. I'm sure you've seen the constant influx of threads regarding NFS. Its a different type of load and requires different parameters. Someone bought some hardware and tuned his system and he did a bunch of things that most people would consider crazy. He disabled prefetch, shrunk the read-ahead cache by a lot, and several other things. The goal was to allow for a decrease the throughput as a tradeoff for improved latency.

It could be that read performance is bad with a queue depth >1 because you are constantly reading ahead, then discarding and re-reading the just discarded data because your queue depth is too big for the amount of RAM you have. But on the flip-side there are some inefficiencies (both with performance and actual availability with disk space) that are added to ZFS with regards to writes(and future reads of said data) with forcing a small transaction queue. In effect your server may be constantly biting off more than it can chew, then trying to fix it by spitting it all out and taking another oversized bite.

I'd actually be curious to see this server in action. Quite odd to see the results you are seeing considering your hardware. It could be related to your 8GB of RAM. I wonder what would happen if you upped your RAM to 16GB. This kind-of reminds me of my pool. I had less than I should have for the pool size, and pool read performance suffered big time. I couldn't stream a single movie from my server. But write speeds were good so long as I wasn't needing to modify the file(which would result in reads which would kill performance overall).

If you have an extra stick of RAM I'd LOVE to see you test it in the system. If only to see what the end result was.

titan_rw · Oct 28, 2013

It's not a box I have physical access to being that it's ~2,000 miles away. I'll ask if there's any more ram that can be plugged in, even temporarily. I think there were an extra 2x 2g sticks, which would make 12 total. I'll let you know.

titan_rw · Oct 28, 2013

Nevermind. The nas has been put in place, and the spare ram has been returned.

I had suggested 16 gig, but the builder went with 8. The zpool is ~5.1 TB usable storage, so I would have thought 8 would have been sufficient. It is working fine with max_pending set to 1. It's not going to be doing any nfs, iscsi, or anything that would put huge random io load on it.

I still wonder how max/min pending interact with ahci and NCQ capable drives.

cyberjock · Oct 28, 2013

Actually, even copying large files is somewhat "random" as ZFS doesn't necessarily put large files together consecutively. But as long as you are happy with the performance and disk space usage then I'd leave it like it is.

datnus · Oct 28, 2013

Check the IOPS performance of the HDD in different max queue depth.
My HDD has a peak IOPS at 32 IO queue depth, so I set the max pending to 32.
And the performance increase at least 10-20% in IOPS.

Important Announcement for the TrueNAS Community.

Read speeds impacted by changing vdev.max_pending?

titan_rw

Guru

cyberjock

Inactive Account

titan_rw

Guru

cyberjock

Inactive Account

titan_rw

Guru

titan_rw

Guru

cyberjock

Inactive Account

datnus

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

Read speeds impacted by changing vdev.max_pending?

Guru

Inactive Account

Guru

Inactive Account

Guru

Guru

Inactive Account

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Read speeds impacted by changing vdev.max_pending?"

Similar threads