vfs.zfs.vdev.max_active (previously vfs.zfs.vdex.max_pending)

EEE3 · Mar 9, 2014

Hi everyone,
I know this topic has been discussed a few times previously but there doesn't seem to be a consensus so I figured I'd throw in my experience. I have the following setup:

Supermicro X10SLM-LN4F motherboard
Intel Intel Xeon E3-1230 v3 CPU
32GB ECC RAM
LSI 9211-8I PCIe running IT firmware
10x3TB WD Red's

8 of the drives are connected to the LSI adapter and the other 2 are connected to the integrated motherboard SATA3 ports. It is setup as a single 10 disk RAIDZ2 pool.

My main focus is on read/write throughput and not IOPS or concurrency. I'm trying to evaluate if it's worth making an investment into 10GigE because 1GigE is currently my bottleneck.

Using a completely untuned "default" FreeNAS 9.1.1 or 9.2.1.2 install my local tests using dd are achieving about 225MB/s for both reads and writes. Looking at gstat I see the individual drives all hitting around 35MB/s very uniformly. Not abysmal but also not very impressive for the hardware I'm using.

In FreeNAS 9.1.1 if I set vfs.zfs.vdex.max_pending to 1 or in 9.2.1.2 if I adjust vfs.zfs.vdev.max_active (seems to have deprecated vfs.zfs.vdex.max_pending) to 1 and change nothing else my read and write speeds instantly more than double. Reads are in the 550MB/s range and writes are around 475MB/s. Looking at gstat again all of the drives are now sustaining over 70+MB/s. I see no negative impact in my typical usage and IOPS seem to remain exactly the same, although as I said that doesn't particularly matter to me.

What's bugging me is why this isn't impacting more people and figuring out what's unique to the handful of us that are seeing this behavior. Are WD Red's the common link? Perhaps the integrated motherboard SATA controller has some weird NCQ bug? I would love to take that out of the equation but I have data on the pool now so that would be a huge hassle.

I have checked the drives and cables and can't find any sign of anything being faulty. SMART reports are pristine and if I use dd to read and write from the raw devices (e.g. /dev/da0), they all shoot right up to advertised 150MB/s with no problem. Obviously I run that particular test when the drives were not part of a pool.

Some additional (somewhat just poking randomly) things I've tried after going back to the default value of vfs.zfs.vdev.max_active=1000 and trying to get the performance up:

1) I've used camcontrol to try various NCQ queue depths, including the minimum of 2 up through 32 for the integrated controller and 255 for LSI. Zero impact.

2) I've tried both disabling TLER and running the default value of 7.0s for reads and writes and nothing changes.

3) I've toggled all of the BIOS PCIe settings for power saving and such and it has made no impact.

Does anyone have any other suggestions or theories? I completely realize I could change hardware or redesign the pools to achieve greater performance -- I'm not asking for that -- to reiterate I'm just curious why it is so helpful in some setups and not others. Thanks..

EEE3 · Mar 9, 2014

As a minor follow up, in camcontrol I tried setting the NCQ queue depth to 1, despite it saying the allowable range begins at 2, and I'm seeing great performance. In other words:

(NCQ disabled on all drives and vfs.zfs.vdev.max_active=1000) == (NCQ enabled on all drives and vfs.zfs.vdev.max_active=1) == great performance.

So this is far from a conclusive but I'm starting to believe the NCQ implementation on the WD Red's just doesn't play well with the behavior of ZFS.

cyberjock · Mar 9, 2014

You have to consider what that setting does. Let me ask this because the answer is very long and I really don't want to write a book if I don't have to... How do you *think* that works. Don't look at the numbers, just the theory behind how it works and how you expect the pool to respond to the change. In particular, imagine you have 1 user versus 5 users simultaneously using the system.

EEE3 · Mar 9, 2014

My assumption is that it essentially limits the pending I/O operations per vdev to 1. This defeats the internal drive buffering completely as well as the internal kernel buffering. If that's wrong, please correct me.

Before you even say it, yes, this seems nuts and I'd expect it to harm a single stream and completely decimate concurrent access. But my experience is the complete opposite, it seems to be helping both. As I sort of concluded in my second post the only theory I have is that the buffering on the WD red's either just stinks all together or is somehow being counterproductive to the way ZFS wants to work. Or maybe I have one drive where the internal buffering is screwed up...who knows. I realize there are a gazillion holes in that theory.

I'm not at all suggesting setting that sysctl to 1 is a good thing. Quite the opposite, it's that it seems to be nothing but good in my setup and I'm actually bothered by it and want to understand why!

cyberjock · Mar 9, 2014

You are right, but it's more complex at deeper levels. If you have only a single user, a depth of 1 is going to give the best performance. If you plan to have 5 concurrent users, 5 is better. There's a performance penalty because of several factors both in code and physical limitations when the number doesn't actually meet the exact number of concurrent pending transactions. In short, 5 is generally considered the "best" tradeoff for most servers to minimize inefficiencies when at very low values while also setting the value high enough that the code and physical limitations don't become a concern.

It's a "this" for "that" type of teeter totter. You are trading one thing for another. If you have 10 different things writing to your server at the same time you'd be here whining very very loudly at your sub 10MB/sec transfer rates. My advice is to leave it at 5. There's many reasons for this, and I really don't feel like writing a book on the topic tonight. If you really want to know the story and have Mumble or Skype we can talk about it, but the bottom line is that there's drawbacks you probably aren't aware of by choosing 1 instead of 5. It's not a "something for nothing", and those tradeoffs are potentially debts that you may not be able to pay back later to the ZFS gods. Note that your data is just as safe, but you may not like the performance later.

aufalien · Mar 9, 2014

Hey Cyber, curious were you got 5 as the current default value in 9.2.1.2 is 1000?

Hey EEE3, can you test using 8 drives off the HBA and eliminate the on board stuff? The drives and card support SATA 6 but the mobo is 3 and I wouldn't mix-n-match for anything other then home use.

cyberjock · Mar 9, 2014

Crap. You are right. I swore we were talking about a different one. Guess I'm tired. It is 1000. I'm out to lunch and you can ignore all that crap since I thought you were talking about a transaction setting.

Honestly, not sure I know what the max_active does. I'll have to look it up. Message me tomorrow if I haven't answered(in case I forgot).

aufalien · Mar 9, 2014

Ah cool, yea I'm unsure what it is as well and looking forward to some knowledge about it.

EEE3 · Mar 9, 2014

aufalien - sorry my initial description was unclear. The two drives which are attached to the motherboard controller are on the SATA 3.0 ports running at 6Gb/s. dmesg confirms all 10 drives are running at 6Gb/s. Regardless of that, it's an excellent suggestion and I would love to eliminate the motherboard controller to simplify the test. Unfortunately I have real data in the pool right now so I can't just disconnect those drives. I'm going to see if maybe I can borrow another 9211 adapter from someone local to see if that has any impact. FWIW this is just for home use.

Cyberjock - I'm glad you were thinking of another setting. When you implied that it has an impact not just on the speed but literally how the data is written to the pool I was confused. I did not see that in a cursory scan of the source code...hopefully that's not the case and the danger of playing with this setting is pretty much nil.

aufalien · Mar 14, 2014

Any traction on this? Would be curious to know.

cyberjock · Mar 15, 2014

Sorry, I've been pretty busy. Maybe tomorrow I'll get to this. I'll bookmark this and hopefully remember to look into it tomorrow. :P

EEE3 · Mar 15, 2014

This is a home server and I was away on business all last week so I didn't get an opportunity to do much more myself.

The one thing I did do is continue to look for any sign of a hardware problem and I did find one. I ran 8 parallel "dd if=/dev/da[0-7] of=/dev/null bs=1M"'s across the 9211-8i drives and noticed that I was getting a sustained 150MB/s for each drive up through 5 but as soon as I went beyond that speeds fell. Obvious bandwidth limit. Turns out the 9211-8i was in what I *thought* was a PCI 2.0 x8 slot but it was one of these stupid "x2 (in x8)" configurations. I moved the card to a PCI 3.0 x16 slot and am now seeing 150MB/s concurrently across all 8 drives.

I'm happy I fixed this mistake but my ZFS results remain what I previously described.

Next I was going to try a pool with some old Seagate and WD RE's I have laying around and see if the results are any different than the WD Red's.

cyberjock · Mar 15, 2014

Ok. So here's a cool link to read if you are really bored.. I think this explains it.

http://people.freebsd.org/~hmp/drag...ib/opensolaris/uts/common/fs/zfs/vdev_queue.c

Mind you this appears to be opensolaris stuff, but I think it applies to FreeBSD as well. Here's what I'm getting from it... I'm not a coder, so I can't really read the code and understand it fully, but from what I read in those notes...

Here's my values for my server(defaults)

vfs.zfs.vdev.scrub_max_active: 2
vfs.zfs.vdev.async_write_max_active: 10
vfs.zfs.vdev.async_read_max_active: 3
vfs.zfs.vdev.sync_write_max_active: 10
vfs.zfs.vdev.sync_read_max_active: 10
vfs.zfs.vdev.max_active: 1000
vfs.zfs.vdev.scrub_min_active: 1
vfs.zfs.vdev.async_write_min_active: 1
vfs.zfs.vdev.async_read_min_active: 1
vfs.zfs.vdev.sync_write_min_active: 10
vfs.zfs.vdev.sync_read_min_active: 10

The I/O scheduler looks for anything above the maximum value(3 or 10 appropriately) and when it finds something it basically does that particular task to the exclusion of anything/everything else. Basically its an "if my to-do list gets too long I need to stop procrastinating and work on my to-do list". That makes sense. ZFS also has a slope based on those values for how to throttle back writes. Normally ZFS "schedules" writes and reads so you typically do a series of reads, then a series of writes. By taking turns and caching in between you can minimize lost time due to doing a head movement for a small read, then head movement for a small write. Remember that the actual act of pulling/pushing data onto/off of the media isn't very long relative to the latency of moving the head to the correct location on the platter. Minimizing the number of head movements is the name of the game. So you do reads, then do writes. Sync reads and writes break this pattern, and that's partly where the L2ARC and ZIL help fill in the gaps so that the pool can continue to operate without being interrupted. It's not much different than you wanting to not be interrupted when you are deep in thought working on some serious task.

Now, the max_active of 1000 means that the total of all I/O shouldn't exceed 1000. But that's pointless since the total is actually 33 (10+3+10+10). Per those notes the maximum ideally should be >= to the sum of each max_active.

So now if you set the max_active to 1, you are short circuiting all of the other max_actives(and even the min_actives) since they're all greater than 1. So any I/O that comes in will be performed right now in the order that the scheduler prioritizes traffic. Per that link the order is: sync read, sync write, async read, async write, scrub/resilver. So any I/O that every comes down the pipe will be performed right now and *immediately*, with the leftovers being for scrubs/resilvers.

I'd expect this to do a few things:

1. Throughput will drop, potentially to absolutely shitty values since your disks are going to be forced to perform at the order and frequency I just mentioned in the above paragraph. So instead of doing a few long writes that were cached and whatnot it'll do tons of small writes. The same goes for reads. Everything will be constantly interrupting everything else.
2. Latency will decrease since you're forcing ZFS to do small transactions and to do them *right now*.
3. Scrubs/resilvering could take a serious serious performance hit. If your pool is constantly active it could take scrubs that normally take hours to suddenly take days because scrubs/resilvering is the lowest priority task and you are forcing the pool to spend more time being active with reads and writes. Again, this would likely be a problem if your pool is constantly active as it would constantly interrupt scrubs/resilvers. Scrubs/resilvers have their own tunables as to when to temporarily suspend scrubs/resilvering because the pool is too active, and in a worst case scenario you may be unable to ever complete a scrub or resilver if the pool is being accessed frequently enough that it suspends the scrub/resilver and it can't ever resume. Note that when I say suspend I mean internally to the code and not that the scrub actually stops or aborts. ZFS simply decides that right now isn't a good time because of server usage and temporarily stops I/O for scrub/resilvering until the server is more quiet(hopefully in a few seconds).
4. It sounds like the amount of actual head movement your disks will have to make will increase significantly as ZFS will no longer be able to take a bunch of small reads or writes and make them into a few bigger reads/writes.
5. As long as your pool can keep up with the I/O being requested then everything is okay. But, I think that once you can't keep up things could go horribly horribly bad for you as you aren't optimizing your workload for the disks.
6. Due to the increased load on the drives, I'd expect drive temps to potentially be higher, drives to potentially wear out faster, and you to have issues performing SMART tests on the disks if they are excessively active.

I get the impression that it's set to 1000 simply to ensure that this value isn't normally your limiting factor as the other max_active values are deemed to be the best way to control your I/O at the moment. Perhaps this is for some future feature that is coming to ZFS. But, right now I think it is a bad idea to set the vfs.zfs.vdev.max_active to a value that is less than the sum of the other min_active values as the code notes say. Generally I've found that when recommendations are made in the code notes it's smart to listen to them as there's often hidden logic and other relationships involved that isn't always immediately obvious to the casual observer

Now regardless of how much or how little I've gotten right or wrong, there's one thing that concerns me...

The maximum number of i/os active to each device. Ideally, this will be >= the sum of each queue's max_active. It must be at least the sum of each queue's min_active.

uint32_t zfs_vdev_max_active = 1000;

So I wouldn't under any circumstances set the vdev.max_active to anything less than 24 and I'm not sure if there would be any change in how ZFS behaves if you set it to anything above 35(sum of all max_actives).

aufalien · Mar 15, 2014

Wow, awesome info. Many thanks dude.

LubomirZ · Sep 14, 2015

resurrecting a dead topic : " The maximum number of i/os active to each device"

ummmm, does this mean SSDs are also limited to 1000 IOPS ???

mav@ · Sep 15, 2015

This is about number of simultaneous I/Os, not IOPS. Second includes time,first doesn't.

Important Announcement for the TrueNAS Community.

vfs.zfs.vdev.max_active (previously vfs.zfs.vdex.max_pending)

EEE3

Dabbler

EEE3

Dabbler

cyberjock

Inactive Account

EEE3

Dabbler

cyberjock

Inactive Account

aufalien

Patron

cyberjock

Inactive Account

aufalien

Patron

EEE3

Dabbler

aufalien

Patron

cyberjock

Inactive Account

EEE3

Dabbler

cyberjock

Inactive Account

aufalien

Patron

LubomirZ

Dabbler

mav@

iXsystems

Similar threads

Important Announcement for the TrueNAS Community.

vfs.zfs.vdev.max_active (previously vfs.zfs.vdex.max_pending)

Dabbler

Dabbler

Inactive Account

Dabbler

Inactive Account

Patron

Inactive Account

Patron

Dabbler

Patron

Inactive Account

Dabbler

Inactive Account

Patron

Dabbler

iXsystems

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "vfs.zfs.vdev.max_active (previously vfs.zfs.vdex.max_pending)"

Similar threads