nfsd using too much CPU when L2ARC is in the pool

dniq · Jun 26, 2014

I've recently upgraded RAM in my FreeNAS server, from 32G to 128G.

After that, the nfsd process, which normally rarely uses more than 25% of CPU, started using 300-400%!

At the same time I noticed that the activity on L2ARC is too low, and hit ration is below 20%.

So I tried to remove the L2ARC (which is a 400G SAS SSD drive) from the pool. And what do you know - CPU load caused by nfsd dropped down to 20-30%! As soon as I add the SSD back as L2ARC - the CPU use skyrockets.

What's happening?

The pool contains 12x3G drives in RAID10 setup (6x2), plus 200G SSD for ZIL and used to have 400G SSD for L2ARC (which I have now removed). The server is Dell r720xd.

L · Jun 26, 2014

That is a lot of l2 for that much ram. Is there a ton of read activity going on on the box?

L · Jun 26, 2014

One more thing.. Can you rule out nfsd? My guess would be it is spinning on locks. If you turn it off.. is the cpu still high?

dniq · Jun 26, 2014

Not at the time. There occasionally are spikes of reads or writes (or both simultaneously). I use it for many things - VMWare images, DB storage, web content, user homes et cetera. So the load varies, and the type of load also varies - web content is a lot (about 16 million) files, mostly small (1-2k, and some number of rather large - up to a gigabyte). MySQL's InnoDB about 1T in size, split into about 2 dozen files. Couchbase data store (and the Couchbase REALLY loves to abuse the data store!). Developers' homes have dedup enabled (all of them have copies of the entire codebase in their homes, so dedup works very well, reducing the size of dataset from 2T to about 200G :) ). Et cetera...

As a matter of fact, the box's performance improved quite a lot after I removed the L2ARC and added it to the ZIL. So I now have 600G of ZIL (the old 200G plus the former 400G L2ARC) - pretty wasteful, I'd say, but what can I do - 200G was the smallest SLC SSD I had available :(

But still, why would nfsd go bananas when there's L2ARC???

dniq · Jun 26, 2014

I can't turn it off - it's my primary production NFS server :(

The high load is WCPU for nfsd process. The top shows total CPU load as nearly 0% user, 20-30% system, and some single digits for the rest, with about 70% idle. I'm guessing top shows total CPU load where 100% == all cores combined (I have a single 16-core CPU in the server), and for processes it considers 100% as load of a single core, so it kind of adds up: 300% for nfsd would mean it's using 3 cores, which falls in line with 20-30% of 1600% (16x100%)...

dniq · Jun 26, 2014

To be honest, I actually have 256G of RAM in the server, but the other 128G can't be used until I put a second CPU in it... So I suppose it's OK if I don't have L2ARC.

L · Jun 26, 2014

wow.. lemme poke through my dtrace scripts and see if i can find one that might help see what it is doing. I actually am a solaris turned freenas person.. In solaris there is tunable for number of nfsd threads, that is typically set far too low for multi-core systems. In solaris the metric is 2-4 threads per client mount. On big'ish systems(8+ core) i would typically amp that up to 512 or 1024. I would typically see cpu come down when there are enough threads.

I am not saying this is what is happening.. but in the past i have seen where nfsd takes a read request passes to zfs and zfs takes a little time to get data(like scanning through l2). Nfsd acts like a small child spinning up and saying "are you done yet??" like a 1000 times a second. I saw this with dtrace... but can't remember where i put the script.

I would think this would most likely be something that could be tuned.

L · Jun 26, 2014

took me about 30 seconds to find some good scripts out there..

https://github.com/siebenmann/cks-dtrace/blob/master/nfs3-stats.d

Is a nice script to measure read and write latencies on nfs requests.

aufalien · Jun 26, 2014

Linda Kateley said:
took me about 30 seconds to find some good scripts out there..

https://github.com/siebenmann/cks-dtrace/blob/master/nfs3-stats.d

Is a nice script to measure read and write latencies on nfs requests.

Love it many thanks Linda.

aufalien · Jun 26, 2014

dniq said:
To be honest, I actually have 256G of RAM in the server, but the other 128G can't be used until I put a second CPU in it... So I suppose it's OK if I don't have L2ARC.

Well, my take on this is while it sounds good on paper, if you run arcstat.py and check your values, like arc size, hit %, miss % etc, then you can determine weather L2Arc will help.

Some one correct me if I'm wrong but if your arc size stays lower then your RAM or whatever you cap the usable RAM for arc to be (arc_max), then you don't need L2Arc.

Also if you had 600GB L2Arc, then you'd be using about ~24GB of RAM to fully support it. CJ uses 1:5 ratio of ARC to L2Arc meaning what it should never exceed.

But still you quadrupled your RAM so this is sorta odd. Hopefully Linda's dtrace script will yield more info.

Although like I said, you may not need L2Arc.

solarisguy · Jun 26, 2014

Nothing about L2ARC below.

I know nothing about your NFS server activity/load. And I have very limited FreeBSD NFS experience. However, I want to point out that a modern Solaris would have a single nfsd daemon serving 1024 requests (that is the default and that number can be changed). On the other hand http://www.freebsd.org/cgi/man.cgi?query=nfsd&sektion=8 tells us

A server should run enough daemons to handle the maximum level of concurrency from its clients, typically four to six.

The default is four.

You really may want to try Linda's suggestion of changing (increasing) number of nfsd processes. Of course only if you do not have 2-3 NFS clients not making concurrent requests...

dniq · Jun 26, 2014

Linda Kateley said:
took me about 30 seconds to find some good scripts out there..

https://github.com/siebenmann/cks-dtrace/blob/master/nfs3-stats.d

Is a nice script to measure read and write latencies on nfs requests.

Looks nice! I'll give it a go tomorrow! Thanks!

For now I just removed the L2ARC disk and added it to the ZIL as a stripe.

As a side note, when I removed the L2ARC - I did a "dd if=/dev/zero of=/dev/mfisyspd13 bs=4k" test to see how it performs and found that it only gets about 230mb/s write performance. So I called Dell, and they found some article where they say that their Dell PERL H310 does not perform well in JBOD mode. So that could be at least part of the problem (although it was working fine until the RAM upgrade). So I asked in a separate thread here whether I can reconfigure H310 to a RAID mode, and each of the physical disks to a single-disk RAID0 so that ZFS recognizes them as members of the zpool and would just continue working without having to backup all data (that's over 4T!), recreating the zpool and then restoring all the data.

dniq · Jun 26, 2014

aufalien said:
Well, my take on this is while it sounds good on paper, if you run arcstat.py and check your values, like arc size, hit %, miss % etc, then you can determine weather L2Arc will help.

Some one correct me if I'm wrong but if your arc size stays lower then your RAM or whatever you cap the usable RAM for arc to be (arc_max), then you don't need L2Arc.

Also if you had 600GB L2Arc, then you'd be using about ~24GB of RAM to fully support it. CJ uses 1:5 ratio of ARC to L2Arc meaning what it should never exceed.

But still you quadrupled your RAM so this is sorta odd. Hopefully Linda's dtrace script will yield more info.

Although like I said, you may not need L2Arc.

Well, for now it seems to work much better without the L2ARC. But there's also an issue with the Dell PERC H310 controller, when it's in JBOD mode... So I'm trying to figure out what to do about it: either buy a few additional controllers and spread the disks amongst them, or switch the controller to RAID mode and configure all physical disks as individual single-disk RAID0 "virtual disks".

dniq · Jun 26, 2014

solarisguy said:
Nothing about L2ARC below.

I know nothing about your NFS server activity/load. And I have very limited FreeBSD NFS experience. However, I want to point out that a modern Solaris would have a single nfsd daemon serving 1024 requests (that is the default and that number can be changed). On the other hand http://www.freebsd.org/cgi/man.cgi?query=nfsd&sektion=8 tells usThe default is four.

You really may want to try Linda's suggestion of changing (increasing) number of nfsd processes. Of course only if you do not have 2-3 NFS clients not making concurrent requests...

Well, actually there are about 30-40 clients, of which 16 make frequent stat requests about 90% of the time and have it mounted in read-only mode (those are web servers, which read PHP scripts from the NFS, with APC, which caches and compiles them and then just checks stat to see if the scripts have changed). Then there are two MySQL servers keeping their InnoDB data files on the NFS, plus two Couchbase that does the same, plus about 30-40 VMWare images... In short, quite a lot :) But on any given day, on average, there isn't that much IO going on - usually about 5-10mb/s read/write, with occasional spikes up to 60-100mb/s. When the WCPU of nfsd started growing - I increased the number of nfsd threads to the number of cores in the CPU, which is 16, so that it doesn't slow down much when it hits 400% or more.

aufalien · Jun 26, 2014

Hey, I just wanted to comment and advice you ditch that POS Dell HBA and get an LSI. Since you are willing to up the RAM, I'd suggest you get a proper HBA which will make a very good diff. How many disk do you have?

dniq · Jun 26, 2014

aufalien said:
Hey, I just wanted to comment and advice you ditch that POS Dell HBA and get an LSI. Since you are willing to up the RAM, I'd suggest you get a proper HBA which will make a very good diff. How many disk do you have?

I'd love to! In fact, I have a call with our Dell account rep for the purpose of giving him a hard time for selling me that crap, and I'll demand that they replace it with something decent. The problem is, they don't seem to have any other controllers that support JBOD :(

Do you have any specific LSI models to recommend? I'd appreciate that!

I have 12 SAS 3T disks in RAID10 set up (by ZFS - 6x2), plus one 200G SLC SSD for ZIL and one 400G SSD for L2ARC (which I removed from L2ARC and added as a stripe to the ZIL).

L · Jun 26, 2014

Yea that innodb is going to want low latency, typically database doesn't care about bandwidth, it cares about latency.. As I was thinking about this today.. I kept thinking it might be in the hardware, if there is a hair of hiccup in the request, sitting on the controller or long waits in buffers, that can make nfsd spin. I have seen really big ram and really big l2's and not seen cpu spike.. But I have seen a lot of nfs spin on locks. The solaris nfsd is archaic(new solaris has been updated very recently) but I believe the bsd is much more modern.

If you wanna have some fun..google "dtrace nfs" and you will find a number of scripts. You don't need to understand dtrace to use it. The probes are typically really lightweight which makes it so you can run in prod without having it add to the load.

solarisguy · Jun 26, 2014

You could try doubling the number of your nfsds, if you can tolerate a short break for a restart.

It would be difficult to optimize almost anything given the diverse nature of your NFS load.

solarisguy · Jun 26, 2014

Did you watch wait times on io from your pools/disks ?

dniq · Jun 26, 2014

Linda Kateley said:
Yea that innodb is going to want low latency, typically database doesn't care about bandwidth, it cares about latency.. As I was thinking about this today.. I kept thinking it might be in the hardware, if there is a hair of hiccup in the request, sitting on the controller or long waits in buffers, that can make nfsd spin. I have seen really big ram and really big l2's and not seen cpu spike.. But I have seen a lot of nfs spin on locks. The solaris nfsd is archaic(new solaris has been updated very recently) but I believe the bsd is much more modern.

If you wanna have some fun..google "dtrace nfs" and you will find a number of scripts. You don't need to understand dtrace to use it. The probes are typically really lightweight which makes it so you can run in prod without having it add to the load.

Well, right now I'll probably start migrating everything to a FreeNAS VM - it seems I can't replace the RAID controller without having to re-create the zpool :( But I'm definitely gonna replace the RAID, since it seem to have a known issue with performance, especially in JBOD mode :( So until then, I think the issue with the nfsd can be "closed". I hope after I go through all the trouble to get the RAID replaced, there won't be any more problems :)

Thanks for all your help!

Important Announcement for the TrueNAS Community.

nfsd using too much CPU when L2ARC is in the pool

Explorer

L

Guest

L

Guest

Explorer

Explorer

Explorer

L

Guest

L

Guest

Patron

Patron

Guru

Explorer

Explorer

Explorer

Patron

Explorer

L

Guest

Guru

Guru

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "nfsd using too much CPU when L2ARC is in the pool"

Similar threads