60+ simultaneous mount requests causing listen queue overflows

Status
Not open for further replies.

ron_post

Cadet
Joined
Oct 30, 2018
Messages
2
I've got a freenas running freebsd 11.1-STABLE (althouth this happened on earlier versions as well) serving user directories over nfs3 to a cluster of linux machines (which are used as a compute cluster). Directories on the cluster are automounted, including home directories. When someone starts a job on many of the cluster machines at the same time (a fairly common occurence), a number of the jobs will fail to start saying they failed to mount the user's home directory. Checking on the server, dmesg will have an error like


sonewconn: pcb 0xfffff801e5278570: Listen queue overflow: 193 already in queue awaiting acceptance (1 occurrences)


The address is close to mountd's, but doesn't exactly match anything in lsof -iTCP -sTCP:LISTEN -P when I check. netstat -s says there were listen queue overflows in tcp, and watching with netstat -Lan I can watch the listen queue for mountd go up to 183. (I'm forcing the nfs mount requests to be tcp, rather than udp, with proto=tcp on the clients.)


I've tried increased the size of the accept queue (kern.ipc.soacceptqueue) from 128 to 1024 and rebooted the machine, but netstat -Lan shows that the limit is still 128 for mountd:


Code:
tcp4  0/0/128                          *.763               

tcp6  0/0/128                          *.763



( rpcinfo -p shows mountd is at port 763).


How to I get mountd to either have a larger queue or to otherwise handle the request spikes from the cluster?
 
Status
Not open for further replies.
Top