random nfs server not responding

Status
Not open for further replies.
Joined
Jul 2, 2013
Messages
4
At apparently random intervals, all of my linux clients connected to a freenas fileserver appear to lose connection to the nfs share. Specifically, log files on each of the linux server will start printing out nfs: server x.x.x.x not responding, still trying.

Log files on the freenas file itself do not seem to indicate any problem at all.

While ssh-ed directly into the freenas box, however, I see that I'm still able to interact with the zfs datasets without an issue. Load average as a whole is acceptable, although I do see the nfsd process using 80-100% cpu.

The server itself was behaving correctly before the upgrades which were just brought to it :
- Upgrading from FreeNas 8.2 to 9.1.1
- Replacing the raid controller with an LSI 9201-16i
- Replacing all of the old hard drives

If anyone might have an idea what might be causing the issue, perhaps a bug I overlooked, or what I could investigate next I'd be grateful.
 
D

dlavigne

Guest
Which driver is being used by the NIC (from ifconfig)? What type of CPU?
 
Joined
Jul 2, 2013
Messages
4
NIC drivers are :
dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 7.3.8
dev.em.1.%desc: Intel(R) PRO/1000 Network Connection 7.3.8

From what I can tell, however, it doesn't seem to be networking related...

There's two separate issues from the client's side ..
a) NFS becomes unresponsive ... ping to the box still works, SSH may or may not work
b) System completely unresponsive from the outside

On one occasion, I saw an indefinite wait kernel error having to do with swap on the physical machines. At other times, I don't see anything special.

So far, I've tried the following :
- Upgrade LSI card firmware to P17
- Upgrade kernel for LSI module P17
- Rebuilding the pool from replica server

I am, however, seeing the following errors as I boot the system. Pretty much the only errors I see...

This is true with all firmware/module versions and is still now ( per drive ) :
(probe754:mpslsi0:0:755:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe754:mpslsi0:0:755:0): CAM status: Invalid Target ID
(probe754:mpslsi0:0:755:0): Error 22, Unretryable error
 

Matt Reynolds

Dabbler
Joined
Sep 27, 2013
Messages
10
Did you ever find a solution to this?

I'm experiencing the same issue with NFS randomly stops working, but sorts itself out 10 minutes later. I'm considering upgrading from 9.1.1 to 9.2 to see whether it sorts out the problem, specifically because we're using the same NIC and there's a newer driver in the current release.
 
Joined
Jul 2, 2013
Messages
4
Never did manage to find a real solution to this.

What I ended up doing is having a couple of app servers SSH into the box every minute. Been about a couple of months now and I've not seen any repeat incidents.
 
Status
Not open for further replies.
Top