NFS3 + VMWare slow disk I/O performance

MasterTacoChief · Jan 1, 2018

FreeNAS 11.0U4, system uptime 60 days, 3x300GB SAS 10k disks in a RAID-Z1 configuration + 240GB SAS SSD L2ARC. I have this dedicated to providing a NFS3 share to VMWare to run a variety of VMs. Early this morning one of the drives reported an increase in bad sectors and the performance dropped significantly. When I noticed it this morning, VMs using this store were running slowly, and the VCenter Network Appliance (also stored on this share) had corrupted itself (it appears to not work well with high latency).
I tried a variety of things, including offlining the disk that reported a bad sector and offlining the L2ARC, but nothing has increased the disk access via NFS. Strangely enough I can copy files to and from this array using SMB and get 200MB/s+ performance, so it seems like there's something in particular going on with NFS on this particular share.
I have another NFS share made up of 5x 4TB SATA disks on the same FreeNAS server and this share isn't having the same issues. Once I migrated the VMs off the bad share (very slowly) to this other share then disk access was back to normal for those VMs. There are no errors in the syslog in FreeNAS, nor any errors from VMWare.
Next I will try a restart of NFS, then if necessary a reboot from FreeNAS. I'd hope for better uptime though.

Below is the log of Disk I/O for the three disks in the array, showing a drop in performance when it all went downhill. da10 was the disk with reported bad sectors. You can see its performance go to 0 when I offlined, then return once I brought it back online. The spikes around 1800 were when I was copying files via SMB. The reads around this time were migrating VMs to another share.

MasterTacoChief · Jan 1, 2018

When I restarted NFS, this string of errors appeared in the syslog:

Code:

Jan  1 21:44:14 nas swap_pager: I/O error - pagein failed; blkno 3408901,size 4096, error 6
Jan  1 21:44:15 nas swap_pager: I/O error - pagein failed; blkno 3419365,size 4096, error 6
Jan  1 21:44:15 nas vm_fault: pager read error, pid 2625 (nfsd)
Jan  1 21:44:15 nas swap_pager: I/O error - pagein failed; blkno 3410277,size 4096, error 6
Jan  1 21:44:15 nas vm_fault: pager read error, pid 2625 (nfsd)
Jan  1 21:44:15 nas kernel: Failed to fully fault in a core file segment at VA 0x800625000 with size 0x20000 to be written at offset 0x4000 for process nfsd
Jan  1 21:44:15 nas kernel: Failed to fully fault in a core file segment at VA 0x800625000 with size 0x20000 to be written at offset 0x4000 for process nfsd
Jan  1 21:44:15 nas swap_pager: I/O error - pagein failed; blkno 3419365,size 4096, error 6
Jan  1 21:44:15 nas vm_fault: pager read error, pid 2625 (nfsd)
Jan  1 21:44:15 nas kernel: Failed to fully fault in a core file segment at VA 0x800bba000 with size 0x19000 to be written at offset 0x32000 for process nfsd
Jan  1 21:44:15 nas kernel: Failed to fully fault in a core file segment at VA 0x800bba000 with size 0x19000 to be written at offset 0x32000 for process nfsd
Jan  1 21:44:15 nas swap_pager: I/O error - pagein failed; blkno 3419392,size 8192, error 6
Jan  1 21:44:15 nas vm_fault: pager read error, pid 2625 (nfsd)
Jan  1 21:44:15 nas kernel: Failed to fully fault in a core file segment at VA 0x800c00000 with size 0x400000 to be written at offset 0x4b000 for process nfsd
Jan  1 21:44:15 nas kernel: Failed to fully fault in a core file segment at VA 0x800c00000 with size 0x400000 to be written at offset 0x4b000 for process nfsd
Jan  1 21:44:15 nas swap_pager: I/O error - pagein failed; blkno 3417015,size 65536, error 6
Jan  1 21:44:15 nas vm_fault: pager read error, pid 2625 (nfsd)
Jan  1 21:44:15 nas kernel: Failed to fully fault in a core file segment at VA 0x801600000 with size 0x1e00000 to be written at offset 0x44d000 for process nfsd
Jan  1 21:44:15 nas kernel: Failed to fully fault in a core file segment at VA 0x801600000 with size 0x1e00000 to be written at offset 0x44d000 for process nfsd
Jan  1 21:44:15 nas kernel: pid 2625 (nfsd), uid 0: exited on signal 11 (core dumped)
Jan  1 21:44:15 nas swap_pager: I/O error - pagein failed; blkno 3408900,size 4096, error 6
Jan  1 21:44:15 nas swap_pager: I/O error - pagein failed; blkno 3410022,size 4096, error 6
Jan  1 21:44:15 nas vm_fault: pager read error, pid 2144 (rpcbind)
Jan  1 21:44:15 nas swap_pager: I/O error - pagein failed; blkno 3410373,size 4096, error 6
Jan  1 21:44:15 nas vm_fault: pager read error, pid 2144 (rpcbind)
Jan  1 21:44:15 nas kernel: Failed to fully fault in a core file segment at VA 0x800e00000 with size 0x400000 to be written at offset 0x51000 for process rpcbind
Jan  1 21:44:15 nas kernel: Failed to fully fault in a core file segment at VA 0x800e00000 with size 0x400000 to be written at offset 0x51000 for process rpcbind
Jan  1 21:44:15 nas swap_pager: I/O error - pagein failed; blkno 3410022,size 4096, error 6
Jan  1 21:44:15 nas vm_fault: pager read error, pid 2144 (rpcbind)
Jan  1 21:44:15 nas kernel: Failed to fully fault in a core file segment at VA 0x80160e000 with size 0x1000 to be written at offset 0x452000 for process rpcbind
Jan  1 21:44:15 nas kernel: Failed to fully fault in a core file segment at VA 0x80160e000 with size 0x1000 to be written at offset 0x452000 for process rpcbind
Jan  1 21:44:15 nas swap_pager: I/O error - pagein failed; blkno 3410021,size 4096, error 6
Jan  1 21:44:15 nas vm_fault: pager read error, pid 2144 (rpcbind)
Jan  1 21:44:15 nas kernel: Failed to fully fault in a core file segment at VA 0x7ffffffdf000 with size 0x20000 to be written at offset 0x453000 for process rpcbind
Jan  1 21:44:15 nas kernel: Failed to fully fault in a core file segment at VA 0x7ffffffdf000 with size 0x20000 to be written at offset 0x453000 for process rpcbind
Jan  1 21:44:15 nas kernel: pid 2144 (rpcbind), uid 0: exited on signal 11 (core dumped)
Jan  1 21:44:16 nas swap_pager: I/O error - pagein failed; blkno 3408900,size 4096, error 6
Jan  1 21:44:16 nas mountd[36407]: unknown group: 
Jan  1 21:44:17 nas nfsd: can't register svc name
Jan  1 21:44:17 nas swap_pager: I/O error - pagein failed; blkno 3408901,size 4096, error 6
Jan  1 21:44:17 nas root: /etc/rc.d/lockd: WARNING: Unable to force statd. It may already be running.
Jan  1 21:44:17 nas root: /etc/rc.d/lockd: WARNING: failed precmd routine for lockd
Jan  1 21:44:17 nas swap_pager: I/O error - pagein failed; blkno 3408901,size 4096, error 6
Jan  1 21:44:17 nas swap_pager: I/O error - pagein failed; blkno 3408901,size 4096, error 6

MasterTacoChief · Jan 1, 2018

Tried a reboot, console started spitting out similar errors constantly. Had to cold boot the system.

dlavigne · Jan 3, 2018

Does the issue persist after updating to 11.1? If so, please create a report at bugs.freenas.org and post the issue number here.

MasterTacoChief · Jan 18, 2018

I had updated to 11.1 since I had to reboot after the last issue anyway. It had been running fine until this morning when the system completely locked up. I could ping it, but there was no connectivity via SMB, NFS, GUI, or the local console. Strangely enough, in the evening prior I had migrated my vCenter appliance from a local disk on one of my hosts over to this FreeNAS's NFS share so I could perform updates to the server, which is similar to the previous issue. vCenter had just been patched up to the latest version as well.
I'm wondering if there's an issue with the way vCenter accesses the disk and NFS/FreeNAS that's causing some issue. I'll be migrating vCenter to a different FreeNAS server that serves as a backup to see if that machine ends up having the same symptoms.

MasterTacoChief · Jan 18, 2018

Once I got everything back up (including email server) I received this message which appears to have been sent last night. Don't know if it's a clue as to what happened or not:

Code:

Subject: Cron <root@nas> /usr/local/libexec/nas/scrub -t 7 freenas-boot

Traceback (most recent call last):
  File "/usr/local/bin/midclt", line 10, in <module>
	sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/middlewared/client/client.py", line 491, in main
	with Client(uri=args.uri) as c:
  File "/usr/local/lib/python3.6/site-packages/middlewared/client/client.py", line 268, in __init__
	self._ws.connect()
  File "/usr/local/lib/python3.6/site-packages/middlewared/client/client.py", line 159, in connect
	rv = super(WSClient, self).connect()
  File "/usr/local/lib/python3.6/site-packages/ws4py/client/__init__.py", line 216, in connect
	bytes = self.sock.recv(128)
socket.timeout: timed out

MasterTacoChief · Jan 18, 2018

Did some more checking into the logged data in the GUI and found a few interesting things:
11.1 started using swap. It also appears to report a lower amount of swap available. Before this I never really saw the swap usage grow, but now it appears to be slowly growing. It looks like my issue occurred once swap hit 4G usage. Coincidence?

Disk usage increased across multiple ZVOLs. Might correlate with the swap usage hitting 4G?
A couple plots of a few disks on different ZOLs attached here.
ZOL0: General storage

ZVOL1: VMWare Images

Security: Zoneminder storage (ZM wasn't even on)

CPU usage increased:

There was a drastic increase in non-sleeping/running processes just before it died. I don't know what grey correlates to, so hard to say what state they were in:

For reference, here is the VOL1 storage utilization. The slight uptick is when vCenter was migrated to the disk, and the downtick is when I migrated it off this morning:

dlavigne · Jan 19, 2018

11.1-U1 (released yesterday) fixed some of these issues. If your issue persists after updating, please create a report at bugs.freenas.org and post the issue number here.

Important Announcement for the TrueNAS Community.

NFS3 + VMWare slow disk I/O performance

MasterTacoChief

Explorer

MasterTacoChief

Explorer

MasterTacoChief

Explorer

dlavigne

Guest

MasterTacoChief

Explorer

MasterTacoChief

Explorer

MasterTacoChief

Explorer

dlavigne

Guest

Similar threads

Important Announcement for the TrueNAS Community.

NFS3 + VMWare slow disk I/O performance

MasterTacoChief

Explorer

MasterTacoChief

Explorer

MasterTacoChief

Explorer

dlavigne

Guest

MasterTacoChief

Explorer

MasterTacoChief

Explorer

MasterTacoChief

Explorer

dlavigne

Guest

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "NFS3 + VMWare slow disk I/O performance"

Similar threads