Out of the blue vmware esx + iscsi issues

Status
Not open for further replies.

Nick Binder

Dabbler
Joined
Nov 17, 2013
Messages
12
My freenas (9.1.1) machine has been up for 77 days, and my vmware server has been up about the same amount of time, everything working fine.

Nothing has changed in those 77 days. However suddenly my vm's started crapping out, and vmware is giving me this error:

Device
t10.FreeBSD_iSCSI_Disk______001871ebb4960-
00_________________ performance has
deteriorated. I/O latency increased from average
value of 145564 microseconds to 6148412
microseconds.
warning
11/17/2013 2:43:32 PM

I google'd for a while now, and came across bug 1531, however I don't know why it would show up out of the blue (these vms are basically a lab that do very little I/O once booted)
Zpool is healthy, etc. Just kind of confused what would cause this... Vmware ESX 5.1
Thanks
 

Nick Binder

Dabbler
Joined
Nov 17, 2013
Messages
12
Looks like tweaking:

zfs.txg.timeout, zfs.txg.write_limit_override, & vfs.zfs.write_limit_shift

seemed to fix it, I just don't know why it started acting up in the first place...(or why it took almost 3 months of being on to present itself)

I take this all back, as soon as I typed this I went back to look at at vmware:

Device
t10.FreeBSD_iSCSI_Disk______001871ebb4960-
00_________________ performance has
deteriorated. I/O latency increased from average
value of 119202 microseconds to 5030842
microseconds.
warning
11/17/2013 10:17:58 PM
vmesxi2.domain.com

Nothing in /var/log/messages:

Nov 17 19:38:02 nasty kernel: done.
Nov 17 19:42:40 nasty manage.py: [common.pipesubr:57] Popen()ing: /usr/local/bin/warden list -v
Nov 17 19:42:40 nasty last message repeated 2 times
Nov 17 22:08:16 nasty manage.py: [common.pipesubr:57] Popen()ing: /usr/local/bin/warden list -v
Nov 17 22:08:17 nasty last message repeated 2 times
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
check for failing disks or sudden increase in percent used (should be kept less than 60% per "zpool list"). assuming you have verified that there isn't a sudden jump in I/O demand via iSCSI.
 

Nick Binder

Dabbler
Joined
Nov 17, 2013
Messages
12
Here's the odd thing, why is it saying 1.02x for dedup? I do not have dedup enabled, I've read the warnings
about it, and I've stayed away!

It threw the warning at me around 3:39am it appears, there should be little to no activity at this time(this is a home network thankfully, the several machines
I have in production are not having this issue *knock on wood*) No failing disks that I can see.

zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
v1 14.5T 6.40T 8.10T 44% 1.02x ONLINE /mnt
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
3:39AM? FreeBSD machines? Maybe running periodic cron jobs? I can tell you this is definitely one of our heavy I/O periods, due to backups plus maintenance scripts running...

As for the 1.02x thing, I think I've seen that once or twice as well. If you're not actively running dedup, I am guessing it is fine but I don't know why it does that offhand.
 

Nick Binder

Dabbler
Joined
Nov 17, 2013
Messages
12
Nope, no cron, but now it keeps happening with any activity at all. This is so strange, I see nothing in the logs on the FreeNAS side.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
So all the SMART results are good? What about individual transfer speeds from each disk? I mean, seriously, you don't go from 119ms latency to 5 or 6 seconds latency without being able to identify the culprit....
 

Nick Binder

Dabbler
Joined
Nov 17, 2013
Messages
12
I don't know how to test the disks on their own, smart said all good for everything. This setup has been running just fine for 77 days until a day ago when these problems showed up.
So far I'm glad it's just happening on my home server and not my servers at work.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
How about you post the SMART data from -a. Many people do the PASS/FAIL command and assume PASS is a good drive when that's not the proper way to check a drive.

Edit: smartctl -H /dev/XXX is not the proper way to verify a disk is in good condition.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Perhaps check to see if any of the disks appears to be responding slowly? Consumer grade SATA drives in particular like to define "all good" as including "going to hell pretty soon but not quite totally broken".

Try "gstat"? Look for the statistical outlier?

Get a list of your disk devices and run dd read tests from each one, see if one is reading slow?
 

Nick Binder

Dabbler
Joined
Nov 17, 2013
Messages
12
I'm not seeing any anomalies with gstat. I did disconnect my other esx server from the iscsi. This volume (which makes up all my disks)
houses data outside the 1TB allocated for my ESX iscsi file extent, I don't think a dd test would be safe?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I'm not seeing any anomalies with gstat. I did disconnect my other esx server from the iscsi. This volume (which makes up all my disks)
houses data outside the 1TB allocated for my ESX iscsi file extent, I don't think a dd test would be safe?

Reading a disk is usually a low risk proposition.
 

Nick Binder

Dabbler
Joined
Nov 17, 2013
Messages
12
Yes, sorry, misread that one. I'm also able to copy files via smb at ~120MB/s to and from this server. It seems only iscsi is giving weirdness.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
What's your hardware specs for the server? Do you have more than 16GB of RAM?
 

Nick Binder

Dabbler
Joined
Nov 17, 2013
Messages
12
It's a q9300 (quad core) 8GB of RAM desktop board (this is a home network remember). I actually created another iscsi lun attached it to vmware and moved a couple servers to it without issues today. (iscsi lun is on the same server)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
iSCSI needs more than 8GB of RAM. iSCSI seems to work better with 16GB of RAM to start. Depending on system loading it can often go up. I have 20GB of RAM and iSCSI is pretty slow. But I don't care about it being slow right now as I don't use the VM for anything. Don't be fooled with the 120MB/sec on file sharing and assume that RAM isn't your problem on iSCSI. I can do over 600MB/sec from my pool, but iSCSI on the same pool peaks at about 30-40MB/sec on a good day.

This does mean you are looking at a whole new system because you can't go over 8GB. But you really should be questioning the value of your data if you are going ZFS with non-ECC RAM anyway. There's stickies on the subject if you are interested.
 

Nick Binder

Dabbler
Joined
Nov 17, 2013
Messages
12
This board accepts up to 32GB of DDR3. This setup has been trouble-free for me for over a year, and actually I was living on 4GB of RAM for the first part of that on a different board with a dual core/ddr2 setup. It's a lab setup and no way implied to be production. The reason I was worried is I have 3 machines deployed in production, but they're ECC RAM, server motherboards/cpu/hardware so maybe I've just somehow hit a ceiling of what my stuff at home can handle. I never had an issue with speed with 4/8GB of memory via iscsi/nfs/smb, so hopefully I was just lucky skating by on my hardware.

Like I said, my production machines have had 0 (zero) issues related to iscsi. Windows ACL's maybe (but that's a whole different animal :)) Our main server in our main location has 180 people banging on it all day with ~40TB of usable space and the box doesn't even flinch :) (.03 load average)
 
Status
Not open for further replies.
Top