Growing pains with ESXi

msmollin · May 11, 2014

Hey everyone. I'm delurking here with a question regarding ESXi + FreeNAS. Really it's about advice and direction with how to increase the performance of my system. Here's some background:

For the better part of 2 years I was using ESXi 4.1 with FreeNAS 9.0. I upgraded FreeNAS to 9.2.0 when that release dropped, with absolutely no issues. I really love this NAS software. The ESXi host mounts via NFS a share off the FreeNAS box for its virtual machines without issue, and I've got ZFS doing hourly snapshots with replication to an attached USB hard drive. Everything was and continues to work well.

Now recently I upgraded the host to ESXi 5.1 (yes I realize 5.5 is out, but there was some weirdness with licensing + dropping support for the C# client that pushed me in the direction of 5.1 for now). The old host only had 8GB of RAM, so it got congested once Minecraft wanted to eat 3GB all by itself. The new host has 32 GB RAM, so now I can give all my machines at least 2 if not 4 (I run a zimbra instance which also eats RAM) and everything is now chugging along very happily.

What I've noticed since moving to the new host hardware is the data store write latency seems higher than I'd like - an average of 150ms. Read latency is great, normally below 5ms and spikes to 75 (I'm guessing when replication bangs on the FreeNAS proc - more on that in a bit). I'd really like to get the write latency average under 100ms, as I notice on some of my VMs that SSH is slow to open a session, and sometimes is slow to respond to terminal, slow to write small files, and this is over the local network so it's not an internet-related problem. Also it's fairly consistent - that 150ms doesn't spike too often.

So here's the current setup:

ESXi Host:
Supermicro X7DBR-3 2x Quad-Core Xeon E5450 3.0GHz w/32GB RAM
No drives in box
ESXi v5.1

FreeNAS Server
HP MicroServer Gen7 AMD Turion w/ 8GB RAM
4x 500GB 2.5" WD Blacks in a ZFS "RAID10" aka two mirrored VDEVs striped together, effectively 1 TB in capacity
FreeNAS 9.2.0

Network:
8 port TRENDnet Gig-E unmanaged switch that has the NAS and ESXi management side of the host plugged in, and uplink to the rest of the network. The VMs are internet facing inside a DMZ, so they go through the other physical interface on the SM box.

So yeah, there's a few ways I could go here. I've read cyberjock's n00b manual front to back, and read the FreeNAS manual (albeit some of the stuff in there needs updating - like the part about root != administrator, which is no longer the case). I've also read several bug reports about ESXi + ZFS which seems to mostly have been resolved in the latest ZFS updates, which I am up to date on as far as 9.2.0 will let me go. Here's paths forward I see:

1. Obviously the Microserver is a little underpowered on processor and RAM. The replication to USB is done via the SSH subsystem in the GUI, so anytime replication hits, both little cores goto 100%, which correlates with the read latency spikes. Yeah I could fix this by setting up the replication job in the CLI, but I really don't want to go behind the GUI's back if I can avoid it. Also, read latency really isn't an issue for 90% of the time. Should I go for something with more RAM / Processor? Would that help the write latency at all?

2. The disks are 7200 RPM consumer drives. which at 2.5" is better than similar 3.5" drives, but still not great. I could move to 3.5" WD Reds, or 2.5" WD Raptors. Would that help? If I got an IBM 1015 I could goto 15K SAS drives, but that kind of cost outlay for a setup that doesn't really garner me any income would be tough to justify.

3. I've read a dedicated ZIL on a small SSD could be beneficial to ESXi loads. I'm running 5 VMs currently, and the Zimbra & Minecraft VMs are fairly busy. Downside here is I'm out of SATA ports on that Microserver, so I'd have to get a HBA to support anything - probably an IBM 1015. Thoughts here?

4. I could upgrade to 9.2.1.x series - there are some ZFS updates in it, but I'd really prefer to not goto 9.2.1.x as it seems like it's a little unstable in the Samba department, which I do turn on occasionally to move ISOs onto the datastore for ESXi to use.

Anything I'm missing? I could also get a real Intel NIC for the HP Server, but my experience in enterprise networking tells me usually that's the last thing you look at. The broadcom NIC in that server is OK - not as bad as a Realtek but certainly not a good intel chip either. I'm not having throughput issues either - I can easily pull 100 MB/s and push close to that. It's this latency that has me scratching my head and wondering if I can do better. Could be the buffer on the NIC is getting saturated, but seems unlikely.

I appreciate any direction anyone can give. Thanks!

--EDIT--
Fixed my RAID10 dyslexic description. Tis what I get for late night typing.

louisk · May 22, 2014

I had similar experience with NFS being slow and ended using iSCSI for my home setup (I would not suggest its use for production). I'm also using jumbo frames. I'm now happily cross-compiling FreeBSD for ARM on my ESXi VM.

msmollin · May 23, 2014

Thanks for the input, Louis. Just to clarify your comment - would you not run iSCSI or NFS in production? I've had NFS in prod for 2+ years now without issues... just performance seems a little worse than it should. I saw 9.2.2 brings the promise of improved iSCSI performance - so I might take a look at switching over when that comes out and appears stable.

I did have a chat with an old coworker who maintained NFS systems and he gave me a few pointers for tuning the NFS subsystem that I am going to try out.

diehard · May 23, 2014

I would absolutely increase the RAM on the FN server before trying much else. 8GB for what you are doing is just not enough.

jgreco · May 23, 2014

VM usage typically has different stressors than typical fileserver use, and the VM rule of thumb is to have double the amount of RAM that the normal rule of thumb would suggest, which means that the floor should be 16GB. You will not break anything by having only 8GB but reduced performance is not a shock. It isn't always obvious what is going on but insufficient RAM leads to the system not being able to cache as much as it ought to. The amount of space reserved for transaction groups is also tied to RAM size, and when combined with the need for ZFS to cache read data in order to more effectively rewrite blocks ... it gets messy. You appear to be doing the right thing with the mirrored vdevs but also be aware that ZFS blocksize can have a significant impact. Changing the blocksize can lead to many different not-always-intuitive performance changes, such as increased fragmentation but also improved write times.

Also be very, very, very aware of the ZFS fill rules. Almost all the time I see people believing that it's perfectly fine to fill their pools past 80%. This is not true, and it is doubly not true for VM use. A pool used for VM's should not be filled past 60% unless you really understand what you're doing. Pathological cases can actually mean that fragmentation and reduced performance start affecting you even at much lower percentage fill rates ... such as 10%! So how full are you actually filling your pool?

Depending on your answer to that, the areas I'd hit first for improvement are either RAM or disk.

msmollin · May 23, 2014

diehard / jgreco - Hmmm you could be right. I concur that VM loads are quite different from your standard fileserver, which is why I wrote this post in the first place because much of the documentation and performance testing I've found has been fileserver and not VM/SAN-type loads.

Currently I'm at 37% full according to the numbers in the Active Volumes pane of the Storage section, so I shouldn't be hitting those fill rules yet. I knew about the 80% rule, but wasn't aware that it starts to degrade past 60%... that's interesting.

So it sounds like RAM is the first place to start, which means a beefier server as the Gen7 HP Micros will only accept 8 max - will probably go with a clone of that supermicro server I'm using for my host, which subsequently will get me on Intel NICs as well.

If I have the funds for it, do you guys think putting 10k drives in like WD raptors would help too? I'm probably going to do up the new server with 9.2.1.x and just set it up as a replication target, so I won't be reusing the current scorpios in the new system. What about a small SSD for a dedicated ZIL, like something with SLC NAND?

I'm also going to try the NFS tuning suggestions and report back as sourcing the new hardware will take a little time.

msmollin · May 23, 2014

Oh I forgot to mention I didn't alter the blocksize from default because (to be honest) I had no idea what the impact would mean for VM loads, and I had read all kinds of posts about issues arising from doing that. I do have some experience managing RAID controllers and I know from mucking with the controller's stripe size you can have similar problems if your file sizes don't match up with the stripe sizes well.

jgreco · May 23, 2014

No, you misunderstand ... it's degrading at the point you're at. 60% full is the VM/SAN equivalent of 80% on a typical ZFS server - the point you should avoid reaching, the point at which it may get real bad. This isn't true for every case and every workload, it is merely a rule of thumb ... much as you can disregard 80% on a typical ZFS server and may be able to get away with it. However there is evidence that suggests fragmentation issues and performance falloff can happen even at 10% capacity, http://blog.delphix.com/uday/2013/02/19/78/ and some environments only fill their pools to 20-30% before they start feeling the pain.

The behaviour of complex systems is difficult to forecast and may not be safely predicted based on past or current performance, since VM images tend to increase fragmentation over time.

So one thing to consider is: regardless of what the HP specs say, it WILL take 16GB. I've got an N36L with two Kingston 8GB 1333 sticks in it and it is very happy;

Code:

CPU: AMD Athlon(tm) II Neo N36L Dual-Core Processor (1297.85-MHz K8-class CPU)
  Origin = "AuthenticAMD"  Id = 0x100f63  Family = 10  Model = 6  Stepping = 3
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x802009<SSE3,MON,CX16,POPCNT>
  AMD Features=0xee500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM,3DNow!+,3DNow!>
  AMD Features2=0x8377f<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,Prefetch,OSVW,IBS,SKINIT,WDT,NodeId>
  TSC: P-state invariant
real memory  = 17179869184 (16384 MB)
avail memory = 16404869120 (15644 MB)

Your system is never going to be made substantially faster by selecting drives with a faster rotational speed. You are likely to see the greatest impact from one of the following:
1) Increase memory to at least 16GB (the Microserver being maxxed out by that)
and/or
2) Increase the amount of free space in the pool by replacing the drives with substantially larger drives. If you do not expect to grow your storage needs, maybe something like 4 x 2TB drives.
While I cannot promise that this would solve your problem, it seems to me like the easiest things to try, and not as expensive as replacing an entire server. Also the parts would be usable even if you did upgrade your server, assuming you plan for that possibility.

msmollin · May 23, 2014

jgreco - I see what you're saying now, and honestly I missed that part about 10% being the start of the fall off point. My apologies - needed more coffee it would seem before responding.

Hmmm this is very interesting. I will definitely try boosting the ram to 16 GB - thanks for that tidbit. Silly HP and their incorrect specs of equipment. I'll report back here as I make changes over the next couple weeks as I don't have immediate access to the server.

gpsguy · May 23, 2014

Here's what I'm using in my N54L.

Kingston 16GB (2 x 8GB) 240-Pin DDR3 SDRAM DDR3 1333
ECC Unbuffered Server Memory Model KVR1333D3E9SK2/16G

jgreco · May 23, 2014

The only problem with that is that then you're buying previous generation memory...

louisk · May 23, 2014

msmollin said:
Thanks for the input, Louis. Just to clarify your comment - would you not run iSCSI or NFS in production? I've had NFS in prod for 2+ years now without issues... just performance seems a little worse than it should. I saw 9.2.2 brings the promise of improved iSCSI performance - so I might take a look at switching over when that comes out and appears stable.

I did have a chat with an old coworker who maintained NFS systems and he gave me a few pointers for tuning the NFS subsystem that I am going to try out.

I have a philosophical issue with running a block device over a best effort protocol. Systems don't like having disks disappear, or become too latent and behave erratically. It's not uncommon for kernel panics to occur. For production, I try very hard to stay far away from anything that causes kernel panics (of any kind). If you want to do block level San storage, use fiber channel.

NFS is different, it's not exporting a block device. If you can get good speed using NFS (lots of installations using NetApp are done with NFS), I'm in favor. I have no issues with NFS in production.

jgreco · May 28, 2014

Also a pointer to a more comprehensive discussion

http://forums.freenas.org/index.php?threads/horrible-performance-even-with-sync-disabled.18382/

diehard · May 29, 2014

jgreco said:
Also a pointer to a more comprehensive discussion

http://forums.freenas.org/index.php?threads/horrible-performance-even-with-sync-disabled.18382/

jgreco, in that thread you mentioned using an LSI2208 + drives as an SLOG, any more thoughts on that? Do you just present a mirror to FN to use for the ZIL? Ever tried it with SSD's? I imagine being able to use the cache on the 2208 could really help lower the write latency compared to just an SSD?

acook8103 · May 29, 2014

diehard said:
jgreco, in that thread you mentioned using an LSI2208 + drives as an SLOG, any more thoughts on that? Do you just present a mirror to FN to use for the ZIL? Ever tried it with SSD's? I imagine being able to use the cache on the 2208 could really help lower the write latency compared to just an SSD?

I can't speak specifically to the LSI2208, but I implemented this setup based on the post you mentioned. I bought a cheap PCI-Express card off eBay that had a 128MB write-cache and BBU. (I do have a UPS, but I'm too scared to turn set Sync=Disabled.) I connected an old 80GB spinning rust drive and made a 4GB disk on via controller. I don't know if the relative sizes, etc are correct, but the setup drastically reduced my latency. Pure write speed is little changed since I'm limited by how quickly the 128MB write cache can dump to the backing disk, but for random writes, it's amazing. Not doing anything nearly as intensive as you, and I can only check the last hour, but I'm averaging 4ms write latency.

Important Announcement for the TrueNAS Community.

Growing pains with ESXi

msmollin

Cadet

louisk

Patron

msmollin

Cadet

diehard

Contributor

jgreco

Resident Grinch

msmollin

Cadet

msmollin

Cadet

jgreco

Resident Grinch

msmollin

Cadet

gpsguy

Active Member

jgreco

Resident Grinch

louisk

Patron

jgreco

Resident Grinch

diehard

Contributor

acook8103

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Growing pains with ESXi

Cadet

Patron

Cadet

Contributor

Resident Grinch

Cadet

Cadet

Resident Grinch

Cadet

Active Member

Resident Grinch

Patron

Resident Grinch

Contributor

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Growing pains with ESXi"

Similar threads