Poor iSCSI VMware ESXi 5.1 Performance

Status
Not open for further replies.

Gailahan

Cadet
Joined
Oct 5, 2013
Messages
2
FreeNAS Hardware:
Tyan S7012 Motherboard
2 x Xeon L5630 Processors
48GB RAM
Areca 1882-24 Raid Controller
10 x 60 GB SSDs
6 x Intel 1Gb NICs connected to Extreme Networks x450E-48P switch.

Problem: I'm only seeing 40-50 MBps iSCSI disk performance on my VM.

Troubleshooting:
Removed Encryption
Removed Compression
Removed Deduplication
Changed disks on RAID controller to write-through cache instead of write-back.
Changed from ZFS RAID Z2 (10 x pass-through disks on RAID controller) to hardware RAID 6 (configured for ZFS)
Changed from file extent on ZFS RAID Z2 to device extent on RAID controller
Changed MTU from 1500 to 9000
Changed VMware from round robin to fixed io
Removed all NICs except for 1 on ESXi host and FreeNAS box


Throughout all of my tests the performance has remained between 40-50 MBps. I was going to try switching from ZFS to UFS, but then I decided to just change to a device extent, which did not change the performance. Does anyone else have any ideas? I think I might switch to Windows Server 2012 since I'm running out of things to try.
 

Attachments

  • performance_thick_provisioned.PNG
    performance_thick_provisioned.PNG
    258.1 KB · Views: 595

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Few tips of stuff I see wrong:

1. Write caching should be disabled. Read caching should be disabled or set to the most conservative settings. The FreeNAS manual calls out for HBAs, and for good reason. As soon as you add in a RAID controller the RAID controller wants to control. But ZFS wants to control too, so they compete. And you lose.. bigtime. I could enable the write cache on my old controller and it would kill my pool performance by about 90%! Yes, its THAT big of a deal.
2. Again, from the manual, hardware RAID arrays are a BIG no-no for ZFS. If you want to use a hardware RAID(or don't have a choice) then you need to look at UFS.
3. Don't let those big MTUs fool you. They aren't the great performance booster that you are probably hoping for. Most people stick with 1500MTU and can saturate Gb LAN with 4 year old hardware. Not to mention changing the MTU can cause its own performance and reliability problems with network traffic. If you can't get good performance with 1500MTU then you probably have something else to worry about and shouldn't be playing with the MTU.
4. Don't use benchmarks to figure out what is "best" for your server. Benchmarks are going to lie to you all over the place. They can help validate performance problems, but if you aren't having performance problems you should ignore anything that benchmarks say. In short, what I'm saying is to not benchmark your system but start setting it up how you want it. When its performing slow, only then start asking yourself what is wrong and start benchmarking things.
5. Do a pool benchmark with dd tests. That'll tell you what the raw performance is and can rule out the disk subsystems. Expect this test to be slower than expected for your hardware if you continue to not use a HBA.
6. Do an iperf test to ensure you can get at least 850Mb/sec between your FN server and ESXi.
 

cheezehead

Dabbler
Joined
Oct 3, 2012
Messages
36
1) check for misaligned vmdk
2) Is VMware returning any errors?
3) Is your iSCSI network isolated from the rest of the network?
4) Would avoid traditional teaming and assign a single IP per FreeNAS interface and then use MPIO to balance the links (10GB is always better)
5) to narrow things down try to create a non-raided extent and see if you max out the link...if it does then the networking stack is ok
6) check the logs for errors/warnings
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
3. Don't let those big MTUs fool you. They aren't the great performance booster that you are probably hoping for. Most people stick with 1500MTU and can saturate Gb LAN with 4 year old hardware. Not to mention changing the MTU can cause its own performance and reliability problems with network traffic. If you can't get good performance with 1500MTU then you probably have something else to worry about and shouldn't be playing with the MTU.

MTU has the potential to win, but is really only something to do once everything is working well. Too much opportunity to break things further...

http://forums.freenas.org/threads/e...ce-comparison-with-vmxnet-and-intel-em.15320/

Even on an all virtual setup, results can be interesting.

I would worry more about the slow L5630's ... what are you seeing on the filer if you run "top" when it is under load?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
My problem with that logic is that a NIC should be handling the TCP/IP offload. So the MTU doesn't matter for a local network as much as people want to believe. Your example is for a virtual NIC if I'm not mistaken. In that case(and the big push for MTU back in the day) was that the CPU did the actual processing of the TCI/IP offload work, so fewer packets were a major boon. This is why I mentioned that even 4 year old hardware can saturate a Gb NIC.

I have yet to see someone change the MTU on physical hardware and see a significant increase, with one exception. That exception was a database server that always transferred data in sets of 2048-bytes. So changing the connection from the database and the application server to 2048-bytes(plus the 14 bytes for the packet frame if I remember correctly) gave slightly better results. But even then it was only a few percent for a server that wasn't limited by network speed anyway.

But you are absolutely right that changing the MTU when you are having problem is a big mistake. Plenty of hardware doesn't handle jumbo packets correctly, and more still only support certain sizes.

I remember buying 4 Intel Gb NICs and a $1000 Gb network switch back in 2003ish. Disabling the TCP/IP offload would max out my CPU at the time and couldn't hit 400Mb/sec. But enabling it saved the day. I still have 2 of those PCI cards here somewhere. But since I was stuck with crappy PCI 32-bit/33Mhz PCI bus(one card which was my SCSI controller) I couldn't regularly get ultra high speeds except to/from my RAM disks.

Is this assessment wrong?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
My problem with that logic is that a NIC should be handling the TCP/IP offload. So the MTU doesn't matter for a local network as much as people want to believe. Your example is for a virtual NIC if I'm not mistaken. In that case(and the big push for MTU back in the day) was that the CPU did the actual processing of the TCI/IP offload work, so fewer packets were a major boon. This is why I mentioned that even 4 year old hardware can saturate a Gb NIC.

I have yet to see someone change the MTU on physical hardware and see a significant increase, with one exception. That exception was a database server that always transferred data in sets of 2048-bytes. So changing the connection from the database and the application server to 2048-bytes(plus the 14 bytes for the packet frame if I remember correctly) gave slightly better results. But even then it was only a few percent for a server that wasn't limited by network speed anyway.

But you are absolutely right that changing the MTU when you are having problem is a big mistake. Plenty of hardware doesn't handle jumbo packets correctly, and more still only support certain sizes.

I remember buying 4 Intel Gb NICs and a $1000 Gb network switch back in 2003ish. Disabling the TCP/IP offload would max out my CPU at the time and couldn't hit 400Mb/sec. But enabling it saved the day. I still have 2 of those PCI cards here somewhere. But since I was stuck with crappy PCI 32-bit/33Mhz PCI bus(one card which was my SCSI controller) I couldn't regularly get ultra high speeds except to/from my RAM disks.

Is this assessment wrong?

Yes, sorry, at least in part wrong. But at least nice and logical!

I'm going to try to keep this accessible for everyone else suffering through this. I'm including some helpful links and being a little more wordy than usual (is that possible?)

First, yes, the example was for entirely virtualized interfaces, which means that there was no actual pesky physical hardware to slow things down, or on the flip side, to speed things up with offload.

Second, hardware offload at gigabit speeds was much more important when we had slower CPU's. Modern CPU's do not necessarily need an assist and can do the heavy lifting at link saturation speeds. On the other hand, gigabit has been around for about 15 years now. The first generation of adapters, like the Tigon 1 (see: Netgear GA620, still have some of those somewhere), it was still an era where we were happy to get interfaces smart enough to handle bus mastering efficiently. The idea of offloading the entire processing into silicon was left to the realm of Windows and Novell drivers.

Now the real big problem is that actual full offload is impractical without a manufacturer's assistance in developing it. With only a few exceptions (I'm thinking of Kip Macy's Chelsio drivers), FreeBSD doesn't support TCP Offload. It is largely seen as something that's very complicated to do in a generalized manner for only modest performance improvements. The Linux people have fought against TOE for years, and the summary on Wikipedia is more complete than anything I could whip up in a few minutes. Please do check it out!

So what you have left on FreeBSD are basically two major subcategories of non-full offload:

1) Checksum Offload - many better interfaces include support for TCP and UDP Checksum Offload. This can be a pretty big deal on slower CPU's. It is less of a factor on faster CPU's. Kind of like how we just assume ZFS doing RAID parity calculations in software is okay but it makes us prefer faster CPU's.

2a) TCP Segmentation Offload (TSO) - generally supported on the same sorts of chips that support checksum offload. The idea is that data written to a socket is handed to the device driver in larger chunks (64K?) and instead of the code doing the chopping up and updating the header fields and queuing packets for transmit, all of that is left to the silicon. It is basically handling the most common, rudimentary task... everything else, like retransmits, has to be handled by the driver.

2b) Large Receive Offload (LRO) - takes and merges packets back into larger chunks (more or less the opposite of TSO).

So anyways, TSO and LRO do work together to significantly reduce the benefits of jumbo frames. On a network segment where all jumbo traffic is local, that's nearly the end of the story for now... but if you have intervening equipment or WAN requirements, like routers, then it is absolutely jumbo for the win - TSO and LRO are basically drug dealers enabling us to continue our addictions to these teeny packets in a less painful manner.

1500 made lots of sense at 10Mbps, but now we're coming into the era of 10Gbps, fully 1000x faster. We ought to fix CRC32 (the use of which means checksums lose effectiveness if you go much above 9000 MTU) and go for massive packets. :smile:

I will note with some interest that there's a complex set of factors that seems to have held us at around 1Gbps for commodity connections for over a decade, and some small part seems to be that it is now pretty easy to cope with gigabit level traffic even on smallish devices. By way of comparison, we made the jump from 10Mbps (commodity in 1993) -> 1Gbps (1998, actually appearing as commodity by about 2003) So what I think may end up happening is that we're going to eventually see the adoption of 10Gbps as commodity over the next several years, but that means that the previous two order-of-magnitude increases took ten years total, while this single order-of-magnitude increase is likely to be more like fifteen years (10Gbps, commodity around 2018?)

I think part of the trouble is that 1Gbps is really fast enough for so many purposes. You can run video over it, or remote desktops, or all sorts of things. Demand to go past 1Gbps is just kind of soft. So if we're lucky, we might see 100Gbps become a commodity when we're old men. Or possibly never.

As for checksum offload, that's generally useful for the obvious reasons.

Intel's ethernet chips usually support checksum, TSO, and LRO ... and do so with a driver authored by Intel. This means that generally their desktop cards perform nearly as well under FreeBSD as their server cards.

So I rate your message as "in part wrong" because you never use TOE Offload with FreeBSD (Chelsio excepted), but you do use lesser offload features such as checksum, TSO, and LRO. Those three things form a core that gives you a good percentage of what TOE might do for you, so you are also "in part right."
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I wasn't aware that TCP Offload wasn't supported in FreeBSD. My pfsense box(latest RELEASE on an Intel Atom) definitely is affected performance-wise by enabling/disabling the offload it does support. And not at all with bigger packets up to 9k.

One thing you are leaving out though, and that's how the protocol handles packet fragmentation and the reverse. With SMB1 it was a big deal. Packet size directly affected throughput because of how the protocol worked. Back then pre-Vista when SMB1 was king, I ran 2 networks, one for the internet and one for the intranet. I had to do this because of some hardware incompatibility issues. Anyway, thanks to SMB2 the protocol has done some things to make packet size far less important for efficiency. I don't fully understand the mechanism they use for it, but Microsoft published a paper years ago detailing that SMB2 does things to make packet size pretty inconsequential until you go above 16k. And their changes also made latency less important to performance of SMB. And by far most users do SMB. With SMB1 even 1ms of latency would kill SMB performance despite having Gb LAN. Naturally, I don't have anything that goes above 16k, so I'm pretty disappointed that I couldn't benchmark it. I've heard from several sources that NFS isn't appreciably affected by packet size anymore either, but I don't know much about NFS nor could I find a source I'd consider credible or prove it for myself.

I wish I could talk to a network engineer about this stuff. I'd have a bunch of questions for him regarding the flowchart for data packets and how much does stuff actually matter.

I'm just not convinced based on what I've seen personally that packet size matters that much. Couple that with the fact that plenty of hardware can't do MTU above 1500 and you have a recipe for potential disaster.

I spent 3 weeks experimenting with data packets back in 2010 trying to get my server above 50MB/sec when I knew I should get better speeds. Ultimately I never even saw even a 10% increase and quite frequently saw a decrease. The real decision maker for me was that my network printer supports MTU=1500 only. So anytime I wanted to print something I'd get nothing but error pages printing. So I ended up at 1500 just because of that.

I do appreciate the write up. Nice to confirm that the stuff I read was correct as there's very little detailed information on this stuff for idiots like myself.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
One thing you are leaving out though, and that's how the protocol handles packet fragmentation and the reverse.

Because it is irrelevant to the realities. Either fragmentation falls under the auspices of existing TSO and LRO capabilities or it actually requires a full-on TOE implementation, and even there I think some vendors implement the logic control parts of it in the host's stack (technical term: "cheating by marketing dept"). But there I'm mainly talking about stuff I haven't used much, because most of my serious work is under FreeBSD.

With SMB1 it was a big deal. Packet size directly affected throughput because of how the protocol worked. Back then pre-Vista when SMB1 was king, I ran 2 networks, one for the internet and one for the intranet. I had to do this because of some hardware incompatibility issues. Anyway, thanks to SMB2 the protocol has done some things to make packet size far less important for efficiency. I don't fully understand the mechanism they use for it, but Microsoft published a paper years ago detailing that SMB2 does things to make packet size pretty inconsequential until you go above 16k. And their changes also made latency less important to performance of SMB. And by far most users do SMB. With SMB1 even 1ms of latency would kill SMB performance despite having Gb LAN. Naturally, I don't have anything that goes above 16k, so I'm pretty disappointed that I couldn't benchmark it. I've heard from several sources that NFS isn't appreciably affected by packet size anymore either, but I don't know much about NFS nor could I find a source I'd consider credible or prove it for myself.

Speaking a bit generically, a lot of it has to do with how much data you can reasonably have in-flight. Sequential small writes are always problematic (process writes file 1, then file 2, then file 3, and so the OS has to hold off on the close of file 1 to receive acknowledgement from the fileserver before reporting to the process that it has successfully written and closed). Parallel small writes are efficient either way because you just launch off the writes for 1, 2, and 3, and are then waiting simultaneously. But for large writes, if you can have more data outstanding, your speeds are definitely strongly related.

The problem is that everybody ends up reinventing the packet loss recovery and congestion controls that a protocol such as TCP contains. So whereas NFS was UDP 20 years ago, it is now TCP. UDP made sense 20 years ago. It was small(er) and uncomplicated, and you had some chance of being able to code up a boot ROM so your device could netboot. With the free software movement in its infancy, code sharing and reuse was not as common... anyways, the problem is, you ended up needing to reinvent a lot of the mechanisms that already existed in TCP. With TCP, if you do a large write with NFS, basically all you do is throw it down the pipe and wait for an eventual return code. TCP is also more resistant to various attacks on the communications channel, and is easier to work with across networks.

I don't know specifically what SMB2 does different, but I imagine that larger memory on contemporary machines means it isn't unthinkable to dedicate substantially more memory and optimize for larger transfers.

But with the transition to TCP, things are also a lot muddier. With a UDP protocol and a userland daemon, you knew, for example, that there would be a syscall involved in moving a single packet, which could involve a trap and privilege context switch. More packets meant more of that. Single-threaded meant it didn't scale too well. Etc. Here especially, large packets meant lower overhead and faster service times. But with TCP, a userland daemon can hand off a much larger chunk of data to the kernel and then immediately go on to service other clients, leaving the kernel to do a task that it is optimized to do. My impression is that large packets no longer have that much to do with the actual protocol in use over the TCP channel. It is more a function of how large packets interact with TCP. Is it faster? Is it more efficient? That's why I was focused on iperf numbers with what I did in this thread.
http://forums.freenas.org/threads/e...ce-comparison-with-vmxnet-and-intel-em.15320/
I wish I could talk to a network engineer about this stuff. I'd have a bunch of questions for him regarding the flowchart for data packets and how much does stuff actually matter.

I can put that hat on but the secret is, generally things are complicated enough that in the real world, all network engineers can actually do is know the general concepts, then experiment and test until you find out what's not letting you go faster. I don't have the depth of intimate protocol knowledge and implementation for SMB so I'm at more of a disadvantage than people who deal with it daily on large production networks, but I can assure you that even there, the book knowledge is important but usually it is the practical implementation things that clobber you.

I'm just not convinced based on what I've seen personally that packet size matters that much. Couple that with the fact that plenty of hardware can't do MTU above 1500 and you have a recipe for potential disaster.

I spent 3 weeks experimenting with data packets back in 2010 trying to get my server above 50MB/sec when I knew I should get better speeds. Ultimately I never even saw even a 10% increase and quite frequently saw a decrease. The real decision maker for me was that my network printer supports MTU=1500 only. So anytime I wanted to print something I'd get nothing but error pages printing. So I ended up at 1500 just because of that.

I do appreciate the write up. Nice to confirm that the stuff I read was correct as there's very little detailed information on this stuff for idiots like myself.

Yeah you definitely want a separate segment for large MTU traffic.
 

Gailahan

Cadet
Joined
Oct 5, 2013
Messages
2
I've tried the following now with no difference in performance:
I switched from the L5630s to X5550s for the increase in clock speed even though top didn't show too much load.
I connected drives directly to the motherboard to bypass the RAID controller.
I cabled the servers directly together to bypass the Extreme Networks switch.
I tested with iPerf and I can get full 1 Gbps speed.
I connected a Windows machine to the iSCSI target (Task Manager -> Networking never shows the Local Area Connection going over 50%).
I've tried Intel Pro/1000 PT, Intel 82576, and Intel 82574L NICs.
 

datnus

Contributor
Joined
Jan 25, 2013
Messages
102
Would you mind sharing "gstat" and "zpool iostat -v 1"?
I'm going to upgrade to all SSD pool and your problem scares me.

I'm a VCI (VMware Certified Instructor). :D
So, prove me if the disks are not busy, I may know the problems.
Vmware may throttle your IOPS to each VM.
 
Status
Not open for further replies.
Top