Slow iSCSI Read/Write Performance in ESXi, Fast CIFS, What's Wrong??? Veeam???

Status
Not open for further replies.

zimmy6996

Explorer
Joined
Mar 7, 2016
Messages
50
Hey all-

So I'm trying to introduce a Veeam backup server in to my environment .... My environment consists of:

2 x FreeNAS Supermicro Servers, 36 bays, 4 x 1GB Nics, 2 bonded LACP on 1 network, 2 bonded LACP on a 2nd network
10 x VMware ESXi Hosts, 2 1GB Nics bonded for Management/Guest Access, 2 1GB Nics bonded for iSCSI/Vmotion

The ESXi Hosts have to vnics on the iSCSI bonded interfaces, 1 of each IP'd in the respective networks for the FreeNAS boxes.

This was done so each FreeNAS had 2 1GB NICs in a 2GB LACP interface, dropping to 2 different switches, giving us MPIO.

So, fast forward to the problems ...

I setup Veeam, and when running backups, (and i'll keep this simply, yes, they are doing SAN based backups), the best "read" speed I can get is about 250-300Mbit/sec (20MB/sec) when I am reading a guest image.

I then checked in to this a little further, accessing one of the datastores from my physical vCenter box, and when I try to upload/download from it, I am still only seeing 200-300Mbit/sec.

The datastores are hooked in to ESXi hosts via iSCSI.

On FreeNAS, I also have some CIFS shares that I created. These CIFS shares reside on the same "volume" that the block VMFS iSCSI devices are on. When I try to read or write to the CIFS share, I am able to push 950Mbit/sec (95MB/sec) in either direction, which is what I would expect. Pegging out one leg of the Port-Channel. Thus I know that:

1) The network is clean
2) The "volume" on the FreeNAS box has the ability to read/write at the speed I'm expecting.

Some additional background on the FreeNAS boxes, in terms of drive setup for each one, they are identical:

1 x 250GB SSD For Boot Drive
2 x 100GB Intel 3700 SSD's for SLOG (Shrank to 20GB each via overprovisioning to increase lifetime) - Mirrored
2 x 250GB Samsung 850 Pro SSDs for L2ARC Cache - Stripped
28 x 3TB Seagate ES.3 7200RPM SATA Drives, setup in 14 mirrors, as one huge Volume.
2 x 3TB Seagate ES.3 7200RPM SATA Drives, hot spares

And system info :

144GB RAM Registered ECC Memory
Dual Intel Xeon E5645 CPUs
4 x 1GB Intel Network Interfaces


In terms of layout, There are a couple of ZVOLs created. One that is specifically used for VMware iSCSI, and a 2nd that is specifically setup for CIFS..

The entire volume is currently only 41% used.

The only test I have not tried yet, is to create another iSCSI ZVOL, and set it up for me to attach directly to a windows box, and see how iSCSI performance on a NTFS formatted iSCSI volume is. Figured that maybe taking VMFS out of the picture might lead to something??? Don't think that really will matter much though, as on my Veeam box, I actually have the 2 FreeNAS iSCSI vmware mounts mounted to my windows machine so that Veeam can access the vmfs shares natively without needing to traverse the ESXi nodes themselves. So in essence I'm already "Testing" iSCSI to Windows.

@jgreco - I'm sure you're reading this now, and laughing, thinking, you know, I remember this putz setting this up before, and now he's done X wrong! LOL My hope is that i've detailed enough that you or someone will tell me I'm on the right track, and not being a complete moron. Any thoughts/help would be greatly appreciated!
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
4 x 1GB Nics, 2 bonded LACP on 1 network, 2 bonded LACP on a 2nd network
10 x VMware ESXi Hosts, 2 1GB Nics bonded for Management/Guest Access, 2 1GB Nics bonded for iSCSI/Vmotion
Nope, nope, nope. iSCSI needs two subnets and not LACP for MPIO and failover look into port binding. vMotion can be done with two vmkernel ports bound to two separate NICs on the same subnet (required).
The ESXi Hosts have to vnics on the iSCSI bonded interfaces, 1 of each IP'd in the respective networks for the FreeNAS boxes.
You have no failover and no MPIO... see above.
This was done so each FreeNAS had 2 1GB NICs in a 2GB LACP interface, dropping to 2 different switches, giving us MPIO.
Again MPIO is handled at layer 3 and not layer 2 or etherchannel/LACP.
250-300Mbit/sec (20MB/sec)
In your other post you mentioned setting up a Veeam proxy on the vCenter server and Veeam B&R server is a VM. So your data is taking the following path: FreeNAS(iSCSI) - ESXi(iSCSI) - vCenter(veeam transport) - ESXi(VeeamVM) all over 1000mb links that may or may not be routed from different VLANs.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
Each FreeNAS box should have two 1gb NICs on separate network fabrics or at least VLANs and subnets. Each ESXi host needs a vmkernel port in each subnet and using an override on the portgroup NIC failover to only allow ONE vNIC (physical NIC) per iSCSI portgroup. This means the other NICs are listed as unused NOT any form of failover mode. The Veeam server you figured out in another post. Is an environment like this a Veeam proxy will only add unneeded complexity.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
iSCSI and LACP/etherchannel = Oil and water
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
Pro tip, once you have the above iSCSI figured out, on each host select the device the that datastore resides on and set the PSP or path selection policy to round robin and look into tuning the IO count for the RR rotation.

Yes I know this is a lot of tedious configuration for 10-12 hosts. This is where host profiles and distributed vSwitches are amazing. Just dont use a vCenter appliance with ephemeral binding.
 

zimmy6996

Explorer
Joined
Mar 7, 2016
Messages
50
Pro tip, once you have the above iSCSI figured out, on each host select the device the that datastore resides on and set the PSP or path selection policy to round robin and look into tuning the IO count for the RR rotation.

Yes I know this is a lot of tedious configuration for 10-12 hosts. This is where host profiles and distributed vSwitches are amazing. Just don't use a vCenter appliance with ephemeral binding.


I'm actually doing MPIO right now, let me clear some things up:


FreeNAS Server 1
----------------------
LAN1+LAN2 = LACP NIC1 172.16.1.5 /24
LAN3+LAN4 = LACP NIC2 172.16.2.5 /24

FreeNAS Server 2
---------------------
LAN1+LAN2 = LACP NIC1 172.16.1.6 /24
LAN3+LAN4 = LACP NIC2 172.16.2.6 /24

Veeam Server is PHYSICAL box, with 30TB of local storage
-----------------------------------------------------------------------
LAN1+LAN2 = LACP NIC1 172.16.1.10 /24 (used for iSCSI mounting of FreeNAS stores)
LAN3+LAN4 = LACP NIC2 172.16.2.10 /24 (used for iSCSI mounting of FreeNAS stores)
LAN5+LAN6 = LACP NIC3 192.168.1.10 /24

ESXi Hosts
-------------
Vswitch0
LAN1 + LAN2 = Grouped in Vswitch0, not LACP
VMK0 - Management (LAN1 Primary, LAN2 Standby) - 192.168.1.x
VMK1 - VmotionA (LAN1 Active, LAN2 Unused) - 192.168.2.x
VMK2 - VmotionB (LAN2 Active, LAN1 Unused) - 192.168.3.x
VMK3 - FreeNAS-A (LAN1 Active, LAN2 Unused) - 172.16.1.x
VMK4 - FreeNAS-B (LAN2 Active, LAN1 Unused) - 172.16.2.x
For my iSCSI Adaptors, I have added FreeNAS-A and FreeNAS-B under VMkernel Port Bindings (vmk3 and vmk4)
I have added both the 172.16.1 and 172.16.2 addresses to the iSCSI discovery, so for my 2 devices, I have 2 paths for each showing.
I have set it up as RR, and changed the IO count from 1000 to 1

Vswitch1
LAN3 + LAN4 = Grouped in Vswitch1, not LACP
Guest VLANs Tagged

Vcenter Server is PHYSICAL box
----------------------------------------
LAN1+LAN2 = LACP NIC1 - 192.168.1.x




So as you can see, I have all of the correct logic setup for networking. The only "additional" thing here is the fact that the LACP interfaces are really there to give me redundancy in the event of a switch failure. On the switch side, its a Cisco 3850 stack, and for my LACP interfaces, port 1 goes to switch A, and port 2 goes to switch B. Obviously, a port-channel is defined for each of the LACP interfaces. That way, should I have a switch failure, I would continue to survive.

Unless LACP itself is actually crippling these interfaces, which again, I have a hard time believing, since from the Veeam box, when I try to hit a CIFS share on FreeNAS1, I am able to pull 950Mbit/sec (which I fully expect, I know how LACP works, a single data path is only going to hit 1 of the 2 LACP links, so Im only expecting 1Gig of bandwidth, not 2Gig), then I can't explain why when accessing the iSCSI mount, I am only reading about 200-300Mbit/sec.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

zimmy6996

Explorer
Joined
Mar 7, 2016
Messages
50
The problem I see is the SATA SSD. It is limiting the the maximum rate data can be sent to the SLOG which is throttling your synchronous write speed. SMB is asynchronous always when iSCSI is synchronous.


Sent from my SAMSUNG-SGH-I537 using Tapatalk


Why would that be a factor here? This isn't a problem with writing to the iSCSI array. Its a problem with reading from it.
 

zimmy6996

Explorer
Joined
Mar 7, 2016
Messages
50
So here is a little more food for thought ... So I added zvol for testing iSCSI. This on is only 100G in size, versus the 18TB sized one currently in use for ESX. Its built in the same master Volume as the production ESX one.

That said, when I mount that 100GB iSCSI mount directly to my Veeam server, and format it NTFS. I am able to copy to, and read from, at 950Mbit/sec ... So it's getting the full speed of ethernet, as I would expect. This confirms again, that the volume/physical disks on FreeNAS are not the limiting factor here.

So this brings me down to what the hell is going on with VMFS. Might this be a case that the Zvol of 18TB is causing the problem?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Why would that be a factor here? This isn't a problem with writing to the iSCSI array. Its a problem with reading from it.
It may not be a factor since you are only using 1Gb networking.
So this brings me down to what the hell is going on with VMFS. Might this be a case that the Zvol of 18TB is causing the problem?
The factors that @kdragon75 pointed out are relevant, yet you appear to have dismissed them.
 

zimmy6996

Explorer
Joined
Mar 7, 2016
Messages
50
It may not be a factor since you are only using 1Gb networking.

The factors that @kdragon75 pointed out are relevant, yet you appear to have dismissed them.


I am absolutely not dismissing @kdragon75

My previous email detailed out the networking specifics, so that it was clear that I do have correct MPIO settings. So the only issue here is, does my addition of LACP cause a crippling effect, ONLY with iSCSI, and ONLY with the VMFS iSCSI mount, because I have now clearly proven read/write performance is working as expected with my smaller test 100GB mount. Like I said, I'm wondering if this is a problem caused by an 18TB block device.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
I am absolutely not dismissing @kdragon75
No your not, that was a good write up and seems to more or less follow best practices. The one thing mentioned about IOPS=1 may be an issue when under load as you may not be taking full advantage of your ctld device queues (FreeNAS). This can be monitored with esxtop. You may also want to triple check any VLAN COS and be sure none of your paths are accidentally being routed.
Can you setup a VM on one of your hosts, map your 100GB zvol as an RDM and test? This will rule out VMFS and tell us if its the host network stack. You could also try setting your PSP back to fixed and see if that makes a difference.
Like I said, I'm wondering if this is a problem caused by an 18TB block device.
What version of ESXi and vCenter are you running?
physical vCenter box
Still makes me cringe ;)
 

zimmy6996

Explorer
Joined
Mar 7, 2016
Messages
50
Still makes me cringe ;)

I promise, when this project is done, I'll upgrade! Honestly, I had wanted to upgrade for some time, but was waiting until Update Manager was integrated, which I know it is now, so yes, it's time!


Anyway, back to the questions.

ESX 6.0U3 and Patched with latest updates
vCenter 6.0.0, 5318200

So I did that test as you asked, mounted RDM to a Windows 10 instance, formatted NTFS, and tested. I am able to push 980Mbit (110MB)/sec. So it's definitely working the way it's supposed to ... At this point, I'm completely lost ... It's almost seems as if it's a Veeam issue ...
 

zimmy6996

Explorer
Joined
Mar 7, 2016
Messages
50
Some more interesting data points ...

CrystalDiskMark running to RDM drive mounted on the VM.

upload_2018-8-9_12-31-16.png


As you can see, we are hitting 95MB read, and 109MB write, which is basically hitting close to the theoretical max for 1GB ethernet. Based on how this is mounted, this right, because it would only be using 1 of the 2 ethernet pipes.



CrystalDiskMark running directly to VM C: drive, which is on FreeNAS VMFS mounted 18TB backend.
upload_2018-8-9_12-33-18.png



This shows me we're actually hitting multi-path correctly And I can validate this directly on FreeNAS using sysstat. When I read/write to the VMFS mounted drive within ESX, I can see it is actually hitting both MPIO interfaces, thus getting 1.4ish Gigabit/sec .... This is what I would be expecting ...

So I guess the take away here is that the SAN is working like it's supposed to. So this is clearly an issue between veeam and the SAN.

And I know it's not "network" between Veeam and the SAN, because I validated that by doing an iSCSI mount of the FreeNAS iSCSI drive directly to the Veeam server, and I am able to get the same 95-105MB/sec read and write, and validated that I can see the full 1GB circuit getting pegged when I do it on FreeNAS.

So ... Now I'm more stumped here on the Veeam side ...

And I know it's doing SAN backups on Veeam, YOu can see the SAN tag.
 

Attachments

  • upload_2018-8-9_12-31-3.png
    upload_2018-8-9_12-31-3.png
    74.8 KB · Views: 440

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
What is the Veeam backup's write target? Local storage, or back on the FreeNAS machine?

If it's reading and writing back to the FreeNAS machine, that would explain the slow write speeds, but not the reads.

Random musings:

Do you have MPIO claim rules set up on your Veeam server?
Are you asking Veeam to do compression/deduplication on your backups?

Also, the default volblocksize in FreeNAS is 16K, I believe - Veeam likes to write in way bigger chunks than that.

If you're writing the backups to the FreeNAS box, have you tried dumping them on an NFS exported dataset with recordsize=1M or similar?
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
Does Veeam report the bottleneck during backups? Is there a chance you are being limited by the CPU or memory on your Veeam server?
 

zimmy6996

Explorer
Joined
Mar 7, 2016
Messages
50
Does Veeam report the bottleneck during backups? Is there a chance you are being limited by the CPU or memory on your Veeam server?

Veeam reports "Source" as the bottleneck.

upload_2018-8-9_18-26-24.png


This was the last incremental backup for a single server.

As far as the Veeam server itself, it is a single Xeon E5520 2.27GHz 4 core with 24GB of RAM. I am adding another E5520 CPU tomorrow, and another 24 GB of ram to take it to 48GB, but I don't expect that is going to make much of a difference. As far as local storage goes, I have fully tested the local storage, and it definitely should be able to handle 1GB of traffic writing to the storage array ...

upload_2018-8-9_18-34-4.png
 
Status
Not open for further replies.
Top