Hyper-V VM Storage Suggestions

Steven Sedory · Jul 26, 2015

Hey there,

I'm reaching out to other virtualization and storage engineers for suggestions on configuring my new hardware for the best Hyper V VM storage setup. I have two nodes that are setup in a failover cluster. I have no problem setting up the cluster/storage, but I'm sure there is someone out there who has achieved the best performance.

Note: this cluster is for hosting local VM's (servers) on-site for a client. There will be roughly 15 Server 2012 R2 VM's running on this cluster that each run various basic services (AD, DNS, DHCP, small SQL DB's, Exchange, etc.)

What I have for the SAN:
-Dell R720xd
-2x six core cpu's
-128GB RAM
-24x 15k 600GB SAS drives
-2x 10GB NIC's

What I have for the nodes (two of the following):
-Dell R620
-2x eight core cpu's
-64GB RAM
-8x 10k 300GB SAS drives
-2x 10GB NIC's

Right now, I have the cluster setup with each node access an iSCSI target/extent that is presented on FreeNAS (running on the 720xd).
-The iSCSI target is accessed via a single portal (single IP),
-Single IP is a LAGG interface setup with LACP on the two 10GB NIC's (with a virtual port channels on the two Nexus switches they are plugged into)
-no authentication or any advanced iSCSI settings set
-using a zvol, not file based extent
-have set 4K block in two places when asked (as NTFS uses 4K by default)
-dedup is on
-compression is off
-Single LUN from all this is 4.5TB (I made the zvol more than small enough for the "80% ZFS rule", as the I could of made it as large as 5.8TB if I remember correctly)

I'm writing this because performance is sub par. I was copying some VMs to the CSV (the cluster shared volume established on top of the iSCSI LUN), and once the ARC cache filled up, the transfer went from around 1Gbps (1GB, not 10GB) to completely halting at times, and then resuming for periods with only small spikes up to a couple hundred Mbps. Please note that I did not have jumbo frames enabled at that time (I'm going to reconfigure the LAGG here shortly and try again). I'm sure the jumbo frames will help, but I don't think it's the solution to the "locking up" of the LUN when transferring large amounts of data.

Your input is much appreciated. I'm open to SMB 3.0 for the CSV's, or whatever other solutions are out there. Thank in advance guys.

depasseg · Jul 26, 2015

How are your pools setup? I mean how many drives and what traffic type?

anodos · Jul 27, 2015

ISCSI is your only option for Hyper-V hosting. Samba does not support it.

Sounds like a sub-optimal hardware (like using a RAID card instead of an HBA). What RAID card?

How is pool configured? Note that for this use case you will need striped mirrors. RAIDZ won't be fast enough.

Lastly, performance of Intel 10 gigabit cards is not great because of lack of quality drivers in FreeBSD 9.3. You're better off with chelsio. However, this won't be so bad as what you are recounting. My money is on a RAID card.

L · Jul 27, 2015

Do people usually use dedup for hyper-v? Also i always run compression on

anodos · Jul 27, 2015

Linda Kateley said:
Do people usually use dedup for hyper-v? Also i always run compression on

Wow! I was reading on my cell phone screen and totally missed that as well. Dedup can also cause serious performance degradation.

sfcredfox · Jul 28, 2015

Here's my two cents, not an expert, but i've been running a datastore on my FreeNAS for a few years now.
Try any/none of my suggestions and see if it helps:

As many have stated, the pool needs to be mirrors. Mine is currently sets of Z1, but I'd bet money it would perform a little better if it was mirrored. Make the call if you'll get enough storage space for all your VMs. You don't want the system to go over 80%. JGRECO even recommends trying to keep more free space than that. He's said 50% or less before on some posts.

Steven Sedory said:
iSCSI target/extent that is presented on FreeNAS (running on the 720xd).
-using a zvol, not file based extent
-have set 4K block in two places when asked (as NTFS uses 4K by default)

Doing device is the correct way to do it as far as I'm concerned.
I left all the block sizes default. I forget what FreeNAS does by default, it's either 8K or 128K, but mine seems fine.
If I'm not mistaken, VMware's VMFS block size is 1M. Not sure what Microsoft uses for datastore file system. NTFS?
Maybe setting the 4K size is the right thing to do for your scenario, out of my depth on that.

Steven Sedory said:
-The iSCSI target is accessed via a single portal (single IP),
-Single IP is a LAGG interface setup with LACP on the two 10GB NIC's (with a virtual port channels on the two Nexus switches they are plugged into)
-no authentication or any advanced iSCSI settings set

Unless I'm mistaken, I don't think that sounds right. I don't think LAGGs work for iSCSI. Again, search for some posts from JGRECO, but I think they say to setup each interface with its own subnet and VLAN. Mine is set up with each iSCSI interface having it's own IP in a different subnet and segregated on the switch by VLANs. Each host system has a VM kernal port for iSCSI in each of these VLANs with the datastore doing round robin. Basically letting VMware do the load balancing and not getting into network layer stuff.

Verify doing LAGG for iSCSI is the best way, because my memory says no. I'm not sure.

Steven Sedory said:
-dedup is on
-compression is off

I'd check this too. Dedup is demanding, but you have a decent amount of RAM. No sure on this one. I'd try without for a while and see if you can spare the space of not doing it.

Compression, I'd turn it on back to default. LZ4?
Compression will definitely help in this case. The default compression puts little overhead on the processor, which your head unit has plenty of, though I don't know how much dedup is using since I don't do it.

You can absolutely find posts on here about compression being of benefit. It basically reduced the amount of data that has to be written and read from the pool, and puts little load on the CPU. I do it with no issues. My performance goes down when I turn it off.

Steven Sedory said:
-Single LUN from all this is 4.5TB (I made the zvol more than small enough for the "80% ZFS rule", as the I could of made it as large as 5.8TB if I remember correctly)

Have you tested doing multiple targets and multiple LUNs?
All your traffic is being handled by one queue at default I think is 32. This starts to get beyond my conform of speaking to, but I wonder if there is some benefit to doing multiple targets, each with their own LUN. Most storage venders have some recommended configuration of the amount of VMs per datastore and the amount of datastores/LUNs to use, etc. I don't think I've ever seen a vender say 'create just one massive datastore and one massive LUN and put everything in there'. I could be totally wrong and there is zero benefit for FreeNAS, but I'm just following (maybe blindly) what the rest of the industry is doing. I guess I don't know of any problems either way though. My system does fine the way it is, it just might be something to research and test. Hell, if you find out I'm retarded, let me know where you found better info so I can get smarter :)

Steven Sedory · Jul 28, 2015

depasseg said:
How are your pools setup? I mean how many drives and what traffic type?

22 drives are in a ZFS RAID10, with the 23rd drive being a Hot Spare, and the 24th being a warm.

Steven Sedory · Jul 28, 2015

anodos said:
ISCSI is your only option for Hyper-V hosting. Samba does not support it.

Sounds like a sub-optimal hardware (like using a RAID card instead of an HBA). What RAID card?

How is pool configured? Note that for this use case you will need striped mirrors. RAIDZ won't be fast enough.

Lastly, performance of Intel 10 gigabit cards is not great because of lack of quality drivers in FreeBSD 9.3. You're better off with chelsio. However, this won't be so bad as what you are recounting. My money is on a RAID card.

Sorry, should have mentioned that. We're using an HBA: LSI 9207-8i flashed in IT mode with R16 firmware (to match FreeNAS 9's R16 drivers).

Steven Sedory · Jul 28, 2015

Linda Kateley said:
Do people usually use dedup for hyper-v? Also i always run compression on

Compression even with dedup?

Steven Sedory · Jul 28, 2015

Just wanted to say thanks for all your input. Here I post something Sunday night in a somewhat "urgent" search for answers, and I don't even respond until Tuesday night.

Since we've only virtualized a handful of the physical servers so far, I moved the VMs off of the cluster and onto the individual host volumes, allowing me to play around with the volumes and run some IO Meter tests against them.

Again, still open for suggestions and tips.

What I'm going to try next....I know this isn't scientific, but I don't have time to test one thing at a time:
-disable dedup
-enable default compression
-use multiple portals and Multi-PathIO instead of LACP

sfcredfox · Jul 29, 2015

I reread some of the posts, but I don't see any mention of your pool's SYNC setting or a SLOG?

I assume you are running conservative and set SYNC=aways? Which means unless you have a SLOG that didn't get mentoned, the pools default ZIL is eating all the SYNC writes from HyperV.

Even with 24 disks, I'm not sure that's enough to handle the double writing effect of having the SYNC writes hit the pool.

So, do you have a SLOG setup? If yes, what is it?
If not, that might be the cause of the start/stop IO.
This post talks about start/stop IO on a system with lots of memory and disks. If I were you, I would review the basics and follow some of their performance evaluations before even thinking about the tuning settings they did.
https://forums.freenas.org/index.php?threads/freenas-9-1-performance.13506/

I run all the default settings and seem to get away with OK performance. I don't really know if its possible to say it's great, or terrible, but my system handles the load that's on it. Giggity.

mbucknell · Aug 4, 2015

I saw you were using dedup with a zvol, i have found that with iscsi or FC, this will stall out and cause disconnects, if you want to use dedup, then try a file extent, it will work and not drop out, but performance can be poor.

HoneyBadger · Aug 4, 2015

I'm way behind on this but I'll get in here anyways.

As mentioned by @sfcredfox you should have sync=always set for safety (especially because this is "hosting for a client" - do not play fast and loose with anyone's data but your own) and for this you'll want an SLOG device to absorb the writes in a timely manner.

Right now, in addition to your potential dedup/compression misconfiguration, it sounds like you're using sync=default (which in iSCSI's case is "off") and your network is outrunning your zpool's ability to absorb writes. A well-performing 10Gbps network will deliver about 1GB (gigaBYTE) per second. Your zpool can probably handle that amount of data, but only in a sequential write scenario where your disks can perform at optimal speed. Throw a read or two and a seek (checksums, anyone?) and that data-to-disk throughput is going to slow down hard; however, your network will continue to merrily ingest data into RAM (because you're running async) until ZFS goes "whoa, hold on there champ, I have two full transaction groups" and straight up blocks writes until it's able to flush Group#1 to disk. Then it will blaze along for another five seconds refilling it ... and stall again. This gives you the inconsistent, "peaky" performance you're seeing.

There's a couple ways to solve it but let's start with the more important bit: you should have an SLOG device for the safety of your client's data. Unless you opt for an NVMe-based SSD like the Intel 750 or P3700, any SLOG you choose should be slower than your network pipe and will have the knock-on effect of throttling your now-sync writes down to a level where your zpool should be able to keep pace.

Steven Sedory said:
What I'm going to try next....I know this isn't scientific, but I don't have time to test one thing at a time:
-disable dedup
-enable default compression
-use multiple portals and Multi-PathIO instead of LACP

No, it's not scientific, but I'm 100% in favor of all of these changes:

Deduplication is only worth it in very specific cases, and "general Windows servers" for the most part isn't one of them. Disable it.
Enable the default (LZ4) compression scheme. You've got more than enough processor to handle it.
Ditch the LACP/VPC setup with the Nexus switches and run them as two independent IP addresses. Set up round-robin for the Hyper-V hosts.

Now, from your initial post, you've also specified a 4K block size because "that's what NTFS likes" ... except that this is probably also making deduplication have a way bigger impact than you expect, since ZFS is now only allowed to write a maximum 4K block - and it then has to store an entry in the dedup table (ddt) for each of those blocks. The blocksize is also only an upper bound - ZFS can write smaller blocks if it determines it's a good idea. Nix the blocksize override and let ZFS decide.

Jumbo frames are probably more trouble than they're worth, skip 'em. Focus on obtaining an SLOG and fixing the dedup/compression/MPIO/blocksize setup before testing with/without jumbo frames.

(Please excuse any misrememberings, I've been out of the loop for a bit, so if some of the bugbears I'm referring to have been fixed, please correct me.)

Steven Sedory · Aug 8, 2015

Wow, thank you all. Your input is invaluable and much appreciated. This afternoon is the time I set aside to work on this (finally).

Today, I will be doing the following:
-set sync=always
-disable dedup
-enable LZ4
-ditch 4K block size (use default setting)
-play with jumbo frames later
-ditch LACP/VPC, setup MPIO

As for the SLOG, I haven't setup/worked-with one before. I just finished reading most of the first page of "Some insight into SLOG/ZIL..." found at https://forums.freenas.org/index.php?threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/

Based on that post, I understand that I need to do the following: "As a result, tuning the size of a transaction group to be appropriate to a pool is advised, and since that maximum size is directly related to SLOG sizing, it is all tied together."

Can anyone give me their expertise on this? If you were me, with the setup described up top, what would you do? As all drive bays are full, should I just take two out of the normal data pool, and stripe them together to create a decent speed SLOG? If I go with a "supercapacitor SSD" as mentioned in the post I linked above, is that okay to mix with the SAS drives on the controller/backplane that I'm using?

Thanks again for all your input.

sfcredfox · Aug 9, 2015

Just based on some reading and not number crunching, there might be room for concern in that case. There's more qualified people for that.

I read this:
'Transitioning to 6Gb/s SAS (Serial-Attached SCSI)' / A Dell® White Paper / 2009
On page 7, it mentions you need 20 drives to saturate the bandwidth of SAS2.

If you're using 24 drives on that backplane, it might be an issue. What about the two 2.5" slots in the back? Is that on a separate controller? I don't know Dell much.

Steven Sedory · Aug 10, 2015

Hey all, wanted to give an update:

I did all of the following:
-set sync=always
-disable dedup
-enable LZ4
-ditch 4K block size (use default setting)
-ditch LACP/VPC, setup MPIO

The SAN is no longer locking up as before. Great news!

When I was transferring the virtual disk files from local storage (which is six 450GB 10K sas drives in a hardware RAID10), the transfers were 40-50 MBps. That volume should have a much higher read rate, so I'm assuming the bottleneck now is the lack of a SLOG, since I have sync=always set now. Is this a true assumption?

Again I ask, if anyone can give some insight as to a good SLOG setup based on our specs, it would be much appreciated.

sfcredfox, we are using all 24 slots available on the backplane. There is onboard SATA, as well as a PERC7 card that we could tap into and route cabling to the two available drive slots on the back of the chassis.

Steven Sedory · Aug 10, 2015

For the record, just copied a 12GB virtual disk file from the SAN back to local storage on the host (same RAID10 as mentioned above), and it copied at approx 150MBps. So Read is obviously much faster than write for the SAN (3-4 times faster).

zambanini · Aug 11, 2015

keep in mind, that disabling dedup does not remove the dedup arc blocks. so also mem will be still in use. the only way to solve it: recreate the pool from scratch.

like honeybadger told u: a nvme ssd would be a solution. it would be much cheaper and also faster then ZEUSram. u can also use one for a l2arc. but then u also need more ram.

sfcredfox · Aug 11, 2015

Steven Sedory said:
sfcredfox, we are using all 24 slots available on the backplane. There is onboard SATA, as well as a PERC7 card that we could tap into and route cabling to the two available drive slots on the back of the chassis.

If any of those two controller options are fast enough (probably 6G or above), you should be OK. I'm using a SATA SSD for my SLOG and it works fairly well, or at least good enough for my workload.

I'd be curious what your SYNC writes per second in MB are if I was you. If you can use zilstat or pop a drive in there for a temp SLOG, you can assess what your typical and max load is, might help plan. I am using an Intel 3500, it gets about 100MBps of writting. I was planning on striping it with up to two more drives if I start exceeding it's capability.

Since your chasis has two slots in the back, those might be perfect? Since you only get two, and it seems like you're a little more serious, maybe you can upgrade to the Intel 3700 (or equal). I believe that gets 200MBps, so if you stripe that, you're looking at 400MBps. Keep in mind that your SLOG might end up being the bottleneck of the whole pool, so plan accordingly. SLOG sets the SYNC ACK rate as you know.

Hope this is helpful or at least gives you something to experiment with.

HoneyBadger · Aug 11, 2015

Steven,

Great to see the improvements in stability there; while your peak speed has definitely dropped, it's now consistent. And 40-50MB/s (assuming that's "megabytes per second" as intended) is pretty respectable for sync writes without an SLOG device. You're on the mark with lack of SLOG being your bottleneck now.

Since you're dealing with client data I'd suggest mirrored SLOG devices. With a single SLOG there's a very, very minor data loss risk (loss of power to your FreeNAS machine and simultaneous failure of your SLOG) - so having a second one makes it that much less likely. For the cost of a second SSD, it's worth mitigating that risk.

If you have the PCIe slots available I'd suggest an Intel P3700 or Intel 750 NVMe SSD; although at this point you'll want to set the tunable value of "vfs.zfs.vdev.trim_on_init=0" otherwise FreeNAS tries to do a full TRIM pass against the device on initial load, which takes a significant amount of time (think "hours") and makes the GUI nonresponsive during.

If you don't have free PCIe slots then the next best would be the Intel S3710 or S3610 series SSDs wired to the onboard SATA controller. Get at least the 200GB models - although you'll never use all the space, they have more physical NAND (meaning faster write speeds) than the smaller ones.

Important Announcement for the TrueNAS Community.

Hyper-V VM Storage Suggestions

Explorer

FreeNAS Replicant

Sambassador

L

Guest

Sambassador

Patron

Explorer

Explorer

Explorer

Explorer

Patron

Cadet

actually does care

Explorer

Patron

Explorer

Explorer

Patron

Patron

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Hyper-V VM Storage Suggestions"

Similar threads