ESXI ISCSI and Freenas - Bottleneck Identification

EnlightenCor · Jul 28, 2020

First, I'll say I have been a long time lurker to these forums and I've tried to do a fair bit of investigation and problem solving on my own, but I think I've gone as far as I can and would like some insight from those who have been around and in the mix a lot longer than I.

I'm attempting to use Freenas ISCSI with ESXI 6.5 ( Dell R610). I am not seeing the performance I believe I could attain, but I'm leaning towards my drives at this point being the bottleneck, but before spending any further cash, I would like others to take a look.

Continued.... after the Hardware section.

Needed Information:
FreeNas Version: FreeNAS-11.3-U3.2

Hardware:
MOBO: SuperMicro X8SIE
Intel(R) Xeon(R) CPU X3460 @ 2.80GHz amd64
32 GB ECC Memory. Memory brands match and in the correct slots per Supermicro Manual.

02:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)

03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
04:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8169 PCI Gigabit Ethernet Controller (rev 10)

LSI SAS2008 PCI Flashed to IT mode.

Drive Configuration:
2X4 Mirror

Drive Models:
5 of ATA ST2000DL003 1.82 TB(5900 RPM)
2 of ATA ST2000DM001 1.82 TB (7200 RPM)
1 of ATA ST32000542AS 1.82TB (5900 RPM)

1 KINGSTON SUV400S37240G 240GB as a LOG device
(It was a spare SSD lying around. Including it seemed to improve performance)

DD_Tests:
1) DD 10GB write of Zeros
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 3.951187 secs (2653825470 bytes/sec)

2) DD 10GB read of Zeros
dd of=/dev/null if=/mnt/tank/ddtest1.dat bs=1024k count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 1.702495 secs (6159055428 bytes/sec)

3) DD 20GB Write of Zeros
dd if=/dev/zero of=/mnt/tank/ddtest2.dat bs=2048k count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 7.574889 secs (2768557911 bytes/sec)

4) DD 20GB read of Zeros
dd of=/dev/null if=/mnt/tank/ddtest2.dat bs=2048k count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 3.387624 secs (6190628305 bytes/sec)

5) DD 10GB write of Random (CPU Intensive, but CPU only 14% Utilized)
dd if=/dev/random of=/mnt/tank/ddtest3.dat bs=1024k count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 100.059708 secs (104795029 bytes/sec)

6) DD 10GB Read of Random (Same CPU utilization)
dd of=/dev/null if=/mnt/tank/ddtest3.dat bs=1024k count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 3.565204 secs (2941138916 bytes/sec)

The ESXI 6.5 Server Dell R610 has 2 ISCSI Datastores from the Freenas. One is over the 10GB Link and I have another over 2 1GB ISCSI Links. The 1Gb setup is through another VSwtich entirely.

When spinning up VMs, I can't seem to get beyond 30 to 40 MB/s on the disks and datastores, maybe 50 MB/s if I am lucky. It doesn't matter if I am using the 10GB Datastore or the 1GB Datastore. I don't see much of a difference between using 10GB or 1GB in this scenario.

I can see consistent writes across all disks. 35 to 50 MB/s is usually what I am seeing.

The CPU rarely goes beyond 10% utilization and load is a consistent ~0.16
I've tested Iperf between the Freenas Box and ESXI and it's a sustained 9 Gb/s either way. Networking seems okay. I haven't ruled out using a better driver for the 10GB, but I'm not sure that's an issue. (nmlx4 driver ) currently.

I've added more RAM (was 12GB now 32GB ) because I know ISCSI likes a lot of RAM, but when it's cranking away I still have 15GB of RAM unused and I didn't see a huge difference in performance.

I added an ISCSI connection from a separate Windows Box over 1GB connection and it starts off well, but then normalizes around 65 MB/s.

That being said, I'm pretty confident the bottleneck exists in the drives I have, but before I run out and purchase more I want to do any further testing to determine any outliners that may exist. IOMeter Read tests are good (160 MB/s 32k 100% Reads), but any mix of writes and the performance deteriorates.

I've tried different combinations of pools with and without the SSD.
4X2 Mirrors
4X2 Stripe within Vdev. Lots of space available, but not really reliable in the long run.

The workloads aren't anything special and mostly just testing things like Ansible, Terraform, and some Shotty Python Applications. I like having the ability to test setups and spin up a VM for testing. It doesn't need to be lightning-fast, but any improvements are always appreciated.

Thanks and I hope I've listed enough info to get everyone up to speed.

Heracles · Jul 28, 2020

EnlightenCor said:
The 1Gb setup is through another VSwtich entirely.

The recommended setup is 1 vswitch per iSCSI link. Here, I use 4x 1G NIC, so I have 4 vswitches for iSCSI.

EnlightenCor said:
The CPU rarely goes beyond 10% utilization and load is a consistent ~0.16

The only reason CPU would go high would be if you are doing compression too aggressively.

EnlightenCor said:
I've added more RAM (was 12GB now 32GB ) because I know ISCSI likes a lot of RAM

Yep, never too much RAM...

EnlightenCor said:
I added an ISCSI connection from a separate Windows Box over 1GB connection and it starts off well, but then normalizes around 65 MB/s.

I never considered Windows as high performance, so I would never rely on it for performance testing...

EnlightenCor said:
Drive Configuration:
2X4 Mirror

What ? You are doing 4-ways mirrors ??? This is more than overkill or paranoid... No point going more than 3-ways mirrors.

I hope you wrote it backward and that you have 4 mirrors with 2 drives each...

Anyway, you drives are not all the same. When doing redundancy like mirrors, it is always best to group identical hardware together.

Are your networks cables connected directly from one server to the other or if you are using a physical switch ?

Did you configured jumbo frame ?

What kind of load balancing are you doing over your network links ?

HoneyBadger · Jul 28, 2020

Welcome. Have some thoughts and questions in bold.

The ARC will artifically boost your read cache results, especially where you've used zeroes without disabling compression - ZFS will just squash those down to next-to-nothing, which is why you're seeing the inflated multi-GB/s speeds in your dd write testing.

Your SLOG also won't be doing anything unless you've manually set sync=always on your ZVOLs - VMware doesn't enforce sync over iSCSI writes, it expects iSCSI arrays to have non-volatile cache. However, if you enable this, expect your write speeds to crater even harder - neither your disks nor your SSD (Kingston UV400) are up to the task of absorbing sync writes at high speed.

EnlightenCor said:
I'm pretty confident the bottleneck exists in the drives I have, but before I run out and purchase more I want to do any further testing to determine any outliners that may exist. IOMeter Read tests are good (160 MB/s 32k 100% Reads), but any mix of writes and the performance deteriorates.

How full is your pool? Occupancy rates can strongly impact performance on spinning disks if you have to seek around a bunch to find free space.

But even with an empty-ish pool, disk bottleneck is what I'm feeling as well. Check the SMART stats perhaps to make sure there isn't a failing drive or loose cable causing a bunch of unnecessary error-checking to happen, but you don't have the fastest drives in the world there (Barracuda Green/LP) and running VM workloads trend towards heavily 4K/8K I/O which is murder on spinning disks.

Can I get a dump of your arc_summary.py results as well, either as a text attachment or in CODE tags? Also check to see if autotune was enabled and somehow is capping out your maximum ARC size.

Random reads you can bandaid with L2ARC (which is a far better use-case for that Kingston drive) - you can cheat short bursts of writes with a fast SLOG device, but if you're running async and hitting slowdowns, it's ultimately going to come down to "you need fast vdevs" in the end.

EnlightenCor · Jul 28, 2020

Thanks for the replies @HoneyBadger and @Heracles. I'll throw some answers to your questions here and will spin up everything later tonight and get the requested dumps.

I am indeed doing a 2X4 and only for the simple reason out of all of the combinations I've done thus far, this gives me the best performance, albeit slight increase. Yes, I am paranoid.

I did have a 4X2 mirror and had done combos (4x2, 2X4) that were striped within the Vdevs and with the ZFS striping across all vdevs in general I got the best performance, but no reliability in the long run. so I'm back to 2X4, but can easily change to whatever's suggested. Maybe put like drives together in smaller mirrors to test that pool?

Jumbo Frames are enabled on the 10GB connection. No switch in the way or load balancer for 10GB, it's a direct connect from the ESXI to the Freenas. The 1GB links have MTU still set to 1500 and a 1GB switch between ESXI and 10GB. Maybe I can set them to 9000, but with what I have thus far I think the network is good to go and could probably be refined some.

I ran a Windows Iscsi just as a comparison outside the scope of the environment in question to compare.

I do have sync=always on and compression is off on the zvols

The Pool is empty as I have been tearing it down and rebuilding based on different ideas/setups.

I will run that PY script later tonight and see what output I receive. I did briefly check SMART stats the other night and reporting graphs were all pretty much identical and no obvious flags I could see, but I will check in-depth tonight.

Maybe checking out drives online wouldn't be a bad idea either. I'd just like to get to a point where I can hone in on the culprit be able to say X or Y is the cause of Z.

Thanks again!

HoneyBadger · Jul 29, 2020

EnlightenCor said:
I am indeed doing a 2X4 and only for the simple reason out of all of the combinations I've done thus far, this gives me the best performance, albeit slight increase. Yes, I am paranoid.

So you have 2 vdevs that are 4-way mirrors, with 4T of usable space? You'll want to switch that to 4 vdevs of 2-way mirrors (8T usable) or add a ninth drive and make 3 vdevs of 3-way mirrors (6T usable) - ZFS distributed data across multiple vdevs and that's where your "back end pool speed" comes from.

EnlightenCor said:
I do have sync=always on and compression is off on the zvols

Definitely turn compression back on (LZ4) as this will help - but with sync=always the focus is now on that SLOG device. Temporarily disable it (just for testing!) by setting sync=standard and see if your pool flies through VMware - if it does, then your bottleneck is the device. You'll want/need to upgrade it to something faster; see the "SLOG Thread" here:

SLOG benchmarking and finding the best SLOG

I'd like to take a few minutes to talk about SLOG devices and what makes good ones versus bad ones. I have no doubt that this will be a controversial topic since this is not well understood by many people. In short, there's 3 things that you need for a "great" SLOG: 1. Fast throughput 2...

www.ixsystems.com

A cheap SATA option would be Intel DC S3700, but to go faster you need to be looking at Intel DC series NVMe cards (DC P3700, or Optane cards/sticks)

EnlightenCor · Jul 30, 2020

I've done a bit more with this today.

I took the SSD out of the pool altogether and built a 4X2 pool. I placed the drives together a bit better, putting the 7200RPMs in the same vdev. I created the pool with Sync=always and Sync=standard. As expected the Sync=standard was a bit faster than sync=always. Compression is on now as well.

The VMs are somewhat faster now, but I'm still around the 65 MB/s range. I know iSCSI on Windows isn't the best test, but it got a consistent 90 to 100 MB/s which is better than before.

Because of the modest increase, I'm more apt to say those 8 old disks should probably be switched out for something a bit better and then add/buy a real SLOG later if needed. I'll leave sync=standard for now. Since it's not critical workloads, it should be fine.

I've got the arc summary info below.

HoneyBadger · Jul 30, 2020

Your ARC maximum is properly set, but you haven't driven enough I/O to fill it yet so I can't really infer much else.

What benchmark are you using or metric are you observing for the "65MB/s" number? FreeNAS UI on network throughput?

EnlightenCor · Jul 31, 2020

I'll need to run some more thoughtful tests to really know for sure. It was a quick spin up of a VM and that number from ESXI disk monitor.... I didn't spend much time checking much on the FN beside iostat and pulling that arc summary. . I've noticed that disk number on esxi before was somewhat slower, just a quick thing I noticed. Works been busy, I'll do some more tinkering.

Important Announcement for the TrueNAS Community.

ESXI ISCSI and Freenas - Bottleneck Identification

EnlightenCor

Cadet

Heracles

Wizard

HoneyBadger

actually does care

EnlightenCor

Cadet

HoneyBadger

actually does care

SLOG benchmarking and finding the best SLOG

EnlightenCor

Cadet

Attachments

HoneyBadger

actually does care

EnlightenCor

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

ESXI ISCSI and Freenas - Bottleneck Identification

Cadet

Wizard

actually does care

Cadet

actually does care

Cadet

Attachments

actually does care

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ESXI ISCSI and Freenas - Bottleneck Identification"

Similar threads