Another iSCSI poor performance thread

ljvb

Dabbler
Joined
Jul 14, 2014
Messages
30
I have looked through all the previous threads, and so far no solutions.. Honestly, I am not even sure if this is a Truenas issue, or a VMWare issue, but the Truenas forum lost the coin toss, so I'm checking here first. I am using Truenas Core 13 (whatever is latest) and ESXi 7

The problem.. I am topping out at between 250 to 280MB/sec writes using simple dd of=/dev/zero on the virtual guests. On the Truenas server itself, I am getting 2GB/sec using the same test (unfortunately while the enclosure and controller support SAS3, the drives I currently have are all SAS2)
Network performance appears fine, iperf3 shows 9.8Gb between the Truenas server and the vmware servers on each vlan, no cross vlan traffic is possible due to lack of a gateway and ACLs in place. All server traffic is on one switch. VMWare network is using distributed switches.

I have my storage server setup using 7 mirrored striped vdevs, the 2 1TB SSDs are unused at this time.
iSCSI configuration uses 2 portals on Truenas, each IP on it's own vlan, with its own (not shared) 10GB interface, SMB and other services are using the 2 remaining 1GB interfaces till I pickup another Dell branded PCI nic, because when I put in the Intel X520's I have on hand, the fans go from reasonable to 747 on full throttle at take off.

Tht ESXI servers are sharing 2TB iSCSI in an MPIO configuration, Round Robin, and IOPS limit set to 1 (currently, I have tried various values between 1 and a 1000, same result).
iSCSI configuration uses 2 VMkernels for port binding, each on their own vlan, each with their own assigned 10GB interface

The actual virtual machines use whatever "hardware" is specified in the precanned configurations based on the OS selected when creating the machine, the machines are a mix of Ubuntu/Debian, FreeBSD, and Windows 11 and 2022 Server (no testing was done on the Windows machines)

I'm at a loss as to why I am getting such poor performance. It's usable, at least once the machines are setup, but not ideal.

Storage
Dell R730XD 24 bay SFF
2 x E5-2620 v4
64GB DDR4
Dell HBA330 mini
14 x 1TB
2 x 1TB SSD
Dell Intel X520/I350 (2 x 10Gb, 2 x 1GB) Daughter card

Virtualization

2 x HP DL160 G9 each with:
2 x E5-2630 v3
64GB DDR4
HP P440
8 x 600GB
Intel X520 PCI card (2 x 10GB)

Network
2 Brocade ICX 6610-24P POE
4 x 40GB (crrently using 4 x 10GB breakout cables, 1 40GB 200 foot fiber cable for stacking)
8 x 10GB
24 x 1GB
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
So you are writing to the array based on HDDs using dd and there is no SLOG?

I'd suggest experimenting with fio.. to see if dd is causing the issue or try a SLOG. The bandwidth may be limited by the latency of the set-up.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
iSCSI configuration uses 2 VMkernels for port binding, each on their own vlan, each with their own assigned 10GB interface

As a best practice, you should remove the port binding on your iSCSI vmkernel adaptors. This is only needed in a network design where there are mulitple initiators in the same subnet/broadcast domain, and your description here would seem to indicate you have two non-overlapping subnets.

The problem.. I am topping out at between 250 to 280MB/sec writes using simple dd of=/dev/zero on the virtual guests. On the Truenas server itself, I am getting 2GB/sec using the same test (unfortunately while the enclosure and controller support SAS3, the drives I currently have are all SAS2)

As mentioned by @morganL using dd may not be accurately showing the performance results. Can you share the exact command line you're using to benchmark both in-guest and on TrueNAS itself?

Using /dev/zero as a source tends to give very inflated performance results, as the default ZFS compression will squash that down to effectively nothing.

There is also the considerations for "how long is the test running" as well as the effective record size - writing to a dataset on your pool will allow it to use larger records, giving better throughput numbers; but in-guest on the VM will likely be broken down to smaller records (16K max, if using defaults) so that will reduce throughput.

We'll also likely need to visit the idea of SLOG and sync writes to ensure the safety of that VM data, but let's go one step at a time here.
 

ljvb

Dabbler
Joined
Jul 14, 2014
Messages
30
Quick update. as a test, I moved all the running VMs to on ESXI server, and installed XCP-NG on the other, setup iscsi with multipathing, and am getting much better performance using CrystalMark.

As for why I used dd, it was just an easy quick and dirty test, not meant to get real world performance stats, rather compare performance across various configurations to see if there was an improvement or not in that specific configuration.

I originally had a DL380p G8 which I was using as my storage server till I broke it (and it was not even the servers fault, it was the cabinet, I was working on both the server and the cabinet an the server was open, I got mad at the rack and slammed it, and the server above.. the shelf screws, which apparently were lose.. dropped and the corner went right into the DL380 while it was powered.. and let out the magic smoke). The DL380 gave me around 2 to 2.5GB/sec over iscsi with 24 600GB SAS2 in mirrored/stripe configuration (I a sure it could have gone faster, but it suited my needs).

I picked up the R730xd to replace it, and it came with 14 1TB SAS2 drives.

I'm not familiar with FIO, so I spun up a couple of windows machines to run CrystalMark. I'll throw out some numbers later from the results.

I also added the 2 1TB drives (one as SLOG, other Cache (I know, a single SLOG is a bad idea) for the fun of it, and have not noticed much difference in the tests I have run so far).

Looking at the reports on Truenas, Arc Size is maxed out based on my installed ram, I am guessing I should probably add another 64GB at some point), Arc Hit ratio is pegged at 100% when running the tests. The rest of the graphs are just lots of spikes. I'm an IT security professional.. storage is not my forte, so I'm not sure what I am reading.. I'll have to do some research.
 
Top