iSCSI Performance suboptimal on NVME Pool over 40GbE NIC

alhsou · Jul 22, 2023

Hi all,

I'm not usually one to post on forums as I like to find solutions myself, but I'm at a dead end here after spending 60+ hours reading about TrueNAS and ZFS and trying to optimise my setup.

So my setup:
- HPE DL360 Gen9
- HPE FLR544 FlexLOM Dual 40GbE (Mellanox Connect-X 3 Pro)
- 2x Intel Xeon E5-2650v4 2.2 Ghz
- Sonnet Technologies Fusion M.2 4X4 PCIe card (has a PLX chip on it so no need for Bifrication support from motherboard)
- 128 GB (8x 16GB RDIMM 2400Mhz Samsung DIMMS)
- Running on ESXi 7.0.3
- 4X Lexar NM790 2TB SSDs

My TrueNAS Core VM has following specs:
- 16 vCPUs
- 64GB RAM
- 60 GB Boot Disk
- Sonnet Tech Fusion card is passed through (well all the NVME SSDs are passed through).
- 2 interfaces: one for MGMT and one for the iSCSI/SMB/NFS etc. They are on seperate VLANs.

I have 2 DSwitches: One for all my VM traffic and ESXi MGMT and one dedicated for 'vSAN' that I use for storage. In the vSAN one, only the 40GB NICs are present as uplinks. I have a seperate VLAN for the 'vSAN' network. Please don't mind the name, it's because I was planning on using vSAN before.

The 2 Physical Hosts are conencted toghether via a DAC cable, so no switching for that VLAN is done.
Each ESXi instance has a seperate vmkernel interface in that vlan/subnet and I have jumbo frames enabled in the whole path, also on the interface in TrueNAS.

The issue I have is that no matter what config I try, it seems like the performance is nowhere near what a single of those SSDs can do.
I even tried installing TrueNas Core bare metal with the full resources of the server, but to no avail. Performance stayed the same.

I am running TrueNAS Core 13U5.2 at the moment, but I have also tried SCALE 22.12 and even the 23.12 Nightly. All gave me the same results.

I have 2 VDEVs that each have 2 NVME disks in mirror, so they are striped together.
I also triend RAID-Z1 and Z2 just for diagnosing purposes but performance is the same on all of them.

Here's an image of the kind of performance I get with CrystalDiskMark and the NVME + Max Performance profile:

My IOPS seems at the lower end of what I would expect, but especially the writes are a BIG issue for me. I tried with 32, 128K Record sizes on the ZVOL, but results stay mostly the same. Maybe 1-5 MB/s improvement between the settings.

Has anyone got any ideas ? I'm really hitting a dead end here.
Also, can anyone give me a correct way to test with fio locally on TrueNAS? I found a lot of different ways to test with different IO Engines, but if someone could point me in the right direction for my setup...

PS: CPU utilisation is not high while performing the test over iSCSI.
I also tried NFS v4 but that gave me half of the IOPS.

alhsou · Jul 23, 2023

Okay, did some more digging and I have some interesting (to me) findings.
It seems like I needed to set-up MPIO to get the most out of the iSCSI performance. Thinking of it it seems logical as you have only one TCP path towards the LUN by default.

I did a test and disabled Sync, now I have these results with 2 NICs assigned to the TrueNAS VM:

However, I can't run in production with Sync disabled... Especially as I will be using the storage purely as block storage for VMs on ESXi servers.

I will be testing with more NICs and reporting back.
Maybe Autotune could help.

alhsou · Jul 23, 2023

I added 2 more NICs, so for a total of 6, but that did't really improve compared to 4.
And yes, ESXi is set to Round Roubin for MPIO.

I also tested with upping the vcpu count to 24 from 16, with thread affinity enabled so 0-23 used.
So it doesn't seem to be CPU limited, or at least not in core count but maybe IPC limited ?

ZFS has a lot of overhead it seems... Or I need to do some more digging.
Also the ZFS cache is not full, only like 3-5GB used of the 64GB while performing the tests.

Could it be that the ARC is not hitting ?
I've set 'zfs primarycache=always and zfs secondarycache=none' on the zpool1, so it should be hitting the RAM.

For me, I'm more interested in how to improve the situation when Sync is enabled. The only resources I find about that sugest a SLOG but that is outdated since I'm using all NVME storage for the pool. The only thing I could think of is some optane drives, but would that really help and NVME pool ?

Another thing I want to try is NVME over TCP or RDMA. It doesn't seem to be nativly supported, and on the TrueNAS homepage I can see they are planning it for SCALE 22.22, but even in the latest 23.12 nightly that I tested it doesn't seem to be included, or I didn't find it atleast.

Anyone has some sugestions ? I don't just want to throw money at it for no reason.
I mean the performance is acceptable now, but it's nowhere near what it is capable of...

The last thing I could do is pull one of my Home ESX LAB servers and install TrueNAS on that and transfer the disk array into that chassis. That is an AMD Ryzen 5 5500G system with 128GB of memory, which is much more modern and has maybe double or more the IPC...

What do you guys think ?

alhsou · Jul 23, 2023

Some more progress...
I found the correct combination of parameters to use fio correctly for seq reads. This is what I got:

Reads Seq:
root@truenas[~]# fio --name=seqread --rw=read --direct=1 --ioengine=libaio --bs=8k --numjobs=8 --size=1G --runtime=600 --group_reporting
seqread: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=1
...
fio-3.25
Starting 8 processes
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
Jobs: 8 (f=8)
seqread: (groupid=0, jobs=8): err= 0: pid=94344: Sun Jul 23 14:41:59 2023
read: IOPS=873k, BW=6821MiB/s (7152MB/s)(8192MiB/1201msec)
slat (nsec): min=1412, max=35922k, avg=7449.45, stdev=177820.22
clat (nsec): min=170, max=21506k, avg=564.40, stdev=50450.33
lat (nsec): min=1643, max=35925k, avg=8104.82, stdev=185323.41
clat percentiles (nsec):
| 1.00th=[ 191], 5.00th=[ 201], 10.00th=[ 201], 20.00th=[ 211],
| 30.00th=[ 221], 40.00th=[ 221], 50.00th=[ 231], 60.00th=[ 231],
| 70.00th=[ 231], 80.00th=[ 241], 90.00th=[ 251], 95.00th=[ 282],
| 99.00th=[ 422], 99.50th=[ 502], 99.90th=[ 19072], 99.95th=[ 36096],
| 99.99th=[254976]
bw ( MiB/s): min= 5989, max= 8148, per=100.00%, avg=7068.80, stdev=194.69, samples=14
iops : min=766658, max=1042954, avg=904806.00, stdev=24920.91, samples=14
lat (nsec) : 250=89.19%, 500=10.30%, 750=0.28%, 1000=0.05%
lat (usec) : 2=0.05%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.07%
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=5.92%, sys=39.52%, ctx=15063, majf=0, minf=139
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=1048576,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
READ: bw=6821MiB/s (7152MB/s), 6821MiB/s-6821MiB/s (7152MB/s-7152MB/s), io=8192MiB (8590MB), run=1201-1201msec
root@truenas[~]#

Writes Seq
root@truenas[~]# fio --name=seqread --rw=write --direct=1 --ioengine=libaio --bs=8k --numjobs=8 --size=1G --runtime=600 --group_reporting
seqread: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=1
...
fio-3.25
Starting 8 processes
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
Jobs: 8 (f=8): [W(8)][75.0%][w=2820MiB/s][w=361k IOPS][eta 00m:01s]
seqread: (groupid=0, jobs=8): err= 0: pid=100266: Sun Jul 23 14:43:57 2023
write: IOPS=382k, BW=2984MiB/s (3129MB/s)(8192MiB/2745msec); 0 zone resets
slat (usec): min=2, max=130300, avg=18.56, stdev=604.29
clat (nsec): min=191, max=45514k, avg=771.95, stdev=89265.84
lat (usec): min=2, max=130303, avg=19.43, stdev=612.42
clat percentiles (nsec):
| 1.00th=[ 221], 5.00th=[ 221], 10.00th=[ 231], 20.00th=[ 231],
| 30.00th=[ 231], 40.00th=[ 241], 50.00th=[ 241], 60.00th=[ 241],
| 70.00th=[ 241], 80.00th=[ 251], 90.00th=[ 262], 95.00th=[ 270],
| 99.00th=[ 442], 99.50th=[ 628], 99.90th=[ 6624], 99.95th=[ 36096],
| 99.99th=[358400]
bw ( MiB/s): min= 2093, max= 3966, per=100.00%, avg=3082.52, stdev=92.39, samples=39
iops : min=267932, max=507668, avg=394561.20, stdev=11826.53, samples=39
lat (nsec) : 250=71.49%, 500=27.76%, 750=0.37%, 1000=0.12%
lat (usec) : 2=0.12%, 4=0.02%, 10=0.02%, 20=0.02%, 50=0.03%
lat (usec) : 100=0.02%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=2.62%, sys=22.35%, ctx=19248, majf=0, minf=130
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,1048576,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=2984MiB/s (3129MB/s), 2984MiB/s-2984MiB/s (3129MB/s-3129MB/s), io=8192MiB (8590MB), run=2745-2745msec
root@truenas[~]#

So the system is not the bottleneck in this case, I think it has more to do with iSCSI/ZFS protocol overhead.
Too bad there's no NVME-OF support on TrueNAS. Would like to test the difference.

Next step is to test on my AMD system that is much newer to see if it is indeed a CPU/RAM bottleneck.

alhsou · Jul 26, 2023

I keep responding to myself, but I hope it could help someone in te same situation by finding this thread.
I did some more testing, and it seems that my theory of a CPU IPC might be very likely now.

At home I have a 4U chassis running consumer AMD AM4 parts that I use as a 2nd ESXi Host (and for vMotion failover) and I run my NAS at.
This NAS was initially meant to be a remote back-up for my files at the Office and syncs every hour from my GDrive certain folders that I use for my business. This is the spec:
-AMD Ryzen 5 5500
-Gigabyte B55M-DS3H
-HP H240 Smart HBA
-128 GB DDR-3200 UDIMM RAM
-Inter Tech IPC-4U-4088-S chassis with Icy Dock 5 bay to 3*5.25 Hotswap drive backplane (has SATA and is connected via an adapter cable to the H240).
-Kingston NV1 512GB NVME drive as a LOG cache for the array.
-3* Toshiba Enterprise 14TB 7.2K 3.5" HDDs

This is running in VMWare ESXi 7.0.3 and the VM has 8 vCPU and 32/64GB RAM (It used to have only 16 when I was using it purely as a back-up NAS but I upgraded it to 32GB and now to 64GB since I began testing). I'm running TrueNAS Scale.

First I made a Windows 10 VM on the same host as the TN and mounted a 2TB ESX iSCSI volume to it. It gave me good performance, so I shut down the system and installed the NVME as a LOG VDEV, then remounted the iSCSI in ESXi and migrated the VM storage to it. BTW it is a RAID-Z1 pool, I know it's not recommended for iSCSI but I can't change it as it has important data on it...

The results are quite... impressive... Considering this is running on Spinning Rust and a very cheap nVME drive I had lying around...

So this low-end modern consumer HW is outperforming my DL360 Gen9...

Now I'm thinking if I would swap the E5-2650V4 CPUs for lower core but higher clocked parts, but those consume a lot more power...
But that is the cheapest option as those CPUs cost next to nothing now.

OR I could look for either a new AMD/Intel consumer system to put in a rackmount chassis and use that for the storage with daily backups to the other NAS I have or look for used AMD EPYC parts and use that instead...

Anyways... Glad I found the issue and learned a lot in the process. Just wished I knew this beforehand then I could have built a beefy consumer based TN system for this...

FYI: with MPIO enabled I even got +- 2.1 GB/s read and 2.5-2.7GB/s write sequential.
I've decided that for my usecase (iSCSI for VMs) Single Threaded performance is the most important, so I'm going to build a system based on the Core i3 13100F, which has very good IPC and Single Threaded performance and is one of the few SKUs in 13th gen without the E-Cores. I will probably run TN Bare Metal on it, and I'm debating between Scale and Core (As Scale only can asign half of the memory zo the ZFS caching, but I like the GUI more). I will also probably add a Micron 7400 MAX 400GB NVME SLOG. Why you might ask ? To assist the Sync Write performance, as these Lexar SSDs dont have PLP. Had I known this, I would just have gone for Micron 7400 PRO 1.92TB drives instead of these, it's only +- 200EUR more for 4 drives... Would have given me so much less headaches.

Consumer SSDs are kinda garbage to be honest... Now I understand why.

Important Announcement for the TrueNAS Community.

iSCSI Performance suboptimal on NVME Pool over 40GbE NIC

alhsou

Cadet

alhsou

Cadet

alhsou

Cadet

alhsou

Cadet

alhsou

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

iSCSI Performance suboptimal on NVME Pool over 40GbE NIC

alhsou

Cadet

alhsou

Cadet

alhsou

Cadet

alhsou

Cadet

alhsou

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "iSCSI Performance suboptimal on NVME Pool over 40GbE NIC"

Similar threads