iSCSI Performance suboptimal on NVME Pool over 40GbE NIC

alhsou

Cadet
Joined
Jul 22, 2023
Messages
6
Hi all,

I'm not usually one to post on forums as I like to find solutions myself, but I'm at a dead end here after spending 60+ hours reading about TrueNAS and ZFS and trying to optimise my setup.

So my setup:
- HPE DL360 Gen9
- HPE FLR544 FlexLOM Dual 40GbE (Mellanox Connect-X 3 Pro)
- 2x Intel Xeon E5-2650v4 2.2 Ghz
- Sonnet Technologies Fusion M.2 4X4 PCIe card (has a PLX chip on it so no need for Bifrication support from motherboard)
- 128 GB (8x 16GB RDIMM 2400Mhz Samsung DIMMS)
- Running on ESXi 7.0.3
- 4X Lexar NM790 2TB SSDs

My TrueNAS Core VM has following specs:
- 16 vCPUs
- 64GB RAM
- 60 GB Boot Disk
- Sonnet Tech Fusion card is passed through (well all the NVME SSDs are passed through).
- 2 interfaces: one for MGMT and one for the iSCSI/SMB/NFS etc. They are on seperate VLANs.

I have 2 DSwitches: One for all my VM traffic and ESXi MGMT and one dedicated for 'vSAN' that I use for storage. In the vSAN one, only the 40GB NICs are present as uplinks. I have a seperate VLAN for the 'vSAN' network. Please don't mind the name, it's because I was planning on using vSAN before.

The 2 Physical Hosts are conencted toghether via a DAC cable, so no switching for that VLAN is done.
Each ESXi instance has a seperate vmkernel interface in that vlan/subnet and I have jumbo frames enabled in the whole path, also on the interface in TrueNAS.

The issue I have is that no matter what config I try, it seems like the performance is nowhere near what a single of those SSDs can do.
I even tried installing TrueNas Core bare metal with the full resources of the server, but to no avail. Performance stayed the same.

I am running TrueNAS Core 13U5.2 at the moment, but I have also tried SCALE 22.12 and even the 23.12 Nightly. All gave me the same results.

I have 2 VDEVs that each have 2 NVME disks in mirror, so they are striped together.
I also triend RAID-Z1 and Z2 just for diagnosing purposes but performance is the same on all of them.

Here's an image of the kind of performance I get with CrystalDiskMark and the NVME + Max Performance profile:

Screenshot 2023-07-23 at 05.47.58.jpg


My IOPS seems at the lower end of what I would expect, but especially the writes are a BIG issue for me. I tried with 32, 128K Record sizes on the ZVOL, but results stay mostly the same. Maybe 1-5 MB/s improvement between the settings.

Has anyone got any ideas ? I'm really hitting a dead end here.
Also, can anyone give me a correct way to test with fio locally on TrueNAS? I found a lot of different ways to test with different IO Engines, but if someone could point me in the right direction for my setup...

PS: CPU utilisation is not high while performing the test over iSCSI.
I also tried NFS v4 but that gave me half of the IOPS.
 

alhsou

Cadet
Joined
Jul 22, 2023
Messages
6
Okay, did some more digging and I have some interesting (to me) findings.
It seems like I needed to set-up MPIO to get the most out of the iSCSI performance. Thinking of it it seems logical as you have only one TCP path towards the LUN by default.

I did a test and disabled Sync, now I have these results with 2 NICs assigned to the TrueNAS VM:

Screenshot 2023-07-23 at 18.30.31.jpg


However, I can't run in production with Sync disabled... Especially as I will be using the storage purely as block storage for VMs on ESXi servers.

I will be testing with more NICs and reporting back.
Maybe Autotune could help.
 

alhsou

Cadet
Joined
Jul 22, 2023
Messages
6
I added 2 more NICs, so for a total of 6, but that did't really improve compared to 4.
And yes, ESXi is set to Round Roubin for MPIO.

I also tested with upping the vcpu count to 24 from 16, with thread affinity enabled so 0-23 used.
So it doesn't seem to be CPU limited, or at least not in core count but maybe IPC limited ?

ZFS has a lot of overhead it seems... Or I need to do some more digging.
Also the ZFS cache is not full, only like 3-5GB used of the 64GB while performing the tests.

Could it be that the ARC is not hitting ?
I've set 'zfs primarycache=always and zfs secondarycache=none' on the zpool1, so it should be hitting the RAM.

For me, I'm more interested in how to improve the situation when Sync is enabled. The only resources I find about that sugest a SLOG but that is outdated since I'm using all NVME storage for the pool. The only thing I could think of is some optane drives, but would that really help and NVME pool ?

Another thing I want to try is NVME over TCP or RDMA. It doesn't seem to be nativly supported, and on the TrueNAS homepage I can see they are planning it for SCALE 22.22, but even in the latest 23.12 nightly that I tested it doesn't seem to be included, or I didn't find it atleast.

Anyone has some sugestions ? I don't just want to throw money at it for no reason.
I mean the performance is acceptable now, but it's nowhere near what it is capable of...

The last thing I could do is pull one of my Home ESX LAB servers and install TrueNAS on that and transfer the disk array into that chassis. That is an AMD Ryzen 5 5500G system with 128GB of memory, which is much more modern and has maybe double or more the IPC...

What do you guys think ?
 

alhsou

Cadet
Joined
Jul 22, 2023
Messages
6
Some more progress...
I found the correct combination of parameters to use fio correctly for seq reads. This is what I got:

Reads Seq:
root@truenas[~]# fio --name=seqread --rw=read --direct=1 --ioengine=libaio --bs=8k --numjobs=8 --size=1G --runtime=600 --group_reporting
seqread: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=1
...
fio-3.25
Starting 8 processes
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
Jobs: 8 (f=8)
seqread: (groupid=0, jobs=8): err= 0: pid=94344: Sun Jul 23 14:41:59 2023
read: IOPS=873k, BW=6821MiB/s (7152MB/s)(8192MiB/1201msec)
slat (nsec): min=1412, max=35922k, avg=7449.45, stdev=177820.22
clat (nsec): min=170, max=21506k, avg=564.40, stdev=50450.33
lat (nsec): min=1643, max=35925k, avg=8104.82, stdev=185323.41
clat percentiles (nsec):
| 1.00th=[ 191], 5.00th=[ 201], 10.00th=[ 201], 20.00th=[ 211],
| 30.00th=[ 221], 40.00th=[ 221], 50.00th=[ 231], 60.00th=[ 231],
| 70.00th=[ 231], 80.00th=[ 241], 90.00th=[ 251], 95.00th=[ 282],
| 99.00th=[ 422], 99.50th=[ 502], 99.90th=[ 19072], 99.95th=[ 36096],
| 99.99th=[254976]
bw ( MiB/s): min= 5989, max= 8148, per=100.00%, avg=7068.80, stdev=194.69, samples=14
iops : min=766658, max=1042954, avg=904806.00, stdev=24920.91, samples=14
lat (nsec) : 250=89.19%, 500=10.30%, 750=0.28%, 1000=0.05%
lat (usec) : 2=0.05%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.07%
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=5.92%, sys=39.52%, ctx=15063, majf=0, minf=139
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=1048576,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
READ: bw=6821MiB/s (7152MB/s), 6821MiB/s-6821MiB/s (7152MB/s-7152MB/s), io=8192MiB (8590MB), run=1201-1201msec
root@truenas[~]#


Writes Seq
root@truenas[~]# fio --name=seqread --rw=write --direct=1 --ioengine=libaio --bs=8k --numjobs=8 --size=1G --runtime=600 --group_reporting
seqread: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=1
...
fio-3.25
Starting 8 processes
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
Jobs: 8 (f=8): [W(8)][75.0%][w=2820MiB/s][w=361k IOPS][eta 00m:01s]
seqread: (groupid=0, jobs=8): err= 0: pid=100266: Sun Jul 23 14:43:57 2023
write: IOPS=382k, BW=2984MiB/s (3129MB/s)(8192MiB/2745msec); 0 zone resets
slat (usec): min=2, max=130300, avg=18.56, stdev=604.29
clat (nsec): min=191, max=45514k, avg=771.95, stdev=89265.84
lat (usec): min=2, max=130303, avg=19.43, stdev=612.42
clat percentiles (nsec):
| 1.00th=[ 221], 5.00th=[ 221], 10.00th=[ 231], 20.00th=[ 231],
| 30.00th=[ 231], 40.00th=[ 241], 50.00th=[ 241], 60.00th=[ 241],
| 70.00th=[ 241], 80.00th=[ 251], 90.00th=[ 262], 95.00th=[ 270],
| 99.00th=[ 442], 99.50th=[ 628], 99.90th=[ 6624], 99.95th=[ 36096],
| 99.99th=[358400]
bw ( MiB/s): min= 2093, max= 3966, per=100.00%, avg=3082.52, stdev=92.39, samples=39
iops : min=267932, max=507668, avg=394561.20, stdev=11826.53, samples=39
lat (nsec) : 250=71.49%, 500=27.76%, 750=0.37%, 1000=0.12%
lat (usec) : 2=0.12%, 4=0.02%, 10=0.02%, 20=0.02%, 50=0.03%
lat (usec) : 100=0.02%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=2.62%, sys=22.35%, ctx=19248, majf=0, minf=130
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,1048576,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=2984MiB/s (3129MB/s), 2984MiB/s-2984MiB/s (3129MB/s-3129MB/s), io=8192MiB (8590MB), run=2745-2745msec
root@truenas[~]#

So the system is not the bottleneck in this case, I think it has more to do with iSCSI/ZFS protocol overhead.
Too bad there's no NVME-OF support on TrueNAS. Would like to test the difference.

Next step is to test on my AMD system that is much newer to see if it is indeed a CPU/RAM bottleneck.
 

alhsou

Cadet
Joined
Jul 22, 2023
Messages
6
I keep responding to myself, but I hope it could help someone in te same situation by finding this thread.
I did some more testing, and it seems that my theory of a CPU IPC might be very likely now.

At home I have a 4U chassis running consumer AMD AM4 parts that I use as a 2nd ESXi Host (and for vMotion failover) and I run my NAS at.
This NAS was initially meant to be a remote back-up for my files at the Office and syncs every hour from my GDrive certain folders that I use for my business. This is the spec:
-AMD Ryzen 5 5500
-Gigabyte B55M-DS3H
-HP H240 Smart HBA
-128 GB DDR-3200 UDIMM RAM
-Inter Tech IPC-4U-4088-S chassis with Icy Dock 5 bay to 3*5.25 Hotswap drive backplane (has SATA and is connected via an adapter cable to the H240).
-Kingston NV1 512GB NVME drive as a LOG cache for the array.
-3* Toshiba Enterprise 14TB 7.2K 3.5" HDDs

This is running in VMWare ESXi 7.0.3 and the VM has 8 vCPU and 32/64GB RAM (It used to have only 16 when I was using it purely as a back-up NAS but I upgraded it to 32GB and now to 64GB since I began testing). I'm running TrueNAS Scale.

First I made a Windows 10 VM on the same host as the TN and mounted a 2TB ESX iSCSI volume to it. It gave me good performance, so I shut down the system and installed the NVME as a LOG VDEV, then remounted the iSCSI in ESXi and migrated the VM storage to it. BTW it is a RAID-Z1 pool, I know it's not recommended for iSCSI but I can't change it as it has important data on it...

The results are quite... impressive... Considering this is running on Spinning Rust and a very cheap nVME drive I had lying around...

1690405341596.png


So this low-end modern consumer HW is outperforming my DL360 Gen9...

Now I'm thinking if I would swap the E5-2650V4 CPUs for lower core but higher clocked parts, but those consume a lot more power...
But that is the cheapest option as those CPUs cost next to nothing now.

OR I could look for either a new AMD/Intel consumer system to put in a rackmount chassis and use that for the storage with daily backups to the other NAS I have or look for used AMD EPYC parts and use that instead...

Anyways... Glad I found the issue and learned a lot in the process. Just wished I knew this beforehand then I could have built a beefy consumer based TN system for this...

FYI: with MPIO enabled I even got +- 2.1 GB/s read and 2.5-2.7GB/s write sequential.
I've decided that for my usecase (iSCSI for VMs) Single Threaded performance is the most important, so I'm going to build a system based on the Core i3 13100F, which has very good IPC and Single Threaded performance and is one of the few SKUs in 13th gen without the E-Cores. I will probably run TN Bare Metal on it, and I'm debating between Scale and Core (As Scale only can asign half of the memory zo the ZFS caching, but I like the GUI more). I will also probably add a Micron 7400 MAX 400GB NVME SLOG. Why you might ask ? To assist the Sync Write performance, as these Lexar SSDs dont have PLP. Had I known this, I would just have gone for Micron 7400 PRO 1.92TB drives instead of these, it's only +- 200EUR more for 4 drives... Would have given me so much less headaches.

Consumer SSDs are kinda garbage to be honest... Now I understand why.
 
Last edited:
Top