NVME pool seems slow as cold molasses

ArchatParks

Dabbler
Joined
Feb 8, 2022
Messages
28
I setup a new storage pool based on a pair of NVME drives with a pair of Optane drives in a mirrored pool. It does not perform as I had hoped.
System is running :TrueNAS-SCALE-22.12-RC.1
It is a VM to an esxi install with the controllers (and drives for the nvme) passed on to the truenas vm.
Memory: 24gb
Processor: Intel(R) Xeon(R) CPU E5-2450 0 @ 2.10GHz
Allocated 8 CPUs (2 sockets with 4 cores per socket).
[ 0.000000] Command line: BOOT_IMAGE=/ROOT/22.12-RC.1@/boot/vmlinuz-5.15.62+truenas root=ZFS=boot-pool/ROOT/22.12-RC.1 ro console=tty1 console=ttyS0,9600 libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
[ 0.082500] smpboot: Allowing 8 CPUs, 0 hotplug CPUs
[ 0.087616] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
[ 0.136081] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
[ 0.155928] rcu: RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=8.
[ 0.191673] MDS: Mitigation: Clear CPU buffers
[ 0.244104] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-2450 0 @ 2.10GHz (family: 0x6, model: 0x2d, stepping: 0x7)

Pool info:
pool: pool0002
state: ONLINE
scan: scrub repaired 0B in 00:00:21 with 0 errors on Sat Nov 26 11:24:13 2022
config:

NAME STATE READ WRITE CKSUM
pool0002 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c7451373-4550-4d27-8621-01baf622de0b ONLINE 0 0 0
45e2e0a4-df54-4556-ab4f-86ae2f45e609 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
36594b5b-25d1-4389-a4f2-af5157652cd9 ONLINE 0 0 0
85643d9f-9b7f-4492-aa68-c49712466d07 ONLINE 0 0 0

" nvme list -v" results in the following:
NVM Express Subsystems

Subsystem Subsystem-NQN Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------

nvme-subsys0 nqn.1994-11.com.samsung:nvme:980PRO:M.2:S6S2NS0TA04636Y nvme0
nvme-subsys1 nqn.2014.08.org.nvmexpress:80868086BTPG128200VY512A-2 INTEL HBRPEKNL0202AHO nvme1
nvme-subsys2 nqn.1994-11.com.samsung:nvme:980PRO:M.2:S6S2NS0TA04991H nvme2
nvme-subsys3 nqn.2014.08.org.nvmexpress:80868086BTPG12820206512A-2 INTEL HBRPEKNL0202AHO nvme3

NVM Express Controllers

Device SN MN FR TxPort Address Subsystem Namespaces
-------- -------------------- ---------------------------------------- -------- ------ -------------- ------------ ----------------
nvme0 S6S2NS0TA04636Y Samsung SSD 970 EVO Plus 2TB 4B2QEXM7 pcie 0000:04:00.0 nvme-subsys0 nvme0n1
nvme1 BTPG128200VY512A-2 INTEL HBRPEKNL0202AHO HPS1 pcie 0000:0c:00.0 nvme-subsys1 nvme1n1
nvme2 S6S2NS0TA04991H Samsung SSD 970 EVO Plus 2TB 4B2QEXM7 pcie 0000:14:00.0 nvme-subsys2 nvme2n1
nvme3 BTPG12820206512A-2 INTEL HBRPEKNL0202AHO HPS1 pcie 0000:1c:00.0 nvme-subsys3 nvme3n1

NVM Express Namespaces

Device NSID Usage Format Controllers
------------ -------- -------------------------- ---------------- ----------------
nvme0n1 1 229.49 GB / 2.00 TB 512 B + 0 B nvme0
nvme1n1 1 29.26 GB / 29.26 GB 512 B + 0 B nvme1
nvme2n1 1 111.34 GB / 2.00 TB 512 B + 0 B nvme2
nvme3n1 1 29.26 GB / 29.26 GB 512 B + 0 B nvme3

results in the following:
nvme list-subsys

nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:980PRO:M.2:S6S2NS0TA04636Y

\

+- nvme0 pcie 0000:04:00.0 live

nvme-subsys1 - NQN=nqn.2014.08.org.nvmexpress:80868086BTPG128200VY512A-2 INTEL HBRPEKNL0202AHO

\

+- nvme1 pcie 0000:0c:00.0 live

nvme-subsys2 - NQN=nqn.1994-11.com.samsung:nvme:980PRO:M.2:S6S2NS0TA04991H

\

+- nvme2 pcie 0000:14:00.0 live

nvme-subsys3 - NQN=nqn.2014.08.org.nvmexpress:80868086BTPG12820206512A-2 INTEL HBRPEKNL0202AHO

\

+- nvme3 pcie 0000:1c:00.0 live


If I issue a "lspci -vv -nn -s 0000:04:00.0|grep Lnk", I get the following:
LnkCap: Port #0, Speed 5GT/s, Width x32, ASPM L0s, Exit Latency L0s <64ns
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
LnkSta: Speed 5GT/s (ok), Width x32 (ok)
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
LnkCtl3: LnkEquIntrruptEn- PerformEqu-

(and the same holds true with the rest of the drives).

Results:
Sync: DISABLED
Compression Level: Inherit (LZ4)
Enable Atime: OFF
ZFS Deduplication: OFF
Case Sensitivity: ON

Running Random write test for IOP/s test on: /mnt/pool0002/normaldata
test: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Run status group 0 (all jobs):
write: IOPS=886, BW=886MiB/s (929MB/s)(4096MiB/4621msec); 0 zone resets
WRITE: bw=886MiB/s (929MB/s), 886MiB/s-886MiB/s (929MB/s-929MB/s), io=4096MiB (4295MB), run=4621-4621msec
-----------------------------------------------------------------------

Running Random Read test for IOP/s test on: /mnt/pool0002/normaldata
test: (g=0): rw=randread, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
read: IOPS=2196, BW=2196MiB/s (2303MB/s)(4096MiB/1865msec)
Run status group 0 (all jobs):
READ: bw=2196MiB/s (2303MB/s), 2196MiB/s-2196MiB/s (2303MB/s-2303MB/s), io=4096MiB (4295MB), run=1865-1865msec
-----------------------------------------------------------------------

Running Mixed Random Workload test on: /mnt/pool0002/normaldata
test: (g=0): rw=rw, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process

Run status group 0 (all jobs):
READ: bw=1094MiB/s (1147MB/s), 1094MiB/s-1094MiB/s (1147MB/s-1147MB/s), io=1992MiB (2089MB), run=1821-1821msec
WRITE: bw=1155MiB/s (1212MB/s), 1155MiB/s-1155MiB/s (1212MB/s-1212MB/s), io=2104MiB (2206MB), run=1821-1821msec
-----------------------------------------------------------------------

Running Sequential write test for throughput test on: /mnt/pool0002/normaldata
test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Run status group 0 (all jobs):
WRITE: bw=769MiB/s (806MB/s), 769MiB/s-769MiB/s (806MB/s-806MB/s), io=4096MiB (4295MB), run=5326-5326msec
-----------------------------------------------------------------------

Running Sequential Read test for throughput test on: /mnt/pool0002/normaldata
test: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Run status group 0 (all jobs):
READ: bw=2263MiB/s (2373MB/s), 2263MiB/s-2263MiB/s (2373MB/s-2373MB/s), io=4096MiB (4295MB), run=1810-1810msec

Sync: ALWAYS
Compression Level: Inherit (LZ4)
Enable Atime: OFF
ZFS Deduplication: OFF
Case Sensitivity: ON

Running Random write test for IOP/s test on: /mnt/pool0002/normaldata
test: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
write: IOPS=240, BW=240MiB/s (252MB/s)(4096MiB/17066msec); 0 zone resets
Run status group 0 (all jobs):
WRITE: bw=240MiB/s (252MB/s), 240MiB/s-240MiB/s (252MB/s-252MB/s), io=4096MiB (4295MB), run=17066-17066msec
-----------------------------------------------------------------------
Running Random Read test for IOP/s test on: /mnt/pool0002/normaldata
test: (g=0): rw=randread, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
read: IOPS=2184, BW=2185MiB/s (2291MB/s)(4096MiB/1875msec)
Run status group 0 (all jobs):
READ: bw=2185MiB/s (2291MB/s), 2185MiB/s-2185MiB/s (2291MB/s-2291MB/s), io=4096MiB (4295MB), run=1875-1875msec
-----------------------------------------------------------------------

Running Mixed Random Workload test on: /mnt/pool0002/normaldata
test: (g=0): rw=rw, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Run status group 0 (all jobs):
READ: bw=231MiB/s (243MB/s), 231MiB/s-231MiB/s (243MB/s-243MB/s), io=1992MiB (2089MB), run=8611-8611msec
WRITE: bw=244MiB/s (256MB/s), 244MiB/s-244MiB/s (256MB/s-256MB/s), io=2104MiB (2206MB), run=8611-8611msec
-----------------------------------------------------------------------
Running Sequential write test for throughput test on: /mnt/pool0002/normaldata
test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Run status group 0 (all jobs):
WRITE: bw=243MiB/s (254MB/s), 243MiB/s-243MiB/s (254MB/s-254MB/s), io=4096MiB (4295MB), run=16884-16884msec
-----------------------------------------------------------------------
Running Sequential Read test for throughput test on: /mnt/pool0002/normaldata
test: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Run status group 0 (all jobs):
READ: bw=2079MiB/s (2180MB/s), 2079MiB/s-2079MiB/s (2180MB/s-2180MB/s), io=4096MiB (4295MB), run=1970-1970msec
-----------------------------------------------------------------------

It would seem it is slower than cold glue and not what I hoped to get from nvme.

If I run the fio test based on the raw NVME, I get this:

[global]
name=nvme-seq-read
time_based
ramp_time=5
runtime=30
readwrite=read
bs=256k
ioengine=libaio
direct=1
numjobs=1
iodepth=32
group_reporting=1

[nvme0]
filename=/dev/nvme0n1

results:
iops : min=13571, max=13685, avg=13657.95, stdev=17.25, samples=60
Run status group 0 (all jobs):
READ: bw=3412MiB/s (3578MB/s), 3412MiB/s-3412MiB/s (3578MB/s-3578MB/s), io=99.0GiB (107GB), run=30003-30003msec
Disk stats (read/write):
nvme0n1: ios=477602/378, merge=0/0, ticks=1103483/11, in_queue=1103526, util=99.79%

---------------------------------------------------------------------------------------------------

[global]
name=nvme-seq-write
time_based
ramp_time=5
runtime=30
readwrite=write
bs=256k
ioengine=libaio
direct=1
numjobs=1
iodepth=32
group_reporting=1

[nvme0]
filename=/dev/nvme0n1


Any ideas as to how to speed up the process?

results:
iops : min=12854, max=13210, avg=13130.10, stdev=67.78, samples=60
Run status group 0 (all jobs):
WRITE: bw=3280MiB/s (3440MB/s), 3280MiB/s-3280MiB/s (3440MB/s-3440MB/s), io=96.1GiB (103GB), run=30003-30003msec
Disk stats (read/write):
nvme0n1: ios=74/459340, merge=0/0, ticks=40/1103274, in_queue=1103348, util=99.84%
 

thewizard

Dabbler
Joined
Apr 1, 2014
Messages
30
VM not really recommended I don't believe.

CPU, E5-2450 is an old chip, DDR3 I believe. Socket FCLGA1356

What motherboard? Socket LGA1356 is pretty ancient. Looking at some of the Supermicro boards of that era you are looking at X9 series vintage.

It's SATA/SAS 6GBPS era with an old Intel C606 chipset which is pretty slow by today's standard. It does at least have PCI 3.0 16x which if you're using an expansion card theoretically can hit 32GB/s but you'll be bottle necked by that slow DDR3 RAM long long before that limit.

So you have two pools?

pool 1?
nvme-subsys0 nqn.1994-11.com.samsung:nvme:980PRO:M.2:S6S2NS0TA04636Y nvme0
nvme-subsys1 nqn.2014.08.org.nvmexpress:80868086BTPG128200VY512A-2 INTEL HBRPEKNL0202AHO nvme1

Pool 2?
nvme-subsys2 nqn.1994-11.com.samsung:nvme:980PRO:M.2:S6S2NS0TA04991H nvme2
nvme-subsys3 nqn.2014.08.org.nvmexpress:80868086BTPG12820206512A-2 INTEL HBRPEKNL0202AHO nvme3

Is that correct? If so you'll be limited to slowest write speed of the Optane drives (they are way slower in straight read/write than the Samesung PCI-E Gen4 980 Pros). The gen4 PCIE Samsungs are good for 6000MB/s but only a motherboard that supports that PCIE spec, which yours would not.

Mirrored array would also be halving your write speed, so I'd say the fact you are still getting 2000-3000MB/s is actually pretty good.

I might have things confused but I think you'd get better performance with pools of like drives e.g.

pool 1
Samsung pro
Samsung pro

pool 2
Optane
Optane

Whatever workload you have you might see some performance gains by putting an 'temp' type files onto the Optane drives and then having your end result written to the Samsung pool.

I don't know what your workload use case it but in some specific cases a L2ARC is a good choice, anything involving lots of reads to the same data over and over. Could be something to consider if that's your workload, you have one pool of two Samsung drives (in mirror if you value your data) and then you have the Optane drives as a L2ARC cache... but agian will only speed things up in very specific use cases. Most of the time it's slower than no L2ARC. You'd want a UPS too if using L2ARC.
 

ArchatParks

Dabbler
Joined
Feb 8, 2022
Messages
28
Thanks for responding!

The actual computer is a Dell PowerEdge R520...not sure the mother board. I tried to look it up but it just gave random Dell part names.

Its one pool....the two Optane drives are the mirrored drives for the slog for the pool (there is another pool but its large hard drives only used for backup so their speed doesn't really matter much). So the pool is the two Samsung drives mirrored for data and the two Optane drives mirrored for the slog.

One other detail I left out is that the four drives all sit on a Supermicro AOC-SHG3-4M2P Full Height, Quad-Port M.2 NVMe SSD PCI-E 3.0 add-on Card.


For the heck of it, a few weeks ago, I installed TrueNas to a temp install on that computer (in order to bypass it runnig as a VM). These were the results:


Running Random write test for IOP/s test on: /mnt/pool0002/normaldata
test: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
write: IOPS=2142, BW=2142MiB/s (2246MB/s)(4096MiB/1912msec); 0 zone resets
WRITE: bw=2142MiB/s (2246MB/s), 2142MiB/s-2142MiB/s (2246MB/s-2246MB/s), io=4096MiB (4295MB), run=1912-1912msec
-----------------------------------------------------------------------
Running Random Read test for IOP/s test on: /mnt/pool0002/normaldata
test: (g=0): rw=randread, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Run status group 0 (all jobs):
READ: bw=2695MiB/s (2826MB/s), 2695MiB/s-2695MiB/s (2826MB/s-2826MB/s), io=4096MiB (4295MB), run=1520-1520msec
-----------------------------------------------------------------------
Running Mixed Random Workload test on: /mnt/pool0002/normaldata
test: (g=0): rw=rw, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
READ: bw=1463MiB/s (1534MB/s), 1463MiB/s-1463MiB/s (1534MB/s-1534MB/s), io=1992MiB (2089MB), run=1362-1362msec
WRITE: bw=1545MiB/s (1620MB/s), 1545MiB/s-1545MiB/s (1620MB/s-1620MB/s), io=2104MiB (2206MB), run=1362-1362msec
-----------------------------------------------------------------------
Running Sequential write test for throughput test on: /mnt/pool0002/normaldata
test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Run status group 0 (all jobs):
WRITE: bw=2203MiB/s (2310MB/s), 2203MiB/s-2203MiB/s (2310MB/s-2310MB/s), io=4096MiB (4295MB), run=1859-1859msec
-----------------------------------------------------------------------
Running Sequential Read test for throughput test on: /mnt/pool0002/normaldata
test: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
READ: bw=2546MiB/s (2669MB/s), 2546MiB/s-2546MiB/s (2669MB/s-2669MB/s), io=4096MiB (4295MB), run=1609-1609msec
-----------------------------------------------------------------------
 
Top