NVMe-oF support planned?

Glowtape

Dabbler
Joined
Apr 8, 2017
Messages
45
Are there plans to support this eventually? The Linux kernel ships with a NVMe-oF target driver, that seems to perform decently. It appears to be a bit more performant over TCP than regular iSCSI. On top of that, it'd also allow to enable RDMA-backed block I/O with Windows via the Starwind NVMe-oF initiator.

Right now I'm jerryrigging this on TrueNAS via terminal, by enabling nvmet etc. via modload and installing an external nvmetcli package for configuration. Native support would be interesting.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Are you pointing the NVMe-oF to a drive or a zvol?
I'd be interested to see what performance difference is between this and iSCSI for workloads that actually go to a zvol (SSDs or drives.)
Both real and synthetic workloads would be interesting.
 

Glowtape

Dabbler
Joined
Apr 8, 2017
Messages
45
I'm pointing it at ZVOLs.

I did some quick read tests with Diskmark. (I skipped the write ones because I want to avoid to add fragmentation to the pool with benchmarking, but I guess they'd fall in line.)

That said, I'm not a 100% I did everything entirely right, considering the results. But the iSCSI service on TrueNAS is in its stock configuration.

My personal use case is to get data from ARC and sometimes L2ARC (on a NVMe SSD) as fast as possible. (Mostly to get SSD-like performance for games stored on the NAS, for streaming assets and such.)

The client system is a Threadripper 2950X with 16 cores, the NAS runs TrueNAS Scale on a Ryzen 5750G with 8 cores and 64GB of DDR4-3200 RAM. Both systems have a Mellanox ConnectX-3 cards set to 40GBe. Both systems are directly connected with a DAC cable. There's 9K jumbo frames configured at both ends.

I created a blank 4GB ZVOLs with no compression for each share type. ZVOL block size is 16KB (chosen as middleground for on-disk compression on other ZVOLs of mine).

(Considering the Q32T1 RND4K results, there might be gains to be had lowering the ZVOL blocksize approaching the 4KB cluster size of NTFS.)

iSCSI via TCP/IP -- via Microsoft iSCSI Initiator

sa81cH8.png


NVMe-oF via RDMA -- via Starwind NVMe-oF Initiator

gHXhI7Q.png
 

Glowtape

Dabbler
Joined
Apr 8, 2017
Messages
45
I'd have to figure out how to do that. Foremost, because there's no publicly available initiator for Windows that does iSER. I guess I could set up Linux on my "spare" SSD in my workstation.

Either way, those results should be transfers from ARC. Diskmark writes an 1GB file to the disk (which ought to be entirely cached), then does its read operations on it. I wouldn't expect the iSCSI target in TrueNAS to go around the ARC and hit the disk (nvmet certainly doesn't do that).
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
The iSCSI Sequential 1M numbers are lower than expected. Perhaps it is due to the low zvol block size?

Is there any indication of where the bottleneck is? Is it the client windows TCP stack?


I'd like to see the test when running against a Zvol on the NVMe SSD.. if you can do that it would be appreciated.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I'd have to figure out how to do that. Foremost, because there's no publicly available initiator for Windows that does iSER. I guess I could set up Linux on my "spare" SSD in my workstation.

Either way, those results should be transfers from ARC. Diskmark writes an 1GB file to the disk (which ought to be entirely cached), then does its read operations on it. I wouldn't expect the iSCSI target in TrueNAS to go around the ARC and hit the disk (nvmet certainly doesn't do that).
Try WinOF for iSER support - I'm only linking through the nVidia portal since the Mellanox support site seems to be throwing me SSL errors. Original WinOF should work for a Win10 client and Connect-X 3.

 

Chrisputer

Cadet
Joined
Aug 16, 2022
Messages
5
Any Updates. I'm wanting something like this where I run a TrueNAS VM on each of my VMWare Hosts, pass through storage, hook everything up via NVMEoF and cluster it all together using the Mirror setup.

I can dream of this, right?
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Any Updates. I'm wanting something like this where I run a TrueNAS VM on each of my VMWare Hosts, pass through storage, hook everything up via NVMEoF and cluster it all together using the Mirror setup.

I can dream of this, right?
That's one dream
Or we can dream of not needing to pay the VMware tax.....(especially if Broadcom milks it)
 

bigjme

Dabbler
Joined
May 16, 2021
Messages
19
Its been a while so i thought i'd see if you've head chance to try this on a real work load like loading games or something that won't be cached in memory on write?

It would be great if there was some way you could provide some details on how you set this up so maybe iX could give it a try their end and see if they could get it working? This could be a game changer for iSCSI workloads

I for example run most of my desktop storage directly off iSCSI drives for games and even software/user data over 25GB links but have held off buying too much NVMe due to the speed
 

Glowtape

Dabbler
Joined
Apr 8, 2017
Messages
45
I haven't really done any more tests, been busy with other things. The synthetic benchmarks are all way faster, so I'm just assuming it applies to the rest. Most notably probably because I can actually use RDMA this way, which _does_ make a difference. That said, an IO has to bypass all caches and hit the disk, I doubt you notice much of a difference (thus why my personal NAS is overspecced with 64GB of RAM and has a 384GB L2ARC).

As to how to use it, create a .conf in /etc/module-load.d and put these entries in there, each on its own line: nvme and nvmet. If you can and want to use RDMA, also put nvmet-rdma in it.

The nvmet kernel module gets configured via sysfs, but there's a frontend to it called nvmetcli, which also allows to make settings persistent. I've been jerryrigging it so far by converting an rpm to deb and fixing up the Python stuff. As soon I upgrade to Bluefin, I'll do/attempt the proper way by downloading the source and do setup.py (there's also a systemd service you'd need to install manually): http://git.infradead.org/users/hch/nvmetcli.git

There's a manual in Documentation/nvmetcli.txt telling you how to configure things in detail. Here's what it looks for me: https://i.imgur.com/Yd6u1lc.png / https://pastebin.com/Xq8VhBak

Over on Windows you need Starwind's NVMe-oF Initiator, which ties into the iSCSI Initiator dialog as separate driver. You can get a free evaluation version that's quasi unlimited (I'd be willing to pay 50 bux or something for a personal license, to get out of any legal weeds, alas it's not to be).

On the TrueNAS side, remember that it works like an appliance and assumes the base images stays unmodified. If you do an upgrade, you need to do all the sing and dance again. Just remember to backup /etc/nvmet/config.json, before an upgrade, to quickly restore all NVMe-oF targets.
 
Joined
Nov 2, 2022
Messages
7
Happened to stumble across this thread and did some testing of my own. I have a ZoL VM (Ubuntu 22.04) running in proxmox with virtual (virtio) drives coming from TrueNAS core over NFS (it's just a test system). I created zvols and shared them via iscsi (targetcli) and subsequently via nvmet and ran some generic fio tests using a linux client. The results for nvme are 2x and in certain circumstances approaching 3x. Very promising stuff for sure.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Happened to stumble across this thread and did some testing of my own. I have a ZoL VM (Ubuntu 22.04) running in proxmox with virtual (virtio) drives coming from TrueNAS core over NFS (it's just a test system). I created zvols and shared them via iscsi (targetcli) and subsequently via nvmet and ran some generic fio tests using a linux client. The results for nvme are 2x and in certain circumstances approaching 3x. Very promising stuff for sure.

Hi Travis, Interesting. When you did these tests was it with a dataset large enough to not all be in cache?

I think we'll get very different speed-ups for cached data, NVMe data, SATA SSD data and HDD data. Interested to understand those numbers.
 
Joined
Nov 2, 2022
Messages
7
So I created zvols like such

# did sparse and thick for both nvme and iscsi zvols
zfs create -V 1gb -s tank/nvmezvol
zfs create -V 1gb tank/nvmezvol

# attached from the client appropriately and ran some fio commands
fio --filename=<device> --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly
fio --filename=<device> --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1

I don't claim to be an expert and would be happy to run different fio tests or other tooling if someone wishes. Based on what I understand I think it's pretty unlikely any cache would have kicked in for the above right?

Got any specific tests/scenarios you'd like me to run through?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
The small size of the zvol (1G) means that the ZFS adaptive replacement cache (ARC) will definitely have had a chance to kick in from a read angle, but that doesn't make your benchmarks invalid - in fact, we probably want to have the reads cached in order to tease out the protocols differences.

If we're bottlenecking at your back-end device (SATA SSD or HDD) then the protocol at the network layer (NVMe-oF vs iSCSI) won't be as crucial. Having the reads come from the ARC (RAM) will be best, because that will showcase the protocol superiority nicely.

Once we determine that NVMe-oF gives protocol level advantages, we can see if alleviating that bottleneck lets us tease any more from the back-end devices. The less latency that exists along any level of the storage pipeline, the better we can identify the spots that remain.

Will something like NVMe-oF make a pool of spinning disks as fast as flash - of course not. But if it takes out a few hundred microseconds of combined protocol overhead, that's a few microseconds better than before.
 
Joined
Nov 2, 2022
Messages
7
My bare metal device (TN CORE 13) has a pool of 24 wd reds in vdevs of 6 with 2 parity drives each, nvme for zil and ssd for l2arc with 32G RAM. Proxmox uses that as a data store over nfs.

The ubuntu 22.04 vm has 4 drives backed by that storage solution, 1 for os and then 3 extras for the pool with raidz1, 8G of RAM. The shared zvols were created on this system using the pool with the 3 devices.

Indeed the testing I did was meant to keep everything equal minus the protocol. The real numbers probably are not great but the intention is definitely to review the relative numbers. I’m happy to tweak any aspect within my power to run further tests if they may be of help.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
My bare metal device (TN CORE 13) has a pool of 24 wd reds in vdevs of 6 with 2 parity drives each, nvme for zil and ssd for l2arc with 32G RAM. Proxmox uses that as a data store over nfs.

The ubuntu 22.04 vm has 4 drives backed by that storage solution, 1 for os and then 3 extras for the pool with raidz1, 8G of RAM. The shared zvols were created on this system using the pool with the 3 devices.

Indeed the testing I did was meant to keep everything equal minus the protocol. The real numbers probably are not great but the intention is definitely to review the relative numbers. I’m happy to tweak any aspect within my power to run further tests if they may be of help.

Just making sure everyone understands that the 200% acceleration is only for cached datasets.

For all-flash, my guess is that the acceleration will be about 25%

For HDD pools, the acceleration may be less than 5%.. especially for READS.

However, I would like to see data from anyone that can test the thesis.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Got details of how you would like the tests run? I can run through more tests.
Same test with dataset being 4x larger than RAM on machine....... that would give us roughly the data needed for HDD use-case.
 
Joined
Nov 2, 2022
Messages
7
Want to send some specific fio commands (or any other tools/commands)? There are obviously lots of variables at play. The VM has 8GB of RAM..
 
Top