Why NVMe SSDs are the best thing since sliced bread

Patrick M. Hausen · Apr 15, 2023

Someone asked me in private conversation about using mutiple NVMe drives on an add-on card like the Supermicro AOC-SLG3-2M and the implications for passthrough in a hypervisor/TrueNAS hybrid context, e.g. ESXi. Confusion arose about the general recommendation never to pass individual drives but only HBAs as PCIe devices.

The person was concerned because the mentioned card itself does not show up as a separate device. It's a mostly passive bus adaptor (modulo some resistors and capacitors for signal stability and a handful of active components).

What is from an "advanced home lab" point of view an absolutely incredible feature of NVMe technology is that each "disk" is its own PCIe device. There is no controller or HBA and connected drives here. Theoretically that makes each individual SSD more expensive but in this industry we have long learned that uniform interfaces and economics of scale beat number of components every single time. And probably the SSD controller and the PCIe interface have long been merged into a single chip or at least chip set - otherwise the prices for "prosumer" NVMe SSDs would not be possible.

What this means in the end is that you can cram this mainboard and the aforementioned card into this chassis - ok you should add an active CPU cooler, but that is also available. Then add a SATA 2.5" SSD of sufficient size to install ESXi including a small datastore for VM images.

Add three M.2 NVMe SSDs of your choice and you can run

ESXi
3 (!) TrueNAS SCALE VMs with
one NVMe SSD passed through as a PCIe device to each VM
one network interface passed through as a PCIe device to each VM

Happy clustering!

Or just a single VM with three SSDs and two network interfaces so both ESXi and TrueNAS can use link aggregation.

Or whatever suits your fancy.

When we get chassis with U.2 slots that are not 19" and ridiculously deep and loud and power hungry ... we can finally put the HBA discussion to rest.

Just my thoughts after that private exchange.

morganL · Apr 15, 2023

Patrick M. Hausen said:
Add three M.2 NVMe SSDs of your choice and you can run

ESXi

3 (!) TrueNAS SCALE VMs with

one NVMe SSD passed through as a PCIe device to each VM

one network interface passed through as a PCIe device to each VM

Happy clustering!

Agreed that NVMe changes the game for lab systems
You can even run TrueNAS SCALE as the Host as well as the Guest VMs. If anyone has tried this, please let us know.

NickF · Apr 16, 2023

morganL said:
Agreed that NVMe changes the game for lab systems
You can even run TrueNAS SCALE as the Host as well as the Guest VMs. If anyone has tried this, please let us kno

I moved all of my homelab's ESXI hosted VMs over to SCALE last year. There are a bunch of other folks who did this ala this post on Reddit:

Reddit - Dive into anything

www.reddit.com

I hadn't even thought to nest other SCALE nodes inside of SCALE. What a clever idea to play around with.

firesyde424 · Apr 18, 2023

We recently conducted a test with 24 x 15.36TB Micron 9400 Pro drives in a stupid configuration that we would never deploy, to see what they were capable of. In a Dell PowerEdge R7525 with a pair of Epyc 75F3 CPUs, 1TB of RAM, and a 24 vdev ZFS pool with 1 drive per pool, we pulled 9.3 million IO and 38GB/sec sequential read IO. I suspect they would have been able to do more but the CPUs were likely our limiting factor. Either CPU processing power or just raw bandwidth.

The speed of these drives are getting insane.

NickF · Apr 19, 2023

firesyde424 said:
We recently conducted a test with 24 x 15.36TB Micron 9400 Pro drives in a stupid configuration that we would never deploy, to see what they were capable of. In a Dell PowerEdge R7525 with a pair of Epyc 75F3 CPUs, 1TB of RAM, and a 24 vdev ZFS pool with 1 drive per pool, we pulled 9.3 million IO and 38GB/sec sequential read IO. I suspect they would have been able to do more but the CPUs were likely our limiting factor. Either CPU processing power or just raw bandwidth.

View attachment 65806
View attachment 65809

The speed of these drives are getting insane.

Can you run a benchmark for me? If not it's cool.
I've been tracking performance of various weird configurations over here: https://forum.level1techs.com/t/truenas-scale-performance-testing/187486

fio --bs=128k --direct=1 --directory=/mnt/POOLNAME/DATASETNAME --gtod_reduce=1 --ioengine=posixaio --iodepth=32 --group_reporting --name=randrw --numjo bs=12 --ramp_time=10 --runtime=60 --rw=randrw --size=256M --time_based

and would love to add that to my dataset.

firesyde424 · Apr 21, 2023

NickF said:
Can you run a benchmark for me? If not it's cool.
I've been tracking performance of various weird configurations over here: https://forum.level1techs.com/t/truenas-scale-performance-testing/187486
fio --bs=128k --direct=1 --directory=/mnt/POOLNAME/DATASETNAME --gtod_reduce=1 --ioengine=posixaio --iodepth=32 --group_reporting --name=randrw --numjo bs=12 --ramp_time=10 --runtime=60 --rw=randrw --size=256M --time_based
and would love to add that to my dataset.

Unfortunately, I can't as that device has been put into service as a secure data transport appliance and is at a customer site currently. I do have an older version of this, using a PE 740xd and 24 x 15.36TB Micron 9300 NVME drives that I might be able to test with. Let me see what I can do.

firesyde424 · Apr 25, 2023

I was able to get some relatively idle time to run this. As I said before, the system listed in my previous post has been put into service and isn't available for testing. Instead, I ran this on an older system that's been in service for just over 3 years. That system's specs are as follows:

Dell PowerEdge R740xd
- 2 x Intel Xeon Silver 4216 CPUs(32 cores, 64 threads total @2.10Ghz)
- 386GB DDR4 Registered ECC RAM @ 2400Mhz
- 24 x 15.36TB Micron 9300 Pro NVME drives(PCIe Gen 3)U.2 drives
  - zpool config: 3 x 8(RAIDZ2)
    - ~240TiB
- 2 x Intel XXV 710 Dual Port 25Gbe NICs
- TrueNAS Core 12.0-U8.1

Bearing in mind that this system was seeing active though minor use during this test, I received the following results:

READ: bw=9104MiB/s (9547MB/s), 9104MiB/s-9104MiB/s (9547MB/s-9547MB/s), io=534GiB (574GB), run=60094-60094msec
WRITE: bw=9096MiB/s (9538MB/s), 9096MiB/s-9096MiB/s (9538MB/s-9538MB/s), io=534GiB (573GB), run=60094-60094msec
read: IOPS=72.8k, BW=9104MiB/s (9547MB/s)(534GiB/60094msec)
write: IOPS=72.8k, BW=9096MiB/s (9538MB/s)(534GiB/60094msec); 0 zone resets

Running this test, CPU usage was approximately 60-90% and would max briefly.

There are a couple of possible bottlenecks here when compared to the R7525 I tested a few weeks ago. One would clearly be the CPUs. 32 cores of 2nd gen Xeon scalable clocked at 2.1Ghz with 2400Mhz RAM vs 64 cores of 3rd gen Epyc clocked at 2.95Ghz and 3200Mhz RAM. The second might be that the R740xd requires a pair of PCIE switch cards sending only 32 lanes to the drive backplane, out of only 96 available. The R7525 doesn't need that, given that it has 256 lanes available and can send a full 96 lanes without breaking a sweat.

NickF · Apr 25, 2023

firesyde424 said:
I was able to get some relatively idle time to run this. As I said before, the system listed in my previous post has been put into service and isn't available for testing. Instead, I ran this on an older system that's been in service for just over 3 years. That system's specs are as follows:

Dell PowerEdge R740xd

2 x Intel Xeon Silver 4216 CPUs(32 cores, 64 threads total @2.10Ghz)

386GB DDR4 Registered ECC RAM @ 2400Mhz

24 x 15.36TB Micron 9300 Pro NVME drives(PCIe Gen 3)U.2 drives

zpool config: 3 x 8(RAIDZ2)

~240TiB

2 x Intel XXV 710 Dual Port 25Gbe NICs

TrueNAS Core 12.0-U8.1

Bearing in mind that this system was seeing active though minor use during this test, I received the following results:

READ: bw=9104MiB/s (9547MB/s), 9104MiB/s-9104MiB/s (9547MB/s-9547MB/s), io=534GiB (574GB), run=60094-60094msec
WRITE: bw=9096MiB/s (9538MB/s), 9096MiB/s-9096MiB/s (9538MB/s-9538MB/s), io=534GiB (573GB), run=60094-60094msec
read: IOPS=72.8k, BW=9104MiB/s (9547MB/s)(534GiB/60094msec)
write: IOPS=72.8k, BW=9096MiB/s (9538MB/s)(534GiB/60094msec); 0 zone resets

Running this test, CPU usage was approximately 60-90% and would max briefly.

There are a couple of possible bottlenecks here when compared to the R7525 I tested a few weeks ago. One would clearly be the CPUs. 32 cores of 2nd gen Xeon scalable clocked at 2.1Ghz with 2400Mhz RAM vs 64 cores of 3rd gen Epyc clocked at 2.95Ghz and 3200Mhz RAM. The second might be that the R740xd requires a pair of PCIE switch cards sending only 32 lanes to the drive backplane, out of only 96 available. The R7525 doesn't need that, given that it has 256 lanes available and can send a full 96 lanes without breaking a sweat.

That’s a pretty solid result. Thanks!

What were your settings on the dataset as far as compression? LZ4?

jgreco · Apr 25, 2023

Patrick M. Hausen said:
Someone asked me in private conversation about using mutiple NVMe drives on an add-on card like the Supermicro AOC-SLG3-2M and the implications for passthrough in a hypervisor/TrueNAS hybrid context, e.g. ESXi. Confusion arose about the general recommendation never to pass individual drives but only HBAs as PCIe devices.

With an NVMe SSD device, the "individual drive" IS a PCIe device. The admonition against passing individual drives applies to drives behind a controller, because what you need in that situation is to pass the CONTROLLER through to ZFS.

Some late model PCIe devices have "virtual functions" and you can pass virtual portions of the controller through to a VM (especially stuff like ethernet controllers), but LSI HBA's predate this technology and you can either pass a complete HBA controller or not pass it. There is no halfway passing mode or virtual function support. Therefore you must pass the entire HBA.

Patrick M. Hausen said:
The person was concerned because the mentioned card itself does not show up as a separate device. It's a mostly passive bus adaptor (modulo some resistors and capacitors for signal stability and a handful of active components).

Correct. You are just running traces from the PCIe bus to the M.2 socket (or whatever); there is nothing on the board to identify it to the host as any particular thing. This is a design flaw in bifurcation because there is no good way to easily support bifurcation auto-config.

Patrick M. Hausen said:
Theoretically that makes each individual SSD more expensive

It doesn't, really. We started with the basic controllers being part of the drive back in the days of SCSI and IDE, and rapidly evolved that through PATA and SATA drives to include relatively sophisticated onboard controllers. These controllers needed to both manage their physical drives and also speak a complex artificial protocol back to the host adapter, handle cache, etc. By simplifying the SAS/PATA/SATA protocol layer into NVMe, you entirely eliminate whatever controller was providing those protocols, and reduce the complexity of the onboard controller on the SSD too.

Patrick M. Hausen said:
the SSD controller and the PCIe interface have long been merged into a single chip

Yes, and if you look at the evolution of SSD's, the whole thing is pretty trivial, fitting onto an M.2 gumstick or occupying a tiny amount of space in a 2.5" shell:

That's an 870 Evo 4TB on the right hand side. Controller, DRAM, flash chips. That's all. You get the same thing going on with the gumsticks. The complexity of the controller is not a big deal (even though in this case it is SATA). The 'Metis' MKX controller and the NVMe controllers keep shrinking with every generation... and getting cheaper too.

firesyde424 · Apr 26, 2023

NickF said:
That’s a pretty solid result. Thanks!

What were your settings on the dataset as far as compression? LZ4?

Correct, just lz4. This dataset houses several oracle databases, including a 240TB database. We get nearly 4 to 1 compression with just the default settings.

NickF · Apr 26, 2023

firesyde424 said:
Correct, just lz4. This dataset houses several oracle databases, including a 240TB database. We get nearly 4 to 1 compression with just the default settings.

Thats pretty insane

jgreco · Apr 26, 2023

NickF said:
Thats pretty insane

For a database? Ehhh, it's pretty good but not quite crazy. Some databases store LOTS of highly compressible zero padded stuff.

TrumanHW · May 22, 2023

firesyde424 said:
We recently conducted a test with 24 x 15.36TB Micron 9400 Pro drives in a stupid configuration that we would never deploy, to see what they were capable of. In a Dell PowerEdge R7525 with a pair of Epyc 75F3 CPUs, 1TB of RAM, and a 24 vdev ZFS pool with 1 drive per pool, we pulled 9.3 million IO and 38GB/sec sequential read IO. I suspect they would have been able to do more but the CPUs were likely our limiting factor. Either CPU processing power or just raw bandwidth.

View attachment 65806
View attachment 65809

The speed of these drives are getting insane.

Awesome! I have the uber-poor-mans version of what you have.
The R7415 (7351p with 256GB) and tested it with 4x and 8x NVMe drives (each of which gets at least 1GBs read // write).
I've not yet checked FIO performance (maybe tomorrow) ... but shouldn't I get more than 850MBs over SMB and SFP28 network..?

rvassar · May 23, 2023

jgreco said:
It doesn't, really. We started with the basic controllers being part of the drive back in the days of SCSI and IDE, and rapidly evolved that through PATA and SATA drives to include relatively sophisticated onboard controllers. These controllers needed to both manage their physical drives and also speak a complex artificial protocol back to the host adapter, handle cache, etc. By simplifying the SAS/PATA/SATA protocol layer into NVMe, you entirely eliminate whatever controller was providing those protocols, and reduce the complexity of the onboard controller on the SSD too.

I'm going to disagree here. These are still remarkably complex devices. The complexity is different in that there's not of lot of mechanical "Physics" calculations for the head seeking / parking, rotational latency, etc... There's still a 458 page base specification for NVMe, on top of the PCIe specifications, and then there three command set, and three transport specifications, a boot specification, and a couple others... There's still a lot going on under the hood on these little widgets, and PCIe itself is getting quite complicated as well.

NVMe Base Specification 2.0

The NVMe spec collection...

jgreco · May 23, 2023

rvassar said:
I'm going to disagree here. These are still remarkably complex devices. The complexity is different in that there's not of lot of mechanical "Physics" calculations for the head seeking / parking, rotational latency, etc... There's still a 458 page base specification for NVMe, on top of the PCIe specifications, and then there three command set, and three transport specifications, a boot specification, and a couple others... There's still a lot going on under the hood on these little widgets, and PCIe itself is getting quite complicated as well.

That conveniently misses the mark.

If you look at a modern SAS system, you have CPU -> PCIe -> SAS HBA -> SAS bus -> SAS controller -> Flash

If you look at an NVMe system, you have CPU -> PCIe -> NVMe controller -> Flash

One of those has fewer items in the list, and if we want to be reasonable, we can eliminate "Flash" at the end and "CPU" at the beginning.

We then notice that both have a SAS/NVMe controller which speaks some protocol; while not strictly identical, we could deem these sufficiently comparable and call them roughly equal, therefore we can eliminate them as well.

So what we're left with is that SAS has two extra elements; the SAS HBA controller and the SAS bus.

Therefore I think it is fair to say that my comment

jgreco said:
you entirely eliminate whatever controller was providing those protocols,

is on-target. It is great to get rid of unnecessary bottlenecks in the chain of things that makes these gizmos work.

rvassar · May 24, 2023

jgreco said:
So what we're left with is that SAS has two extra elements; the SAS HBA controller and the SAS bus.

Therefore I think it is fair to say that my comment

is on-target. It is great to get rid of unnecessary bottlenecks in the chain of things that makes these gizmos work.

Provided you can pack all your devices inside the chassis... Which for PCIe is at most what? A 1.5 meter sphere around the CPU socket? SAS4 will get you 5+ meters and 1400+ devices. And you can't eliminate "Flash" at the end of the NVMe chain. The devil is in the details with the various forms of "flash"... It's slowly coming together, but... Until then, YMMV.

NickF · May 24, 2023

rvassar said:
The devil is in the details with the various forms of "flash"...

With Optane going the way of the dinosaur, it's really just NAND at the end of the day in this market. Sure, there's QLC,TLC,MLC,SLC and whatever other crazy inventions to store exponentially higher quantities of voltage states. But it's all NAND. Sure there are some variations in controllers, and embedded DRAM, SLC caching, etc. But NAND is only one form of flash, and really, it's the only "flash" in enterprise storage now.

rvassar said:
Which for PCIe is at most what? A 1.5 meter sphere around the CPU socket? SAS4 will get you 5+ meters and 1400+ devices.

PCI-E Re-timers, PLX switch chips, PCI-E "Host Bus Adapters", CXL, etc, etc all exist to solve the distance issue. PCI-E over a fabric is here, you can be several racks away, in the future probably several miles away. So, I'm not understanding the argument here?

jgreco · May 24, 2023

NickF said:
So, I'm not understanding the argument here?

Me either. My point is that stuff like a SAS HBA represents a significant additional hop for the data; if you can eliminate a controller, that's great. The HBA controller represents a little PowerPC CPU core that is busily servicing all the ports. This is not going to be fast.

The point here is that there are a few ways to handle data. Let's consider ethernet switches.

Basic switches do what is known as store-and-forward. They receive a frame, look at it, determine what port it is destined for, and then add it to the queue for that port. This adds a small delay to the packet.

But more expensive switches have the ability to do something known as cut-through forwarding, where the switch silicon starts analyzing the packet AS IT ARRIVES on the port, without waiting for the complete frame to arrive. The switch makes a decision about where to send the packet and then (assuming the port is free) begins cramming the packet out the egress port, so the data is literally exiting the switch while still being received. This adds a much smaller delay to the packet but you lose certain error recovery features and also it's a problem if the outgoing port is busy.

But the best option is to get rid of the switch; straight thru wins every time. If you do not have an engineering need for the intermediate hop, get rid of it!

So in many ways SAS is like a store-and-forward switch, it is hugely reliable, well-understood, cheap, scalable, and very common. But you end up going through that HBA which adds latency.

PCIe architectures only scale so far and there have been innovations such as PLX switches, which I consider to be akin to cut-through switches. They add some delay but they are extremely effective at increasing the capability of your overall system. The PLX switch is relatively low latency (compared to an HBA).

But of course you can just get rid of all of it. NVMe direct to CPU is really as good as it gets. This is a real winner. If you don't need a PLX or HBA in there, why have it!

TrumanHW · May 30, 2023

EXCELLENT you two.
The arguments helped me understand a couple of technologies better than a few LONG reads previously digested.

In short, please don't ever find common ground again!

NickF · May 30, 2023

To be clear, he knows infinitely more than me. I just poke the Grinch and hope his heart ends up growing twice it's size.

We aim to please.

Important Announcement for the TrueNAS Community.

Why NVMe SSDs are the best thing since sliced bread

Hall of Famer

Captain Morgan

Guru

Contributor

Guru

Contributor

Contributor

Guru

Resident Grinch

Contributor

Guru

Resident Grinch

Contributor

Guru

Resident Grinch

Guru

Guru

Resident Grinch

Contributor

Guru

Similar threads