100GbE 24xNVMe High-Performance Block Storage (SAN) on TrueNAS

aglover1221

Cadet
Joined
Jan 22, 2024
Messages
1
Hi everyone, a few weeks ago I had a dream to buy a whole bunch of used hardware on eBay and create a high-performance SAN for our Project on Vast.ai. Reading around the forums it looks like there are a ton of unsuccessful attempts trying similar endeavors. I would like to create a post/blueprint that demystifies this for good. There are a lot of individuals in this forum that I can tell have put months if not years of effort into fine-tuning performance with TrueNAS and Im hoping that they are willing to contribute their knowledge to this. We have a budget for consultants as well if we need to engage iX Systems to make this fly.

At a high level here is what I want to do:
Let's build a PCIe Gen 4 SAN array with 4x100GbE interfaces and provision storage to 20+ Hosts. We currently run NVMe scratch disks on all of our hosts and I would like to move to a shared storage model where capacity, performance, and redundancy can be shared between all the servers. While we do this let's document the setup and leave it here as a blueprint/starting point for anyone else who likes making old computers go fast.

The Hardware:
6x Dell Z9100-ON switches running Sonic (32x100GbE interfaces)
-Pair as the spine and 2 pairs as ToR / Leaf Switches
-10GbE WAN
-4x100GbE Trunk between each switch
-4x100GbE Uplink to Core
-Id like to connect the SAN to the Core so that there are the same number of hops for all hosts, if this is unreasonably complicated or it makes sense to have a SAN on each leaf we will build another SAN

Dell R7525 with 24x 3.84TB Samsung PM9A3 NVMe SSDs
-2xMellanox Dual Port CX5 Nics in Ethernet Mode
-BOSS card for OS
-Either 2x64 Core or 2x32 Core with higher frequency (we can try both)
-1-2TB 3200Mhz DDR4

20xClients
-8xGPUs
-32-64 Cores
-512GB ram
-Broadcom N1100G single port 100G
-All running Ubuntu 20.04

I found a pretty sweet deal on CWDM4 Transceivers and OS2 Cables so I am hoping that doesn't add any complexity or performance issues over short-range modules on OM4.

TODO:
Switch Configuration
-We have a SONIC consultant starting in a week or two
-Does anyone have any specific recommendations on switch configurations for this?
100G Link Aggregation?
-QSFP28 is 4x25G links, do we just LACP all of these together to get 100G?
ISCSI or another protocol?
-Is NVMe over TCP an option with TrueNAS? (it seems like some folks have gotten this running)
Pool Configuration
-My preference would be to have all 24 drives in a RaidZ1 or RaidZ2
-Does it make sense to put all 24 drives in one vdev?
Tunables
-If anyone has landed on some config options with similar hardware Id love to hear about your results here

Other posts:
https://www.truenas.com/community/r...ng-to-maximize-your-10g-25g-40g-networks.207/
https://www.truenas.com/community/t...-not-keeping-up-with-fast-nvme-drives.111940/
https://www.truenas.com/community/threads/24-nvme-ssds-slow-performance.113062/
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
100G Link Aggregation?
-QSFP28 is 4x25G links, do we just LACP all of these together to get 100G?
No, that is transparent, just like 40 GbE. Set both sides to 100 GbE instead of 4x25 GbE if needed.
-Is NVMe over TCP an option with TrueNAS? (it seems like some folks have gotten this running)
Mostly no, but it's possible that someone hacked something together.
-My preference would be to have all 24 drives in a RaidZ1 or RaidZ2
-Does it make sense to put all 24 drives in one vdev?
I suspect that would suck, performance-wise. You'd be throwing away much of the IOPS potential of the SSDs, and you'd need pretty large blocks for this to not be a complete disaster (need to be able to divide each block into 20ish chunks). Mirrors are almost guaranteed to be the way to go.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
If your aim is super-high performance, you can't just gloss over the issues of RAIDZ for block storage.

SSD/NVME or not, the following post covers why RAIDZ isn't good for block storage:
 

firesyde424

Contributor
Joined
Mar 5, 2019
Messages
155
Hi everyone, a few weeks ago I had a dream to buy a whole bunch of used hardware on eBay and create a high-performance SAN for our Project on Vast.ai. Reading around the forums it looks like there are a ton of unsuccessful attempts trying similar endeavors. I would like to create a post/blueprint that demystifies this for good. There are a lot of individuals in this forum that I can tell have put months if not years of effort into fine-tuning performance with TrueNAS and Im hoping that they are willing to contribute their knowledge to this. We have a budget for consultants as well if we need to engage iX Systems to make this fly.

At a high level here is what I want to do:
Let's build a PCIe Gen 4 SAN array with 4x100GbE interfaces and provision storage to 20+ Hosts. We currently run NVMe scratch disks on all of our hosts and I would like to move to a shared storage model where capacity, performance, and redundancy can be shared between all the servers. While we do this let's document the setup and leave it here as a blueprint/starting point for anyone else who likes making old computers go fast.

The Hardware:
6x Dell Z9100-ON switches running Sonic (32x100GbE interfaces)
-Pair as the spine and 2 pairs as ToR / Leaf Switches
-10GbE WAN
-4x100GbE Trunk between each switch
-4x100GbE Uplink to Core
-Id like to connect the SAN to the Core so that there are the same number of hops for all hosts, if this is unreasonably complicated or it makes sense to have a SAN on each leaf we will build another SAN

Dell R7525 with 24x 3.84TB Samsung PM9A3 NVMe SSDs
-2xMellanox Dual Port CX5 Nics in Ethernet Mode
-BOSS card for OS
-Either 2x64 Core or 2x32 Core with higher frequency (we can try both)
-1-2TB 3200Mhz DDR4

20xClients
-8xGPUs
-32-64 Cores
-512GB ram
-Broadcom N1100G single port 100G
-All running Ubuntu 20.04

I found a pretty sweet deal on CWDM4 Transceivers and OS2 Cables so I am hoping that doesn't add any complexity or performance issues over short-range modules on OM4.

TODO:
Switch Configuration
-We have a SONIC consultant starting in a week or two
-Does anyone have any specific recommendations on switch configurations for this?
100G Link Aggregation?
-QSFP28 is 4x25G links, do we just LACP all of these together to get 100G?
ISCSI or another protocol?
-Is NVMe over TCP an option with TrueNAS? (it seems like some folks have gotten this running)
Pool Configuration
-My preference would be to have all 24 drives in a RaidZ1 or RaidZ2
-Does it make sense to put all 24 drives in one vdev?
Tunables
-If anyone has landed on some config options with similar hardware Id love to hear about your results here

Other posts:
https://www.truenas.com/community/r...ng-to-maximize-your-10g-25g-40g-networks.207/
https://www.truenas.com/community/t...-not-keeping-up-with-fast-nvme-drives.111940/
https://www.truenas.com/community/threads/24-nvme-ssds-slow-performance.113062/
So we have actually done this. We have a deployed system with the following specs:

Dell PowerEdge R7525(At the time we put this together, Dell hadn't certified the R7625 for more than 16 NVME drives)
  • 2 x AMD Epyc 7H12 64 core CPUS @ 2.6Ghz, 128 cores total
  • 1TB 3200MT/S DDR4 RDRAM1 x Intel E810-XXV-2 dual port 25Gbe NIC
  • 24 x 30.72 Micron 9400 Pro U.3 NVME SSDs
    • 12 x 2 mirrored vdevs, sync=standard, atime=off, lz4 compression, deduplication=off, checksum=on, record size=16K, Auto TRIM enabled
  • 1 x Intel E810-XXV-2 dual port 25Gbe PCIE NIC
  • 2 x Chelsio T62100-LP-CR dual port 100Gbe dual port NICs
    • We only get a theoretical 150Gbe out of each card. Each NIC will operate at 100Gbe individually, but with both ports running, there's only 150Gb/sec of throughput available, given these are PCIE gen 3 cards.
  • TrueNAS Core 13.0-U6.1
The 4 x 100Gbe NICs are directly connected to an Oracle database server. Crucially, we don't use any kind of link aggregation. In our testing, we discovered pretty serious overhead issues above 40Gbe. We found far better performance over NFS by setting up the 100Gbe links separately with their own IP addresses.

I can look back through my notes. We purchased the time of a TrueNAS consultant that we've used to help us support our non IXSystems servers in the past. There was a significant amount of tweaking and tuning that needed to happen outside of our general TrueNAS knowledge. Out of the box performance was fairly disappointing considering the hardware present and without the tweaking, we wouldn't have met our performance goals.

I can't remember the boot parameter that needed tweaked, it had to do with NVME queues or something like that. Without the change to this setting, TrueNAS core would only detect a small number of the 24 drives. It's in another thread on these forums, I'll see if I can dig and find it.

The zpool config for this system is 12 mirrored vdevs for a raw capacity of approximately 330TB. Under synthetic testing we were able to pull 40GB/sec read using FIO locally on the server, with a 128K block size. 4K block IO was approximately 1.2 million. Those numbers dropped significantly when performed over NFS, using the 4 x 100Gbe interfaces, but the system will still sustain 10GB/sec with a 50% read/write mix at 16K block sizes for the Oracle database server.

In terms of lessons learned, the 2 x 64 CPUs were likely overkill for the intended workload. We went this high because, at the time, there wasn't a lot of information around how much CPU power TrueNAS would need to operate at this performance level. Prior to purchasing and deploying this system, we conducted tests with an Intel based 64 core system using 24 x 15.36TB PCIE Gen 3 NVME drives. We were able to EASILY overwhelm all 64 cores with synthetic tests so it was reasonable to assume that we'd want considerably more CPU power, given that the system we deployed used newer, faster, PCIE gen 4 drives with at least 2x performance of the gen 3 drives we tested with.

While we were able to generate workloads that would overwhelm all 128 cores of the deployed system, they weren't realistic at all. The actual Oracle database workloads come nowhere near maxing out these CPUs. If I had to do it again, I would have gone with a pair of faster 32 core CPUs.

Now, having said that, I add this caveat. We're using mirrored vdevs which have a lower CPU requirement than RAIDZ, due to the lack of parity calculations. We did test with RAIDZ1 and RAIDZ2, using different vdev sizes. Especially the config where we tested a 2 x 12 RAIDZ2 pool, we were able to easily overwhelm all 128 cores. Even the smaller vdev sizes we tested could still max out the system. So, depending on your requirements and what you want to do, and also what your expectations are, it's entirely possible 128 cores may not be enough.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I can't remember the boot parameter that needed tweaked, it had to do with NVME queues or something like that. Without the change to this setting, TrueNAS core would only detect a small number of the 24 drives. It's in another thread on these forums, I'll see if I can dig and find it.
That'll be hw.nvme.num_io_queues=64 most likely as you referenced in this thread (I've linked the reply from @bsdimp which goes into more detail about interrupt exhaustion as well)

 

firesyde424

Contributor
Joined
Mar 5, 2019
Messages
155
That'll be hw.nvme.num_io_queues=64 most likely as you referenced in this thread (I've linked the reply from @bsdimp which goes into more detail about interrupt exhaustion as well)

It dawned on me that we have other TrueNAS core systems with 24 NVME drives in them. We've had one based on a PowerEdge R740xd that's been deployed 4 years. We had issues with that one but they were mostly around how bleeding edge NVME support was in FreeBSD back then. The change to the NVME IO queues was not required on any of the Intel based servers but was required on the two AMD based servers we use. Any idea why?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
It dawned on me that we have other TrueNAS core systems with 24 NVME drives in them. We've had one based on a PowerEdge R740xd that's been deployed 4 years. We had issues with that one but they were mostly around how bleeding edge NVME support was in FreeBSD back then. The change to the NVME IO queues was not required on any of the Intel based servers but was required on the two AMD based servers we use. Any idea why?
At the risk of generalizing I believe it's related to how the PCIe lanes are connected. Because the AMD platforms have enough lanes directly off the CPU, they can connect there, and each device fires off interrupts and queues. The Intel platforms with 24x NVMe don't have enough lanes and have to rely on 48:16 PCIe MUX switches; which are collecting and firing individual interrupts themselves rather than each device having the ability to do it.
 

firesyde424

Contributor
Joined
Mar 5, 2019
Messages
155
At the risk of generalizing I believe it's related to how the PCIe lanes are connected. Because the AMD platforms have enough lanes directly off the CPU, they can connect there, and each device fires off interrupts and queues. The Intel platforms with 24x NVMe don't have enough lanes and have to rely on 48:16 PCIe MUX switches; which are collecting and firing individual interrupts themselves rather than each device having the ability to do it.
I wondered if the PCIE switches had something to do with it.
 
Top