100GbE 24xNVMe High-Performance Block Storage (SAN) on TrueNAS

aglover1221 · Jan 22, 2024

Hi everyone, a few weeks ago I had a dream to buy a whole bunch of used hardware on eBay and create a high-performance SAN for our Project on Vast.ai. Reading around the forums it looks like there are a ton of unsuccessful attempts trying similar endeavors. I would like to create a post/blueprint that demystifies this for good. There are a lot of individuals in this forum that I can tell have put months if not years of effort into fine-tuning performance with TrueNAS and Im hoping that they are willing to contribute their knowledge to this. We have a budget for consultants as well if we need to engage iX Systems to make this fly.

At a high level here is what I want to do:
Let's build a PCIe Gen 4 SAN array with 4x100GbE interfaces and provision storage to 20+ Hosts. We currently run NVMe scratch disks on all of our hosts and I would like to move to a shared storage model where capacity, performance, and redundancy can be shared between all the servers. While we do this let's document the setup and leave it here as a blueprint/starting point for anyone else who likes making old computers go fast.

The Hardware:
6x Dell Z9100-ON switches running Sonic (32x100GbE interfaces)
-Pair as the spine and 2 pairs as ToR / Leaf Switches
-10GbE WAN
-4x100GbE Trunk between each switch
-4x100GbE Uplink to Core
-Id like to connect the SAN to the Core so that there are the same number of hops for all hosts, if this is unreasonably complicated or it makes sense to have a SAN on each leaf we will build another SAN

Dell R7525 with 24x 3.84TB Samsung PM9A3 NVMe SSDs
-2xMellanox Dual Port CX5 Nics in Ethernet Mode
-BOSS card for OS
-Either 2x64 Core or 2x32 Core with higher frequency (we can try both)
-1-2TB 3200Mhz DDR4

20xClients
-8xGPUs
-32-64 Cores
-512GB ram
-Broadcom N1100G single port 100G
-All running Ubuntu 20.04

I found a pretty sweet deal on CWDM4 Transceivers and OS2 Cables so I am hoping that doesn't add any complexity or performance issues over short-range modules on OM4.

TODO:
Switch Configuration
-We have a SONIC consultant starting in a week or two
-Does anyone have any specific recommendations on switch configurations for this?
100G Link Aggregation?
-QSFP28 is 4x25G links, do we just LACP all of these together to get 100G?
ISCSI or another protocol?
-Is NVMe over TCP an option with TrueNAS? (it seems like some folks have gotten this running)
Pool Configuration
-My preference would be to have all 24 drives in a RaidZ1 or RaidZ2
-Does it make sense to put all 24 drives in one vdev?
Tunables
-If anyone has landed on some config options with similar hardware Id love to hear about your results here

Other posts:
https://www.truenas.com/community/r...ng-to-maximize-your-10g-25g-40g-networks.207/
https://www.truenas.com/community/t...-not-keeping-up-with-fast-nvme-drives.111940/
https://www.truenas.com/community/threads/24-nvme-ssds-slow-performance.113062/

Ericloewe · Jan 23, 2024

aglover1221 said:
100G Link Aggregation?
-QSFP28 is 4x25G links, do we just LACP all of these together to get 100G?

No, that is transparent, just like 40 GbE. Set both sides to 100 GbE instead of 4x25 GbE if needed.

aglover1221 said:
-Is NVMe over TCP an option with TrueNAS? (it seems like some folks have gotten this running)

Mostly no, but it's possible that someone hacked something together.

aglover1221 said:
-My preference would be to have all 24 drives in a RaidZ1 or RaidZ2
-Does it make sense to put all 24 drives in one vdev?

I suspect that would suck, performance-wise. You'd be throwing away much of the IOPS potential of the SSDs, and you'd need pretty large blocks for this to not be a complete disaster (need to be able to divide each block into 20ish chunks). Mirrors are almost guaranteed to be the way to go.

sretalla · Jan 23, 2024

If your aim is super-high performance, you can't just gloss over the issues of RAIDZ for block storage.

SSD/NVME or not, the following post covers why RAIDZ isn't good for block storage:

Resource - The path to success for block storage

ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle bits modified after creation. This is optimal for RAIDZ. It's what most...

www.truenas.com

firesyde424 · Jan 23, 2024

aglover1221 said:
Hi everyone, a few weeks ago I had a dream to buy a whole bunch of used hardware on eBay and create a high-performance SAN for our Project on Vast.ai. Reading around the forums it looks like there are a ton of unsuccessful attempts trying similar endeavors. I would like to create a post/blueprint that demystifies this for good. There are a lot of individuals in this forum that I can tell have put months if not years of effort into fine-tuning performance with TrueNAS and Im hoping that they are willing to contribute their knowledge to this. We have a budget for consultants as well if we need to engage iX Systems to make this fly.

At a high level here is what I want to do:
Let's build a PCIe Gen 4 SAN array with 4x100GbE interfaces and provision storage to 20+ Hosts. We currently run NVMe scratch disks on all of our hosts and I would like to move to a shared storage model where capacity, performance, and redundancy can be shared between all the servers. While we do this let's document the setup and leave it here as a blueprint/starting point for anyone else who likes making old computers go fast.

The Hardware:
6x Dell Z9100-ON switches running Sonic (32x100GbE interfaces)
-Pair as the spine and 2 pairs as ToR / Leaf Switches
-10GbE WAN
-4x100GbE Trunk between each switch
-4x100GbE Uplink to Core
-Id like to connect the SAN to the Core so that there are the same number of hops for all hosts, if this is unreasonably complicated or it makes sense to have a SAN on each leaf we will build another SAN

Dell R7525 with 24x 3.84TB Samsung PM9A3 NVMe SSDs
-2xMellanox Dual Port CX5 Nics in Ethernet Mode
-BOSS card for OS
-Either 2x64 Core or 2x32 Core with higher frequency (we can try both)
-1-2TB 3200Mhz DDR4

20xClients
-8xGPUs
-32-64 Cores
-512GB ram
-Broadcom N1100G single port 100G
-All running Ubuntu 20.04

I found a pretty sweet deal on CWDM4 Transceivers and OS2 Cables so I am hoping that doesn't add any complexity or performance issues over short-range modules on OM4.

TODO:
Switch Configuration
-We have a SONIC consultant starting in a week or two
-Does anyone have any specific recommendations on switch configurations for this?
100G Link Aggregation?
-QSFP28 is 4x25G links, do we just LACP all of these together to get 100G?
ISCSI or another protocol?
-Is NVMe over TCP an option with TrueNAS? (it seems like some folks have gotten this running)
Pool Configuration
-My preference would be to have all 24 drives in a RaidZ1 or RaidZ2
-Does it make sense to put all 24 drives in one vdev?
Tunables
-If anyone has landed on some config options with similar hardware Id love to hear about your results here

Other posts:
https://www.truenas.com/community/r...ng-to-maximize-your-10g-25g-40g-networks.207/
https://www.truenas.com/community/t...-not-keeping-up-with-fast-nvme-drives.111940/
https://www.truenas.com/community/threads/24-nvme-ssds-slow-performance.113062/

So we have actually done this. We have a deployed system with the following specs:

Dell PowerEdge R7525(At the time we put this together, Dell hadn't certified the R7625 for more than 16 NVME drives)

2 x AMD Epyc 7H12 64 core CPUS @ 2.6Ghz, 128 cores total
1TB 3200MT/S DDR4 RDRAM1 x Intel E810-XXV-2 dual port 25Gbe NIC
24 x 30.72 Micron 9400 Pro U.3 NVME SSDs
- 12 x 2 mirrored vdevs, sync=standard, atime=off, lz4 compression, deduplication=off, checksum=on, record size=16K, Auto TRIM enabled
1 x Intel E810-XXV-2 dual port 25Gbe PCIE NIC
2 x Chelsio T62100-LP-CR dual port 100Gbe dual port NICs
- We only get a theoretical 150Gbe out of each card. Each NIC will operate at 100Gbe individually, but with both ports running, there's only 150Gb/sec of throughput available, given these are PCIE gen 3 cards.
TrueNAS Core 13.0-U6.1

The 4 x 100Gbe NICs are directly connected to an Oracle database server. Crucially, we don't use any kind of link aggregation. In our testing, we discovered pretty serious overhead issues above 40Gbe. We found far better performance over NFS by setting up the 100Gbe links separately with their own IP addresses.

I can look back through my notes. We purchased the time of a TrueNAS consultant that we've used to help us support our non IXSystems servers in the past. There was a significant amount of tweaking and tuning that needed to happen outside of our general TrueNAS knowledge. Out of the box performance was fairly disappointing considering the hardware present and without the tweaking, we wouldn't have met our performance goals.

I can't remember the boot parameter that needed tweaked, it had to do with NVME queues or something like that. Without the change to this setting, TrueNAS core would only detect a small number of the 24 drives. It's in another thread on these forums, I'll see if I can dig and find it.

The zpool config for this system is 12 mirrored vdevs for a raw capacity of approximately 330TB. Under synthetic testing we were able to pull 40GB/sec read using FIO locally on the server, with a 128K block size. 4K block IO was approximately 1.2 million. Those numbers dropped significantly when performed over NFS, using the 4 x 100Gbe interfaces, but the system will still sustain 10GB/sec with a 50% read/write mix at 16K block sizes for the Oracle database server.

In terms of lessons learned, the 2 x 64 CPUs were likely overkill for the intended workload. We went this high because, at the time, there wasn't a lot of information around how much CPU power TrueNAS would need to operate at this performance level. Prior to purchasing and deploying this system, we conducted tests with an Intel based 64 core system using 24 x 15.36TB PCIE Gen 3 NVME drives. We were able to EASILY overwhelm all 64 cores with synthetic tests so it was reasonable to assume that we'd want considerably more CPU power, given that the system we deployed used newer, faster, PCIE gen 4 drives with at least 2x performance of the gen 3 drives we tested with.

While we were able to generate workloads that would overwhelm all 128 cores of the deployed system, they weren't realistic at all. The actual Oracle database workloads come nowhere near maxing out these CPUs. If I had to do it again, I would have gone with a pair of faster 32 core CPUs.

Now, having said that, I add this caveat. We're using mirrored vdevs which have a lower CPU requirement than RAIDZ, due to the lack of parity calculations. We did test with RAIDZ1 and RAIDZ2, using different vdev sizes. Especially the config where we tested a 2 x 12 RAIDZ2 pool, we were able to easily overwhelm all 128 cores. Even the smaller vdev sizes we tested could still max out the system. So, depending on your requirements and what you want to do, and also what your expectations are, it's entirely possible 128 cores may not be enough.

HoneyBadger · Jan 23, 2024

firesyde424 said:
I can't remember the boot parameter that needed tweaked, it had to do with NVME queues or something like that. Without the change to this setting, TrueNAS core would only detect a small number of the 24 drives. It's in another thread on these forums, I'll see if I can dig and find it.

That'll be hw.nvme.num_io_queues=64 most likely as you referenced in this thread (I've linked the reply from @bsdimp which goes into more detail about interrupt exhaustion as well)

Tunable needed to detect more than 16 NVME drives on AMD Epyc?

As part of the buildout of a high performance AMD Epyc based TrueNAS Core build, I ran into an issue where only 16 of the 24 NVME drives were being detected by the OS. The server's specs are as follows: Dell PowerEdge R7525 2 x AMD Epyc 7H12 CPUs, 128 cores\256 threads total @2.6Ghz 1TB DDR4...

www.truenas.com

firesyde424 · Feb 24, 2024

HoneyBadger said:
That'll be hw.nvme.num_io_queues=64 most likely as you referenced in this thread (I've linked the reply from @bsdimp which goes into more detail about interrupt exhaustion as well)

Tunable needed to detect more than 16 NVME drives on AMD Epyc?

As part of the buildout of a high performance AMD Epyc based TrueNAS Core build, I ran into an issue where only 16 of the 24 NVME drives were being detected by the OS. The server's specs are as follows: Dell PowerEdge R7525 2 x AMD Epyc 7H12 CPUs, 128 cores\256 threads total @2.6Ghz 1TB DDR4...

www.truenas.com

It dawned on me that we have other TrueNAS core systems with 24 NVME drives in them. We've had one based on a PowerEdge R740xd that's been deployed 4 years. We had issues with that one but they were mostly around how bleeding edge NVME support was in FreeBSD back then. The change to the NVME IO queues was not required on any of the Intel based servers but was required on the two AMD based servers we use. Any idea why?

HoneyBadger · Feb 24, 2024

firesyde424 said:
It dawned on me that we have other TrueNAS core systems with 24 NVME drives in them. We've had one based on a PowerEdge R740xd that's been deployed 4 years. We had issues with that one but they were mostly around how bleeding edge NVME support was in FreeBSD back then. The change to the NVME IO queues was not required on any of the Intel based servers but was required on the two AMD based servers we use. Any idea why?

At the risk of generalizing I believe it's related to how the PCIe lanes are connected. Because the AMD platforms have enough lanes directly off the CPU, they can connect there, and each device fires off interrupts and queues. The Intel platforms with 24x NVMe don't have enough lanes and have to rely on 48:16 PCIe MUX switches; which are collecting and firing individual interrupts themselves rather than each device having the ability to do it.

firesyde424 · Feb 26, 2024

HoneyBadger said:
At the risk of generalizing I believe it's related to how the PCIe lanes are connected. Because the AMD platforms have enough lanes directly off the CPU, they can connect there, and each device fires off interrupts and queues. The Intel platforms with 24x NVMe don't have enough lanes and have to rely on 48:16 PCIe MUX switches; which are collecting and firing individual interrupts themselves rather than each device having the ability to do it.

I wondered if the PCIE switches had something to do with it.

Important Announcement for the TrueNAS Community.

100GbE 24xNVMe High-Performance Block Storage (SAN) on TrueNAS

aglover1221

Cadet

Ericloewe

Server Wrangler

sretalla

Powered by Neutrality

Resource - The path to success for block storage

firesyde424

Contributor

HoneyBadger

actually does care

Tunable needed to detect more than 16 NVME drives on AMD Epyc?

firesyde424

Contributor

Tunable needed to detect more than 16 NVME drives on AMD Epyc?

HoneyBadger

actually does care

firesyde424

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

100GbE 24xNVMe High-Performance Block Storage (SAN) on TrueNAS

Cadet

Server Wrangler

Powered by Neutrality

Contributor

actually does care

Contributor

actually does care

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "100GbE 24xNVMe High-Performance Block Storage (SAN) on TrueNAS"

Similar threads