aglover1221
Cadet
- Joined
- Jan 22, 2024
- Messages
- 1
Hi everyone, a few weeks ago I had a dream to buy a whole bunch of used hardware on eBay and create a high-performance SAN for our Project on Vast.ai. Reading around the forums it looks like there are a ton of unsuccessful attempts trying similar endeavors. I would like to create a post/blueprint that demystifies this for good. There are a lot of individuals in this forum that I can tell have put months if not years of effort into fine-tuning performance with TrueNAS and Im hoping that they are willing to contribute their knowledge to this. We have a budget for consultants as well if we need to engage iX Systems to make this fly.
At a high level here is what I want to do:
Let's build a PCIe Gen 4 SAN array with 4x100GbE interfaces and provision storage to 20+ Hosts. We currently run NVMe scratch disks on all of our hosts and I would like to move to a shared storage model where capacity, performance, and redundancy can be shared between all the servers. While we do this let's document the setup and leave it here as a blueprint/starting point for anyone else who likes making old computers go fast.
The Hardware:
6x Dell Z9100-ON switches running Sonic (32x100GbE interfaces)
-Pair as the spine and 2 pairs as ToR / Leaf Switches
-10GbE WAN
-4x100GbE Trunk between each switch
-4x100GbE Uplink to Core
-Id like to connect the SAN to the Core so that there are the same number of hops for all hosts, if this is unreasonably complicated or it makes sense to have a SAN on each leaf we will build another SAN
Dell R7525 with 24x 3.84TB Samsung PM9A3 NVMe SSDs
-2xMellanox Dual Port CX5 Nics in Ethernet Mode
-BOSS card for OS
-Either 2x64 Core or 2x32 Core with higher frequency (we can try both)
-1-2TB 3200Mhz DDR4
20xClients
-8xGPUs
-32-64 Cores
-512GB ram
-Broadcom N1100G single port 100G
-All running Ubuntu 20.04
I found a pretty sweet deal on CWDM4 Transceivers and OS2 Cables so I am hoping that doesn't add any complexity or performance issues over short-range modules on OM4.
TODO:
Switch Configuration
-We have a SONIC consultant starting in a week or two
-Does anyone have any specific recommendations on switch configurations for this?
100G Link Aggregation?
-QSFP28 is 4x25G links, do we just LACP all of these together to get 100G?
ISCSI or another protocol?
-Is NVMe over TCP an option with TrueNAS? (it seems like some folks have gotten this running)
Pool Configuration
-My preference would be to have all 24 drives in a RaidZ1 or RaidZ2
-Does it make sense to put all 24 drives in one vdev?
Tunables
-If anyone has landed on some config options with similar hardware Id love to hear about your results here
Other posts:
https://www.truenas.com/community/r...ng-to-maximize-your-10g-25g-40g-networks.207/
https://www.truenas.com/community/t...-not-keeping-up-with-fast-nvme-drives.111940/
https://www.truenas.com/community/threads/24-nvme-ssds-slow-performance.113062/
At a high level here is what I want to do:
Let's build a PCIe Gen 4 SAN array with 4x100GbE interfaces and provision storage to 20+ Hosts. We currently run NVMe scratch disks on all of our hosts and I would like to move to a shared storage model where capacity, performance, and redundancy can be shared between all the servers. While we do this let's document the setup and leave it here as a blueprint/starting point for anyone else who likes making old computers go fast.
The Hardware:
6x Dell Z9100-ON switches running Sonic (32x100GbE interfaces)
-Pair as the spine and 2 pairs as ToR / Leaf Switches
-10GbE WAN
-4x100GbE Trunk between each switch
-4x100GbE Uplink to Core
-Id like to connect the SAN to the Core so that there are the same number of hops for all hosts, if this is unreasonably complicated or it makes sense to have a SAN on each leaf we will build another SAN
Dell R7525 with 24x 3.84TB Samsung PM9A3 NVMe SSDs
-2xMellanox Dual Port CX5 Nics in Ethernet Mode
-BOSS card for OS
-Either 2x64 Core or 2x32 Core with higher frequency (we can try both)
-1-2TB 3200Mhz DDR4
20xClients
-8xGPUs
-32-64 Cores
-512GB ram
-Broadcom N1100G single port 100G
-All running Ubuntu 20.04
I found a pretty sweet deal on CWDM4 Transceivers and OS2 Cables so I am hoping that doesn't add any complexity or performance issues over short-range modules on OM4.
TODO:
Switch Configuration
-We have a SONIC consultant starting in a week or two
-Does anyone have any specific recommendations on switch configurations for this?
100G Link Aggregation?
-QSFP28 is 4x25G links, do we just LACP all of these together to get 100G?
ISCSI or another protocol?
-Is NVMe over TCP an option with TrueNAS? (it seems like some folks have gotten this running)
Pool Configuration
-My preference would be to have all 24 drives in a RaidZ1 or RaidZ2
-Does it make sense to put all 24 drives in one vdev?
Tunables
-If anyone has landed on some config options with similar hardware Id love to hear about your results here
Other posts:
https://www.truenas.com/community/r...ng-to-maximize-your-10g-25g-40g-networks.207/
https://www.truenas.com/community/t...-not-keeping-up-with-fast-nvme-drives.111940/
https://www.truenas.com/community/threads/24-nvme-ssds-slow-performance.113062/