All NVME pool questions

jamiejunk · Oct 22, 2021

We are looking to replace one of our all flash servers that are currently using SATA 2.5 inch SSDs. It’s just serving VM disks to our Promox servers via NFS. It's been great for many years, but it's a bit long in the tooth and I just got some funding opened up. So i'd like to take advantage of that and get us something that will last us for the next 8-10 years like our last box.

I’m wondering what everyone thinks of doing an all NVME storage pool using something like the the Supermicro SuperServer SSG-110P-NTR10.
https://www.supermicro.com/en/products/system/Storage/1U/SSG-110P-NTR10

This is the board used in that machine:

X12SPO-NTF | Motherboards | Products | Supermicro

I’m thinking about filling it with ten Samsung 3.8TB PM9A3 drives.
I could do mirrors, but honestly, I think if I set this up as two RaidZ2 vdevs, it would still be more than able to overwhelm the 10gig nics. And the RaidZ2 would give us more space and redundancy. So I’m leaning in that direction.

So I guess I have two questions:

1) If I’m filling this up with NVME drives, what kind of processor do I need? Does it need to be crazy fast to support all of those drives and the possible amount of IO they could generate? Or will the lowest supported proc for this board do just fine serving up stuff over NFS or iscsi to my VM hosts?

2) Has anyone used this system / board yet?

3) Anyone have any advice or pitfalls to watch out for when running an all NVME setup?

4) In the past we’ve always done external JBODS connected to a “head” server using an LSI HBA. But this has the NVME ports on the board itself. So we won’t be doing the JBOD setup this time. Any issues not using an HBA and just using the ports on the motherboard?

Honestly, I'd love to do another JBOD / Head unit like we've done in the past, but with U.2 or U.3 drives. But i'm not seeing any Supermicro shelves that seem to support that kind of setup, and I thought I read some stuff here in the forums that said to stay away from the Broadcom Tri-Mode HBAs, which to my knowledge are the only HBAs that would support an external JBOD with NVME. Although i'd love to be proven wrong :)

Thanks!

NugentS · Oct 22, 2021

The only comment I have is that if I was running the expense of 10 Samsung 3.8TB NVMe drives I would not be cheaping out on the CPU
As for the vdevs - you get 98K Read IOPS and 30K Write IOPS and with twin vdevs you would (simplistically) get twice that (in theory)

I think you are at the performance level of suck it and see. Either way it should be damn quick.

Question - would you design in a SLOG or not?

jamiejunk · Oct 23, 2021

NugentS said:
The only comment I have is that if I was running the expense of 10 Samsung 3.8TB NVMe drives I would not be cheaping out on the CPU

I'm fine not cheaping out on the CPU. But what's the point of spending a bunch of money if it's going to be at like 5% utilization while flooding a 10 gig link? I seriously have zero idea on how fast of a CPU i'll need with these. That's why i'm looking for guidance :)

NugentS said:
Question - would you design in a SLOG or not?

We didn't seem to need a SLOG or L2arc with the JBOD that had 24 SAMSUNG 840 Pro SSDs. So i'm not sure we would need them with this setup either. But maybe I do?

NugentS · Oct 23, 2021

For NFS & iSCSI by default VMWare waits for the write to be committed to the pool (which is quite fast I accept)
With SYNC=off its as fast as it possibly can be as TN will lie about the commit write
With SYNC=always AND a decent SLOG the the writes go to the SLOG (containing the ZIL), TN says committed and then eventually the writes get committed. If TN reboots / crashes etc then when it comes back up it reads the SLOG for any uncommitted writes. This is slower than off, but faster than always without a SLOG.

Part of the issue is write amplification. Without a SLOG the ZIL sits on the pool so data gets written twice on the pool, once to the ZIL and then to the main pool.

I guess the SLOG would need to be very fast in order to compare to the Pool. I suspect testing would be required. Test with Sync=always and sync=off and see if there is a difference then compare to say an Optane or one of the RAMDISK devices

jgreco · Oct 23, 2021

It's great to see you back and that things have worked out.

Pure iSCSI or NFS service does not require ungodly amounts of CPU. The E5-1650v3 filer here with 24x HDD in mirrors puts out about 5Gbps with 95% CPU idle. I suspect a more modern CPU would not have much trouble with an all-flash array at 10Gbps speeds.

blanchet · Apr 23, 2022

@jamiejunk Do you have succeeded to build your all NVMe storage server with the 1U SuperMicro SuperServer SSG-110P-NTR10 ?

jamiejunk · Apr 26, 2022

blanchet said:
@jamiejunk Do you have succeeded to build your all NVMe storage server with the 1U SuperMicro SuperServer SSG-110P-NTR10 ?

No, I don't feel like I got enough real feed back on it. I'd still like to, but between that and the chip shortage has put this on the back burner.

firesyde424 · May 4, 2022

We run a PowerEdge R740xd with soon to be 24 x Micron 9300 Pro 15.36TB U.2 NVME drives. This TrueNAS has been in place, in one form or another, for well over two years. Based on our experience I would suggest the following, once you do decide to move this project forward.

Make sure you are running on the latest version of TrueNAS. Full NVME support is relatively new and hot add\hot remove support for NVME is even newer. When we first set this system up, there was no hot add\hot remove, and the system wouldn't support more than 12 NVME drives. Successive TrueNAS updates have mostly remedied this.
Our vDev config is currently 2 x 8 RAIDZ2, soon to be 3 vdevs. This is because of the application this TrueNAS supports. In this config, with our testing, we discovered that the drives were easily capable of saturating the 10Gbe NICs we purchased with the server. Once we realized that, we upgraded the server to 25Gbe NICs. You may not have that option, but at the very least, be aware of the kind of throughput your system is capable of. Before being put back into service in it's latest config, we tested this system at 8 mirrored vdevs. I'll dig up the screen shot, but if I remember right, it benchmarked at something like 203Gbit/sec read and 4.1 million random read IO.
When we first looked at building this system, there wasn't a lot of information about running this kind of TrueNAS and we mostly had to guess. We built it with 2 x Xeon Silver 4216 CPUs which is a total of 32 cores @ 2.1Ghz. Our thought here was from our experience with our larger 60 and 102 bay JBOD mechanical drive TrueNAS systems where we would see issues with the parity calculations from a failed drive during high IO workloads. The parity calculations would consume most or all of the available CPU resources, effectively bring the TrueNAS offline. Given the much greater IO that this system was built for, we fell on the conservative side and purchased far more CPU power than the system required for normal operations. We have not had a drive failure yet, so I can't tell you if our choice was a good one or not.
Make sure that your PCIE switch card is supported in TrueNAS. We initially had issues with the cards that Dell shipped as they were not fully supported. Fortunately, a few weeks later, a TrueNAS update dropped that fixed this issue.

jamiejunk · May 5, 2022

firesyde424 said:
We run a PowerEdge R740xd with soon to be 24 x Micron 9300 Pro 15.36TB U.2 NVME drives. This TrueNAS has been in place, in one form or another, for well over two years. Based on our experience I would suggest the following, once you do decide to move this project forward.

Make sure you are running on the latest version of TrueNAS. Full NVME support is relatively new and hot add\hot remove support for NVME is even newer. When we first set this system up, there was no hot add\hot remove, and the system wouldn't support more than 12 NVME drives. Successive TrueNAS updates have mostly remedied this.

Our vDev config is currently 2 x 8 RAIDZ2, soon to be 3 vdevs. This is because of the application this TrueNAS supports. In this config, with our testing, we discovered that the drives were easily capable of saturating the 10Gbe NICs we purchased with the server. Once we realized that, we upgraded the server to 25Gbe NICs. You may not have that option, but at the very least, be aware of the kind of throughput your system is capable of. Before being put back into service in it's latest config, we tested this system at 8 mirrored vdevs. I'll dig up the screen shot, but if I remember right, it benchmarked at something like 203Gbit/sec read and 4.1 million random read IO.

When we first looked at building this system, there wasn't a lot of information about running this kind of TrueNAS and we mostly had to guess. We built it with 2 x Xeon Silver 4216 CPUs which is a total of 32 cores @ 2.1Ghz. Our thought here was from our experience with our larger 60 and 102 bay JBOD mechanical drive TrueNAS systems where we would see issues with the parity calculations from a failed drive during high IO workloads. The parity calculations would consume most or all of the available CPU resources, effectively bring the TrueNAS offline. Given the much greater IO that this system was built for, we fell on the conservative side and purchased far more CPU power than the system required for normal operations. We have not had a drive failure yet, so I can't tell you if our choice was a good one or not.

Make sure that your PCIE switch card is supported in TrueNAS. We initially had issues with the cards that Dell shipped as they were not fully supported. Fortunately, a few weeks later, a TrueNAS update dropped that fixed this issue.

This is SUPER helpful and is making me feel more confident about moving forward on this. We first build an all SSD storage box for our VMs back in 2012 - 2014 I think. It's been great, but it's time to retire that machine.

It seems a bit silly to build another all SSD machine in 2022. Like we should be moving forward. Like with all NVME. But it seems tough to find folks doing this with TrueNAS.

I've seen somethings about PCIE Switches online, but honestly, i'm not sure I understand them. From what I can tell it takes the place of an HBA for the drives to connect to correct? The SuperServer SSG-110P-NTR10 has on board support for ten U2.2 NVME drives, so i'm assuming as long as I only go with ten drives I won't need any kind of PCIE Switch. As long as the on board one is supported by TrueNAS. Although, i'm not even sure how to find out if it is.

More googling :)

Patrick M. Hausen · May 5, 2022

jamiejunk said:
It seems a bit silly to build another all SSD machine in 2022. Like we should be moving forward. Like with all NVME. But it seems tough to find folks doing this with TrueNAS.

We have two hyperconverged hosts in production, see my signature.

jgreco · May 5, 2022

jamiejunk said:
PCIE Switches online, but honestly, i'm not sure I understand them. From what I can tell it takes the place of an HBA for the drives to connect to correct?

A modern HBA is essentially a small computer with a bunch of SAS ports and firmware that proxies from the driver on the host system to the SAS ports. It is a protocol converter in a sense, in that it "converts" PCIe to SAS. But SAS devices are incompatible with PCIe, they do not speak PCIe, they speak SCSI. So what's really happening with an HBA is that a device driver on the system gets a request to "read block", the driver reaches out over PCIe to the HBA's CPU, says "I have a request to read block", the HBA CPU uses its firmware to talk SAS down a SAS lane, get that block, and then returns it over PCIe to the host driver. The HBA CPU is doing a translation or proxy or whatever you'd like to think of it as.

A PCIe switch is very much like an SAS expander or an ethernet switch. It just speaks PCIe and allows the use of more lanes. It has no smarts to speak of. It is driver-agnostic (doesn't require a driver).

In a simple case, let's say where you have a Supermicro AOC-SHG3-4M2P card, you have a PCIe x8 slot but four PCIe x4 devices on the card. Ignoring real world blocking issues, one might think that this was merely a card that broke the x8 up into four x2's and gave each M.2 slot an x2 link. (In practice x2 bifurcation is rare, of course).

That's not what happens, though. Instead, you're talking over the x8 to the PLX chip, and if you're only using one or two of the M.2's, you get essentially full bandwidth to each device. If two devices are already communicating at "full speed" and a third spins up, you now have more bandwidth than the x8 can handle, so you get a bottleneck, where each device gets maybe 66%. If you've seen this happen on an SAS expander or ethernet switch, surprise, you already know the general concept for a PLX switch.

So the big thing here is that a PLX switch operates at the insane speeds of the PCIe bus, so as long as you're not generating contention, a device behind a PLX is nearly as fast as the device directly connected.

But storage systems have always been faced with massive capacity, device speed, and upstream need mismatches. You probably have few cases where you actually need to be getting full speed out of a bunch of devices simultaneously. Your uplink from a server probably doesn't exceed 100Gbps. So even a shelf of 24 SATA SSD's (144Gbps) can theoretically do a pretty good job with that. This only gets more true as SSD speed increases with NVMe. You do not need 24 PCIe3 SSD's to each be able to pour their 32Gbit/sec (aggregate 768Gbit/sec) at your 100Gbps network; your CPU might not even be able to cope. So a PLX switch relies on the ability to safely oversubscribe, just like we do with SAS or ethernet.

Suggested reading:

Broadcom Solutions | Data Center, Networking, Software, Broadband, Wireless, Storage and Industrial

Broadcom’s category-leading semiconductor and infrastructure software product portfolio serves critical markets that include data center, networking, software, broadband, wireless, storage and industrial.

www.broadcom.com

Or perhaps better, if we lift this out of storage and consider a different context:

As an example where four PLX switches are used to increase GPU capacity.

jgreco · May 5, 2022

Reactions:Patrick M. Hausen

Any feedback on important points I missed, @Patrick M. Hausen or other usual suspects? I've seen this question a few times and I could tack it up as a resource if my answer was decent enough. Definitely soliciting input here.

Patrick M. Hausen · May 5, 2022

The interrupt handling when interrupt lines (does something like this exist in PCIe, still?) are overprovisioned seems to be a bit of rough terrain in FreeBSD. Warner Losh has put a ton of work into that subsystem on the way from FreeBSD 11 to 12 and now 13.

Combine this with buggy firmware that loses completion interrupts for time consuming operations like TRIM, even more so when operating in systems with multiple devices competing for interrupt lines (?).

When we started with the systems outlined below we could get stable operation with 4 SSDs in the pool but not with 6. We now run at 6 with the improvements done by Warner and - most importantly - Intel firmware updates for the P4510.

The symptoms you will see when this problem hits you are messages like "lost interrupt" and "controller reset". The observable behaviour of the system will be freezes of storage operations in the order of a couple of seconds. Very annoying and not useable for VMs.

HTH,
Patrick

firesyde424 · May 5, 2022

jamiejunk said:
It seems a bit silly to build another all SSD machine in 2022. Like we should be moving forward. Like with all NVME. But it seems tough to find folks doing this with TrueNAS.

Not necessarily. I currently manage a set of VMWare based hyperconverged clusters. One is 16 hosts with 2.3PB of mostly Micron SATA based SSDs. We've only switched recently from SATA to NVME because the price per TB for NVME dropped below SATA based flash. It all comes down to your use case and what you are planning to do.

Important Announcement for the TrueNAS Community.

All NVME pool questions

jamiejunk

Contributor

NugentS

MVP

jamiejunk

Contributor

NugentS

MVP

jgreco

Resident Grinch

blanchet

Guru

jamiejunk

Contributor

firesyde424

Contributor

jamiejunk

Contributor

Patrick M. Hausen

Hall of Famer

jgreco

Resident Grinch

Broadcom Solutions | Data Center, Networking, Software, Broadband, Wireless, Storage and Industrial

jgreco

Resident Grinch

Patrick M. Hausen

Hall of Famer

firesyde424

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

All NVME pool questions

Contributor

MVP

Contributor

MVP

Resident Grinch

Guru

Contributor

Contributor

Contributor

Hall of Famer

Resident Grinch

Resident Grinch

Hall of Famer

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "All NVME pool questions"

Similar threads