High End NVME CPU requirements?

firesyde424 · Apr 27, 2023

I just received approval for what is likely to be a ridiculous project. The background is that a very large customer is paying us "stupid" money to quickly process and archive a 300TB+ proprietary database that requires a lot of processing and will require embedded images to be extracted and converted to a standard format. Given the size, estimated workload, and timeline, the only real option is NVME, never mind that NVME is at price parity with SAS. With that in mind, as I've sat down to engineer this, is occurs to me that I have no idea what kind of CPU power would be required to drive the storage of this project. I came up with the following:

Dell PowerEdge R7525(Has to be 15th gen because, for whate
- 2 x AMD 7H12 CPUs, 128 cores\256 threads total @ 2.6Ghz
- 1024GB DDR4 Registered ECC RAM @ 3200MT/sec
  - Configured at 16 x 64GB due to 8 channels of memory per CPU.
- 400\200Gbe networking
  - See below
- 24 x 30.72TB Micron 9400 Pro Gen 4 NVME U.3 drives

So.. yes, I am fully aware that 128 cores of AMD Epyc may be beyond overkill. However, is it really? While testing with a 64 core(2 x 32) Sapphire Rapids Xeon server and 24 x 15.36TB Micron 9400 pro drives, we were able to easily max all 64 cores out and still couldn't even remotely come close to what the drives were capable of. Even with the kind of CPU power we're throwing at the database side of this project, I don't expect we'll push these drives, but I'd rather not be undersized for CPU and I've not been able to find a lot of good information on how to size something like this. We have a PowerEdge R740xd in service with 24 x 15.36TB Micron 9300 Pro NVME drives with 32 cores of 2nd gen Xeon Scalable CPU power and it works fairly well. However, at times, we can see high CPU usage and that server only has 25Gbe networking.

As far as networking goes, the plan is to use either 200Gbe or 400Gbe network cards with the database and storage servers connected directly to each other. Since we're using AMD for this build, that effectively means TrueNAS Scale so networking support should be fairly good. We'll likely go with 200Gbe given that I don't know if the database server would be able to push more than that. The card we are looking at is the Nvidia ConnectX-7.

So, after that wall of text, two questions.

1: Are there any good sources out there for the kind of CPU I'd need to support these drives or does anyone have experience at this scale?

2: IXSystems does sell an NVME model of TrueNAS Scale with 100Gbe cards so I'm assuming I should be able to get 200Gbe working as well. Does anyone have good information on this?

Etorix · Apr 27, 2023

firesyde424 said:
Since we're using AMD for this build, that effectively means TrueNAS Scale so networking support should be fairly good.

Why are you assuming this? CORE should work just as fine with EPYC—and not waste 50% of the available RAM for ARC.

firesyde424 · Apr 27, 2023

Etorix said:
Why are you assuming this? CORE should work just as fine with EPYC—and not waste 50% of the available RAM for ARC.

Not saying it doesn't work, but in our testing, we have run into several issues. For example, TrueNAS Core will almost immediately alert you that your CPU temperatures are high. Check the dashboard, sure enough, a reported temperature of 90C+. Okay, time to PANIC!! Except... the server's fans aren't running like sirens and the server seems perfectly fine. Sure enough, if you check the iDRAC sensors, the CPUs are reporting at 55C and haven't been higher than 60C. Among other issues are what I hope are just incorrectly reported core clock speeds and not an issue with turbo.

Yes, these are minor issues. However, in a scenario where I have to put this server at a remote data center and rely on monitoring software to make sure the server is healthy, the incorrectly reported CPU temperatures are almost a deal breaker. Either way, I don't have these issues in TrueNAS Scale and it makes more sense to use Scale vs Core for us, in cases where we're using AMD Epyc CPUs.

Dice · May 2, 2023

This is some deeep waters very few have experience from. I'll offer a few loose nuggets:

PCIe Lanes & connectivity, ensure there is 'supporting electricals' to your desired amount of population.

A youtuber famous for dropping various hardware did some content on setting tunables on stupid fast hardware, in the context of TN.
They basically outperformed the bandwidth of CPU/RAM/PCIe ..something..something, with their blazing fast drives (in the same territory as yours..) that required some tweaking to stop capping performance. I don't recall exactly, you'll have to check, but something about using ARC for metadata only as the busses were slower than the aggregate speed of drives? This was for CORE if I recall correctly. That would also have the implication that most of your RAM will not be used.
Also I don't recall anything about particular circumstances other than "playing with hardware". Which is typically a quite bad route to making a ZFS filer work to expectation.

Here you stand at a very coarse but specific line where testing and tuning comes in, to the workload at hand. The load may act very differently from the item-dropping-youtubers scenario. Ie, how much of the 'working set' of the database will fit into RAM? Maybe significant parts of it does not see much action? That alone will have a humongous impact on the validity of YT-item-dropper's tweaks, and instead relying on ARC in your insanely powerful system.

On that route, L2ARC is likely out the window. The pool will be faster.
I've not seen nor heard about LOGs being used in this type of high performance system. I bet that is for the same reasons as L2ARC.. And that such a massive db probably would have a replica somewhere else anyhow.

From the top of my head, the default settings (for ARC) on SCALE definitely need tuning in this use case. IIRC, if not adjusted, SCALE uses far less agressive ARC:RAM ratio than CORE. (a fun feature in the proxmox/hypervisor context :P ) Ie, will leave potential on the table in the context of a purpose built filer.

There might be many more important tunables you'll need to fiddle with. Remember that TN out the box is substantially tuned for a 1GbE home NAS. Already at 10GbE you will be looking at tuning...

Regarding networking at those speeds - you will be needing tuning and testing. At least if you choose the CORE route. There is an excellent resource on the tuning for CORE. I'm not familiar with the situation on SCALE. I'd expect the driver situation to be closer to the cutting edge than CORE.

I think the most important take away from a "production standpoint" is that SCALE is not even remotely equally proven as choice, to CORE.
The differences may be bigger than you and your $$$-client is prepared for.

Looking forward to hear about this endeavor.

Cheers,

firesyde424 · May 2, 2023

Dice said:
This is some deeep waters very few have experience from. I'll offer a few loose nuggets:

PCIe Lanes & connectivity, ensure there is 'supporting electricals' to your desired amount of population.

A youtuber famous for dropping various hardware did some content on setting tunables on stupid fast hardware, in the context of TN.
They basically outperformed the bandwidth of CPU/RAM/PCIe ..something..something, with their blazing fast drives (in the same territory as yours..) that required some tweaking to stop capping performance. I don't recall exactly, you'll have to check, but something about using ARC for metadata only as the busses were slower than the aggregate speed of drives? This was for CORE if I recall correctly. That would also have the implication that most of your RAM will not be used.
Also I don't recall anything about particular circumstances other than "playing with hardware". Which is typically a quite bad route to making a ZFS filer work to expectation.

Here you stand at a very coarse but specific line where testing and tuning comes in, to the workload at hand. The load may act very differently from the item-dropping-youtubers scenario. Ie, how much of the 'working set' of the database will fit into RAM? Maybe significant parts of it does not see much action? That alone will have a humongous impact on the validity of YT-item-dropper's tweaks, and instead relying on ARC in your insanely powerful system.

On that route, L2ARC is likely out the window. The pool will be faster.
I've not seen nor heard about LOGs being used in this type of high performance system. I bet that is for the same reasons as L2ARC.. And that such a massive db probably would have a replica somewhere else anyhow.

From the top of my head, the default settings (for ARC) on SCALE definitely need tuning in this use case. IIRC, if not adjusted, SCALE uses far less agressive ARC:RAM ratio than CORE. (a fun feature in the proxmox/hypervisor context :P ) Ie, will leave potential on the table in the context of a purpose built filer.

There might be many more important tunables you'll need to fiddle with. Remember that TN out the box is substantially tuned for a 1GbE home NAS. Already at 10GbE you will be looking at tuning...

Regarding networking at those speeds - you will be needing tuning and testing. At least if you choose the CORE route. There is an excellent resource on the tuning for CORE. I'm not familiar with the situation on SCALE. I'd expect the driver situation to be closer to the cutting edge than CORE.

I think the most important take away from a "production standpoint" is that SCALE is not even remotely equally proven as choice, to CORE.
The differences may be bigger than you and your $$$-client is prepared for.

Looking forward to hear about this endeavor.

Cheers,

Thanks for the insights there.

We don't use LOG devices on our NVME TrueNAS servers for the reason you specified. About the only thing we could use that might improve performance without sacrificing power loss protection would be something like an NVDIMM, but even then, the performance benefit wouldn't genuinely be noticeable and not worth the increased complexity. The Micron 9300 drives we already have in use are rated at 150K/850K read/write IO and 3.5GB/sec read/write. The 9400 drives we're contemplating here are 300K/1.6million(Yes, you read that right) read/write IO and 7GB/sec read/write. Response times for both models are between 10 and 60 microseconds. If there are drives out there that are faster in the enterprise space, I've not heard of them.

The Epyc CPU's have 128 PCIE lanes each, and we'll have two of them. The drives would consume 96 at most. Our 2nd gen scalable PowerEdge R740xd has a pair of 16x PCIE "switches" because you only get 48 lanes per CPU from that generation of Intel Xeon. Our PowerEdge R7515 and R7525 servers don't have any obvious switching, with the backplanes directly connected to the motherboards.

Ideally, we would have purchased a single database server and used the NVME drives locally as that would be the fastest option. However, the database software requires all cores on the DB server/host to be licensed and, with the licenses retailing for $47.5K PER CORE, we would be likely spending a significant amount on licensing for performance we'd lose to the storage requirements of the drives.

Dice · May 3, 2023

I realized you never mentioned pool layout and design.
I expect you've already familiarized yourself with these;

Resource - The path to success for block storage

ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle bits modified after creation. This is optimal for RAIDZ. It's what most...

www.truenas.com

Some differences between RAIDZ and mirrors, and why we use mirrors for block storage

ZFS is a complicated, powerful system. Unfortunately, it isn't actually magic, and there's a lot of opportunity for disappointment if you don't understand what's going on. RAIDZ (including Z2, Z3) is good for storing large sequential files. ZFS will allocate long, contiguous stretches of disk...

www.truenas.com

firesyde424 · May 3, 2023

Dice said:
I realized you never mentioned pool layout and design.
I expect you've already familiarized yourself with these;

Resource - The path to success for block storage

ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle bits modified after creation. This is optimal for RAIDZ. It's what most...

www.truenas.com

Some differences between RAIDZ and mirrors, and why we use mirrors for block storage

ZFS is a complicated, powerful system. Unfortunately, it isn't actually magic, and there's a lot of opportunity for disappointment if you don't understand what's going on. RAIDZ (including Z2, Z3) is good for storing large sequential files. ZFS will allocate long, contiguous stretches of disk...

www.truenas.com

We aren't quite sure how to configure this. Our existing older NVME server is configured in a 3x8 RAIDZ2. Performance is well beyond sufficient and is actually limited by the 25Gbe interfaces on this server. My hope is that we can obtain sufficient performance with this new system from a 3x8 RAIDZ2 pool config. The database server has a lead time that's 2 weeks longer than the storage server so we plan to do a significant amount of testing before deploying these systems

Success for us is 120,000 x 64K block IO and/or saturation of a 200Gbe interface in sequential transfers. If we can obtain that kind of performance with the 3x8 RAIDZ2 configuration, we likely will go that route for the sake of conservative redundancy. We likely won't go with mirrored pools simply because, even with 30TB drives, we would just barely get to the capacity requirements and I like to have a little more room than that.

Dice · May 3, 2023

firesyde424 said:
Success for us is 120,000 x 64K block IO and/or saturation of a 200Gbe interface in sequential transfers.

Excellent with a clear goal rather than what from time to time happen - shoot for the moon and "hopefully the client will be happy".

firesyde424 said:
If we can obtain that kind of performance with the 3x8 RAIDZ2 configuration, we likely will go that route for the sake of conservative redundancy.

LGTM.

firesyde424 said:
Our existing older NVME server is configured in a 3x8 RAIDZ2. Performance is well beyond sufficient and is actually limited by the 25Gbe interfaces on this server.

You have more points of reference than I realized. Excellent.

I hope to hear more about your specific tuning/tunables when the time comes. It is very valuable to the forums with numbers/experience/tales from the higher end of the performance spectrum.

Davvo · May 3, 2023

The following resource might be helpful in calculating your expected performance (or simply to validate your point).

ZFS Storage Pool Layout

This amazing document, created by iXsystems in February 2022 as a "White Paper", cleanly explains how to qualify pool performance touching briefly on how ZFS stores data and presents the advantages, performance and disadvantages of each pool...

www.truenas.com

Also, make sure to check the performance of mixed IOPS and speeds: as far as I know, the only drives who don't suffer from such operations are the now dead Intel's optane.

firesyde424 · May 3, 2023

It's starting to get real. The first batch of drives has arrived.

They are far heavier than you'd think. 8 years ago, I started working for this company and our entire infrastructure, backups included, could have been contained on 3 of these.

Davvo · May 3, 2023

€ 4.438,20, without taxes. Each. Yep, not your everyday soho build.

Dice · May 6, 2023

Another thread discussed similar 'pool design ideas' which brought back some additional meat to the "why choose mirrors", which I think is quite nice as an eye opener. The idea of mirrors vs raidz for VM's is not simply a question of spindle count and IOPS.

Somehow I got invested in your success in this build, and wish to share these posts to you here.

CPU performance - Mirrors vs Raidz2 - all SSD

Hello Dear Forum, I'm expanding my homelab. I'm looking to increase my current RaidZ1, 3x500GB SSD worth of capacity for VM storage. I've gathered (well, scavenged from other systems... ) two additional 500GB SSDs. I'm willing to pay for an additional up to 2 more units. Total: 7 units to play...

www.truenas.com

Is RAID10 / mirrored vdevs still the disk layout to choose at all flahs storage servers for esxi??

Dear Community, I used to learn that a RAID10, or in other words "multiple mirrored vdevs", was the disk layout to choose for a zvol at a ZFS-server for deployments as ESXi datastore. Now, time has to come to shift from hybrid pools to all flash storage pools. While I was in contact with one...

www.truenas.com

I still believe your plan to start out for space efficiency is a good choice, if that works, the better. I wouldn't be surprised either way honestly. If it would tank like no tomorrow, or it just ran fine.
I find it fascinating how performance is buried when using Z2/Z3 Really a lot.

Maybe your drives are stupidly overkill anyways, and you'll have enough free space that this wont matter to a jeopardize your goals. But if it tanks somehow, you know where to check next..

firesyde424 · May 8, 2023

Dice said:
Another thread discussed similar 'pool design ideas' which brought back some additional meat to the "why choose mirrors", which I think is quite nice as an eye opener. The idea of mirrors vs raidz for VM's is not simply a question of spindle count and IOPS.

Somehow I got invested in your success in this build, and wish to share these posts to you here.

CPU performance - Mirrors vs Raidz2 - all SSD

Hello Dear Forum, I'm expanding my homelab. I'm looking to increase my current RaidZ1, 3x500GB SSD worth of capacity for VM storage. I've gathered (well, scavenged from other systems... ) two additional 500GB SSDs. I'm willing to pay for an additional up to 2 more units. Total: 7 units to play...

www.truenas.com

Is RAID10 / mirrored vdevs still the disk layout to choose at all flahs storage servers for esxi??

Dear Community, I used to learn that a RAID10, or in other words "multiple mirrored vdevs", was the disk layout to choose for a zvol at a ZFS-server for deployments as ESXi datastore. Now, time has to come to shift from hybrid pools to all flash storage pools. While I was in contact with one...

www.truenas.com

I still believe your plan to start out for space efficiency is a good choice, if that works, the better. I wouldn't be surprised either way honestly. If it would tank like no tomorrow, or it just ran fine.
I find it fascinating how performance is buried when using Z2/Z3 Really a lot.

Maybe your drives are stupidly overkill anyways, and you'll have enough free space that this wont matter to a jeopardize your goals. But if it tanks somehow, you know where to check next..

Performance matters, but it's more about the sudden size of this database(240TB+), timelines, our lack of 100Gbe networking, and the ridiculous cost of licensing the database server. We're not able to get our hands on 100Gbe switches in a timeframe that makes sense for this project and the storage needs to be separate to bring as much of a return from the investment in the database software licensing as possible. This capital for this hardware is coming from a technology fee in the project and the hardware itself will be put to other uses once this project is over.

HoneyBadger · May 8, 2023

Given the performance/success target being measured in 64K, the space efficiency overhead from writing small records to RAIDZ would be mitigated significantly here if OP is going to write those larger sizes using NFS as was done previously. The default recordsize of 128K will be enough to write them without chopping them up into the smaller default volblocksizes that ZVOLs/iSCSI tend to use.

With this kind of power behind the pool itself, I have a feeling that network parallelism and queueing upstream is going to become a bottleneck - but if I recall (and the "similar threads" window confirms it) there was a build with an R740XD and Micron 9300s in the past. Did you end up adjusting the number of NFS servers on the TrueNAS host at that point, or did you end up bottlenecked elsewhere?

Ericloewe · May 8, 2023

firesyde424 said:
However, the database software requires all cores on the DB server/host to be licensed and, with the licenses retailing for $47.5K PER CORE

Gotta pay for those licensing audits, which are only needed because the product is absurdly and inconceivably expensive, because it needs to pay for licensing audits, because ...

firesyde424 said:
So.. yes, I am fully aware that 128 cores of AMD Epyc may be beyond overkill. However, is it really?

I mean, the cost to upgrade from half the cores would probably not touch the cost of a single one of those SSDs... It's hard to argue against the extra cores when dealing with this level of cost. It'd be slightly different if you didn't need two CPUs, but those PCIe lanes would be missed in a single-socket server.

firesyde424 · May 17, 2023

They are here!!

2 x 64 core AMD Epyc CPUs, 1TB of DDR4 RAM, 2 x 100Gbe NICs, 4 x 25Gbe NICs and 24 x 30TB Micron 9400 Pro NVME drives. I think I might need some alone time....

Ignore the labels on the drive caddies. In a bid to sell their over priced storage, Dell doesn't allow any suppliers such as CDW to resell caddies so we've had to reuse some caddies in addition to third party caddies.

In an odd specification limit, Dell will allow a 15th gen R7525 with a 7004 series Epyc to have a 24 X 2.5" NVME backplane such as the server on the top in this picture. But the newer, 16th gen R7625 with 9004 series Epyc CPUs can only be configured with a max of 16 x 2.5" NVME drives, as seen on the bottom. Thus the reason for using the older server for the storage side of this project.

Ericloewe · May 17, 2023

What the hell, I have had an R6515 and an R7525 on order since late March or early April and neither of them has arrived yet!

firesyde424 said:
so we've had to reuse some caddies in addition to third party caddies.

A common story in any Dell, HPE or Lenovo shop, no doubt. Infinitely frustrating and perpetuated by clueless procurement departments that go along with it.

firesyde424 · May 18, 2023

Ericloewe said:
What the hell, I have had an R6515 and an R7525 on order since late March or early April and neither of them has arrived yet!

When these were first quoted back in March, their original delivery date was early August. It had to do with the 128GB RDIMMS I originally configured as they were in short supply, in addition to a few other items. I was able to work with our CDW rep to make configuration changes to get the delivery dates down to something more normal. I wanted to get our project in before the K-12 orders so we weren't waiting behind all the schools trying to upgrade over the summer.

firesyde424 · May 18, 2023

Ericloewe said:
A common story in any Dell, HPE or Lenovo shop, no doubt. Infinitely frustrating and perpetuated by clueless procurement departments that go along with it.

You aren't kidding! Dell doesn't resell the 30TB drives in this project but they do resell its smaller, older sibling, the 15.36TB Micron 9300 pro. We pay about $1800 for them. Dell retails them for just over $10K. We'd probably pay $5700 after "negotiation." No thank you.

Ericloewe · May 18, 2023

firesyde424 said:
When these were first quoted back in March, their original delivery date was early August. It had to do with the 128GB RDIMMS I originally configured as they were in short supply, in addition to a few other items. I was able to work with our CDW rep to make configuration changes to get the delivery dates down to something more normal. I wanted to get our project in before the K-12 orders so we weren't waiting behind all the schools trying to upgrade over the summer.

Early August is rough. I've been waiting for a Supermicro NAS since January, and I just want to get the thing in house and get on with things and I feared I might end up with a similar timeline for the Dells.

firesyde424 said:
You aren't kidding! Dell doesn't resell the 30TB drives in this project but they do resell its smaller, older sibling, the 15.36TB Micron 9300 pro. We pay about $1800 for them. Dell retails them for just over $10K. We'd probably pay $5700 after "negotiation." No thank you.

It's so absurd how they quote one (obscene) price in the configurator, you add it and the delta is smaller than the price of the thing, and then when you get a quote from them the server is priced competitively - except for storage, which stays obscene after all the assorted discounts.
Since I really didn't want to deal with RAID controller nonsense on the R7525, I specced the NVMe-only 16-bay option, so I had to order a pair of 480ish GB SSDs for what must have been at least 200 bucks each.

(Why not just buy Supermicro? Big company procurement stuff. I can force Supermicro through when Dell doesn't have a similar product. I can avoid HPE because they actually quote us stuff without discounts, so they end up at twice the price, and I can avoid Lenovo because they sold us lemons a few years back).

Important Announcement for the TrueNAS Community.

High End NVME CPU requirements?

Contributor

Wizard

Contributor

Wizard

Contributor

Wizard

Contributor

Wizard

MVP

Contributor

MVP

Wizard

Contributor

actually does care

Server Wrangler

Contributor

Server Wrangler

Contributor

Contributor

Server Wrangler

Similar threads