Tell me why my NVMe build sucks (or doesn't)

geruta · May 17, 2023

I will look at SLOG in the future. the MB has a lot of PCI slots so I could always buy a PCIe to M.2 adapter and plug 1 or 2 more NVMe drives in.

I took Etorix's advice and swapped the two SATA DOMs for 1 cheap 500 GB NVMe for booting.

I also decided to YOLO and upgrade from 6 to 8 SSDs, and a better processor (Xeon Gold 6128 Processor)

I just completed the purchase of all the gear, just shy of $2100 which isn't bad. Fingers crossed it's smooth sailing from here :)

Final re-cap of HW:

8x SAMSUNG Electronics 870 EVO 2TB 2.5 Inch SATA
1x Crucial 500GB NVMe (boot disk)
Supermicro X11SPL-F Intel C621
Intel® Xeon® Gold 6128 Processor
4x SAMSUNG 32GB 2Rx4 DDR4-2666 (128GB total)
1x INTEL X710-DA2 2X 10 GBE SFP+ DUAL PORT NIC
1x 10Gb DAC
Less important stuff: Case, PSU, case fans, CPU fan, SSD brackets, SATA cables

Dice · May 17, 2023

Etorix said:
…provided one needs a SLOG.

I believe it is always sane to discuss SLOGs in the context of VM hosting. Particularly when synchronous writes are on the table.

Etorix said:
But with a SSD pool, there's a question whether the pool could actually be faster than a single SLOG device.

The SLOG will for sure have challenge to keep up.
We're still in SATA territory, which makes quite the difference to an nvme based setup, where it's lights out quickly.

I'd love to see some benchmarks where a fast SLOG is put up against increasingly faster pools, to see at what point the non-SLOG pool outperforms the slower SLOG pool with sync=always.
As always, difficult to setup synthetic benchmarks to mimic actual use. Which in the case of a low-intensity real workload, benchmarks are not much more than an academic endeavor. I like such endeavors. Might consider giving it a shot myself.

QonoS · May 17, 2023

Just get cheap (Price/TB) SATA SSDs with PLP, those outperform even NVME SDDs without PLP when doing sycn writes:
https://geizhals.eu/?cat=hdssd&v=e&sort=r&bl1_id=30&xf=252_1920~4643_Power-Loss+Protection~4832_1

No need for SLOG at all. Anything besides Optane would bring no real benefit and Optane is just too expensive.

Dice · May 18, 2023

QonoS said:
Just get cheap (Price/TB) SATA SSDs with PLP,

QonoS said:
No need for SLOG at all.

This is not correct in the case of ZFS.

There is an additional timing window, beyond "SSD confirms data has been comitted" and "actually comitted" , ie pre-caches on the SSD drive itself, would be protected handled nicely with a PLP SSD, in probably most other filesystems.

However, in ZFS you'd also need to compensate for the ZIL (intent log).
Here is an excellent explanation:

Some insights into SLOG/ZIL with ZFS on FreeNAS

What is the ZIL? POSIX provides a facility for the system or an application to make sure that data requested to be written is actually committed to stable storage: a synchronous write request. Upon completion of a sync write request, the underlying filesystem is supposed to guarantee that a...

www.truenas.com

It is not not enough to rely on PLP drives in the case of ZFS.

jgreco · May 18, 2023

QonoS said:
No need for SLOG at all. Anything besides Optane would bring no real benefit and Optane is just too expensive.

That's just crazy talk. What happens to the data in the write cache? In ZFS, the write cache is in main system memory, and a transaction group is virtually guaranteed to have data in it unless no writing is occurring. If you just lose transaction groups due to a panic or power loss, your VM disks may become inconsistent because of the missing data never being committed. This is exactly what SLOG is supposed to prevent.

Even a basic SATA SSD with proper PLP is sufficient to take care of this problem, and is certainly not "too expensive".

QonoS · May 18, 2023

Dice said:
This is not correct in the case of ZFS.

There is an additional timing window, beyond "SSD confirms data has been comitted" and "actually comitted" , ie pre-caches on the SSD drive itself, would be protected handled nicely with a PLP SSD, in probably most other filesystems.

However, in ZFS you'd also need to compensate for the ZIL (intent log).
Here is an excellent explanation:

Some insights into SLOG/ZIL with ZFS on FreeNAS

What is the ZIL? POSIX provides a facility for the system or an application to make sure that data requested to be written is actually committed to stable storage: a synchronous write request. Upon completion of a sync write request, the underlying filesystem is supposed to guarantee that a...

www.truenas.com

It is not not enough to rely on PLP drives in the case of ZFS.

As in the post by jgreco you mention it is written "Without a dedicated SLOG device, the ZIL still exists, it is just stored on the ZFS pool vdevs.".
So when the ZIL is stored on a ZFS pools vdevs consisting of PLP SSDs, at which point is it "not enough to rely on PLP drives" exactly ?

jgreco said:
[...] What happens to the data in the write cache? In ZFS, the write cache is in main system memory, and a transaction group is virtually guaranteed to have data in it unless no writing is occurring. If you just lose transaction groups due to a panic or power loss, your VM disks may become inconsistent because of the missing data never being committed. This is exactly what SLOG is supposed to prevent.

Even a basic SATA SSD with proper PLP is sufficient to take care of this problem, and is certainly not "too expensive".

That is perfectly true and when zpool vdevs are PLP SSDs no SLOG is needed. For this reason I wrote "Just get cheap (Price/TB) SATA SSDs with PLP". Please understand my post as whole.

jgreco · May 18, 2023

QonoS said:
So when the ZIL is stored on a ZFS pools vdevs consisting of PLP SSDs, at which point is it "not enough to rely on PLP drives" exactly ?

At the point where you need any sort of performance out of the pool, because the in-pool ZIL region has poor performance compared to a dedicated SLOG. Especially true for RAIDZ.

QonoS said:
That is perfectly true and when zpool vdevs are PLP SSDs no SLOG is needed. For this reason I wrote "Just get cheap (Price/TB) SATA SSDs with PLP".

Simply not true. First, performance tanks when using the in-pool ZIL. Second, there are very few cheap SATA SSD's with PLP.

QonoS said:
Please understand my post as whole.

Thanks for correcting me. I understood your post as wrong. I see now that I should have understood it as whole. What the hell does that even mean?

Dice · May 18, 2023

@QonoS
There might be more to your point than I first realized on autopilot.

Because it sure seems like, if the ZIL contains basically a copy of sync-write transaction groups, and it resides on a power loss protected separate device, or it resides on a vdev of PLP SSD's, seem equally protected.

A caveat I could see is the latency, in which an Optane SLOG maybe would generate a net benefit in certain zpool configurations. Since the ZIL also takes on the same characteristics as the underlying pool, (ie - same overhead). That is, in the worst case (pool of raidz3 of SATA SSD) maybe could generate enough overhead that it at some point that even a "PLP-native pool", would benefit from the latency cut of using a high performing Optane.

I feel like the last year's price drop in SSDs have started to open up new avenues and 'simple formulas for success'. Particularly in the context of high speed nvme's, which I would think could be gloriously gimped by a SLOG, even of Optane class. But this turns quickly into a specific discussion about workloads and characteristics of writes. Some cases it will be <all that matters>, others not at all, depending on configuration etc.

@geruta
To switch gears on NVMe vs SATA, I'd personally not get involved in SATA at this point. The gains in the nvme space wildly outperforms the already reached static cap of the SATA protocol...

Cannot deny the Micron 7400 PRO, seems fantastic. PLP.
Price per GB is really not far off from the Samsung 870 EVO 2TB 2.5.
The Microns Read 4400MB/s, Write 2000MB/s, available in U.3 or m.2. It'll be eons faster than the SATA Samsung 870's.
I recon the sweet spot per GB would be the 3.84TB Version. Rather than 4x2TB, I'd look for 4x3.84TB's, and run mirrors with fewer larger drives.

Hah, I'm carrying myself away here :D

jgreco · May 18, 2023

Dice said:
Because it sure seems like, if the ZIL contains basically a copy of sync-write transaction groups,

It doesn't, though. A transaction group is a (usually) large group of individual writes. ZIL entries have to be written synchronously, which means that each sync write ends up being a "transaction group" of one write (and I use the term "transaction group" here VERY loosely, not in the ZFS sense).

Dice said:
and it resides on a power loss protected separate device, or it resides on a vdev of PLP SSD's, seem equally protected.

If we're merely discussing protectedness, yes, you could argue this.

Dice said:
Since the ZIL also takes on the same characteristics as the underlying pool, (ie - same overhead). That is, in the worst case (pool of raidz3 of SATA SSD) maybe could generate enough overhead

That overhead is significant. Even in a mirror pool, the in-pool ZIL requires that writes be completed to both sides of the mirror. For RAIDZ, you actually involve multiple devices. In both cases, you are ALSO competing with pool data traffic to the pool, which means that you are experiencing additional latency if there are any pool reads or ZFS transaction group flushes being committed to the pool. You might be able to mitigate this on Optane or something like that, I suppose, haven't tried it.

Dice said:
To switch gears on NVMe vs SATA, I'd personally not get involved in SATA at this point. The gains in the nvme space wildly outperforms the already reached static cap of the SATA protocol...

And why's this? If you are looking at large amounts of storage, SATA is still quite attractive. A decent SSD such as the 870 EVO is capable of ~5Gbps, so you don't need many of them to be able to fill a 10Gbps uplink from the NAS. A stuffed 24-bay NAS with three way mirrors is capable of 40Gbps write or 120Gbps read.

Dice · May 19, 2023

jgreco said:
And why's this?

I could have emphasized the 'personally' a bit more. I meant it less as a "how to build the ultimate NAS", than a recount of my own personal reasons. I like to buy items that can be used in various roles in my little lab. For that, I can see retrospectively that I've benefited hugely from avoiding to "over-optimize" each system, due to their changing nature over time. Thus, being able to repurpose items freely. From this perspective, I much rather buy a newer nvme drive.

However, this thread has led me to explore/challenge what 'within reach' or 'within comprehension' setups that require synchronous writes, and thus a SLOG, vs how next level performance (vs SATA) nvme's will influence the role of SLOGs.

jgreco said:
A decent SSD such as the 870 EVO is capable of ~5Gbps, so you don't need many of them to be able to fill a 10Gbps uplink from the NAS. A stuffed 24-bay NAS with three way mirrors is capable of 40Gbps write or 120Gbps read.

The question is - what do you do for SLOG in that scenario?
I'm concerned it will be difficult to judge if even the fastest Optane drives would be sufficient to unleash the speed of such a box? My gut says no, particularly if any sync writes of sequential character would need to take place. Even the best Optanes would bog down the system at that point.

Even if SATA drives in enough quantities (picture a 24bay) can saturate 40-100Gbps in a total configuration, they would be worse in terms of overall latency than nvme, even if MB/s would be the closer, because of a more efficient protocol.
Add to that, several times (or magnitudes(!)) increased write/read/IOPS compared to a SATA SSD, would start to gain significant ground to negate the increased load caused by overheads of in-pool ZIL. Ie, making it net beneficial to run nvme&PLP over SATA+SLOG.

I really was surprised to find the PLP backed DC class nvme - Micron 7400PRO at nearly the same price per GB, at double the size (ie no insurmountable premium) to the Samsung consumer SATA SSDs.... That is a significant shifting point. Particularly as cutting the number of drives in half while retaining capacity, reduces the arguments on PCIe lane starvation in many builds.

This is where I'm out theorizing and thinking ahead, on how the technology leap challenges assumptions on sound pool designs.
I liked this thread, https://www.truenas.com/community/threads/truenas-scale-nvme-performance-scaling.104641/
and the linked talk: https://www.youtube.com/watch?v=v8sl8gj9UnA discusses the Flash challenges further.

As already mentioned, costs of connectivity (backplanes, pci lanes, pci controller cards etc) is still vastly more expensive. To this day, "over-optimizing" an SSD build, of 'significant storage capacity' would still favor SATA/SAS. I agree on that point.

To me, the interesting part is how the rise of high performance flash is challenging the last decade's design considerations.
Sooner than later, the design using SLOG on a flash pool may be completely obsolete. I'm intrigued by this question, maybe it is already the reality.

HoneyBadger · May 19, 2023

Dice said:
The question is - what do you do for SLOG in that scenario?

Persistent memory devices like NVDIMMs.

In almost all scenarios, you want a hypothetical SLOG device that's faster than your pool. Assuming the default zvol block size of 16K, it's significantly better to program those to a single, high-endurance SLOG device and then let ZFS aggregate them into larger (1M HDD/128K SSD) blocks to the pool disks.

jgreco · May 19, 2023

Dice said:
The question is - what do you do for SLOG in that scenario?

NVDIMM? Optane: The next generation?

Dice said:
I'm concerned it will be difficult to judge if even the fastest Optane drives would be sufficient to unleash the speed of such a box? My gut says no, particularly if any sync writes of sequential character would need to take place. Even the best Optanes would bog down the system at that point.

"Particularly if any sync writes of sequential character"? Well, let's not be ridiculous here. If you are doing stuff like that, you have justification for DAS. Network storage is poorly suited to that type of workload. NAS typically requires multiple concurrent workloads in order to perform well, and I think that's the answer for your "sync writes of sequential character" issue. There's just too much latency involved for a single stream writer, as outlined in the ZIL/SLOG article (see the "laaaaatency" section).

Dice said:
if SATA drives in enough quantities (picture a 24bay) can saturate 40-100Gbps in a total configuration, they would be worse in terms of overall latency than nvme,

Yes. So what? You can make the same argument about NVMe. NVMe at the far end of a network connection (i.e. NAS) is lots slower than NVMe at the far end of a PCIe connection (i.e. DAS). If you really need low latency, you don't involve NAS at all. Otherwise, you have to look at the architecture of ZFS and understand that when you're pulling a 128KB block off of SATA-backed storage, you are queueing up requests for 32 blocks, and so if you have a 12-wide RAIDZ2 vdev, you get two or three requests per disk. with the disks being asked for those in parallel, so 12 * 6Gbps = 72Gbps if you've arranged your vdevs optimally, or probably at least 24Gbps even if not. Never underestimate the value of concurrency. Latency is worse, but throughput is pretty awesome. Use the right tool for the right job. Low latency? Sure, eliminating a bunch of network controllers, switch silicon, HBA controllers, and possibly SAS expanders reduces latency. No doubt. But eventually, with NVMe, you do get to a point where you're out of lanes, and then you need PLX switches or whatever comes next to increase the number of devices, and that too introduces new latency.

Dice said:
Sooner than later, the design using SLOG on a flash pool may be completely obsolete. I'm intrigued by this question, maybe it is already the reality.

I thought progressive ZFS installations had already moved on to using Optane or NVDIMM for SLOG, so I'd say SLOG on a PLP flash pool is so last decade already.

jgreco · May 19, 2023

HoneyBadger said:
Persistent memory devices like NVDIMMs.

Stop stealing my thunder. Grr. (kidding.)

Dice · May 20, 2023

jgreco said:
NVDIMM? Optane:

I'm aware of Optane being state of the art, for a good while by now.
Ah, those I had completely forgot about NVDIMMs. I recall as discussion about their validity at work recently.
Something along the lines of end of sales date no more than 1.5 year.. 2025 or something along the lines.
CXL appears upcoming "soon" to fill the void of fast &"cost effective NV-flash.

I was less sure on how they are stacking up on stupid fast FLASH storage systems, and where/how bottlenecks develop. It is not as obvious compared to when spinning rust was in the mix...

jgreco said:
Well, let's not be ridiculous here. If you are doing stuff like that, you have justification for DAS. Network storage is poorly suited to that type of workload.

Me previously preferring NVMe over SATA is ...

Let's not carry this discussion too far out into the 'general storage world', and have one foot left on the TN forums, where people come for TN advice, on best adapting TN (albeit, with it's shortcomings and caveats).
In writing that, I was thinking about high end use cases of TN, for example video editing NAS. I might've been off anyways.

jgreco said:
You can make the same argument about NVMe.

I buy into your general line of argument, with the exception here;

jgreco said:
But eventually, with NVMe, you do get to a point where you're out of lanes, and then you need PLX switches

I think this is written from the logic of a 'large hive mind controlling all the drives'.
As I see it, we're more likely to go in the direction clustering of nodes when dealing with NVMe, if capacity needs surpasses the availability on a single node. Because, once a decent chunk of PCIe lanes are exhausted, there is more or less enough performance to saturate the wildest network connections.
If such performance extraction is not required - well, then SATA/SAS + expander backplanes is probably your friend.

HoneyBadger said:
Assuming the default zvol block size of 16K, it's significantly better to program those to a single, high-endurance SLOG device and then let ZFS aggregate them into larger (1M HDD/128K SSD) blocks to the pool disks.

This was enlightening. Thanks.

Dice · May 20, 2023

I figured I share some crude timestamps of key takeaways form the talk linked previously, that ties into this discussion.

Scaling ZFS for NVMe - Allan Jude - EuroBSDcon 2022

Learn about how ZFS is being adapted to the ways the rules of storage are being changed by NVMe. In the past, storage was slow relative to the CPU so request...

youtu.be

Basically a z3 array would cause a DOUBLE amount of latency + double the throughput (!) even on NVMe, with sync writes (!)
This is the "significant overhead" grinch mentioned. It is a lot worse than I thought, and really puts a strain on the argument that 'fast enough hardware in vdevs' would solve the problem.

A few other notes on how ZFS strength in creating great performance with loads of drives, and a few fast SSDs, turns into an overhead nightmare that does not play as nice with super fast hardware. Ie, the hardware is not the limiting factor. Ie, this is what most of the talk is centered about explaining.

Short note on how nvme's are so fast the aggregation (I'm reading construction of transaction groups) is not able to fill the nvme queue, and leaves it idling, ie slowing down the process rather than 'speeding it up' as would've been the case on rotating rust.

Scaling ZFS for NVMe - Allan Jude - EuroBSDcon 2022

Learn about how ZFS is being adapted to the ways the rules of storage are being changed by NVMe. In the past, storage was slow relative to the CPU so request...

youtu.be

The aggregation mentioned by @HoneyBadger, also recieves some unpacking:

HoneyBadger said:
it's significantly better to program those to a single, high-endurance SLOG device and then let ZFS aggregate them

https://youtu.be/v8sl8gj9UnA?t=2060 some useful on the Commit delay to the SLOG delays built in (to allow more txg's into each flush..?), that causes unnecessary delays. Also cool.

Systems with multiple pools of each HDD or SSD would face tuning issues.

Scaling ZFS for NVMe - Allan Jude - EuroBSDcon 2022

Learn about how ZFS is being adapted to the ways the rules of storage are being changed by NVMe. In the past, storage was slow relative to the CPU so request...

youtu.be

Maybe less valid for most commercial systems(?) where I imagine mixed pool setups are less common practice.

Cheers,

chris1979 · May 20, 2023

Personally, I have never had a problem with booting from nvme can't tell you what that is like for storage but as a boot drive there ok. Is it a homebrew system or pre-built?

jixam · May 21, 2023

HoneyBadger said:
In almost all scenarios, you want a hypothetical SLOG device that's faster than your pool. Assuming the default zvol block size of 16K, it's significantly better to program those to a single, high-endurance SLOG device and then let ZFS aggregate them into larger (1M HDD/128K SSD) blocks to the pool disks.

Wait, are you suggesting that an in-pool ZIL causes additional fragmentation?

(Edit: as in, suggesting that the SLOG is a prerequisite for the aggregation?)

HoneyBadger · May 23, 2023

jixam said:
Wait, are you suggesting that an in-pool ZIL causes additional fragmentation?

(Edit: as in, suggesting that the SLOG is a prerequisite for the aggregation?)

Only in the case of small synchronous writes, where ZFS will need to commit that data to a stable location (in-pool ZIL) and then later that ZIL will be cleared. Aggregation comes from the transaction group in RAM, so you'll be doing "small writes now" - eg, 4/8/16K - to the SSDs, and then the larger aggregated writes (up to 128K) when the transaction group is committed.

For async, of course none of this applies as those just queue up in RAM.

Hope that helps clear things up, and sorry if I made it confusing!

NickF · May 23, 2023

jixam said:
Wait, are you suggesting that an in-pool ZIL causes additional fragmentation?

(Edit: as in, suggesting that the SLOG is a prerequisite for the aggregation?)

HoneyBadger said:
Only in the case of small synchronous writes, where ZFS will need to commit that data to a stable location (in-pool ZIL) and then later that ZIL will be cleared. Aggregation comes from the transaction group in RAM, so you'll be doing "small writes now" - eg, 4/8/16K - to the SSDs, and then the larger aggregated writes (up to 128K) when the transaction group is committed.

For async, of course none of this applies as those just queue up in RAM.

Hope that helps clear things up, and sorry if I made it confusing!

I think to summarize, the answer is that it certainly can, for a similar reason that RAIDZ can be less storage efficient than you might expect when compared to mirrors, or fragmentation if you have the wrong block size set for the workload in question.

As an example, I have some Optane drives in a mirrored configuration using the default 128k record size (because I forgot to change it, and then didn't care) running VMs in my Homelab on SCALE.

Code:

NAME                                          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG   CAP     DEDUP    HEALTH  ALTROOT
optane_vm                                 1.73T   756G  1020G               -            -         33%    42%      1.13x    ONLINE  /mnt
  mirror-0                                                 888G   682G   206G        -         -       64%     76.8%      -      ONLINE

And it's all sortsa fragmented...64%! but they are SSDs...How much does it really matter? Enough that I should care? Maybe, maybe not.

Important Announcement for the TrueNAS Community.

Tell me why my NVMe build sucks (or doesn't)

Cadet

Wizard

Explorer

Wizard

Resident Grinch

Explorer

Resident Grinch

Wizard

Resident Grinch

Wizard

actually does care

Resident Grinch

Resident Grinch

Wizard

Wizard

Cadet

Dabbler

actually does care

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Tell me why my NVMe build sucks (or doesn't)"

Similar threads