So, for the application of a single device as a SLOG, we're likely not going to see much of a difference, partitions vs namespaces?
I am hard pressed to imagine a way where the implementations end up behaving significantly differently. It would mostly be a matter of who was implementing the "space" mapping. For partitions, the host is adding an offset to the LBA to get access to a particular portion of the SSD's LBA map. For namespaces, the controller is doing something to quickly transform that, and it's likely just an addition of some sort. So just a matter of who is doing the adding, I suspect.
What may be interesting is potentially adding multiple (3 or 4) "overprovisioned" 4GB namespaces to a single NVME drive and then striping those "devices" as a SLOG.
I see what you're saying with respect to the increased TPS at NS={2,3,4,...} but I am skeptical. Plus it's read traffic, while SLOG is exclusively write traffic.
This really gets to questions of how ZFS works with a SLOG device. I knew more about this ten years ago when I had gone in to self-solve the issues related to bug #1531 at which time I learned lots about the whole ZFS write process. So I'm going to toss out some ideas about which I might be right or wrong.
First, striping namespaces on a single NVMe makes an assumption that you can make two namespaces somehow go faster than a single namespace. I agree the graph appears to suggest that. But I'm wondering if this is just due to latency effects, the way that we obsess about queue depth on SAS. I'd like to understand the mechanics of what's going on in the graph. It's clear because this evaporates after NS=5 that some inefficiency in the system is being exploited and transformed into small extra amounts of performance.
If you had TWO NVMe devices, you might be able to make one namespace on one striped with one namespace on the other go faster. But I am skeptical even there. At some point you may hit a synchronization penalty because the striping inherently wants to read both sides of a block at about the same time. Even if you take advantage of NVMe's huge NCQ (up to 64K queues of 64K entries) you are not guaranteed to be able to match speeds to keep sync.
I would like to understand why Micron claims handing a single NVMe device two transactions to complete to a single namespace is going to be less efficient than handing a single NVMe device one transaction each to two namespaces. It might not be much more efficient, but even just processing the switch between namespaces likely takes a tiny bit of time. Or are they just cramming NCQ up to the gills with entries and finding that 128k queues to two namespaces is slightly more efficient than 64k queues to one?
I have a hard time seeing how a single NVMe device handed one transaction each to two namespaces could complete that operation FASTER than two transactions to a single namespace unless there was concurrency available in the controller or some other inefficiency to be squeezed out.
That's the controller-level analysis I have in my head. Correct me anywhere you please.
The next bit is ZFS and is a bit more nebulous. I do not recall ZFS being able to handle concurrency in the ZIL. zil_commit is designed as singlethreaded and causes other calls to zil_commit to block.
This probably made sense years ago when the ZIL mechanism was designed; I don't think anyone was thinking of the possibility of parallel log devices.
Now, theoretically, if you did have two log devices, and please note I am NOT saying "striped" here, let's just go with a POSIX-satisfying theoretical design, if you had one writer client, then you can probably only use one logger write thread as well, because the latency in the operation will primarily be the stuff happening in the SSD controller and flash. The messaging back and forth from the SSD controller to the CPU is happening at NVMe speed (hella fast). But if you had two writer clients, having a second logger write thread allows you to ingest at up to 2x the speed because you can write to both devices. So that's potentially workable.
The problem with striping is that you get into a lot of finicky management of trying to keep the write pace even between the two devices, and you are going to lose some potential speed due to the need to block when one device is a bit slower. Better to firehose data down to each device at whatever speed can be managed. It's still sync writes but you can do the log commits in parallel. That is also scalable to multiple devices (beyond 2).