A lot of questions there, hope this helps.
Special vdevs:
You understand (I hope) that ZFS like many advanced filing systems keeps an awful lot of data about how stuff is organised, internally. Pointers, indexes, search trees, tables, internal logs, additionap data like ACLs and ownership associated with files, free space, dedup table if that's in use... stuff like that. That's collectively the pool's "metadata" (data ABOUT the data you're storing in the pool). It tends to be small, random, and often changes/updates. You also know, (I hope) that ZFS pools are created from redundant sets of disks - vdevs - and you just add more of them if needed. ZFS spreads data (including metadata) across all the vdevs in the pool, and across the disks in each vdev, for efficiency and data protection. That means usually metadata is scattered on those HDDs as well. There is no way to control what disks, what data is on - you give ZFS a bunch of disks and their vdev structure, and ZFS will use them as it "thinks" best within its programming and tunables.
A couple of exceptions exist. Historically, you can dedicate some disks to read cache (L2ARC) or write log (ZIL/SLOG) and that'll be respected. That means you can get SSDs for those high performance tasks, without needing a pure SSD pool.
As of TrueNAS-12, you can also dedicate some disks to metadata as well, and that'll be respected too, meaning those can be given smaller dedicated SSDs as well. Those are called "special vdevs" - perhaps " Small data vdevs" would be a better name. Same result in the end. Big user files on HDD, and read cache, write log, metadata, DDT and (optionally) tiny files, on SSD for speed/efficiency.
Dedup:
To be fair, FreeNAS/TrueNAS/ZFS commentators have *always* said, to avoid dedup, but also to be fair they've said "unless you have the hardware for it". The problem is, this specific issue just never got appreciated or perceived or awareness. So that was the advice. It's not an expert feature though. If you were on 1G you probably wouldnt have the problem. The combination of 10G + dedup hasnt really ever been appreciated - maybe because domestic 10G until very recently was so rare.
Dedup's use cases are *extreme* data size reduction. In my case, for example, the maths goes like this (roughly)
- 40 TB of data now. Say 80TB in a while. ZFS likes to run with quite a lot of free space, so that 80% should still ideally only be about 60% full . So about 133 TB raw pool capacity. I like to run 3 way mirrors. That's about 400 TB raw disk space. Double it because backup server. 800 TB raw capacity in theory between them. 0.8 PB for a home server and backup? Ridiculous. Using say enterprise 8 TB disks? That's 100 disks at £200 each. And connectors/backplanes. And power costs.
- Enable dedup? Now 13 TB, 1/3 of the size. But my future writes will also dedup more (more likely to already have copies in the existing 40TB) so my deduped size wont double in 5 years. It might go up by 50%, say 18 TB deduped. At 60% full and 3 way mirrors, thats 90TB, or 11 disks. Maximum 22 disks inclusing a 2nd backup server.
- This is when and why one uses dedup. Not to just shave off a few tens of percent. When storage cost and scale is actually prohibitive otherwise.
- Other use cases are limited disk size, or limited bandwidth (historically one could cut data sent by a huge amount if replicating or backing up/restoring)
The point is, this isn't because ZFS is wasteful. Its inherent in my data size, choice of mirroring with 2 failure tolerance in a set as protection, and the fact disks die/fail so redundancy is needed across 2 servers in order to never need to worry about it. I could be using anything - ZFS, BTRFS, windows server, I'd still probably need comparable disk use for that level of safety.
Cost and SSD choices:
Being fair again, I dont think you *need* optane. Good pro-am SSDs could be enough. I'm just not in a mood to have to buy stuff only to haver to replace it, and data isn't out there to confirm what I need to predict my best uses and choices. I know Im a very heavy user when busy, so I spent than have risk of issues. Thats me though.
But yes, good quality storage costs. Fast storage costs. Whatever you do, it can be cheap, fast, or reliable. Pick any 2. Cheap and fast won't be reliable, cheap and reliable will skip premium fast components, and fast and reliable will cost. It's not cheap running a file server. I happen to like the benefits and peace of mind is huge.
Specific other questions:
Battery-backed RAM - Because SSDs are tricky to make fast (they need a lot of tricks to work efficiently!), some people havbe made PCIe cards with ordinary volatile RAM - but with battery on board, to either stop the RAM losing its data, or to provide the 30 seconds needed to back it up to separate on-board NVRAM before power fades (it'll be restored when main power returns). In effect, an SSD with the speed and durability of ordinary main motherboard RAM modules. Less common now, usually expensive too ;-)
4x4 controllers? Look at things like Asus Hyper thing, 4 way M.2 slots on a card. But whatever you get, there's a twist you need to be aware of. When you put multiple PCIe devices on one slot, you actually need the slot to handle them as independent devices, not as one device. If the motherboard can do it, then the feature you need in its firmware is called "bifurcation". That means, the motherboard can be told in BIOS to treat an 8x slot as 4x + 4x, or a 16x slot as four 4x or an 8x + two 4x, or whatever. If the motherboard can't, you need a PCIe card that can handle 2 or 4 M./2 devices but make them all visible to the BIOS. Usually that means an expensive card with a PLX chip on board. Yes these exist, but are silly prices sometimes.
An alternative option is an HBA that can take U.2 NVMe devices. But these kinds of SSDs truly do want 4 PCIe lanes, because they can use them. Not like SATA or SAS where you can squeeze 10 devices onto 3 lanes or whatever and they'll all get enough bandwidth. NVMe SSDs really *can* use all their lanes in some cases. The problem is however you fiddle it, a CPU/board only has so many lanes - and it's not a big number. So your 4 way U.2 HBA either needs 16 PCIe3 lanes, or if it's squeezing them into 8 lanes, you wont get full bandwidth off some good SSDs. So be aware and think about that aspect. HBAs that can handle U.2/M.2/NVMe are newish, but Broadcom's 9400 range I think is an example. Ask others for help in this area,. I could be wrong.
Your best bet for the Samsung is probably cheap single or dual M.2 adapters, depending what slots your board has and any support for bifurcation.
ZFS pools - the DISKS don't really "nest". Datasets can, but thats the logical structures on within the pool's files and data, not the pool's actual design. The disks are in vdevs, and vdevs make up the pool, thats as deep as vdev and disk nesting goes.
The 12-BETA web UI warns against using mirrors on some vdevs and raidz on others. But SLOG/ZIL, special (metadata) and probably L2ARC should all be non-raidz if present. If the web UI complains, it's wrong advice, and hopefully fixed by the final release. Check the box marked "ignore this warning" and do mirrors for those anyway.
I wouldn't say it's "just a setting". You run a very uncommon setup - a 10G+ LAN, with dedup. It's that combination that is the issue. If the problem has arisen commercially Ive never heard of it. Or they solved it other ways. But it's not "just" a setting. You could equally say its because 10G is a bad choice for networking. Switching down to 1G would fix it, too, probably. Or not having all-SSDs. The point is, you can fix this numerous ways, because its an oversight in older ZFS about "if A and B and C you can get this kinda pathological behaviour emerging. Preventing or disabling any one of A, B, C in older ZFS will prevent the issue."
I think I prefer to think of myself as driven by necessity. I didnt have an easy option to not look for it. Glad it helped!