Okay, I'm back for a little bit of additional information.
The original post was written back in the days when FreeNAS 9.1 was the new hotness - since then, we've had a lot of change in how the write throttle performs, and how you can tune it. The previous tunables (write_limit_*) and behavior of the throttle ("nothing, nothing, nothing, ALL WRITES BLOCKED LOL") are quite thankfully gone, and what we've since been blessed with is a much more gradual out-of-the-box setting as well as a much more tunable throttle curve. But the core issue of "why am I getting write throttling" remains -
your pool vdevs are too slow for your network ingest speed.
The below comments should be considered for users who have made
no changes to their tunables. I'll talk about those later - we're sticking with the defaults for this exercise.
(Be Ye Warned; Here Thar Be Generalizations.)
The maximum amount of "dirty data" - data stored in RAM or SLOG, but not committed to the pool - is
10% of your system RAM or
4GB,
whichever is smaller. So even a system with, say, 192GB of RAM (oh hi
@2nd-in-charge ) will still by default have a 4GB cap on how much SLOG it can effectively use.
ZFS will start queuing up a transaction when you have either
64MB of dirty data or
5 seconds have passed.
(In newer OpenZFS builds - so TN12 and later, and all SCALE builds - this "tripwire" value is 20% of your maximum dirty data or around 820MB. The same 5-second timer still applies; it's whichever is hit first!)
The write throttle starts to kick in at
60% of your maximum dirty data or
2.4GB. The curve is exponential, and the midpoint is defaulted to
500µs or
2000 IOPS - the early stages of the throttle are applying nanoseconds of delay, whereas getting yourself close to 100% full will add literally
tens of milliseconds of artificial slowness. But because it's a curve and not a static On/Off, you'll equalize your latency-vs-throughput numbers at around the natural capabilities of your vdevs.
Let's say you have a theoretical 8Gbps network link (shout out to my FC users) simply because that divides quite nicely into 1GB/s of writes that can come flying into the array. There's a dozen spinning disks set up in two 6-disk RAIDZ2 vdevs, and it's a nice pristine array. Each drive has lots of free space and ZFS will happily serialize all the writes, letting your disks write at a nice steady 100MB/s each. Four disks in each vdev, two vdevs - 800MB/s total vdev speed.
The first second of writes comes flying in - 1GB of dirty data is now on the system. ZFS has already forced a transaction group due to the 64MB/20% tripwire; let's say it's started writing immediately. But your drives can only drain 800MB of that 1GB in one second. There's 200MB left.
Another second of writes shows up - 1.2GB in the queue. ZFS writes another 800MB to the disks - 400MB left.
See where I'm going here?
After twelve seconds of sustained writes, the amount of outstanding dirty data hits the 60% limit to start throttling, and your network speed drops. Maybe it's 990MB/s at first. But you'll see it slow down, down, down, and then equalize at a number roughly equal to the 800MB/s your disks are capable of.
That's what happens when your disks are shiny, clean, and pristine. What happens a few months or a year down the road, if you've got free space fragmentation and those drives are having to seek all over? They're not going to deliver 100MB/s each - you'll be lucky to get 25MB/s.
One second of writes now causes 800MB to be "backed up" in your dirty data queue. In only three seconds, you're now throttling; and you're going to throttle harder and faster until you hit the 200MB/s your pool is capable of.
So what does all this have to do with SLOG size?
A lot, really. If your workload pattern is very bursty and results in your SLOG "filling up" to a certain level, but never significantly throttling, and then giving ZFS enough time to flush all of that outstanding dirty data to disk, you can have an array that absorbs several GB of writes at line-speed, without having to buy sufficient vdev hardware to sustain that level of performance. If you know your data pattern, you can allow just enough space in that dirty data value to soak it all up quickly into SLOG, and then lazily flush it out to disk. It's magical.
On the other hand, if your workload involves sustained writes with hardly a moment's rest, you simply need faster vdevs. A larger dirty data/SLOG only kicks the can down the road; eventually it will fill up and begin throttling your network down to the speed of your vdevs. If your vdevs are faster than your network? Congratulations, you'll never throttle. But now you aren't using your vdevs to their full potential. You should upgrade your network. Which means you need faster vdevs. Repeat as desired/financially practical.
This was a longer post than I intended but the takeaway is fairly simple; the default SLOG sizing and write throttle limits you to 4GB. Vdev speed is still the more important factor, but if you know your data and your system well enough, you can cheat a little bit.