Some insights into SLOG/ZIL with ZFS on FreeNAS

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
they (all of a sudden) care more about the "speed" (exactly latency) than about the (small) possible data loss
You can work with this.

The lack of power loss prevention (PLP) on the drive is (or rather will be) directly related to the poor SLOG performance; most drives are very slow if you force them to write directly to NAND. Drives with PLP can "sort of ignore" the command to flush their cache and continue using RAM to accelerate writes.

Even a used Intel DC S3700 200G should be around the USD$50 mark and provide significantly more performance.
 

poldas

Contributor
Joined
Sep 18, 2012
Messages
104
Sorry if you wrote about it.
I thought to use 2 SSD disks, one for ZIL and one for L2ARC, maybe I'm wrong and I should mirrored or striped them. What is the best practice and disk size?
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Sorry if you wrote about it.
I thought to use 2 SSD disks, one for ZIL and one for L2ARC, maybe I'm wrong and I should mirrored or striped them. What is the best practice and disk size?
L2ARC and SLOG are two very different workloads, so the SSDs you would use for them would have very different characteristics.

L2ARC devices should be reasonably large, and deliver good random read performance with acceptable writes. They don't need power-loss protection circuitry or great steady-state write speeds.

SLOG devices don't need to be large, but often end up being large as a "side effect" since bigger drives tend to have higher write throughput and better endurance (to a point) - they also need to have power-loss protection circuitry, and good sustained write speeds.

If you already have the two SSDs, please post the model so that we can determine if it will be a viable SLOG device or if it's better left to an L2ARC. The workload you intend to put on the pool will also determine which of the two you would benefit from more; generally speaking, it's better to max out your RAM first before adding L2ARC, and if you're doing a lot of sync writes then good SLOG is almost mandatory.
 

poldas

Contributor
Joined
Sep 18, 2012
Messages
104
I don't have any SSD I though about buy new one but not write/read intesive only cheap 120 GB SSD for example Crucial CT120BX500SSD1 Maybe it isn't good idea?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I don't have any SSD I though about buy new one but not write/read intesive only cheap 120 GB SSD for example Crucial CT120BX500SSD1 Maybe it isn't good idea?
Definitely not good SLOG devices. Fine for L2ARC but you may not need or be able to use an L2ARC depending on your workload and system specs.
 

2nd-in-charge

Explorer
Joined
Jan 10, 2017
Messages
94
2) A small pool on a system with a lot of memory, such as one where a designer has included lots of ARC for maximum responsiveness, can counter-intuitively perform very poorly due to the increased default size for transaction groups. In particular, if you have a system with four disks, each of which is capable of writing at 150MB/sec, and the pool can actually sustain 600MB/sec, that still doesn't fit well with a system that has 32GB of RAM, because it allows up to 4GB per txg, which is greater than the 3GB per 5 seconds that the pool can manage.

I'm wondering how it applies to our new system. We have 192GB RAM (24GB transaction group?), and 12-drive pool configured as 2x 6-drive RaidZ2 vdevs. The pool would sustain 1200MBps at best, or 6GB in 5 seconds.
Do I understand it correctly that I need to
1. Tune transaction group size to something less than 6GB (e.g. 4GB) from the default 24GB? A pointer to how to do that would be greatly appreciated
2. Overprovision the SLOG device to 2x4Gb=8Gb?
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I'm wondering how it applies to our new system.
The original post was written in 2013, and a lot has changed since that time. It could definitely use an update.

Pools with "too much RAM for their vdev" aren't affected as badly any longer because of the gradual write throttle but they can still be hurt.

Remind me to loop back on this when I have a full keyboard and not just a phone. Is there a general build thread you have for your system?
 

2nd-in-charge

Explorer
Joined
Jan 10, 2017
Messages
94
Is there a general build thread you have for your system?
Yes, but I've been mostly talking to myself lately in that thread, and didn't ask about SLOG. The full h/w details are now in my signature as well.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Okay, I'm back for a little bit of additional information.

The original post was written back in the days when FreeNAS 9.1 was the new hotness - since then, we've had a lot of change in how the write throttle performs, and how you can tune it. The previous tunables (write_limit_*) and behavior of the throttle ("nothing, nothing, nothing, ALL WRITES BLOCKED LOL") are quite thankfully gone, and what we've since been blessed with is a much more gradual out-of-the-box setting as well as a much more tunable throttle curve. But the core issue of "why am I getting write throttling" remains - your pool vdevs are too slow for your network ingest speed.

The below comments should be considered for users who have made no changes to their tunables. I'll talk about those later - we're sticking with the defaults for this exercise.

(Be Ye Warned; Here Thar Be Generalizations.)

The maximum amount of "dirty data" - data stored in RAM or SLOG, but not committed to the pool - is 10% of your system RAM or 4GB, whichever is smaller. So even a system with, say, 192GB of RAM (oh hi @2nd-in-charge ) will still by default have a 4GB cap on how much SLOG it can effectively use.

ZFS will start queuing up a transaction when you have either 64MB of dirty data or 5 seconds have passed.
(In newer OpenZFS builds - so TN12 and later, and all SCALE builds - this "tripwire" value is 20% of your maximum dirty data or around 820MB. The same 5-second timer still applies; it's whichever is hit first!)

The write throttle starts to kick in at 60% of your maximum dirty data or 2.4GB. The curve is exponential, and the midpoint is defaulted to 500µs or 2000 IOPS - the early stages of the throttle are applying nanoseconds of delay, whereas getting yourself close to 100% full will add literally tens of milliseconds of artificial slowness. But because it's a curve and not a static On/Off, you'll equalize your latency-vs-throughput numbers at around the natural capabilities of your vdevs.

Let's say you have a theoretical 8Gbps network link (shout out to my FC users) simply because that divides quite nicely into 1GB/s of writes that can come flying into the array. There's a dozen spinning disks set up in two 6-disk RAIDZ2 vdevs, and it's a nice pristine array. Each drive has lots of free space and ZFS will happily serialize all the writes, letting your disks write at a nice steady 100MB/s each. Four disks in each vdev, two vdevs - 800MB/s total vdev speed.

The first second of writes comes flying in - 1GB of dirty data is now on the system. ZFS has already forced a transaction group due to the 64MB/20% tripwire; let's say it's started writing immediately. But your drives can only drain 800MB of that 1GB in one second. There's 200MB left.

Another second of writes shows up - 1.2GB in the queue. ZFS writes another 800MB to the disks - 400MB left.

See where I'm going here?

After twelve seconds of sustained writes, the amount of outstanding dirty data hits the 60% limit to start throttling, and your network speed drops. Maybe it's 990MB/s at first. But you'll see it slow down, down, down, and then equalize at a number roughly equal to the 800MB/s your disks are capable of.

That's what happens when your disks are shiny, clean, and pristine. What happens a few months or a year down the road, if you've got free space fragmentation and those drives are having to seek all over? They're not going to deliver 100MB/s each - you'll be lucky to get 25MB/s.

One second of writes now causes 800MB to be "backed up" in your dirty data queue. In only three seconds, you're now throttling; and you're going to throttle harder and faster until you hit the 200MB/s your pool is capable of.

So what does all this have to do with SLOG size?

A lot, really. If your workload pattern is very bursty and results in your SLOG "filling up" to a certain level, but never significantly throttling, and then giving ZFS enough time to flush all of that outstanding dirty data to disk, you can have an array that absorbs several GB of writes at line-speed, without having to buy sufficient vdev hardware to sustain that level of performance. If you know your data pattern, you can allow just enough space in that dirty data value to soak it all up quickly into SLOG, and then lazily flush it out to disk. It's magical.

On the other hand, if your workload involves sustained writes with hardly a moment's rest, you simply need faster vdevs. A larger dirty data/SLOG only kicks the can down the road; eventually it will fill up and begin throttling your network down to the speed of your vdevs. If your vdevs are faster than your network? Congratulations, you'll never throttle. But now you aren't using your vdevs to their full potential. You should upgrade your network. Which means you need faster vdevs. Repeat as desired/financially practical.

This was a longer post than I intended but the takeaway is fairly simple; the default SLOG sizing and write throttle limits you to 4GB. Vdev speed is still the more important factor, but if you know your data and your system well enough, you can cheat a little bit.
 
Last edited:

2nd-in-charge

Explorer
Joined
Jan 10, 2017
Messages
94
The maximum amount of "dirty data" - data stored in RAM or SLOG, but not committed to the pool - is 10% of your system RAM or 4GB, whichever is larger. So even a system with, say, 192GB of RAM (oh hi @2nd-in-charge ) will still by default have a 4GB cap on how much SLOG it can effectively use.

But 10% of 192GB is 19.2GB. Is my "system RAM" value something other than 192Gb?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
But 10% of 192GB is 19.2GB. Is my "system RAM" value something other than 192Gb?
Derp. I meant the exact opposite - 10% or 4GB, whichever is smaller. Meant to write "up to 4GB" and somehow got the wires crossed. Blame a number of consecutive 16-hour days.
 

2nd-in-charge

Explorer
Joined
Jan 10, 2017
Messages
94
10% or 4GB, whichever is smaller.
OK, thanks. This makes sense.
Sounds like my plan is pretty simple. Overprovision the S3700s to 8GB and don't tune anything..
 

2nd-in-charge

Explorer
Joined
Jan 10, 2017
Messages
94
So I overprovisioned to 8GB and added a two-drive mirrored log to the pool.
"Thank you", said ZFS, and created 2GB swap partition on each drive.
Apparently, to get 8GB mirror slog you need to present 10GB to the OS.

4GB cap on how much SLOG it can effectively use.
I believe you and my vfs.zfs.dirty_data_max is indeed 4GB.
Why does the FreeNAS guide say "ZFS currently uses 16 GiB of space for SLOG."?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
So I overprovisioned to 8GB and added a two-drive mirrored log to the pool.
"Thank you", said ZFS, and created 2GB swap partition on each drive.
Apparently, to get 8GB mirror slog you need to present 10GB to the OS.

Technically that's a FreeNAS thing - it creates that 2GB swap/buffer space on every drive. Maybe I'll put in a bug/feature request to have that disabled or significantly reduced on SLOG or small devices eg: "if device is smaller than 16GB, don't make a swap partition"

I believe you and my vfs.zfs.dirty_data_max is indeed 4GB.
Why does the FreeNAS guide say "ZFS currently uses 16 GiB of space for SLOG."?

I'm going to guess the guide was using inaccurate or outdated information. There's a whole lot of "rules of thumb" and "guesstimates" that get bandied around with ZFS that are founded in misunderstood or outdated information, like the "1GB of RAM per TB of data" thing. The divide between Oracle ZFS/OpenZFS and the updates over time don't help clarify things either. Wonder if there's a process by which I can submit documentation corrections.

What's the original size of those S3700s by the way? If they're a size we haven't seen in the benchmark thread I'd appreciate if you could pull one out of the mirror and run the test against it. (Also, you might want to switch them to 4Kn sectors vs. 512b for improved results - better yet, benchmark before and after you do that switch!)
 

2nd-in-charge

Explorer
Joined
Jan 10, 2017
Messages
94
echnically that's a FreeNAS thing - it creates that 2GB swap/buffer space on every drive. Maybe I'll put in a bug/feature request to have that disabled or significantly reduced on SLOG or small devices eg: "if device is smaller than 16GB, don't make a swap partition"

FreeNAS GUI (System->advanced - Swap Size in GiB - rover over question mark) specifically says that setting "Does not affect log or cache devices as they are created without swap." Except when you create mirrored SLOG..

I managed to remove that swap partition from SLOG drive by removing the SLOG from the pool, temporarily setting swap size to 0, adding SLOG mirror again and reverting back to 2GiB slog size. BTW, I couldn't remove mirrored SLOG via GUI, but
Code:
zpool remove 'pool' mirror-2
worked.

Now I'm seriously thinking if I should do it the other way around. Surely my S3700 SSDs are better swap devices than HDDs. What if I set Swap Size to 16Gb (or even 64Gb, why not), add mirrored SLOG, then set it to 0 (for the future HDDs), and delete all swap partitions from the existing HDDs? Not that swap is ever going to be used on a system with 192Gb RAM not under much stress, but if it is, might as well be on SSDs.

What's the original size of those S3700s by the way? If they're a size we haven't seen in the benchmark thread I'd appreciate if you could pull one out of the mirror and run the test against it.

They are 100G, I'm sure they are tested in that thread already. When I tested mine, I was getting slightly lower write speeds for large blocks (197-199MBps), could be due to Dell firmware (DL06) or the SAS2208 controller.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
FreeNAS GUI (System->advanced - Swap Size in GiB - rover over question mark) specifically says that setting "Does not affect log or cache devices as they are created without swap." Except when you create mirrored SLOG..

Well then that sounds like a bug on its own if it's creating swap where it says it won't.

I managed to remove that swap partition from SLOG drive by removing the SLOG from the pool, temporarily setting swap size to 0, adding SLOG mirror again and reverting back to 2GiB slog size. BTW, I couldn't remove mirrored SLOG via GUI, but
Code:
zpool remove 'pool' mirror-2
worked.

It let you create mirrored swap but not remove it?

Now I'm seriously thinking if I should do it the other way around. Surely my S3700 SSDs are better swap devices than HDDs. What if I set Swap Size to 16Gb (or even 64Gb, why not), add mirrored SLOG, then set it to 0 (for the future HDDs), and delete all swap partitions from the existing HDDs? Not that swap is ever going to be used on a system with 192Gb RAM not under much stress, but if it is, might as well be on SSDs.

Swap being used is basically a warning for "you have something amiss in your system" because the tunables should give a small amount of headroom in main RAM. Normally I only see it here when people are running a bunch of jails or plugins and haven't adjusted their arc_max target. Regarding disabling swap on your main data drives, the second function of that partition is to allow you to use a drive with slightly larger/fewer LBAs to replace an existing one, so I'd leave it alone. Do keep an eye on the swap value though and if it's ever nonzero you can check into it.

They are 100G, I'm sure they are tested in that thread already. When I tested mine, I was getting slightly lower write speeds for large blocks (197-199MBps), could be due to Dell firmware (DL06) or the SAS2208 controller.

Try the swap to 4Kn sector size I mentioned earlier, it might improve the low-end a little bit since there won't have to be as many I/O requests going over the wire.

And I hope you mean SAS2008 or SAS2308, because the SAS2208 is a RAID chip. Edit: Oh dear. Is taking that M5110 out and shooting it replacing it with an M1015 or other HBA an option?
 

2nd-in-charge

Explorer
Joined
Jan 10, 2017
Messages
94
It let you create mirrored swap but not remove it?
Yep, single SLOG device could be added and removed. Two drives added at the same time were partitioned (when the Swap Size was default 2GiB), and added as a mirror. I could offline the individual drives, but not remove the log vdev.
Just before leaving work today I removed the log vdev again (using zpool remove), but couldn't run the slog test diskinfo -Sw (operation not permitted). Crudely re-plugged one of the drives, then it worked on that drive.

Try the swap to 4Kn sector size I mentioned earlier, it might improve the low-end a little bit since there won't have to be as many I/O requests going over the wire.
Thank you. Will try both and if there is anything interesting post the results in that thread.

And I hope you mean SAS2008 or SAS2308, because the SAS2208 is a RAID chip.
No, it's not a typo. The SAS2208 is on the IBM MegaRaid M5110 card that came with the server. It is set to JBOD mode. FreeNAS defaults to using mrsas driver (rather than the older mfi), passes all info to camcontrol. The drives show up as da12 (Sandisk SSD Plus boot drive), da13 and da14 (the S3700s). smartctl works properly on all drives I plugged in so far (thee types of SSDs). Are there any known issues with this configuration?

Edit: Oh dear. Is taking that M5110 out and shooting it replacing it with an M1015 or other HBA an option?
It's an option, although I'm a bit worried that IBM server will complain if I put a different card in the storage slot (like Dell servers do). And M5110 is free :)
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Well, there is this page from a guy who claimed to have flashed an onboard SAS2208 into a SAS2308:
http://mywiredhouse.net/blog/flashing-lsi-2208-firmware-use-hba/

If it's passing through the drives and SMART status and you've validated that it isn't interfering as far as trying to throw writeback cache or similar in the way ... it's probably okay? The throughput impact might be the extent of what you see.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
That seems like a very bad idea.
Worse than using a SAS2208 in JBOD mode? Possibly. Unsupported? Definitely.

I don't know if the IBM X-series Thinkservers have a "locked" storage slot like some Dell R-series units do but a true HBA is definitely the best solution.
 
Top