How do I reduce size of ZIL in RAM, and tuning to avoid write stalling

logan893 · Jan 23, 2016

I'm putting together a FreeNAS server for home use and I'm looking to tune the write cache to minimize stalls on continuous writes, and I'm having trouble finding the correct information for the latest release of FreeNAS 9.3.1.

My aim is to allow a steady stream of writes without lockups. It's preferable to have the writes see a gradual and smooth reduction in speed, rather than bursts of writes and a 4 second interruption.

To achieve this I assume I should tune the ZIL (reduce in-RAM portion) to hold an estimated 5 seconds (vfs.zfs.txg.timeout) of maximum wanted burst data writes. I also plan to add an SLOG (SSD) with a partition size in the range of 2-4 times the size of the ZIL.

My server will have one (potentially two) 1Gbps interfaces to the physical LAN, serving 2-3 clients, and it will run at least one Windows VM on the same host. FreeNAS will have access to somewhere around 12-20 GB of RAM, and my pool consists of 2x 4TB Seagate NAS HDDs. If performance is decent enough I will consider using the NAS to install and load games played from one of the other computers.

Everything I read says 1/8 of RAM is devoted to ZIL by default, which I expect will be much more than needed in my case. I understand that previously you could tune the size of the ZIL in RAM using vfs.zfs.write_limit_override.

Is FreeNAS 9.3.1, now that vfs.zfs.write_limit_override is deprecated, no longer allowing you to reduce the size of the ZIL in RAM?

Are there any official best-practice resources for tuning to avoid write stalling? On the forum I've been able to find mostly more or less random (and often futile) attempts at tuning parameters, with no real explanation of why and what the parameters do.

jgreco · Jan 23, 2016

The newer versions of ZFS do a better job of learning how the pool performs, but I still dump the transaction groups every second. This seems to help quite a bit with both the learning process and also handling unexpected burst traffic. It definitely reduces the peak throughput a little bit, but that's never been as much of an issue to me as a filer that goes catatonic.

logan893 · Jan 23, 2016

Seems I'll need to run some real-world tests once my disks are through their stress testing.

I'll start with "vfs.zfs.txg.timeout=1" and take it from there. (I assume this is the parameter you refer to.)

I was hoping to also reduce the ZIL to give more space for ARC. Potentially also to free up RAM to further motivate the addition of a small L2ARC (in the order of 40-80 GB).

jgreco · Jan 24, 2016

I'm not sure what you mean by "reduce the ZIL"; the ZIL is always whatever size it needs to be to record the current and previous transaction group. If you mean reducing a transaction group size, no, you probably don't want to do that, just shrink the window instead.

logan893 · Jan 24, 2016

It's not unlikely that my understanding of the ZIL size is lacking. I've not yet found any detailed information regarding new parameters guiding the size of the ZIL, and most other ZFS tunables.

What I've gathered is that FreeNAS limited (still limits?) ZIL to 1/8 of total RAM. I'm assuming it was via vfs.zfs.write_limit_shift=3. This parameter is now deprecated.

Is this limit a maximum, and only used on bursts of data? So if not used for ZIL, it's filled with ARC, but when it's needed some ARC may be flushed?

jgreco · Jan 24, 2016

First off, nothing has ever limited the size of the ZIL in the manner you're discussing. The size of the ZIL is inherently tied to the size of a transaction group, which has been 1/8th of system RAM in the past, but the ZIL is an on-disk data structure, and has nothing to do with system RAM, but if we follow the inferences we might notice that the size of the ZIL might be ballparked by taking 1/4th (one fourth, not one eighth) of system RAM.

But you're worrying about it the wrong way around. ZFS creates a transaction group in memory. Once "full", either by hitting the maximum size a txg is allowed to grow to, or hitting the time limit, it moves to quiescing state momentarily, then is synced out to disk. The limits for these operations used to be fairly crappily defined, and I spent a lot of time fighting them with earlier versions of ZFS, see for example bug 1531. But you do want the transaction group to be as large as possible.

The biggest thing to note these days is that ZFS is much more adaptive. From my perspective, the txg.timeout value actually gives ZFS a goal for responsiveness. txg.timeout=5 is really kind of saying that you don't find it acceptable for I/O to pause more than five seconds, even if the actual meaning is something else entirely. I don't want my filer ever to become that unresponsive, and so I use txg.timeout=1, which means my transaction groups are smaller, and I'm potentially flushing stuff to disk that might have been updated in memory instead. But the size of the transaction group is no longer a big deal. I've been able to catch ZFS unaware... reboot a filer, totally quiescent, with a lot of RAM and a slow pool. Then suddenly write as much as you can. You will probably overwhelm the system and you'll get a transaction group that's substantially greater than five seconds to write out. But it learns, very quickly, and doesn't make that mistake a second time. It's much better than it used to be and there doesn't seem to be a need to worry about it.

Robert Trevellyan · Jan 24, 2016

logan893 said:
FreeNAS will have access to somewhere around 12-20 GB of RAM

Are you planning to virtualize FreeNAS?

jgreco · Jan 24, 2016

Probably a good question.

logan893 · Jan 25, 2016

jgreco, thanks a lot for your explanations and guide to further reading. I read your bug report, and I also re-read your SLOG/ZIL sticky to clear up my initial misunderstandings.

I realize I was mistakenly combining two concepts, the transaction group and the ZIL, and I saw them as more interlinked than they truly really are. Your bug report #1531 details what I initially saw during testing of FreeNAS 9.3 in a VM under ESXi with a slow pool and even a mild write load. My own initial test, coupled with some initial reading and that's where my concerns grew.

What follows is my highly unscientific and anecdotal experience of writing and reading to a slow pool.

I was toying with 5 virtual drives in RAIDZ2, created from the same VMstore, and an HDD at that. Needless to say, pool performance was sub-par initially.

My use case was to transfer files from a Windows PC to a FreeNAS CIFS share. The first 1 GB of data (with 8 GB of RAM allocated) would be sent (and buffered in RAM) quickly, at the speed allowed by my 1GbE LAN, and then came the Catatonic Devil. It was basically stuck like this for a very long time (minutes) so I aborted the test and at that point I was not in a position to test further.

Removing data from this test pool, and running transfers again, I saw the same burst of data today, and then a freeze. However it was not frozen for as long, and started churning out more and more data, at a slower pace. When the transfer was complete, I fired up a second transfer, and while it starts out high it doesn't reach as high as previous buffer-then-freeze scenario. Also it slowly decreased the transfer speed to where it appears sustainable for the pool without freezing.

Finally I threw three writes and a read at it, and this caused it to once again stutter for a few seconds. Within perhaps 15-20 seconds it was back on track, and again writing while also reading data from disk.

Conclusion: As jgreco states, FreeNAS is now more able to cope and adapt than previously. I will be adjusting the transaction group flush timeout as a first step, and this may be sufficient.

Another reason why this pool performance was potentially extra horrible at first is that the vmdks were created using thin provisioning, unlike the thick provisioning of the FreeNAS VM itself.

If I decide to use FreeNAS it will be in a VM. At least initially. The server I have currently has a fairly low power PSU at 250W so I cannot put more than the two 4TB Seagate NAS and reasonably expect it to cope.

I'm running an LSI 9212-4i4e (SAS2008) with IT firmware in passthrough mode for FreeNAS. I have another identical LSI HBA card in IR mode which I will use for 2x SSDs in RAID1 for VMstore. Likely SSD candidate is the Samsung SM863.

jgreco · Jan 25, 2016

I haven't seen repeat-stalls normally, but if you're dealing with virtual disks, that could very likely be changing the dynamics of it all. Also, I am dealing with bigger systems than what you describe, and a lot of what I'd be testing against would be mirrors these days, though I do have an 11 disk RAIDZ3 system that I like to heap some abuse on.

I think it has some trouble when the underlying pool performance changes dramatically and suddenly, based on some experimentation with artificially stressing the disks outside of ZFS (with dd) and then trying to cause a repeat-stall (successfully). My recollection is that it learned very quickly and aggressively lowered the transaction group size.

Do note that we don't recommend virtualization for new users. If you're not expert with both FreeNAS and ESXi, there are lots of hazards.

logan893 · Jan 27, 2016

Thanks again for the help. The virtualization warnings are duly noted, I've read much about it in the past months. FreeNAS is new to me so I'm letting it take the time it needs to build this setup. I have not fully decided on using ZFS and FreeNAS just yet, although it is growing more appealing.

The rewards of using ZFS are enticing and appear great, yet the potential hazards of using ZFS on a not fully bulletproof setup with double or tripple redundancy in every way are equally sobering. As a result I wouldn't trust FreeNAS with the only copy of any data even remotely valuable, even if I were to run it on a stand-alone machine.

jgreco · Jan 27, 2016

The potential hazards of ZFS aren't really that much different from other things, but usually if your NTFS-on-a-single-disk craps out, you chalk it up to "hard drives suck" and then you restore it from backup (or suffer the loss). If your poorly designed fileserver loses a configuration and you've made your storage dependent on the 100% correct operation of three of them, you lose all your storage. Etc. Trusting your data to storage on a random PC with a random configuration is a crap shoot.

The difference with FreeNAS is that we EXPECT our storage systems not to lose data in those ways, so we do tend to stick to some well-understood recipes that result in very reliable platforms.

There's little reason not to trust FreeNAS with your data. It generally does as-good-as-or-better-than other filesystems. It's just that we're paranoid because we've seen stuff happen.

logan893 · Jan 27, 2016

I don't mean to say that FreeNAS is untrustworthy. If that was implied, that was a mistake. With any storage solution the robustness often goes hand in hand with your level of hardware investment. At least to a point.

Reading about so many lost pools, regardless of what's causing it, be it user error, negligence, poor hardware, incorrect virtualization, etc, does warrant some concern. Especially for the home user with a relatively low budget. Going from the single drive scenario to multiple drives in various RAID constellations no doubt makes a system more complex. Adding ZFS is yet another layer of complexity. I'm grateful that solutions such as FreeNAS are available and working as well as they are.

I assume FreeNAS with ZFS would compare favorably to many other software RAID setup in terms of being at risk for data loss, using the same hardware. However I don't have the FreeNAS expertise to back that up, so I defer to the forum and the wealth of information and good advice (or bad examples) available here.

One other appealing aspect of FreeNAS is its underlying OS. I've been an avid user of FreeBSD for a decade, and I prefer it greatly over Linux, etc. ESX(i) has also been my go-to base of operation, for almost as long. I have no professional experience with either, and I do concede that my knowledge is limited. I am far from an expert in either area. Being the slightly paranoid type myself however (especially with valuable data), I do plan to ensure I have adequate safeguards for when I store any important data using FreeNAS, or any other NAS solution.

We are getting a more and more off topic now. I'll be sure to create separate threads when I have more questions I cannot find the answers to. And perhaps I'll post a write-up of my complete setup plans once they solidify, so you again can discourage me from running virtualized. :)

jgreco · Jan 27, 2016

logan893 said:
I don't mean to say that FreeNAS is untrustworthy. If that was implied, that was a mistake. With any storage solution the robustness often goes hand in hand with your level of hardware investment. At least to a point.

Well, that's more or less it. I've been a big believer in redundancy for ... well, a really long time. Two cheap things tend to be more reliable than one expensive thing. Buying a single SATA disk in an external enclosure isn't good data protection. Buying a super expensive RAID solution doesn't necessarily translate into significantly better protection than buying two separate external SATA disks and putting the same data on both of them. There's someplace in between. I like to think that's ZFS.

Reading about so many lost pools, regardless of what's causing it, be it user error, negligence, poor hardware, incorrect virtualization, etc, does warrant some concern.

We really don't see a lot of lost pools. Almost all of what we see can be traced back to bad choices that are well outside the guidelines.

Especially for the home user with a relatively low budget.

Yes. FreeNAS really isn't well-suited for them. Except that these days, the cost of hardware is falling at a really good clip.

Going from the single drive scenario to multiple drives in various RAID constellations no doubt makes a system more complex. Adding ZFS is yet another layer of complexity. I'm grateful that solutions such as FreeNAS are available and working as well as they are.

You understand the complexity angle. Complexity can be a hazard. It can also provide benefits.

I assume FreeNAS with ZFS would compare favorably to many other software RAID setup in terms of being at risk for data loss, using the same hardware.

Favorably, yes, conditionally. It is definitely favorable on a suitable platform that includes ECC and native SATA or HBA ports, and is otherwise within the hardware guidance. In that case, it offers reliable checksum verification and very tough resiliency. It can monitor the health of each hard drive, and proactively warn as trouble approaches and eventually develops, since many hard drive failures manifest themselves by starting out as small runs of read errors.

If you put it on a platform that doesn't have ECC, there's an argument to be made that errors can be introduced into the pool that could cause corruption. Very unlikely. Definitely possible. ZFS lacks a "fsck" or "chkdsk" type utility, since errors should never occur, so an introduced error is potentially an unrecoverable "back it all up and then nuke the pool" type event.

If you put it on a platform with a hardware RAID controller, it's possible for the underlying storage to "go bad" without any apparent notification. FreeNAS is not designed to monitor the underlying storage behind a RAID controller. This could lead to sudden, catastrophic failure which might not be present in a platform that has the facilities to do such monitoring (if available).

If you put it on a platform with USB-attached drives, you might discover interesting failure modes when someone unplugs a hard drive inadvertently.

There are specific reasons for the hardware designs that are championed here in the forums. We recommend things that are not expected to fail with FreeNAS. It gets harder to talk about ZFS on random hardware platforms because each of them has various strengths and weaknesses which may not be meaningful to ZFS but may be useful with some other filesystem.

However I don't have the FreeNAS expertise to back that up, so I defer to the forum and the wealth of information and good advice (or bad examples) available here.

One other appealing aspect of FreeNAS is its underlying OS. I've been an avid user of FreeBSD for a decade, and I prefer it greatly over Linux, etc. ESX(i) has also been my go-to base of operation, for almost as long. I have no professional experience with either, and I do concede that my knowledge is limited. I am far from an expert in either area. Being the slightly paranoid type myself however (especially with valuable data), I do plan to ensure I have adequate safeguards for when I store any important data using FreeNAS, or any other NAS solution.

We are getting a more and more off topic now. I'll be sure to create separate threads when I have more questions I cannot find the answers to. And perhaps I'll post a write-up of my complete setup plans once they solidify, so you again can discourage me from running virtualized. :)

Bidule0hm · Jan 27, 2016

jgreco said:
Two cheap things tend to be more reliable than one expensive thing.

Hmm... not with 2 mirrored USB sticks vs. one SSD... :D

logan893 said:
Reading about so many lost pools, regardless of what's causing it, be it user error, negligence, poor hardware, incorrect virtualization, etc, does warrant some concern.

Well, you'll only see threads about problems, but no threads when everything is fine (BTW everything is fine with my server if you wonder, didn't lost one bit) :)

jgreco · Jan 27, 2016

Bidule0hm said:
Hmm... not with 2 mirrored USB sticks vs. one SSD... :D

TENDS. I SAID TENDS.

I think I've choked Beastie to death. Want to be my next victim?

Important Announcement for the TrueNAS Community.

How do I reduce size of ZIL in RAM, and tuning to avoid write stalling

logan893

Dabbler

jgreco

Resident Grinch

logan893

Dabbler

jgreco

Resident Grinch

logan893

Dabbler

jgreco

Resident Grinch

Robert Trevellyan

Pony Wrangler

jgreco

Resident Grinch

logan893

Dabbler

jgreco

Resident Grinch

logan893

Dabbler

jgreco

Resident Grinch

logan893

Dabbler

jgreco

Resident Grinch

Bidule0hm

Server Electronics Sorcerer

jgreco

Resident Grinch

Similar threads

Important Announcement for the TrueNAS Community.

How do I reduce size of ZIL in RAM, and tuning to avoid write stalling

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Pony Wrangler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Server Electronics Sorcerer

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "How do I reduce size of ZIL in RAM, and tuning to avoid write stalling"

Similar threads