Disable cache flush

Status
Not open for further replies.

MichelZ

Dabbler
Joined
Apr 6, 2013
Messages
19
If I have a Battery Backed controller with some cache on it, is it safe to disable cache flush?

zfs set zfs:zfs_nocacheflush = 1
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
AAh. You have a RAID controller with on-card RAM. Based on my testing with 3 different RAID controllers that had RAM and benchmark and real world tests, here's my recommended settings for ZFS users:

1. Disable your on-card write cache. Believe it or not this improves write performance significantly. I was very disappointed with this choice, but it seems to be a universal truth. I upgraded one of the cards to 4GB of cache a few months before going to ZFS and I'm disappointed that I wasted my money. It helped a LOT on the Windows server, but in FreeBSD it's a performance killer. :(
2. If your RAID controller supports read-ahead cache, you should be setting to either "disabled", the most "conservative"(smallest read-ahead) or "normal"(medium size read-ahead). I found that "conservative" was better for random reads from lots of users and the "normal" was better for things where you were constantly reading a file in order(such as copying a single very large file). If you choose anything else for the read-ahead size the latency of your zpool will go way up because any read by the zpool will be multiplied by 100x because the RAID card is constantly reading a bunch of sectors before and after the one sector or area requested.

If you deviate from these settings ZFS performance tanks BADLY from every experience I've had(and everything I've read). ZFS likes to make its own decisions on when to write and read to the disks. It seems to be very smart about not trying to do both at the same time(seek times are a PITA). Having a second device that starts making those decisions(your RAID card in particular) is really bad news because you are almost guaranteed to have both at the same time. What happens is the RAID card tries to flush its write cache at the same time ZFS tries to do reads. Then performance tanks and it gets harder and harder to get it back without waiting for the zpool to go completely idle(ZFS and RAID controller cache flushed). For me, just disabling write cache more than doubled my zpool performance. Changing the read-ahead more than doubled it again! I went from about 130MB/sec on my personal zpool to over 1GB/sec. I wasn't happy when a 30TB zpool wanted to scrub at 130MB/sec so I had to figure out what the problem was(67 hours for a scrub and you can't even stream 1 movie at the same time was unacceptable). Turns out the RAID card settings made all the difference. Who'd have thunk it? After all, everything you read about ZFS says to use "dummy" SATA controllers and never use RAID controllers(which you and I are both using). I actually removed the battery backup from my server and sold it to a friend that uses Windows still. It serves no purpose if you have write cache disabled.

There is no zfs_nocacheflush sysctl for FreeBSD. The equivalent is vfs.zfs.cache_flush_disable(default is "0" as verified by sysctl -a | grep vfs.zfs.cache). Remember in your Googling travels that Solaris' settings are different from FreeBSD. The theoretical function and recommended use/non-use cases are usually correct(but not always, and should always be verified). This only adds to a lot of confusion for people that haven't been tweaking FreeBSD for years(raises my hand) and makes it even harder to determine what tweaks work and what don't(after all.. if a sysctl doesn't exist in FreeBSD how do you prove it easily?).

Before you go trying that setting, my advice is to read this topic and then don't do it for any reason except for testing(and have good backups.. I think I've said that before to you though). Even if you try it and it works great(yes.. I think it will make you very happy performance-wise) I don't think it'll "teach" you anything except that not performing a "latency intensive task" will make the server so much faster. (Well, no duh!) I don't even try to tweak my systems because statistically I have a far higher chance of kissing my data goodbye or actually hurting performance than helping it. I've yet to see a noobie show up in the forums and their tweaks actually be well thought out(and actually work). It's not for the faint at heart, and the consequences can be the zpool is corrupted beyond repair/mounting. Probably 95%-99% of the time when someone complains about poor performance I have told them to delete their "performance enhancing tweaks" (if they had them) and by following this direction they saw performance increase. The other downside is that if corruption does begin to happen you may not even know it until your next scrub(assuming a scrub is capable of finding the corruption), so the storage needs of your backups will also go way up. This tweak has the universal result of causing major performance increases, but at the cost of major data loss suckage.

You are really on a losing battle performance-wise because you chose to go with a vdev type that requires redundancy calculations. That forces a stripe to be read(if not already in the read cache), the change made in memory, the redundancy calculations be performed, then the data written for each sync write. I think(but don't quote me on this) that using a ZIL will make the write to the ZIL without the extra latency added I just mentioned(no stripe read, redundancy calculations made, etc). I have no idea how much it will "help" your numbers. It might double, it might go up 10x. I really don't know. The ZIL(and L2ARC for that matter) are extremely complex. They aren't your typical dummy write cache like you see on your RAID controller. The ZIL only caches certain writes and only in certain circumstances(It's up to the advanced server admin to adjust the tweaks if necessary to get the most value out of the ZIL). Everything else is written directly to the zpool and there's not a darn thing you can do about it(without even more tweaking).

There is a tradeoff between performance and reliability and you are certainly hitting that limit. This is why iSCSI isn't recommended on ZFS. There is no "magic wand" to fix it either. ZFS pushes the reliability "all the way to 11", which means performance will take a nosedive. That's why so much RAM is used by ZFS to help make performance even remotely manageable.

Personally, if I were in your shoes I think I'd abandon trying to tweak your FreeNAS server and look at getting a ZIL(and using the RAID card settings I mentioned above). Normally I'm a big advocate against ZILs because everyone adds them without doing lots of homework. I've read everything I could find on ZILs(and L2ARCs) because I find them fascinating how they work but I don't even feel confident I could make a 100% certain choice on when to use or not use them. If a ZIL won't solve your problem then the next step I'd try is going to mirrors instead of RAIDZ(x). This may not work well for you because you will lose a total of 50% of all of your disks to redundancy, but it may be the only way aside from doing very very dumb things. If neither of those things help, you might be forced to go to UFS and give up on ZFS completely. In all honesty, I've had great luck with my Intel SSDs(I don't use any other brands) and I would feel confident that an array of Intel SSDs on a RAID would be fairly safe(but definitely keep backups!). The lack of moving parts really increases reliability, but you do have limited write cycles for SSDs. I've never used UFS on FreeNAS, so I don't know what dirty tricks and/or limitations you may run into in your situation.

I don't know if you've read bug report 1531, but I'd give it a full read(not a skim) and you'll see jgreco(he's the iSCSI on ZFS problem god) and you can see some of the stuff he did and lessons he learned. I think that that "big ticket" items have been mentioned to you already, but I'd definitely read it before spending countless hours with Google.

Not to be pedantic, but you are probably making a bigger mess with multiple threads for a single problem. I realize this really conflicts with the "one question per thread" philosophy(aka the forum rules), but you've had so many threads created in the last few days for your one problem I can't even remember what hardware you had and what questions you had. You might be better off if you had stuck to one thread as you may start getting conflicting answers (which will only confuse you even more) because nobody knows what was said in the other threads. I'm going to see if I can merge all of your applicable threads regarding this issue back into 1 for your own benefit.

I do give you props for your determination in trying to fix your problem!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
That config setting (•zfs set zfs:zfs_nocacheflush = 1 ) is from the ZFS Tuning Guide on the FreeBSD Wiki...
If not even that is accurate, then I might better stay the hell out of all that...

https://wiki.freebsd.org/ZFSTuningGuide

To quote a setting from a WIP document in a portion that has clearly not even been started...? it seems obvious to me that gcooper stuffed it in there as a reminder to finish that up at some point, it was "ripped off" from one of the Solaris documents but is a useful starting point.

But here's the problem you're facing.

You have a RAID controller. I don't know which one, but let us say it is a relatively modern LSI SAS2208 which comes with a dual-core 800MHz Power-PC processor and 512MB of RAM (*). It is pretty good at what it was designed to do, which is to abstract the physical storage and provide a more-reliable storage subsystem for whatever you're doing. Every sector that is sent to and from an attached disk is handled by the software on the card, because that's what these things are designed to do. They don't really expect to be used as dumb SATA controllers. So yes you can configure your read cache and your BBU write cache and all that garbage, and those things will work.

But now, conflict. ZFS is intended to be your storage controller, handling RAID, caching, and all that. It has access to a CPU that is much faster. It has access to a much larger pile of system memory. It is designed with the idea that it has access to all these disk I/O channels, and it isn't shy about making aggressive use of them to get better performance. And you're proposing to put a cheesy little RAID controller powered by a pair of hamsters on a wheel in between your ZFS and your storage. Hey, it'll work, because ZFS is awesomely resilient, but don't expect too much from it.

(*) Assumed to resemble a dual-core version of the 750FX, which charitably seems to have Geekbench results around 400, so let's say a result of 800. An E3-1230 is like 12,000.
 

MichelZ

Dabbler
Joined
Apr 6, 2013
Messages
19
Hi

I am using the LSI 9260 i8 (SAS2108) as a dumb sata controller. (JBOD Mode, which disables all RAID functionality)
This is what I got from LSI support:

Unfortunately, cache retention and performance mechanisms (writeback) are only supported on drives configured for Integrated RAID which contain valid DDF metadata. If a drive is configured for JBOD the drive cache and system cache is used instead of the RAID controller cache.

So cache is not used in JBOD mode.
So I will not disable flush cache then, as the disks themselves are not BBU protected.

Thanks
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
Hi

I am using the LSI 9260 i8 (SAS2108) as a dumb sata controller. (JBOD Mode, which disables all RAID functionality)
This is what I got from LSI support:

Unfortunately, cache retention and performance mechanisms (writeback) are only supported on drives configured for Integrated RAID which contain valid DDF metadata. If a drive is configured for JBOD the drive cache and system cache is used instead of the RAID controller cache.

So cache is not used in JBOD mode.
So I will not disable flush cache then, as the disks themselves are not BBU protected.

Thanks

You do not understand what that tunable does based on what you said. It has absolutely nothing to do with raid controllers or BBUs on hard disks.
 

MichelZ

Dabbler
Joined
Apr 6, 2013
Messages
19
The tunable instructs zfs not to flush the disk/controller cache on every sync write... yes?

Having a BBU with controller cache would make cache flushes not necessary.. yes? (as the cache is protected)

We are not speaking about the host's cache, we are speaking about the disk's cache (or the controller in this case)

So what exactly is wrong in my understanding?

Thanks
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
You do not understand what that tunable does based on what you said. It has absolutely nothing to do with raid controllers or BBUs on hard disks.

!!!!?????

http://milek.blogspot.com/2008/02/2530-array-and-zfs.html

I think it'd be correct to say that what it does doesn't appear to be helpful or useful in the context of what's being talked about in this thread, given the specifics, but for someone who was using a RAID controller with write caching that was actually doing something, it would be something to examine. I have spent a heck of a lot of time with ZFS and so finding the above reference was not difficult, but for the average user, finding a good description of what it's all about is kind of a pain, I would think.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
That link is for Sun's implementation. From what I've read FreeBSD's slightly different. Allegedly if the controller doesn't support the ZFS caching(Hint: Apparently there is no support for this "feature" since its different for FreeBSD) it does absolutely nothing. Unfortunately I can't find the link that accurately describes the function in FreeBSD and I don't have time to write a book right now.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Look, absent any explanation of how things are different in FreeBSD by someone who understands the code, the differences, and the design tradeoffs, I'm pretty much limited to the resources at hand, which include the documentation that I can find, reading the source code, and decades of experience.

So when I see code that says:

Code:
zil_add_block(zilog_t *zilog, const blkptr_t *bp)
{
        {snip declarations}

        if (zfs_nocacheflush)
                return;


what it seems to me is that the code is deliberately avoiding updating the on-disk ZIL (and presumably avoiding flushing the cache in doing so as part of the process). That seems consistent with this recent description on freebsd-fs of what is going on. Whether or not this is useful, wise, faster, etc., is all a very good question; my impression is that it isn't a good idea in general, unless you have gear that is actively getting its cache arse kicked by flushes, and since we've already established the OP has a setup that doesn't qualify, there's no point in worrying about it.
 

MichelZ

Dabbler
Joined
Apr 6, 2013
Messages
19
While my gear does not qualify, I would still very much like to know your toughts about it if I would have qualifying gear... (controller with cache and BBU which actually gets used ;) )
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
While my gear does not qualify, I would still very much like to know your toughts about it if I would have qualifying gear... (controller with cache and BBU which actually gets used ;) )
I don't think you understood jgreco's explanation involving the RAID controller.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
While my gear does not qualify, I would still very much like to know your toughts about it if I would have qualifying gear... (controller with cache and BBU which actually gets used ;) )

Large "enterprise" storage systems come with MASSIVE amounts of both read and write cache, that's how they manage to pump huge numbers of IOPS. The primary point of the optimization originally mentioned seems to be aimed at systems where flushing the cache is sufficiently expensive that you don't want to do it regularly. And what I'm specifically thinking of is that it is quite possible for several gigabytes of write cache to be directed at a small number of disks, resulting in an extended period of 100%-busy for the drives doing absolutely nothing of value, since the data was in nonvolatile cache storage anyways, and it is just the storage system doing "what you asked it to" but not really doing what you needed it to. This comes down to the fact that ZFS really expects to be talking to dumb disks and that it is the smart controller. If you have a second smart controller in the data path, then you get the storage system's smart controller trying to do "intelligent" things when it shouldn't.

My guess is that on the LSI2108 with only 512MB of write cache, and the average drive being able to write at 150MB/sec, and 8 drives makes it more like 64MB per drive cached, and I'm guessing the LSI does nothing with cache flush requests ... this seems like it wouldn't be an issue even if you were using the write cache, even if it did flush it, but I still think the setting really isn't a good idea for a variety of reasons that have been explained elsewhere.
 

Greg_E

Explorer
Joined
Jun 25, 2013
Messages
76
This has been really helpful in several ways... I'm currently trying to build my first FreeNAS box and may have spent a bunch of money needlessly. I too bought an LSI 9260-8i and it seems that for ZFS I should have just stayed with the on board JBOD controllers. I eventually also want to get into iSCSI. My card is currently not supporting JBOD mode, I expect there is alternate firmware I need for this to function.

So now I have a decision to make, do I run the RAID on the card and use UFS and be able to eventually work up iSCSI, or do I just take the card out and run JBOD on motherboard controllers and ZFS. I can use the card in Windows boxes so it won't be a total waste. The box is a SuperMicro 1017R-MTF which is a little 1 RU chassis with eight 2.5 inch drive bays and only 16GB of RAM for now (6 more RAM socket pairs). It also has a single internal USB A connection, so I'm running FreeNAS on one of the Sandisk micro drives connected internally.

I think after reading most of the linked thread on ZFS performance, I'm going to go with a more "traditional" FS for now and let the controller handle RAID duties... Now all I need are a couple more fans in the chassis to continue. I'll still take the suggestions about the caches into consideration when I get back to setting this up.

[edit] Just got a message back from LSI, my controller will not support JBOD.
 
Status
Not open for further replies.
Top