Database performance and vfs.zfs.cache_flush_disable

Status
Not open for further replies.

EdvBeratung

Dabbler
Joined
Oct 25, 2014
Messages
12
Hi FreeNAS gurus, nerds and heros, :)

I have been reading through tons of threads and documentation but I am not 100% clear yet if it is safe to use vfs.zfs.cache_flush_disable so I hope that someone might be able to help me with some answers.

Quick story:
I am trying to use FreeNAS with a custom built database application which I don't trust so I am using "sync=always" to make sure it stays consistent. Unfortunately the DB is storing huge binaries and it is transacting with 4 kB records and even with a LSI 9207 (IT firmware) and Samsung 850 Pro SSDs (those are terribly fast) as SLOG I can't get any higher than 5 MB/s in writes. Iozone (with "-r 4k" option) confirms that. With larger record sizes I'm getting 400 MB/s and more but unfortunately the DB can't be adjusted and it always sends tiny 4 kB records.

Now I found out that there is an option called vfs.zfs.cache_flush_disable which I set to "1" in tunables -> loader and i'm now getting nearly 50 MB/s write speed which is about 10 times faster - great.
Since data consistency is important I want to make sure I understand what I'm doing and therefore I have some questions.

1. What does this option vfs.zfs.cache_flush_disable affect? The LSI 9207 in IT mode doesn't have any cache and basically works passthrough. So will it affect HDD integrated write cache? Or cache in PC's RAM? Or both? If it would affect PC RAM than the data would be no more secure than using "sync=disabled", correct?

2. The system got redundant power supplies and a separate UPS for each power supply so the system should never go down because of power issues.
If the BSD kernel crashes (panic etc) for whatever reason and there is still something in the HDD write cache in the SLOG HDD will that be written to the HDD even the kernel is dead as long as there is still power on the HDD?

3. I will try to simulate crashes anyway so I'm curious to see if rebuild from SLOG works. Are there any logs aynwhere that would show me that the pool has been recovered using information from the SLOG?
 
Last edited:

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
It would be *really* nice if you'd give the forum rules a read and provide the info that is required when creating a thread.

Thanks.
 

EdvBeratung

Dabbler
Joined
Oct 25, 2014
Messages
12
It would be *really* nice if you'd give the forum rules a read and provide the info that is required when creating a thread.

Thanks cyberjock!

This is on a test system using FreeNAS 9.3 M4.

It runs on the following hardware:
Intel Xeon E3-1240 V2, 3.40GHz, 4 Cores, HT, Turbo
24 GB DDR3 ECC SDRAM
4x WD30EFRX (WD Red 3 TB) configured as RAID 10
1x Samsung 850 Pro SSD 256 GB (SLOG)
1x Samsung 850 Pro SSD 256 GB (L2ARC)
All HDDs connected to LSI 9207-8i HBA (PCIe x8) with IT firmware (P20) and latest driver for FreeBSD 9.3 (mpslsi P20)

During the tests I described above the CPU stays below 10%.

I guess this is what you were asking for?! If not please let me know. Thank you!
 
Last edited:

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Ok, well, a few comments...

With as little RAM as you have, a ZIL and L2ARC aren't likely to help and stand a near 100% chance of hurting performance. Both of those take RAM to be used, so unless you already have an overabundance of RAM (I'm usually saying 64GB+ when I say "overabundance of RAM") you are going to be disappointed.

If you are using FreeBSD 9.3 then I really have no comments to add. FreeNAS has some tunables that differ from FreeBSD, so doing tests with FreeBSD 9.3, then coming here to understand the behavior isn't really a fair comparison. It's like buying a compact car and then trying to argue why it doesn't perform like a full-sized truck. Well, they aren't the same thing.

I can tell you for 100% certainty that you are definitely not on FreeNAS 9.3 though. If you were you wouldn't be on P20 drivers or P20 firmware, but you claim to be using both.

In any case, if you are stuck with 4KB blocks and random reads/writes, you are now talking about what is effectively the worst-case-scenario for using ZFS. You are going to have to throw significant hardware at the system to get good performance. Remember that ZFS is designed to protect your data at any and all costs, and when I say "any and all costs" I mean "you'll need to spend money on hardware to offset that performance penalty". You may be able to help it somewhat with tuning, but 24GB of RAM is not much to work with, even when talking about "just" 6TB of usable disk space.

But, to answer your question, vfs.zfs.cache_flush_disable affects disk caching. Normally when you write data to a hard drive the hard drive stores the info on the on-disk cache (typically 32-128MB of cache). The disk can claim that the data is written to the hard drive platter even if it isn't. Setting vfs.zfs.cache_flush_disable means that a disk write cache flush command doesn't follow a write to the pool, so you can throw some data in the on-disk cache and keep going without the data actually being written. Unfortunately, if this data is ultra-critical and is lost you could have problems. Problems could range from losing the last few seconds of writes to a pool that is inconsistent with itself (if you think this sounds bad, it is....) To make matters worse this can break POSIX compliance because a sync write is supposed to be stored in nonvolatile storage before being returned, and this breaks that compliance.

Just like with sync=disabled, setting the cache_flush_disable incurs risk. Here in the forum we take the conservative approach that people don't want to break this behavior because far too few people will even understand this explanation to be able to truly make an informed opinion about if they want to do this or not and how serious the consequences could be. This is one of those things that might hurt you on day one, or might not hurt you on day 1000. You also may not be aware of the data loss that has resulted until it is too late. But as most people that want to go with ZFS are looking to protect their data and are willing to pay for ZFS to behave "properly", this is certainly the anti-thesis of that philosophy.

I'm not specifically aware of any logs that show that the SLOG has been committed to the pool, but if the SLOG device is missing the pool won't mount, and if the SLOG is there, the pool will read the SLOG and commit data in it to the pool before the pool will mount.

As for how a hard drive behaves in a kernel panic situation, that is going to be something that is specific to the hardware used. I know the Intel S3700s are pretty much "the go-to" for good L2ARCs an ZILs. They have all "the right sauce" for ZFS. Samsung Pro drives are "probably" okay, but as I don't pride myself in reading all available documentation I wouldn't be the person to answer that question. If your data is important I'd be going with Intel S3700s, and I'd probably be deciding not to do this in-house and buy a TrueNAS. It's nice to have someone else to blame when things go wrong and it's nice to have someone else that has done all of the appropriate homework to make sure bits weren't lost because of a poor hardware choice. ;)
 

EdvBeratung

Dabbler
Joined
Oct 25, 2014
Messages
12
If you are using FreeBSD 9.3 then I really have no comments to add. FreeNAS has some tunables that differ from FreeBSD, so doing tests with FreeBSD 9.3, then coming here to understand the behavior isn't really a fair comparison. It's like buying a compact car and then trying to argue why it doesn't perform like a full-sized truck. Well, they aren't the same thing.

Thank you for your answer!
It is helpful although I can't see that I wrote anywhere I'm using FreeBSD 9.3. I don't. Of course I'm using FreeNAS 9.3 (latest M4 nightly). I probably wouldn't post FreeBSD questions in a FreeNAS forum.
But your answer looks still valid for both FreeBSD and FreeNAS so thanks again!



I can tell you for 100% certainty that you are definitely not on FreeNAS 9.3 though. If you were you wouldn't be on P20 drivers or P20 firmware, but you claim to be using both.

Not sure where this comes from. I simply downloaded the P20 firmware and the P20 driver for FreeBSD 9.3 from LSI website, flashed the P20 firmware, copied the P20 driver (mpslsi.ko) to /boot/kernel and loaded the mpslsi driver in "System" -> "Tunables".
(mpslsi_load="YES" as type "loader")

Reading through the release notes since P16 I felt that it was definitely worth upgrading to P20 before starting with this controller.

So you might reduce the certainty from 100% to a lower value... ;)
 
Last edited:

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
No, what you did was added a driver that is almost certainly incompatible with FreeNAS. That's what you REALLY did. You cannot and should not be trying to add drivers that way. The P16 is the driver we're using because it's stable. FreeNAS is an appliance, not an OS. You can't willy-nilly add and remove drivers because you want something newer.

At this point I'm going to bow out of this thread. I'm really not going to sit and argue the finer points of FreeNAS yet again. We've made it very clear all over this forum that FreeNAS is an appliance and not an OS. If you don't treat it as such you'll be sorry. In fact, the new 9.3 has a verify OS button that if you click and it fails I won't even be answering questions from that individual because they've customized their OS in a way that is unsupported. The devs will also be ignoring tickets from people that have customized their OS. ;)
 

EdvBeratung

Dabbler
Joined
Oct 25, 2014
Messages
12
Thanks, good to know! I wasn't aware of that.
So I'll be removing the original LSI P20 driver from loader again and I'll flashing back to P16 and see if it makes a difference.
 

newlink

Dabbler
Joined
Oct 24, 2014
Messages
11
We have similar problems, I can't go faster than 8 Mb/s with 4 kB chunks and Samsung SSD (50 Mb/s is their typical 4 kB speed). Small files do not perform very well, probably a cache problem like you found out
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, Samsung SSDs are well known around these parts to not be a good choice. :/
 

EdvBeratung

Dabbler
Joined
Oct 25, 2014
Messages
12
I was looking out for Intel S3700 and I saw in their data sheets that their write performance varies significantly with their size (like most SSDs do) and therefore also the price.
Will the S3700 100 GB version be enough for a SLOG? Does someone use that one as SLOG? If yes any chance an "iozone -i 0 -r 4k -s 2g" can be executed on a dataset with NO compression, atime activated and sync=always?
If nobody uses this drive as SLOG I will probably buy the cheapest 100 GB variant for testing-

Btw, I was flashing back to P16 and used the FreeNAS builtin P16 driver and behaviour is kind of the same but sometimes I see very short glitches in the transfer rate for half a second or so where there is basically no IO so overall the performance/stability is slightly worse compared to the latest P20 firmware and the drivers for FreeBSD 9.3 directly from LSI so I was going back to P20 again and I don't see these glitches anymore. Maybe there is some bug in the P16 driver and/or firmware that has been corrected by LSI.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You do realize we've seen people lose their pools suddenly with mismatched driver/firmware. So regardless of what you think might be related to firmware, you're better off sticking with the approved stuff.

Frankly, I don't think you have enough information to prove that the problem is even firmware related. Firmware version may be showing a symptom, but you definitely are a far cry from proving its the cause.

LSI has documented the changed between P16 and P20. You are welcome to give them a read. But talking to the devs about updating the driver version (for which I have a bug ticket open with them to upgrade), their answers with P16 still being used on 9.3 (which is used on 9.2.1.x) is that there's nothing of particular consequence in the changelogs to justify upgrading if you are using the hardware properly.

Just like with all software and harware problems, it's always a tradeoff between what you stand to gain and what you stand to lose. The tradeoff just wasn't deemed worth it in the eyes of the devs. There's nothing stopping you from rolling out a FreeNAS build with the P20 drivers (which is what I'd do I really just 'gotta have the latest'). Just don't be too surprised if you are later unhappy because of some bug that's been introduced with those choices.

I can tell you I'ved worked on dozens of machines with the M1015s, of all versions from P13 to P16, and I've never seen the effect you are describing. Of course, those people are almost always using exactly what is recommended, and you aren't. I never have recommended Samsung Pros and quite a few users have reported major problems with all families of Samsung SSDs.

Not to hound you or start an argument, but it's somewhat manageable to provide support for known hardware and known software versions and settings (this includes ZFS tuning). People that want to go off and do other things are welcome to, but they do so at their own risk, and their support in the forums if they have problem is often very limited in scope. For this reason (and a few others) I deliberately don't tune ZFS. I want to be 100% sure that if I have a problem that I'm likely not the only one with the problem. Don't get me wrong, I've done some testing of ZFS tunables and other things, but I haven't left them set that way long-term.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
I'll post my results with my s3700. A scrub has kicked off, so I'll need to wait for it:

Code:
 scan: scrub in progress since Sun Nov  9 00:00:03 2014
  5.75T scanned out of 18.4T at 1.34G/s, 2h42m to go
  0 repaired, 31.19% done
 
Last edited:

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Here's my results from a 200GB S3700 with sync=always and compression disabled. This is a dataset I service iscsi files extents off of, so I force everything through the slog.

Code:
Record Size 4 KB
  File size set to 2097152 KB
  Command line used: iozone -i 0 -r 4k -s 2g
  Output is in Kbytes/sec
  Time Resolution = 0.000001 seconds.
  Processor cache size set to 1024 Kbytes.
  Processor cache line size set to 32 bytes.
  File stride size set to 17 * record size.
  random  random  bkwd  record  stride
  KB  reclen  write rewrite  read  reread  read  write  read  rewrite  read  fwrite frewrite  fread  freread
  2097152  4  20104  19786


Watching gstat showed the slog device about 55% busy. It was also showing about 40,000 KB/sec with about 5,000 writes/sec, and about 10,000 total ops/sec.

I really have no idea if these are good numbers or not. I know I'm not usually writing 4k blocks to the slog, so I get a lot more than 20MB/sec over iscsi.
 

EdvBeratung

Dabbler
Joined
Oct 25, 2014
Messages
12
titan_rw,
Thank you very much for your help! I really appreciate the time you spent testing this for me!
Just in case that this is a test box where you can do a reboot - could you also add this tunable (type loader) that I described in this thread ("vfs.zfs.cache_flush_disable" with a value of "1")? The reason I'm asking for this is because if the values stay about the same the S3700 might ignore any cache flush requests because of the cache protection it has and this is what I'm curious to know. The performance you are getting is about 4 times higher than what I see on my test device.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Again, there's not enough solid information to prove it *was* the firmware. The integration of the moving pieces cannot be simplified down to "if its broken and I update the firmware and it starts working its a firmware bug".

Unfortunately, for that OP, he broke the standard convention for driver/firmware version and the devs may or may not be able to reproduce the problem (thereby actually determining the cause and providing a usable solution).

BUT, considering how many people are on P16 firmware AND drivers, I tend to think this is a one-off thing that is somehow unique to his configuration. He does have a boatload of drives and that may play a role in it.

In any case, I don't think anyone can argue that some drives have issues with P16. If that was the case there would be rampant threads all over the forums as virtually every drive model ever made has definitely been used by many many users just by sheer chance. But that's not what we are seeing.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
titan_rw,
Thank you very much for your help! I really appreciate the time you spent testing this for me!
Just in case that this is a test box where you can do a reboot - could you also add this tunable (type loader) that I described in this thread ("vfs.zfs.cache_flush_disable" with a value of "1")?

I can, but it probably won't be until tomorrow. Waiting on a backup to finish which can't be restarted if it breaks.

I ran the above test again, and got 17,000KB/sec instead of the 20,000 above. Not sure why the difference. The pool was otherwise 'unbusy' in both cases.

Also interesting is that the test results in exactly 8GB (8192MB) written to the ssd. Calculated using smart attribute 241 host writes, before and after.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Here's the results with vfs.zfs.cache_flush_disable=1:

Code:
Record Size 4 KB
  File size set to 2097152 KB
  Command line used: iozone -i 0 -r 4k -s 2g
  Output is in Kbytes/sec
  Time Resolution = 0.000001 seconds.
  Processor cache size set to 1024 Kbytes.
  Processor cache line size set to 32 bytes.
  File stride size set to 17 * record size.
  random  random  bkwd  record  stride
  KB  reclen  write rewrite  read  reread  read  write  read  rewrite  read  fwrite frewrite  fread  freread
  2097152  4  36701  36682


gstat was showing about 70,000 KB/sec writes to the slog, and about 33% busy. So disabling cache flush definitely did something. The slog is writing more, and the system thinks it has more room left than before.

Being that the S3700 is power-loss protected as it is, I'm not sure if it's safe to disable flushing.
 

EdvBeratung

Dabbler
Joined
Oct 25, 2014
Messages
12
Hi titan_rw,
Thanks again for all your help!
Pretty interesting to see the differences. Looks like the Samsung 850 I'm currently using heavily relies on the write cache. With write cache enabled (cache flush disabled) I'm getting a bit more than 50,000 KB/s whereas w/o cache (normal ZFS HDD cache flush) I'm getting less than 5,000 KB/s.
For the Intel the difference seems a lot smaller and it "only" doubles the write speed w/o cache flushing and the Samsung outperforms the Intel with caching enabled.

I will order one S3700 and I will also "play" a bit with simulating errors like "sysctl debug.kdb.panic=1", unplugging power from drives, unplugging the whole SATA/SAS cable or simply pulling power from the power supplies so I can find out what causes file corruption and what causes pool corruption when used with vfs.zfs.cache_flush_disable enabled or disabled and also when using the Intel as SLOG compared with Samsung.
That should keep me busy. ;)
 
Status
Not open for further replies.
Top