sync=always is always slow?

HoneyBadger · Oct 8, 2015

There's an Intel paper on it with some Linux-specific stats here:
http://www.intel.com/content/dam/ww...nce-pcie-nvme-enterprise-ssds-white-paper.pdf

But even just comparing the datasheet for the ZeusRAM to the ARK site for the P3700 it looks like the Intel's got a good shot at winning. HGST claims <23us for write latency on the ZeusRAM, Intel's claiming 20us for the P3700. And I know for sure that an NVMe driver shim won't induce nearly the same latency as SAS.

If someone wants to ship me one of each I'd be glad to test it out ... ;)

toadman · Oct 8, 2015

I just did a quick search. The SSDs are quoting average latency in the 20us range. The SAS HBA is at least an extra 0.5ms. I suspect the system measured response time would be in ms as well, i.e. the system HW latency, but that should be the same in both cases.

Bottom line, I bet having that ZeusRAM behind the HBA will prove to make it perform worse than a 750.

Waco · Oct 8, 2015

You can essentially bypass the performance limitations of sync=always by running applications with high queue depths and large writes.

I have a setup at work that sustains 1.5 GB/s in sync=always mode without much trouble at all. Granted, it has 240 disks (though I was able to achieve very similar performance with only 60 disks), but still, it's possible with the right workload.

Will Dormann · Oct 8, 2015

toadman said:
Bottom line, I bet having that ZeusRAM behind the HBA will prove to make it perform worse than a 750.

Could be. However the current server that is running FreeNAS is out of available PCIe slots. A little creative re-shuffling might open up one, but given the guidance of using a mirrored SLOG (at least for SSDs) makes me think that I'd want two of them to really be safe, which is completely out of the question here.

The wheels are already in motion to procure the ZeusRAM, so I suppose the plan is to stick with that and see how that compares with the virtual ramdisk, benchmark-wise. But also beyond that, to focus on real-world performance as opposed to synthetic benchmarks. :)

I'd be quite interested to see a ZeusRAM vs. P3700 (or other NVMe) SLOG showdown, but obviously it'd take someone with both the equipment and the desire/ability to benchmark them and share the results here!

Waco · Oct 8, 2015

My wife runs a testbed at work that has a bunch of NVMe drives...no ZeusRAM though it should be similar to DRAM, no?

If there's enough interest it might make for an interesting geek night. :) It'd be ZoL, not FreeNAS, but I wouldn't expect to see drastic differences between "real" ZFS and ZFS on Linux.

HoneyBadger · Oct 8, 2015

NVMe is NAND wired to PCIe, so it would be like a seriously fast SSD.

Waco said:
It'd be ZoL, not FreeNAS, but I wouldn't expect to see drastic differences between "real" ZFS and ZFS on Linux.

I've found ZoL to be slower than FreeBSD, which in turn is slower than on Solaris.

Waco · Oct 8, 2015

HoneyBadger said:
I've found ZoL to be slower than FreeBSD, which in turn is slower than on Solaris.

I've never done a 1:1 comparison, but I do run ZoL at work on more than a few machines. No complaints really!

HoneyBadger · Oct 8, 2015

A 1:1 comparison would be fun if you've got a spare box that can run it.

Waco · Oct 8, 2015

HoneyBadger said:
A 1:1 comparison would be fun if you've got a spare box that can run it.

There's zero chance of doing a 1:1 comparison on the beefy hardware at work. It's easy to test during idle time with ZoL (since we run a distro with it installed by default), it's not easy to boot FreeNAS and test native performance (since these are diskless nodes).

Maybe I'll build a test rig to do a 1:1 comparison this winter, but it definitely wouldn't be the same caliber of hardware. :)

jixam · Oct 9, 2015

Excuse me for asking new questions. I came to the forum to start a new thread, but this one does seem appropriate.

We are going to build a server for VM storage, based on 2x Intel P3600 NVMe in a mirror and with sync=always. My question is, will we benefit from a separate ZIL? From this thread, I guess that the answer is that nobody really knows yet?

Of course, we will test things out first. But even if everything seems fine, I am worried that performance will tank in a few years. We have seen that with spinning disks, as fragmentation grows. Maybe someone knows the answer to a more generic question: does fragmentation of the zpool increase with an in-pool ZIL?

Will Dormann · Oct 9, 2015

jixam said:
We are going to build a server for VM storage, based on 2x Intel P3600 NVMe in a mirror and with sync=always. My question is, will we benefit from a separate ZIL? From this thread, I guess that the answer is that nobody really knows yet?

With my setup, I have a 4-drive stripe of SATA (6Gbps) SSD mirrors (RAID10, basically). In sync=always mode I was able to basically double the synthetically-benchmarked performance by adding a SLOG. This was a major surprise to me, as:

SSD SATA drives should be inherently fast.
The SLOG drive I added is actually slightly slower than the SSDs it is supporting.

Now, I have no experience with NVMe, but the conclusion that I drew from my experiments is that regardless of the speed of the underlying zpool, there can be a benefit to a SLOG with sync=always mode. It should be trivial for you to test (and share results here), as SLOG configurations can be changed on the fly in a non-destructive, non-interrupting manner.

cyberjock · Oct 9, 2015

toadman said:
What's interesting is he's got a pool of SSDs. The 11x performance delta is surprising. I mean if he had the same SSDs in a striped SLOG (well, two of them anyway) presumably he'd get the same 60 MB/s write bandwidth?

Totally expected.

toadman said:
I think some people have tested a setup using a ramdisk for an slog. Yes, I know such a thing isn't a production setup but it would give you an idea of your systems best possible performance with sync=always. Then you could compare alternative solutions to that max.

Yes, people have done it. And they are beyond retarded for even considering it. Literally, the slog is a copy of what is in RAM already! So why would you then make an slog in RAM? So you can store two copies of the data and have the performance penalty with writing to the slog device? Just set sync=disabled and kiss the zpool goodbye if things blow up. ;)

Anytime someone says they benchmarked an slog using a RAM device I immediately dismiss everything they say because they clearly do not understand the basics of the slog, so they clearly are going to come to incorrect conclusions about whatever information they present and likely made mistakes in their testing criteria. So with improper testing criteria coupled with a total lack of understandig of slog, why would you even *try* it? So you can have some cool benchmark numbers? Sorry, I'm more pragmatic than to look at some slog numbers that are really high just because I can.

Will Dormann said:
I started out with a SAS array with a dedicated SLOG device, and it was slow as well. To eliminate the variable of the SLOG, I tried an SSD-only array, and I had results that were about the same. I figured that an SSD-only array shouldn't need an SLOG. Is that an incorrect assumption?

It can be. There are definite sutations where an all-SSD zpool with sync writes can benefit greatly with an SLOG.

Waco said:
If there's enough interest it might make for an interesting geek night. :) It'd be ZoL, not FreeNAS, but I wouldn't expect to see drastic differences between "real" ZFS and ZFS on Linux.

That would be an incorrect assessment. They can differ, greatly.

toadman · Oct 9, 2015

cyberjock said:
Yes, people have done it. And they are beyond retarded for even considering it. Literally, the slog is a copy of what is in RAM already! So why would you then make an slog in RAM? So you can store two copies of the data and have the performance penalty with writing to the slog device? Just set sync=disabled and kiss the zpool goodbye if things blow up. ;)

Anytime someone says they benchmarked an slog using a RAM device I immediately dismiss everything they say because they clearly do not understand the basics of the slog, so they clearly are going to come to incorrect conclusions about whatever information they present and likely made mistakes in their testing criteria. So with improper testing criteria coupled with a total lack of understandig of slog, why would you even *try* it? So you can have some cool benchmark numbers? Sorry, I'm more pragmatic than to look at some slog numbers that are really high just because I can.

Putting aside the insensitive use of "retarded" (which I assume is your personal opinion and not that of iX), yep, system ram for an slog would be ill advised in any production system.

I could see implementing it for testing if you wanted to maybe look at latency through the block storage stack, which would be called (I think) if an slog is present and one needed a sync write (vs. say sync=disabled). (Although I think for all practical purposes the code stack latency would be negligible.) Or if you wanted to understand the additional system HW latency you were incurring with some sort of block slog device. Then again given the code stack latency is nil vs. us or ms latency on devices, sync=disabled would be just as good a compare for measuring HW latency. (I think.)

cyberjock · Oct 9, 2015

toadman said:
I could see implementing it for testing if you wanted to maybe look at latency through the block storage stack, which would be called (I think) if an slog is present and one needed a sync write (vs. say sync=disabled). (Although I think for all practical purposes the code stack latency would be negligible.) Or if you wanted to understand the additional system HW latency you were incurring with some sort of block slog device. Then again given the code stack latency is nil vs. us or ms latency on devices, sync=disabled would be just as good a compare for measuring HW latency. (I think.)

Exactly. If you want to compare the block storage stack your solution is to use sync=disabled versus sync=enabled with whatever hardware you want to test. You are right that if your code stack latency is even measurable your performance is so incredibly poor that you aren't looking for an slog device to help, you're looking for a way to stop the bleeding on your wrists because you have no clue what is broken and were hoping a sacrifice would fix the server. ;)

There is literally no test I can conceive of where you'd want to measure something and compare sync=disabled versus sync=enabled with a memory device, aside from seeing big numbers and getting a woody over it. There are people out there that just love seeing big numbers (two of my close friends live for that stuff). I'm more pragmatic than I was a decade ago as the numbers matter to me when I'm trying to find the upper limits for real-world applications. But there is no use, in any real-world scenario, where you'd want to do sync=enabled and then use memory device. You are just flat out better off doing sync=disabled at that point (and I have done that on occasion for testing purposes).

Yeah, 'retarded' is my personal opinion. I won't repeat what others say because some people say things even less friendly than what I said.

I never speak for iXsystems when here as that is not in my job description. I just peruse the forums to:

1. Make sure things are being prioritized properly. If 200 people have a particular problem I make sure that the higher ups know not to wait 3 months to fix it.
2. Help out if I can since I'm pretty knowledgeable in this stuff. I do have my limits though, but if I feel it is important (or I just want to know) I just ask a developer since they're a phone call away for me.
3. Flame people named toadman and jgreco because I get $1 for each flame post. ;)

toadman · Oct 9, 2015

cyberjock said:
aside from seeing big numbers and getting a woody over it.

Ha! Now that is funny! :)

I never speak for iXsystems when here as that is not in my job description. I just peruse the forums to:

1. Make sure things are being prioritized properly. If 200 people have a particular problem I make sure that the higher ups know not to wait 3 months to fix it.
2. Help out if I can since I'm pretty knowledgeable in this stuff. I do have my limits though, but if I feel it is important (or I just want to know) I just ask a developer since they're a phone call away for me.
3. Flame people named toadman and jgreco because I get $1 for each flame post. ;)

Steering people in the right direction is much appreciated. Although you should ask for a raise on #3. Should be worth $2. :)

Will Dormann · Oct 9, 2015

cyberjock said:
But there is no use, in any real-world scenario, where you'd want to do sync=enabled and then use memory device.

Offensive pejoratives aside, of course a ramdisk SLOG doesn't make sense in the real world. As the title of this thread indicates, I was curious why sync=always was very slow for me, even with an SLOG, and on a zpool that consisted of nothing but SSDs. The purpose for the ramdisk SLOG benchmark was simply to show that the SATA SLOG was the bottleneck, without shelling out for a ZeusRAM and waiting for that to arrive. Obviously, the actual ZeusRAM device will be slower than RAM directly accessible from the CPU. How much slower is yet to be determined, and that will have to wait for additional equipment to arrive.

cyberjock · Oct 9, 2015

As someone who has played with the ZeusRAM, I can tell you that they saturate their SATA/SAS connection without any problems at all. They are used in the higher-end systems that iXsystems sells, and they really can do 500MB/sec sustained. They are worth every penny if you need high throughput and don't want to be stuck with memory that will eventually fail due to write limits.

You could have validated that your slog was the bottleneck by setting sync=disabled. That is literally what you are supposed to use for diagnostic purposes as it short circuits the slog. sync=disabled basically turns the slog into an infinitely fast (but limited to RAM size) slog device.

jixam · Oct 10, 2015

Will Dormann said:
Now, I have no experience with NVMe, but the conclusion that I drew from my experiments is that regardless of the speed of the underlying zpool, there can be a benefit to a SLOG with sync=always mode. It should be trivial for you to test (and share results here), as SLOG configurations can be changed on the fly in a non-destructive, non-interrupting manner.

I will do a test with one NVMe configured for SLOG, but benchmarks can be deceiving so some good advice wouldn't hurt :).

The performance will likely be sufficient for us even without a SLOG, but I am worried that an in-pool ZIL will increase fragmentation over time.

These NVMe devices are a bit expensive and too big for SLOG, so I also wonder whether it would make sense to partition the drive, setting aside a few GB for a SLOG but using the remaining space for the zpool.

Or maybe my bad experiences with fragmentation are less of an issue today, with spacemap_histogram and no spinning disks.

HoneyBadger · Oct 10, 2015

jixam said:
I will do a test with one NVMe configured for SLOG, but benchmarks can be deceiving so some good advice wouldn't hurt :).

The performance will likely be sufficient for us even without a SLOG, but I am worried that an in-pool ZIL will increase fragmentation over time.

As the OP of this thread discovered, just because something's all-flash doesn't mean you don't benefit from SLOG. With your vdevs being NVMe, "slow" in that world might mean "only 300MB/s" but why leave performance on the table?

These NVMe devices are a bit expensive and too big for SLOG, so I also wonder whether it would make sense to partition the drive, setting aside a few GB for a SLOG but using the remaining space for the zpool.

Or maybe my bad experiences with fragmentation are less of an issue today, with spacemap_histogram and no spinning disks.

You'll basically be halving the available I/O to the drives as you'll be writing the data to the SLOG partition and then ~5s later the same data will get dumped to the vdev partition, while the SLOG is trying to ingest a new txg. Performance will suffer. Again, "suffer" might mean "only 300MB/s" but why buy a drive capable of 1GB/s and then neuter it?

jixam · Oct 11, 2015

HoneyBadger said:
You'll basically be halving the available I/O to the drives as you'll be writing the data to the SLOG partition and then ~5s later the same data will get dumped to the vdev partition, while the SLOG is trying to ingest a new txg.

Well, everything is a compromise. You are telling me to buy another device. Fine, but not using that device for the pool will also halve the available I/O and will increase fragmentation due to the pool being more full.

I can evaluate the real-world performance of the pool and add a SLOG if it is too low – in that case, I will "need" a SLOG and can easily justify it. But we generate probably just about 10MB/s and maybe the SLOG will not matter.

What worries me is that an in-pool ZIL might permanently fragment the pool so the easy fix of adding a SLOG will not actually fix things down the road. I have a feeling that I shouldn't worry because the ZIL data is very short-lived, but some confirmation would be nice :).

HoneyBadger said:
... why buy a drive capable of 1GB/s and then neuter it?

We are moving to SSD mostly to improve the performance when reading uncached data.

Important Announcement for the TrueNAS Community.

sync=always is always slow?

actually does care

Guru

Explorer

Explorer

Explorer

actually does care

Explorer

actually does care

Explorer

Dabbler

Explorer

Inactive Account

Guru

Inactive Account

Guru

Explorer

Inactive Account

Dabbler

actually does care

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "sync=always is always slow?"

Similar threads