sync=always is always slow?

Status
Not open for further replies.

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
Hi folks,

I've got FreeNAS 9.3 on a system with 10GbE and 256GB of RAM. I'm setting up iscsi for use with esxi, and I'm using the zvol route to get the VAAI accelration that FreeNAS now supports it. I've noticed that the "sync=always" zfs setting has a huge hit on performance, regardless of the speed of the underlying disks.

In my case, I've got the zfs equivalent of a 4-drive RAID10 of SSDs. If a VM lives in the SSD-based datastore in vSphere, a simple "dd" test (dd if=/dev/zero of=zero bs=1M) shows that it can write at just over 60MB/sec if sync=always is set for the zpool (or zvol) in question. If I change sync=standard, the throughput goes to close to 650MB/sec.

Is this expected? If sync=always has such a large performance penalty, what are the cases where it would make sense to use?

Thanks.
 
Last edited:

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
It's useful if you want to force all writes to be sync writes (duh?).

The catch is that most people who set sync=always also have a dedicated slog device, so you don't end up with the crippling performance hit. ;)
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
What's interesting is he's got a pool of SSDs. The 11x performance delta is surprising. I mean if he had the same SSDs in a striped SLOG (well, two of them anyway) presumably he'd get the same 60 MB/s write bandwidth?

I don't know enough about the differences in how ZFS handles an SLOG vs. the pool itself. I'm guessing the difference is that with sync=standard most of the writes are just going to memory, hence the performance increase.
 

wreedps

Patron
Joined
Jul 22, 2015
Messages
225
What is VAAI acceleration? I am interested!!!
 

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
It's useful if you want to force all writes to be sync writes (duh?)

I started out with a SAS array with a dedicated SLOG device, and it was slow as well. To eliminate the variable of the SLOG, I tried an SSD-only array, and I had results that were about the same. I figured that an SSD-only array shouldn't need an SLOG. Is that an incorrect assumption?
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I started out with a SAS array with a dedicated SLOG device, and it was slow as well. To eliminate the variable of the SLOG, I tried an SSD-only array, and I had results that were about the same. I figured that an SSD-only array shouldn't need an SLOG. Is that an incorrect assumption?

What are the SSDs you've been using, both for the dedicated SLOG and for the all-flash setup?

An SLOG might actually still be of value. When you're writing sync=always with in-pool ZIL, all of those transactions have to finish before it can report back. The added latency of scheduling that all together might be enough to slow things down, as opposed to just having to write to one vdev (single drive or mirror) for an SLOG.

Just to test, have you tried making a two-drive mirror of SSDs and then attaching another one as the SLOG device?
 

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
I'm using 120GB Samsung 850 Pros for the SLOG drives, and the all-flash array drives are 1TB 850 pros. The 1TB drives are inherently faster than the 120s (520 MB/s vs. 470MB/s), and there are two vdevs (each mirrors) for the 1TB-drive zpool, so I couldn't imagine having an SLOG helping much.

In the case of the SAS array (with SLOG) and the SSD array (without SLOG), both are just above 60MB/sec sequential write with sync=always.
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
In this case "faster" doesn't really equate to bandwidth. On sync transactions I would think latency is more important since the sending system is basically stalled waiting for the ack.

I think some people have tested a setup using a ramdisk for an slog. Yes, I know such a thing isn't a production setup but it would give you an idea of your systems best possible performance with sync=always. Then you could compare alternative solutions to that max.
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
I'm using 120GB Samsung 850 Pros for the SLOG drives, and the all-flash array drives are 1TB 850 pros. The 1TB drives are inherently faster than the 120s (520 MB/s vs. 470MB/s), and there are two vdevs (each mirrors) for the 1TB-drive zpool, so I couldn't imagine having an SLOG helping much.

In the case of the SAS array (with SLOG) and the SSD array (without SLOG), both are just above 60MB/sec sequential write with sync=always.

I think Honeybadger was suggesting using an SSD SLOG even on the SSD pool due to potential latency issues scheduling the writes to the in pool ZIL even on the all SSD pool. Would be an interesting data point.
 

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
In this case "faster" doesn't really equate to bandwidth. On sync transactions I would think latency is more important since the sending system is basically stalled waiting for the ack.

I think some people have tested a setup using a ramdisk for an slog. Yes, I know such a thing isn't a production setup but it would give you an idea of your systems best possible performance with sync=always. Then you could compare alternative solutions to that max.

Good to know. I know that the ZeusRAM device has been mentioned as a good SLOG device, but at around $3000 for the single device, it's pretty steep. https://www.hgst.com/products/solid-state-drives/zeusram-sas-ssd
My assumption was that a SLOG device was mostly to remove the contention of having the ZIL on the same spinning platters as the data. And that a good SSD should be quite appropriate. Perhaps both of those assumptions are incorrect?


I think Honeybadger was suggesting using an SSD SLOG even on the SSD pool due to potential latency issues scheduling the writes to the in pool ZIL even on the all SSD pool. Would be an interesting data point.

I should be able perform such a test. As well, I should be able to compare having two (or more) independent SLOG devices, as opposed to the current configuration of having two zfs-mirrored drives as a single SLOG device.
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
My assumption was that a SLOG device was mostly to remove the contention of having the ZIL on the same spinning platters as the data. And that a good SSD should be quite appropriate. Perhaps both of those assumptions are incorrect?
The point of the SLOG is to reduce latency for sync writes. If you don't have a SLOG, then a sync write must be acknowledged as written to the pool. If you do have a SLOG, then it can acknowledge the write. In most/all cases, a proper SLOG will have lower latency than the Pool. In an all SSD Pool, I would imagine there being less of a delta between the two, but would depend on the speed of the SLOG and the speed of the pool.
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
I know that the ZeusRAM device has been mentioned as a good SLOG device, but at around $3000 for the single device, it's pretty steep. https://www.hgst.com/products/solid-state-drives/zeusram-sas-ssd

My assumption was that a SLOG device was mostly to remove the contention of having the ZIL on the same spinning platters as the data. And that a good SSD should be quite appropriate. Perhaps both of those assumptions are incorrect?

Yep, Zeus is pretty dang expensive. Thinking about it, I would think using RAM as an SLOG would yield similar performance to sync=disabled. Or in your case 10-11x the bandwidth you are seeing from the pool. So $3000 for a 10x improvement. :)

On the SLOG, yes, it just provides a dedicated device for the ZIL. An SSD is appropriate, but as you are seeing in your system, it's not the holy grail. (It's clearly MUCH faster than sending sync writes to a pool of spinning disks. I can assure you that is painfully slow. I've tried it.) Sounds like latency is still an issue in the system. Not sure what else you can try. It would be more a question of how the system is architected. i.e. what chipset, where is the SSD attached? Is it NVMe or not, etc.
 

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
OK, the benchmark numbers are in! Hopefully this info will help anybody else who happens to stumble across this. The write speeds were determined via dd in a linux guest OS, as described above. For the slower transfers, I didn't let them complete, but the numbers should be pretty accurate. I emulated having a ZeusRAM device by creating a ramdisk via mdcontrol. All disks are present in a SAS3 enclosure, network connectivity is via a 10GbE switch.

Baseline (sync=standard):
14364282880 bytes (14 GB) copied, 22.0575 s, 651 MB/s

SAS zpool (sync=always):

No SLOG:
3196059648 bytes (3.2 GB) copied, 83.7404 s, 38.2 MB/s

Mirrored two-drive SSD SLOG
6866075648 bytes (6.9 GB) copied, 64.7897 s, 106 MB/s

Single SSD SLOG:
5809111040 bytes (5.8 GB) copied, 49.0787 s, 118 MB/s

Dual SSD SLOG:
4932501504 bytes (4.9 GB) copied, 38.4387 s, 128 MB/s

Quad SSD SLOG:
5913968640 bytes (5.9 GB) copied, 43.5033 s, 136 MB/s

RAM (mdcontrol) SLOG:
14367580160 bytes (14 GB) copied, 25.3676 s, 566 MB/s


SSD zpool (sync=always):

No SLOG:
42445279232 bytes (2.4 GB) copied, 37.6283 s, 65.0 MB/s

Single SSD SLOG:
3949985792 bytes (3.9 GB) copied, 33.0147 s, 120 MB/s

RAM (mdcontrol) SLOG:
14364041216 bytes (14 GB) copied, 25.4577 s, 564 MB/s

Conclusions
While an SLOG does help sync write throughput, a commodity SSD is not sufficient if you want to see the full speed of your disks. Interestingly, having an SSD-based SLOG on an SSD-based zpool does improve performance considerably. Even with a RAM-based SLOG, you still won't reach the speeds of sync=default, but it'll be pretty close.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Thanks for the testing; that's an excellent bit of data. It is what I expected to see; even with all-SSD vdevs, the I/O contention caused by in-pool ZIL takes a significant toll on throughput.

An important note though is that not all SSDs make suitable SLOG devices. The whole point of a sync write is to guarantee that data is on stable storage, and most consumer SSDs (including that 850 Pro) are not power-failure-safe; so that storage isn't as stable as people think. If power is lost in the middle of a write, data could be at risk.

Popular high-performance models are the Intel DC S3710 for SATA and the Intel DC P3700 for NVMe. Less-expensive options include the Intel DC S3610 for SATA and Intel 750 for NVMe. (You might be noticing a pattern here.)
 

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
An important note though is that not all SSDs make suitable SLOG devices. The whole point of a sync write is to guarantee that data is on stable storage, and most consumer SSDs (including that 850 Pro) are not power-failure-safe; so that storage isn't as stable as people think. If power is lost in the middle of a write, data could be at risk.

Good to know! Thanks.
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
Conclusions
While an SLOG does help sync write throughput, a commodity SSD is not sufficient if you want to see the full speed of your disks. Interestingly, having an SSD-based SLOG on an SSD-based zpool does improve performance considerably. Even with a RAM-based SLOG, you still won't reach the speeds of sync=default, but it'll be pretty close.

Excellent. Thanks for posting those results!

Massive latency increase when adding the SATA SSD as SLOG obviously. I'd be curious what the improvement would be with an NVMe drive direct attached to the cpu. It should improve substantially as you would be decreasing the physical HW latency and the SW stack latency.
 

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
An important note though is that not all SSDs make suitable SLOG devices. The whole point of a sync write is to guarantee that data is on stable storage, and most consumer SSDs (including that 850 Pro) are not power-failure-safe; so that storage isn't as stable as people think. If power is lost in the middle of a write, data could be at risk.

I've been noodling on this for a bit, and something isn't exactly clear. Are such non-power-safe devices unsuitable only for SLOG devices, or are they unsuitable for any operation in a zpool? e.g., as main storage in a vdev. Oracle seems to indicate: "ZFS is designed to work correctly whether or not the disk write caches are enabled. This is acheived through explicit cache flush requests". Which to me seems to imply that an 850-based (or similar) VDEV shouldn't be at risk in case of power loss or kernel panic. Is this correct?

If that is correct, then why would such a device be inappropriate for an SLOG? If it is not correct (850s and similar are not safe to use), then wouldn't that apply to spinning rust drives as well? Pretty much every drive comes with some sort of write cache, no?

I found a thread asking similar questions, but it seems to be unanswered. Thoughts?

Also, we should be getting a ZeusRAM in at some point, at which time I should be able to update my benchmark numbers with the results. I don't expect it to be as fast as the ramdisk I used for earlier testing, due to the fact that it'll be communicated with via SAS2 instead of directly by the CPU. The big question is how much of a difference it'll make.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hardcore ZFS guys, feel free to correct me here:

The reason non-power-safe devices with volatile cache, be they spinning rust or cheap SSDs, are OK in a vdev, is because ZFS will validate and replay any "lost data" from the ZIL if there's a power failure. And that ZIL has a copy of that data because you had sync=always enabled.

That's why it's so critical that the SLOG is capable of flushing its volatile write cache even without external power. The power-safe disks are able to immediately return "operation complete" when they're asked to flush, because they know they have enough internal power to dump their volatile DRAM to stable NAND.

The ultimate answer is "If you don't have sync writes enabled, there is always a risk of data loss."

Edit: I know it sounds counterintuitive, but consider looking at an Intel DC P3700 NVMe SSD first. They're much less costly than a ZeusRAM (~USD$900 for the 400GB model vs USD$3000) and the lack of SAS overhead may make them comparable in latency numbers. It is still NAND that will wear out though, and requires a PCIe slot. Just giving you options.
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
Edit: I know it sounds counterintuitive, but consider looking at an Intel DC P3700 NVMe SSD first. They're much less costly than a ZeusRAM (~USD$900 for the 400GB model vs USD$3000) and the lack of SAS overhead may make them comparable in latency numbers. It is still NAND that will wear out though, and requires a PCIe slot. Just giving you options.

That would be an interesting compare. It's not counter intuitive when you consider the quest is for the lowest latency. And a 750 sitting off the CPU would have less system latency than a Zeus sitting off a SAS HBA (I think.) Obviously the Zeus would have far less device latency.

If I get some time I might pencil out some inherent latency numbers and see if a 1st order approximation of this is possible (without knowing all the other system variables).
 
Status
Not open for further replies.
Top