Sync writes with SLOG *slower* than without SLOG???

xyzzy

Explorer
Joined
Jan 26, 2016
Messages
76
I'm testing a new TrueNAS build for eventual use as an ESXi iSCSI datastore. I was testing the following pool configurations and found some surprising results:
  1. 2 Intel P5510 3.84TB mirrored + no SLOG
  2. 2 Intel P5510 3.84TB mirrored + 1 Intel P5510 3.84TB as SLOG
When testing the ZVOL directly with "fio" on the TrueNAS box:
  • Sequential writes: "No SLOG" config is 20-31% faster
  • Random writes: "No SLOG" config is 6-55% faster
When testing from a Windows 2019 VM that's writing to a datastore on the pool:
  • Sequential writes: "No SLOG" config is 23-53% faster (probably even faster but my 2x10GbE NICs are a bottleneck here)
  • Random writes: "No SLOG" config is about the same
Notes:
  • Both pools were set up with sync=always, compression=lz4, atime=off, recordsize=128K, volblocksize=64K
  • Windows tests used CrystalDiskMark 8.0.2 in "NVMe SSD" mode
  • Fio tests were created that mimic the above CDM tests
  • I'm not planning on using a P5510 as a SLOG drive in production but that's what I had on hand to test with.
Does it make sense that the config with a SLOG was slower than the config without a SLOG?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,398
Yes, because for SLOG, you want something significantly faster than your pool, and sync=disabled. An SLOG is NOT a write cache, as @jgreco is fond of saying. It's an indirect write journal, which allows the pool to immediately acknowledge writes as they occur, and then complete the writes in the background. Ideally, your SLOG will also have power loss protection, to prevent loss of journaled writes and possible pool corruption from missing writes in the case of a power failure.
 
Last edited:

xyzzy

Explorer
Joined
Jan 26, 2016
Messages
76
Yes, because for SLOG, you want something significantly faster than your pool, and sync=disabled. An SLOG is NOT a write cache, as @jgreco is fond of saying. It's an indirect write journal, which allows the pool to immediately acknowledge writes as they occur, and then complete the writes in the background. Ideally, your SLOG will also have power loss protection, to prevent loss of journaled writes and possible pool corruption from missing writes in the case of a power failure.
In theory, I get how a "significantly faster" disk would be needed.

However, in practice:
1) The write latencies on these drives is 15 microseconds so the only way to top that would be to get an Optane P5800X (5 microsecond write latency + higher write IOPS). I'm wondering if the improvement would be significant enough to be worth the $$$. (Anyone got a P5800X they can loan me for a few days? :smile:)
2) I figured there would be some benefit to the data drives not having to handle ZIL responsibilities at the same time they're writing data blocks but would have never guessed the "with SLOG" performance would be worse.

(Side note: I did briefly try a 100 GB P4801X Optane drive as the SLOG drive thinking that it's marginally better IOPS and write latency would help but it fared significantly worse than my test case where I used the P5510 as the SLOG drive.)

I didn't understand the part about "sync=disabled"? At first I thought you were suggesting a way to set the SLOG drive only to "sync=disabled". But as far as I know, sync is a pool-level setting, so I don't know how to have sync=always on the data drives and sync=disabled on the SLOG drive. Can you please clarify?

And yes, whichever drive(s) (whether a dedicated SLOG or data drive(s)) is handling ZIL responsibilities absolutely needs to have power loss protection (the P5510's do). That's a must.

Oh, and @jgreco absolutely rocks! :smile:
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,110
You're in the somewhat pathological situation where ZIL by your pool (2*P5510, striped for ZIL purpose) is faster than your SLOG (single P5510), so I suppose it does make sense that no SLOG is faster. Optane P5800X would need to beat a stripe of P5510 to be an improvement; or you'd need to stripe Optane drives as SLOG.
It's probably easier to just go without a SLOG, especially as your data drives have PLP. Even better: get a fourth P5510 for a stripe of mirrors.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,944
sync=disabled means that the TN reports the data as written immediately - although there is a data risk
sync=always means that the TN waits till the data is written before it reports that its written
With a SLOG the sync=always is required but TN waits till the data is written to the SLOG. Thus the SLOG much be much faster than the pool to have an effect.
 

xyzzy

Explorer
Joined
Jan 26, 2016
Messages
76
You're in the somewhat pathological situation where ZIL by your pool (2*P5510, striped for ZIL purpose) is faster than your SLOG (single P5510), so I suppose it does make sense that no SLOG is faster. Optane P5800X would need to beat a stripe of P5510 to be an improvement; or you'd need to stripe Optane drives as SLOG.
Interesting.....so in the "no SLOG" case, the ZIL is striped across the 2 data disks? I thought the ZIL would be mirrored.

That said, I *did* do a similar set of tests with "fio" where I had a single P5510 with no SLOG vs a single P5510 with a single P5510 SLOG and the "no SLOG" case won by similar margins.

It's probably easier to just go without a SLOG, especially as your data drives have PLP. Even better: get a fourth P5510 for a stripe of mirrors.
Are you referring to a "RAID 10" kind of config? (I always get confused about "stripe of mirrors" vs" mirror of stripes"). If so, that's the first thing that crossed my mind when I saw those stats.
 

xyzzy

Explorer
Joined
Jan 26, 2016
Messages
76
sync=disabled means that the TN reports the data as written immediately - although there is a data risk
sync=always means that the TN waits till the data is written before it reports that its written
With a SLOG the sync=always is required but TN waits till the data is written to the SLOG.
Yes....totally understand. For my use case, data risk is not acceptable so "sync=always" is a must. And in past builds with SATA SSDs, the SLOG definitely helped with sync write performance. So, I was a little confused about Samuel Tai's comment about "sync=disabled".
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,110
As far as I understand, ZIL is always striped, regardless of the pool geometry. The goal is to be as fast as possible (and it was devised in the days of spinning rust…), redundancy is less critical since it is only ever read in the rare case where the pool is imported in an unclean state.

"Stripe of mirrors" is indeed "RAID 10-like". That may be overkill, but since a SLOG does not help in your case you're already halfway from one mirror to two mirrors for more performance (and an even faster ZIL while keeping "sync=always"). ZFS doesn't do "mirrors of stripes", as redundancy is at vdev level, not at pool level.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,398

xyzzy

Explorer
Joined
Jan 26, 2016
Messages
76
As far as I understand, ZIL is always striped, regardless of the pool geometry. The goal is to be as fast as possible (and it was devised in the days of spinning rust…), redundancy is less critical since it is only ever read in the rare case where the pool is imported in an unclean state.
Got it. This seems reasonable to me.

However, I'm still confused why single data drive / no SLOG case is faster than single data drive with SLOG.

"Stripe of mirrors" is indeed "RAID 10-like". That may be overkill, but since a SLOG does not help in your case you're already halfway from one mirror to two mirrors for more performance (and an even faster ZIL while keeping "sync=always"). ZFS doesn't do "mirrors of stripes", as redundancy is at vdev level, not at pool level.
That makes sense and I think it's likely where I'll end up. Thanks!
 

xyzzy

Explorer
Joined
Jan 26, 2016
Messages
76
Apologies for the confusion. It's always dangerous to rely on memory without referring to an actual system. On my system with an SLOG, I have sync=standard. For an explanation of the options, see http://milek.blogspot.com/2010/05/zfs-synchronous-vs-asynchronous-io.html. With sync=always, you're waiting on sync for both the ZIL write and the pool write.
No worries. But just to clarify, there's no way to have the data drives use sync=always and the SLOG use a different sync setting, right?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,398
No worries. But just to clarify, there's no way to have the data drives use sync=always and the SLOG use a different sync setting, right?

I don't believe so, although I believe sync=standard tries to do the "right" thing.
 

xyzzy

Explorer
Joined
Jan 26, 2016
Messages
76
I don't think my original question got answered so let me try again with a simpler scenario with no mirrors:
  • Config 1 = 1 Intel P5510 3.84TB + no SLOG
  • Config 2 = 1 Intel P5510 3.84TB + 1 Intel P5510 3.84TB as SLOG
When testing the ZVOL directly with "fio" on the TrueNAS box:
  • Write SEQ 1M Q8T1: "No SLOG" is 10% faster
  • Write SEQ 128K Q32T1: "No SLOG" is 9% faster
  • Write RND 64K Q32T16: "No SLOG" is 107% faster
  • Write RND 64K Q1T1: "No SLOG" is 12% faster
How is this possible?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
But just to clarify, there's no way to have the data drives use sync=always and the SLOG use a different sync setting, right?
The sync setting is per dataset/ZVOL, not (only) pool-wide and you can't have it not apply to the SLOG if it applies to any dataset/ZVOL in the pool that SLOG is part of (since that setting is exactly and only about how ZIL... which in this case is your SLOG... is to be handled).

As already recommended, you're probably best to just drop the idea of SLOG as your pool would be sufficiently protected and isn't going to be faster (as you already proved).
 

Kailee71

Contributor
Joined
Jul 8, 2018
Messages
110
Just in case you were looking for "what's the fastest SLOG device at the mo"... Here's a few pages worth of benchmarked drives.

TLDR; Optanes generally seems to be top of the pile, with some cheap (and old) RMS-[23]00 PCIe devices also giving solid performance. I now use a really small (8GiB) RMS-200 and as I don't get many writes larger than that I can basically keep my 10GbE fully occupied on writes up to that size even with sync=always, even though the pool is on only 2x2 4TiB Red Pluses.

Also, @Samuel Tai; if you're going to use an SLOG with decent write speed then anything else than sync=always makes no sense, and invalidates having the separate device in the first place.

Hope this helps,

Kai.
 
Top