SLOG on 2x Optane - mirrored vs striped SLOG peculiarity

jgreco · Dec 9, 2021

Kailee71 said:
At least we have jgreco left

HEY! I resemble that remark...

Kailee71 · Dec 9, 2021

Uh oh, am I in the sin bin?
[Edit]
Just to clarify - I didn't mean to draw parallels to Jock's prickly prose, but rather to the invaluable quality of your content. Apologies if this came across the wrong way.
[/Edit]

sretalla · Dec 10, 2021

OK, let's get serious for a minute here...

I went back to the start and you're talking about SLOG and wanting to speed things up but not being worried about losing stuff...

That's a conflict that's easily resolved by simply changing the approach to match your parameters.

Set the dataset(s) to be sync=disabled and forget about SLOG, it will go to RAM and not wait for the commit to disk notification, so will be at fastest possible speed... with the risk of data loss in the scenarios already discussed here.

I hope that answers you as you wished.

Kailee71 · Dec 10, 2021

While I haven't tested this, are you sure the 5 second interval doesn't affect large async transfers without tuning?

sretalla · Dec 10, 2021

The 5 second interval is about flushing the slog to the pool. I think Pool-resident ZIL isn't required to confirm async writes... maybe somebody knows better.

jgreco · Dec 10, 2021

Uh?

sretalla said:
The 5 second interval is about flushing the slog to the pool.

Nope. SLOG writes are always lock-step with the sync write requests. There is never a read of the SLOG, except at pool import time, so "flushing the slog to the pool" is not a thing that happens. You're confusing a few different interactions here, sorry.

sretalla said:
I think Pool-resident ZIL isn't required to confirm async writes...

No SLOG or ZIL is required to "confirm" ASYNC writes. Async writes always ONLY go to the in-memory transaction group that is being built out, to be flushed out with the txg.

ANY synchronous write HAS to be committed to the ZIL in order to guarantee compliance with the POSIX requirement to be "committed to stable storage" (I believe is the wording used). This can be the in-pool ZIL space, or a separate SLOG device holding the pool's ZIL.

But those statements may be hard to understand if your concept of ZIL/SLOG is incorrect.

There are basically two things happening in parallel.

Every single write, no matter what, gets written into main memory as part of the currently building transaction group. This will be flushed out to the pool when the transaction group size limit is reached, or after five seconds, whichever is less. This bit has NOTHING to do with ZIL/SLOG. It is how data is written to the pool.

Additionally, because metadata often MUST be written synchronously, and because sometimes we ALSO want file data written synchronously, ZFS has the ZIL mechanism. This "intent log" will commit a write to disk (whether in-pool ZIL or external SLOG) before returning a success status to the operating system. This data is NEVER read during normal operation, and is ONLY used at pool import time to verify and, if needed, recover the few seconds of activity that was in-flight at the time the pool stopped. Which should only be a concern if the filer crashed or lost power.

Writes not tagged for SYNC are not written to the ZIL, and ONLY go to the transaction group mechanism.

This should dispel any misunderstanding as to the SLOG being some sort of write cache; it isn't. It also isn't needed for most other use cases. If you are trying to manipulate thousands of small files quickly, or doing database work, or doing VM's, then there are definitely a lot of factors to consider for both SLOG and ARC/L2ARC sizing.

sretalla · Dec 10, 2021

jgreco said:
Nope. SLOG writes are always lock-step with the sync write requests.

I get your point... The memory resident copy is what's flushed, but the SLOG (and memory) won't be allowed to build up beyond that point if the flush can't happen.

jgreco said:
No SLOG or ZIL is required to "confirm" ASYNC writes.

OK, that's what I was trying to say.

jgreco said:
Writes not tagged for SYNC are not written to the ZIL, and ONLY go to the transaction group mechanism.

Perfect... that would support the recommendation I made.

Generally thanks for the clarity.

jgreco · Dec 10, 2021

sretalla said:
SLOG (and memory) won't be allowed to build up

I think you get the point, but for the rest of the audience, I'm going to say this is poorly stated, in that no transactions to the SLOG or in-pool ZIL can "build up" in the manner this could be read to imply. The ZIL transactions are always lock-step. You are correct in that transaction groups, the things in memory, are not allowed to build up to arbitrarily large amounts of stuff to write to disk. I find it easiest to see these in my mind as two completely parallel and nearly disconnected processes.

bonox · Dec 10, 2021

thanks @sretalla and @jgreco.

HoneyBadger · Dec 10, 2021

bonox said:
So question again, is it possible to devote 320GB of 384GB RAM to dirty data to smooth out high burst data dumps please?

In a word, "yes" but you'll also need to adjust zfs_txg_timeout or the 5s default timer will trip you up well before that 320GB threshold is reached.

Module Parameters — OpenZFS documentation

openzfs.github.io

Bear in mind that you're opening yourself up to a potentially huge world of hurt in terms of data loss, as well as some "unexpected edge case" types of behaviour when ZFS needs to flush 100GB+ to stable vdevs and still occasionally return reads back.

bonox · Dec 10, 2021

That transaction timeout would explain why it would appear not to work so far. Thanks honeybadger. At this stage I think I'm really just having fun playing with lose-able data and trying to maximise performance of my use case. optane is pointless here anyway. It's a 1RU server with no free pci slots as the three are already the HBAs and NICs. I'll have a further dig around the transaction parameters and have a fiddle. Are there any others you know of what I should be looking at in addition to this one and the throttle params?

Sounds like it could be worth prioritising zfs_vdev_async_write_min_active under the scheduler banner but since the demand on the pool is pretty much just write, there may not be any point changing from defaults. Maybe zfs_delay_scale as well. Probably sort that out with some benchmarking.

sretalla · Dec 11, 2021

I think we need to go back to what @jgreco was saying... I don't think the txg timeouts/flushing even apply if you're using sync=disabled.

awasb · Dec 11, 2021

jgreco said:
Uh?

Nope. SLOG writes are always lock-step with the sync write requests. There is never a read of the SLOG, except at pool import time, so "flushing the slog to the pool" is not a thing that happens. You're confusing a few different interactions here, sorry.

No SLOG or ZIL is required to "confirm" ASYNC writes. Async writes always ONLY go to the in-memory transaction group that is being built out, to be flushed out with the txg.

ANY synchronous write HAS to be committed to the ZIL in order to guarantee compliance with the POSIX requirement to be "committed to stable storage" (I believe is the wording used). This can be the in-pool ZIL space, or a separate SLOG device holding the pool's ZIL.

But those statements may be hard to understand if your concept of ZIL/SLOG is incorrect.

There are basically two things happening in parallel.

[...]

"We want pics!"

ZFS sync/async + ZIL/SLOG, explained – JRS Systems: the blog

sretalla · Dec 11, 2021

awasb said:
ZFS sync/async + ZIL/SLOG, explained – JRS Systems: the blog

That looks suspiciously like the Ars Technica article I've quoted a number of times in this forum already... not sure which one came first or where the diagrams are from.

ZFS 101—Understanding ZFS storage and performance

Learn to get the most out of your ZFS filesystem in our new series on storage fundamentals.

arstechnica.com

See page 3

jgreco · Dec 11, 2021

sretalla said:
I think we need to go back to what @jgreco was saying... I don't think the txg timeouts/flushing even apply if you're using sync=disabled.

sync=disabled merely disables (mostly) the ZIL/SLOG mechanism. This leaves the transaction group building process as the primary driver of performance.

Some insights into SLOG/ZIL with ZFS on FreeNAS

What is the ZIL? POSIX provides a facility for the system or an application to make sure that data requested to be written is actually committed to stable storage: a synchronous write request. Upon completion of a sync write request, the underlying filesystem is supposed to guarantee that a...

www.truenas.com

Things have changed a bit in the near-ten-years since I wrote that, with ZFS getting substantially better at learning about pool performance, and transaction group sizing being augmented with the ZFS write throttle.

ZFS fundamentally should not be expected to write to pools at a speed that exceeds the pool's capabilities.

If we look at the numbers the poster is contemplating, 320GB, and consider what would happen... if the current transaction group is 160GB and the flushing transaction group is 160GB, we have to wonder about how long that would take to write to disk. Your average disk drive probably cannot sustain over 150MBytes/sec, and that only on a mostly-empty pool, so in order to flush out 160GB in a very charitable 60 second-sized txg timeout, you still need to write about 3GBytes/sec, or 20 mirror vdevs. If any significant number of seeks are introduced, your filer's write rate will drop FAR below 150MBytes/sec. Since ZFS seizes up when it is blocked waiting for a txg to flush, this introduces a catastrophic scenario where you could lock up for many minutes or even an hour.

This is why I usually drive the bus in the other direction... I lower the transaction group size, and instead focus on the pool performance. My interest is in NEVER seeing the pool I/O wedge, and accepting somewhat slower I/O by reducing the transaction group size is a reasonable way to do that. This makes you focus on the problem from the more realistic angle of making sure the pool can soak up the I/O loads you're actually sending to ZFS.

The interactions with the write throttle and other factors in modern ZFS may make the sort of "large buffer subterfuge" you are attempting more difficult.

awasb · Dec 11, 2021

sretalla said:
That looks suspiciously like the Ars Technica article I've quoted a number of times in this forum already... not sure which one came first or where the diagrams are from. [...]

It's the same author: Jim Salter.

bonox · Dec 11, 2021

jgreco said:
The interactions with the write throttle and other factors in modern ZFS may make the sort of "large buffer subterfuge" you are attempting more difficult.

Acknowledged, and appreciate the detailed thoughts.

It's possibly not as bad as you think it is and while I don't have 20 vdevs there are 6 vdevs and 30 disks on this pool and it really and truly is a write only pool until something goes wrong and requires me to find a previous file, at which point all the writes stop anyway. Throughput is far more important than latency in this case. I could handle a 5 minute transaction timeout and 'lockups' don't matter since they shouldn't really occur with one write stream and no other activity on the pool. A huge advantage of a one or two user base, who both work together on the same problem.

And i'm entertaining myself playing with this file system performance in more quiet moments for what I know is a weird edge case. But what's life without the odd challenge here and there ;) And if the tweaks don't work, i'll try something else.

jgreco · Dec 11, 2021

It isn't "one write stream" that's a factor here. ZFS will prefer to lay down one or a dozen write streams into contiguous free space where possible. The big factor is what happens when fragmentation occurs, and impacts the number of seeks that need to be done as part of a transaction group commit. A sudden uptick in the number of seeks required will cause a "lockup" as it suddenly takes the system a lot longer to write the txg than anticipated, and the system will tend to learn that and reduce throughput in the future, anticipating it. Your only real fix to that is to have lots of free space (think 50-80% of pool free).

bonox · Dec 11, 2021

perhaps I should have said only one file being sent to the filer at a time. The compute server feeding the filer is sequential and produces one output file at a time. What the filer does under the covers is its own business I guess from my perspective.

The pool tends to get used over a week and then all data is deleted and the next week starts again. So every week goes from 100% free to about 25% free with what i would hope is very little fragmentation (It's rare to do specific file deletes during the week - i'd just delete everything instead). Does that count as an anti-fragmentation methodology or is the only way to reset it to destroy the pool and recreate?

bonox · Dec 12, 2021

well, it works beautifully with a txg timeout of 300 seconds and 320GB max dirty with delay scale of 500000. Copies at pretty much 10Gb wire speed for five minutes and the sender is back to work again. Takes another 12 ish minutes for the NAS to finish writing. So a perfect result on the intent. I think over time i'll try shortening the txg timeout to as short as I can get away with and keep an eye on the dtrace results to see where the changes shift things.

Appreciate the help and comments all - especially that txg timeout value - none of the dirty data values make any real difference without that one on board as well. Not out of the woods yet by any means, but have a shining halo of a goal now with opportunity to wind the outstanding writes back a bit.

Important Announcement for the TrueNAS Community.

SLOG on 2x Optane - mirrored vs striped SLOG peculiarity

Resident Grinch

Contributor

Powered by Neutrality

Contributor

Powered by Neutrality

Resident Grinch

Powered by Neutrality

Resident Grinch

Dabbler

actually does care

Dabbler

Powered by Neutrality

Patron

Powered by Neutrality

Resident Grinch

Patron

Dabbler

Resident Grinch

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SLOG on 2x Optane - mirrored vs striped SLOG peculiarity"

Similar threads