Sync writes vs SLOG

remonv76 · May 2, 2022

Can somebody explain to me how the impact works of sync writes?
We all know that it has an significant performance impact on write speed. Like 5MB/s is quite normal with sync writes and without a SLOG.
Now when we add a SLOG, all writes go to the SLOG, which in turn writes the data every 5 sec to the pool. Default setting of the SLOG is around 4GB.

Now lets asume we have a 24xdisk SAS pool with an NVME SLOG and sync writes is enabled on the zpool. Does this mean SLOG will write the data to the SAS pool with 5MB/s or is sync writes enabled only on the SLOG (performance impact on SLOG) and full performance is gained on the SAS pool?

If the performance impact is still on the SAS pool, then the 4GB SLOG dump to the pool will take 10 minutes with 5MB/s. I can not imagine this is the case. Hope somebody can clear this up.

jgreco · May 2, 2022

remonv76 said:
Now when we add a SLOG, all writes go to the SLOG, which in turn writes the data every 5 sec to the pool.

This is incorrect. You are mistaking the SLOG for a cache; it is not, it is a log. Your write cache is main system RAM, and the transaction groups built there are flushed from the write cache to the pool every five seconds.

Please see

Some insights into SLOG/ZIL with ZFS on FreeNAS

What is the ZIL? POSIX provides a facility for the system or an application to make sure that data requested to be written is actually committed to stable storage: a synchronous write request. Upon completion of a sync write request, the underlying filesystem is supposed to guarantee that a...

www.truenas.com

Once you do, you will hopefully understand why

remonv76 said:
If the performance impact is still on the SAS pool, then the 4GB SLOG dump to the pool will take 10 minutes with 5MB/s. I can not imagine this is the case. Hope somebody can clear this up.

this is gibberish.

Patrick M. Hausen · May 2, 2022

And why would a 24 disk SAS pool deliver only 5 MB/s sustained writes? 11x 2 disk mirror with two hot spares will definitely do way more.

remonv76 · May 2, 2022

uh… no I’m not saying the SLOG is a cache. Write blocks are not maintained in memory only, else you have a fun time repairing your pool if power is down. SLOG is used as a transaction log device. And every 5 seconds it dumps it data permanently to the pool.
But you probably know more about this, seeing your explaination in the link! Thanks and i’m going to read it.

remonv76 · May 2, 2022

Patrick M. Hausen said:
And why would a 24 disk SAS pool deliver only 5 MB/s sustained writes? 11x 2 disk mirror with two hot spares will definitely do way more.

you don’t know my requirements. So please try a raidz2 without a SLOG and syn write enabled.

sorry, but i’m not talking at what performance impact you get from sync writes. Ofcourse it will depend what vdev config you use or which disk you use. I was just wondering how it is all related. But i understand now how a SLOG device will benefit sync writes.

jgreco · May 2, 2022

remonv76 said:
no I’m saying the SLOG is a cache.

And I'm saying you're wrong. It's a log. It is never read from during normal operations, only at boot time during pool import. A cache would generally have 50% reads and 50% writes on some sort of FIFO or other similar basis. A SLOG has 99.999+% writes. No reads during normal operations.

Understanding what is really going on here is key to understanding how system performance is impacted. Once you get that the write cache flushes from main memory, the only question really becomes how much sync write operations hurt you. The pool writes are already happening at the fastest rate allowed by the pool. They always flush out from the committing transaction group to the pool at max speed.

Patrick M. Hausen · May 2, 2022

remonv76 said:
you don’t know my requirements. So please try a raidz2 without a SLOG and syn write enabled.

24 disk wide RAIDZ2 is more than "strongly discouraged". As a rule of thumb you get the sustained write performance of a single disk per vdev.

remonv76 · May 2, 2022

jgreco said:
This is incorrect. You are mistaking the SLOG for a cache; it is not, it is a log. Your write cache is main system RAM, and the transaction groups built there are flushed from the write cache to the pool every five seconds.

Please see

Some insights into SLOG/ZIL with ZFS on FreeNAS

What is the ZIL? POSIX provides a facility for the system or an application to make sure that data requested to be written is actually committed to stable storage: a synchronous write request. Upon completion of a sync write request, the underlying filesystem is supposed to guarantee that a...

www.truenas.com

Once you do, you will hopefully understand why

this is gibberish.

But I’m not sure about “Your write cache is main system RAM”. That is true, but i also understood that it is a copy of what is written to the SLOG. And that with sync writes enabled it only sends an ACK when the transaction is completed to the SLOG, not in memory.

We will be using two Radian RMS-200 for SLOG and 3x Samsung 983 DCT nvme 1.9T as L2ARC. So especially the RMS-200 will perform very nicely With sync writes enabled.

remonv76 · May 2, 2022

Patrick M. Hausen said:
24 disk wide RAIDZ2 is more than "strongly discouraged". As a rule of thumb you get the sustained write performance of a single disk per vdev.

Where have you read this? Performance wise you are right. But when it comes to vital data information, RAIDZ2 is the way to go. Ever had a double or triple disk failure? We have had this with truenas 9.3! We are planning to configure the 24 disks as 3x 8 disk RAIDZ2 vdevs and add 2 hotspares. (Yes that is 26 disk)

Patrick M. Hausen · May 2, 2022

remonv76 said:
But I’m not sure about “Your write cache is main system RAM”. That is true, but i also understood that it is a copy of what is written to the SLOG. And that with sync writes enabled it only sends an ACK when the transaction is completed to the SLOG, not in memory.

Async writes are cached in memory if possible. Sync writes don't return from their respective system call unless they are written to stable storage. Which in our case is your pool.

The data to be written is in memory. The write operation is acknowledged once the memory has been flushed to disk. If it was async it would be ack'ed immediately without being written.

Now if you add an SLOG device and we assume you pick a proper SSD which is significantly faster, has got high write endurance, power loss protection, etc ... the data in memory will be written to the SLOG first. And the write operation will be ack'ed after that has been accomplished. Which is the main mechanism of the expected speedup.

But still the data that has been written to the SLOG will not be dropped in memory. It will still be flushed from memory to stable storage in your pool. Once that is complete the copy on your SLOG is just irrelevant. The perceived performance increase is due to the faster commit to some stable storage (your SLOG), not to any speedup in your pool.

So as @jgreco already put it the SLOG is 99% written to. And never read. Because the data goes to your pool at the speed your pool can deliver. Only ZFS can tell your application "I got this" faster if it writes the data to your SLOG first. Written is written. Unless the hardware lies to you, but that is a different story.

Only if your system crashes or experiences a power outage and some transaction was written (and ack'ed) to your SLOG but not yet to your pool will the SLOG be read to copy that transaction to main pool storage.

Hope that explained it,
Patrick

remonv76 · May 2, 2022

jgreco said:
And I'm saying you're wrong. It's a log. It is never read from during normal operations, only at boot time during pool import. A cache would generally have 50% reads and 50% writes on some sort of FIFO or other similar basis. A SLOG has 99.999+% writes. No reads during normal operations.

Understanding what is really going on here is key to understanding how system performance is impacted. Once you get that the write cache flushes from main memory, the only question really becomes how much sync write operations hurt you. The pool writes are already happening at the fastest rate allowed by the pool. They always flush out from the committing transaction group to the pool at max speed.

Yes i know that. I never talked about a read cache. Where did you get that idea? I only talked about write performance and if this impact the SLOG also when it dumps its data to the pool.
You also state that Freenas uses 1/8 of memory for storing transaction groups. But there is a freenas setting for setting the maximum size of a transaction group to 4GB. So when you have 512GB of mem, like we do, Freenas will reserve 64GB for storing the transaction groups which can keep 16 transaction groups? That is kind of much, don’t you think? Does the SLOG also has to be 64GB?

remonv76 · May 2, 2022

Patrick M. Hausen said:
Async writes are cached in memory if possible. Sync writes don't return from their respective system call unless they are written to stable storage. Which in our case is your pool.

The data to be written is in memory. The write operation is acknowledged once the memory has been flushed to disk. If it was async it would be ack'ed immediately without being written.

Now if you add an SLOG device and we assume you pick a proper SSD which is significantly faster, has got high write endurance, power loss protection, etc ... the data in memory will be written to the SLOG first. And the write operation will be ack'ed after that has been accomplished. Which is the main mechanism of the expected speedup.

But still the data that has been written to the SLOG will not be dropped in memory. It will still be flushed from memory to stable storage in your pool. Once that is complete the copy on your SLOG is just irrelevant. The perceived performance increase is due to the faster commit to some stable storage (your SLOG), not to any speedup in your pool.

So as @jgreco already put it the SLOG is 99% written to. And never read. Because the data goes to your pool at the speed your pool can deliver. Only ZFS can tell your application "I got this" faster if it writes the data to your SLOG first. Written is written. Unless the hardware lies to you, but that is a different story.

Only if your system crashes or experiences a power outage and some transaction was written (and ack'ed) to your SLOG but not yet to your pool will the SLOG be read to copy that transaction to main pool storage.

Hope that explained it,
Patrick

Ah that explains it. So every written transaction is put in transaction groups in memory and on the SLOG, (then ack is send for write sync), but only from memory it is written to the pool. This cleares so many thing up! Thanks Patrick. The main question is if our RMS-200 8GB is big enough. It should hold 2 transaction groups (4GB each). We have 30-40 virtual servers, mixed environment. It should be, but we know for sure when we migrate the data to this storage. Transactions will go faster, so more data will be written.

<edit>
read @jgreco explaination again. “Every time the transaction group fills, or whenever the transaction group time limit expires, the transaction group begins flushing out to disk, and another transaction group begins building up. If the new transaction group fills before the previous one finishes writing, ZFS pauses I/O to allow the pool to catch up.”
So the RMS-200 8GB should be enough…..

remonv76 · May 2, 2022

Ps. This is the performance we get with 2 vdevs in RAIDZ2 (7disks each) without a SLOG and write sync enabled. We mostly look at the 4k - 32k performance blocks. And it is not fast without a SLOG device.

Patrick M. Hausen · May 2, 2022

The nice thing is - you can add and remove SLOG devices in live operation. You can even add a single disk, check how it performs, if it pays off add a second one as a mirror device. You did not write much about the environment and use case but the mere size of your pool to me implies it's not a hobbyist's system but something mission critical. So there has to be some budget.

If you add a single disk SLOG the worst thing that can happen is a transaction group committed to SLOG, not yet to your pool proper, and the SLOG device failing at that precise moment and your system crashing/rebooting at the same time - so you lost a couple of writes. Which in my opinion is just the regular case with any write cacheing system.

If that single SLOG pays off and you observe an increased perceived performance you can add a second (supposedly expensive) SSD as a mirror device any time.

Without knowing your precise business case it's difficult to come up with proper advice. In any "enterprise" environment the hardware cost for that particular SLOG is negligible in my experience.

HTH,
Patrick

remonv76 · May 2, 2022

Hi Patrick,

Yes it is a production storage system and we are building 2 of them.
2x Xeon 2650
512GB - upgradable to 768GB
26x 1.8TB Seagate ST1800MM0129 hybrid sas (16GB flashcache)
2x RMS-200 8GB unlimited endurance pci-e nvme cards for mirror SLOG
3x 1.9TB Samsung DCT 983 nvme for L2ARC (using 1.5T as cache)
uplinks 2x 40Gbps

VMware standard environment of 5 Esxi hosts connected with 10Gbps interfaces

We still have to decide iscsi or nfs. We currently like the iscsi multipath performance, but with 40Gbps this is not so big a deal. iSCSI is still faster though, but we don’t like losing 50% diskspace. (Else you experience performance degradation)

But what you are saying is right. Mirror is much faster. But we had a triple disk failure once and lost the complete pool, probably because of the intense rebuild process. Luckily we had a second storage in snapshot replication, so we only lost a couple of hours. Never again! Haha.

remonv76 · May 2, 2022

But the performance with the current SAS pool, without SLOG, is not great. That‘s why i needed to understand how sync writes work and how it impacts the whole process. So i’m not sure why this is though, but I will try to setup a mirror and test the performance without SLOG. We need to get addequate performance out of the pool first, before adding any SLOG or L2ARC.

I also read in another thread that Freenas Core builds ZFS differently than Freenas 11 and older. He also noticed performance degradation with a clean install of Freenas Core compaired to 11. Freenas 11 was faster and even een upgrade to freenas core seemed faster. So definately going to try that too.

jgreco · May 2, 2022

remonv76 said:
We still have to decide iscsi or nfs. We currently like the iscsi multipath performance, but with 40Gbps this is not so big a deal. iSCSI is still faster though, but we don’t like losing 50% diskspace.

There's nothing special about iSCSI or NFS, both are subject to the same general problems when used for block storage. If you are thinking that iSCSI makes you lose 50% disk space but NFS don't, please note that this is wrong. Any type of block storage needs to maintain a large free space pool in order to be performant. See any of the materials I've written over the years.

remonv76 · May 3, 2022

I’m not thinking that. According to Freenas they say you can use 50% for iscsi and 80% for nfs, without any performance degradation. And you say Freenas /ixsystem is wrong… ok, have to read your materials then.

sretalla · May 3, 2022

remonv76 said:
the performance with the current SAS pool, without SLOG, is not great. That‘s why i needed to understand how sync writes work and how it impacts the whole process. So i’m not sure why this is though, but I will try to setup a mirror and test the performance without SLOG. We need to get addequate performance out of the pool first, before adding any SLOG or L2ARC.

You're right to suspect that you'll end up with an overall performance hit if your pool can't swallow the sustained IOPS/throughput that the SLOG and RAM are temporarily handling.

There are 3 things you can do about that:

Make sure your pool is as performant as possible to match (or come as close as you can to matching) your workload (particularly the sustained workload)

Mess with settings and hope you don't break anything in the process (there are a few threads around that talk ... perhaps erroneously... about adjusting the number of seconds that will be held in RAM for a transaction group, so potentially expanding the cushion provided to your pool)

Adjust your expectations to understand that while SLOG/RAM can help to mitigate spikes and/or short bursts of high I/O, sustained performance will not exceed that of your pool disks.

Ultimately, you need to accept that your pool's native performance matters and if you care about that, you'll want to reconsider your RAIDZ2 approach and perhaps think about either 3-way mirrors or 2-way mirrors with a few spares in the pool to mitigate risk of pool loss.

See the link below to the ZFS project's statement (which I note now says 16 disks is the max. for RAIDZ).

Workload Tuning — OpenZFS documentation

openzfs.github.io

remonv76 said:
I also read in another thread that Freenas Core builds ZFS differently than Freenas 11 and older. He also noticed performance degradation with a clean install of Freenas Core compaired to 11. Freenas 11 was faster and even een upgrade to freenas core seemed faster. So definately going to try that too.

OpenZFS 2.0 was introduced with TrueNAS 12 (if I recall correctly), so that statement may not be false, but I don't see how any of the rest of it can have any kind of impact (and maybe improved performance somehow carries some risk or flaw with it).

It all sounds awfully anecdotal to me.

remonv76 · May 3, 2022

That was it, OpenZFS 2.0. I still have to read in to it, to understand the differences.

More testing is needed on Freenas itself to get the most out the the pool. Testing it through a VM does not get me representative results, because of sync writes or write caching. We need to do this on the unit itself. If we can get close to 70% of the workload performance, we would be happy. But that would be mainly in the 4k-32k range. So it‘s going to be tough with SAS disks.

We thought about going all-flash (datacenter ssd 1.2 or 1.8TB), but the price for 2 units would exceed our budget.

Important Announcement for the TrueNAS Community.

Sync writes vs SLOG

Dabbler

Resident Grinch

Hall of Famer

Dabbler

Dabbler

Resident Grinch

Hall of Famer

Dabbler

Dabbler

Hall of Famer

Dabbler

Dabbler

Dabbler

Attachments

Hall of Famer

Dabbler

Dabbler

Resident Grinch

Dabbler

Powered by Neutrality

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Sync writes vs SLOG"

Similar threads