Using SLOG as a delayed write cache that only writes to the array at a specific time (or when full)

jbssm · Jun 20, 2021

I reckon this might seem a very ignorant question coming form someone that doesn't know anything about ZFS.

I am trying Unraid, and despite some shortcomings, one very cool thing is that I can setup a SSD drive as a cache device that greatly speeds up my write operations and that keeps my system silent by holding my data until the evening when it writes it back to the array.

I was trying to find an alternative on TRueNAS and came across the notion of SLOG, but from what I see, it writes what's in the SLOG (SDD) back to the array after a few seconds.

Is it possible to delay the SLOG flushing until a specific time or until it get's full?

danb35 · Jun 20, 2021

jbssm said:
Is it possible to delay the SLOG flushing until a specific time or until it get's full?

No. SLOG isn't a write cache--ZFS does that in RAM anyway. SLOG is used only for sync writes, which you wouldn't be doing much unless you're running things like databases or VMs on your TrueNAS box.

rvassar · Jun 20, 2021

A "sync write" means the OS is guarantying all data writes have been committed to the physical hardware before returning success. Some applications & network services/protocols have this written in to their specifications. It's more of a performance issue for databases, NFS servers, SMTP email servers, etc... It's really about leaving a paper trail so the data doesn't get lost in a crash. Ala: "I'm about to update record A with this data X. I have updated record A. Record A now correctly reflects that I updated it to X." The Unix/Linux file sharing protocol NFS specifically mandates per RFC (the design document) that all writes be performed with the O_SYNC flag set, as do the RFC's covering SMTP email transactions. This is an option flag for the Unix/Linux open(2) system call. When O_SYNC is set, a write call doesn't return to the calling program until the data is committed to the device and confirmed by the underlying hardware.

The SLOG device allows the OS to commit a copy of that data to a very fast SSD or battery-backed RAM device as an intention log, and return a "success" result to the calling program much faster than the physical spinning disks would be capable of. That's all it does. It doesn't cache and accumulate multiple changes, or schedule writes for serialization to avoid fragmentation, etc... If the system crashes, it can recommit the log's list of data and ensure consistency, that is all it does.

If it helps, figure out how many CPU instructions you computer executes while a 7200 RPM disk platter makes one single revolution.

Finally.. If you're using Windows file sharing "SMB"... A SLOG won't help you at all. The protocol doesn't require it.

jgreco · Jun 20, 2021

Please do go and read

https://www.truenas.com/community/threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/

which is an introduction for new users as to what the SLOG is. It is absolutely no sort of write cache, that's why "ZIL" stands for "ZFS Intent Log" and not "ZFS Write Cache," as explained by posters above.

ZFS already does the things you're talking about natively. Writes are done to the write cache (called a "transaction group") in main system memory, and are flushed out to disk periodically or when the transaction group fills. However, the transaction group timeout is five seconds, not the hours you would need for your type of operation.

The argument people sometimes make is that they'd like the transaction group timeout to be longer, which you can actually try, but ZFS actually puts a bunch of effort into figuring out a transaction group size that is well sized for the pool's I/O capacity. If you try to force this, you can get into a situation where ZFS blocks I/O because the pool isn't handling I/O fast enough, and people generally hate that. Otherwise, ZFS actually does have the ability to handle transaction groups that are gigabytes in size.

What you really want is some form of tiering, and ZFS doesn't support that in the way that some other things do.

jbssm · Jun 21, 2021

Thank you all for you detailed answers about ZIL and SLOG, I get it that what I am looking for (not really performance, but more, let's say "silence during the day", it's just not possible within the framework of ZFS and TrueNAS.

jgreco · Jun 21, 2021

jbssm said:
Thank you all for you detailed answers about ZIL and SLOG, I get it that what I am looking for (not really performance, but more, let's say "silence during the day", it's just not possible within the framework of ZFS and TrueNAS.

"silence during the day" ... so I guess the thing here is, what do you even mean by this? Spinning down the drives? That's a bad idea for all the reasons that have been outlined over the years. But also, under what conditions would you be ONLY writing to the pool without also reading? Unless you have a buttload of L2ARC, reads are likely to hit the pool at some point, even if just to get metadata on where the free space was in order to fulfill write requests.

Ignoring for the sake of discussion the fact that it's a bad idea to spin hard drives down, and that most writes would be accompanied by reads of metadata etc., that would spin up a pool, you could just set your drives to spin down after ten minutes and get a large amount of the way there.

Also, if you really did have a situation where you were just committing writes, you could potentially cobble together an SSD pool to store the writes, and then rsync the files into the main pool at night. This is messy but not out of the realm of possibility, depending on what it is you are trying to do.

jbssm · Jun 21, 2021

jgreco said:
"silence during the day" ... so I guess the thing here is, what do you even mean by this? Spinning down the drives? That's a bad idea for all the reasons that have been outlined over the years. But also, under what conditions would you be ONLY writing to the pool without also reading?

Actually is more about not writing that much to the HDDs during the day but to the cache SDDs that would move the data in the night. Case in point: Torrent downloading. Those would complete the downloads into the SSDs and then after the all download is done move it to the HDD ZFS array.

Basically, something very similar to the way Unraid does these things if you are familiar with it (which is something automated that does your rsync suggestion: just that it automatically updates all the links for all the shares to point to the correct device without me having to worry about it - that's the problem with rsync that I see: sure, I can just move things at night with rsync during the night, but what about accessing a single folder to get/put my data any time of the day?)

Arwen · Jun 21, 2021

I wonder if an application could be written to make TrueNAS support a write cache type tiered storage?

This comes up often enough where people have a need for fast writes, and mistakenly think a SLOG is the way to achieve that.

Something along the lines of:

Main pool, (hard drives?)
Fast write pool, (SATA SSD or NVMe?)
Application to have access to both, but sharing out a "union" of them.
The "union" device would only show space numbers, (free / used), from the main pool
All writes go the fast pool to start, and after file close or a settable timeout, flushed to the main pool
On loss of the fast write pool, all writes would go to the main pool

Someone can design their fast write pool for their desired level of redundancy. For example, the main pool can be RAID-Z2 but the fast write pool can be a simple single device. And loss of the fast write pool would only loose files in transfer, which may be acceptable to some people. Or they can move their "fast write pool" to mirrored devices. Or "copies=2".

This need for a "write cache device", (aka NOT SLOG), reminds me of the old ASM/QFS tiered file system that StorageTek used to use internally, (and I think Sun Microsystems sold).

jgreco · Jun 22, 2021

BSD does support the unionfs mount, but it does some very dodgy stuff under the sheets, and there is a scary warning about actually using it in the manpage. It does, however, demonstrate that this could be possible if someone wanted to write it.

We seem to have moved beyond the days when these things were generated as grad student projects or stuff like that. :-(

rvassar · Jun 22, 2021

jgreco said:
We seem to have moved beyond the days when these things were generated as grad student projects or stuff like that. :-(

The Grad Students have for the most part all moved on to various cloud IOT/device musings. File systems are hard! But more importantly... Grad school has become absurdly expensive.

Arwen · Jun 22, 2021

jgreco said:
BSD does support the unionfs mount, but it does some very dodgy stuff under the sheets, and there is a scary warning about actually using it in the manpage. It does, however, demonstrate that this could be possible if someone wanted to write it.

We seem to have moved beyond the days when these things were generated as grad student projects or stuff like that. :-(

Perhaps the BSD "unionfs" could be used as a basis for the new tiered storage?

Of course, my heavy programming days were last century

So the most I do today are scripts, mostly in Bash.

If the "unionfs" does some of what is needed, a new "tieredfs" might be able to be created. Then perhaps no application is needed.

Patrick M. Hausen · Jun 23, 2021

Code:

# man mount_unionfs
[...]
BUGS
     THIS FILE SYSTEM TYPE IS NOT YET FULLY SUPPORTED (READ: IT DOESN'T WORK)
     AND USING IT MAY, IN FACT, DESTROY DATA ON YOUR SYSTEM.  USE AT YOUR OWN
     RISK.  BEWARE OF DOG.  SLIPPERY WHEN WET.  BATTERIES NOT INCLUDED.

     This code also needs an owner in order to be less dangerous - serious
     hackers can apply by sending mail to <freebsd-fs@FreeBSD.org> and
     announcing their intent to take it over.

     Without whiteout support from the file system backing the upper layer,
     there is no way that delete and rename operations on lower layer objects
     can be done.  EOPNOTSUPP is returned for this kind of operations as
     generated by VOP_WHITEOUT() along with any others which would make
     modifications to the lower layer, such as chmod(1).

     Running find(1) over a union tree has the side-effect of creating a tree
     of shadow directories in the upper layer.

     The current implementation does not support copying extended attributes
     for acl(9), mac(9), or so on to the upper layer.  Note that this may be a
     security issue.

     A shadow directory, which is one automatically created in the upper layer
     when it exists in the lower layer and does not exist in the upper layer,
     is always created with the superuser privilege.  However, a file copied
     from the lower layer in the same way is created by the user who accessed
     it.  Because of this, if the user is not the superuser, even in
     transparent mode the access mode bits in the copied file in the upper
     layer will not always be the same as ones in the lower layer.  This
     behavior should be fixed.

Constantin · Jun 23, 2021

A SSD front end certainly could make a huge difference for bursty use cases, ie pools that are mostly dormant and then get hit hard for a short amount of time for a lot of data written. There, the use case of a SSD intermediary cache that eventually gets flushed to the HDD pool seems pretty evident. The user will experience writes to the hybrid pool as if it was made of just SSDs.

But once the HDD part of the pool starts to be busy all the time, I wonder to what extent a SSD write cache on the front end can help. Eventually, the cache *has* to be flushed and then the pool will be busy with that, at HDD speeds, until it’s done.

This so reminds me of SMR HDDs that, under light use, behave just like a CMR drive. But, past a certain use threshold, performance craters.

The impact would not be anywhere near as bad with a SSD cache for a HDD pool but the analogy likely holds. That is, the pool will feel like a SSD pool for just writes until the cache has to be flushed.

Not sure about the read aspects though arguably a large L2ARC / sVDEV can already fulfill this function to some extent.

rvassar · Jun 23, 2021

I'd probably approach it from an angle of abusing the snapshot mechanism in combination with some kind of overlay filesystem like they use in IOT devices ala Yocto Linux... The goal would be to present a filesystem as a fusion of "now" and snapshots. Where "now" and snapshot "now-1" resides on the flash, and "now-2" resides on the spinning rust. It's one of those cases where most of the pieces are already there, but it doesn't work that way and they're not quite Legos...

HoneyBadger · Jun 24, 2021

Constantin said:
But once the HDD part of the pool starts to be busy all the time, I wonder to what extent a SSD write cache on the front end can help. Eventually, the cache *has* to be flushed and then the pool will be busy with that, at HDD speeds, until it’s done.

This so reminds me of SMR HDDs that, under light use, behave just like a CMR drive. But, past a certain use threshold, performance craters.

The impact would not be anywhere near as bad with a SSD cache for a HDD pool but the analogy likely holds. That is, the pool will feel like a SSD pool for just writes until the cache has to be flushed.

The key here is to have a gradual flushing/throttle process that ramps up, not unlike the current ZFS write throttle, with tunable values.

I'm going to call it "zcache" here for the examples, not to be confused with the vanilla "cache/L2ARC" or the "bcache/bcachefs" filesystems. Tunables don't actually exist and this is all hypotheticals.

Eg:
Cache drive is <30% full = don't throttle anything, don't start flushing.
Cache drive is 30%-60% full = begin flushing at a rate determined by scaling between zfs_zcache_flush_min and zfs_zcache_flush_max
Cache drive is >60% full = continue flushing at max speed, engage the ZFS write throttle to make sure cache isn't overwhelmed

This also lets ZFS potentially leverage this "zcache" vdev as a MFW/MRW (Most Frequently/Recently Written, not >my face when >my reaction when) cache device - if something is in the write cache pending a flush to slower disk, it could potentially get a hit here. Given the speed of devices needed, it would probably go ARC > zcache > L2ARC > pool

Obviously any device used for zcache would need to have SLOG-level endurance and ideally full end-to-end PLP.

Important Announcement for the TrueNAS Community.

Using SLOG as a delayed write cache that only writes to the array at a specific time (or when full)

jbssm

Dabbler

danb35

Hall of Famer

rvassar

Guru

jgreco

Resident Grinch

jbssm

Dabbler

jgreco

Resident Grinch

jbssm

Dabbler

Arwen

MVP

jgreco

Resident Grinch

rvassar

Guru

Arwen

MVP

Patrick M. Hausen

Hall of Famer

Constantin

Vampire Pig

rvassar

Guru

HoneyBadger

actually does care

Similar threads

Important Announcement for the TrueNAS Community.

Using SLOG as a delayed write cache that only writes to the array at a specific time (or when full)

Dabbler

Hall of Famer

Guru

Resident Grinch

Dabbler

Resident Grinch

Dabbler

MVP

Resident Grinch

Guru

MVP

Hall of Famer

Vampire Pig

Guru

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Using SLOG as a delayed write cache that only writes to the array at a specific time (or when full)"

Similar threads