ZFS durability question wrt onboard caches

shanemikel

Dabbler
Joined
Feb 8, 2022
Messages
49
How does ZFS provide durability given the multi layer caches involved in storage systems?

First, I'm using HDDs with onboard caches. Unlike with SSDs I see mentioned around here, which must have power loss protection, the HDDs AFAIK don't have this circutry. Does ZFS somehow tell disks not to use onboard cache? If so, I may as well save money and buy drives with smaller cache size, all else being equal.

Second, HW RAID controller caches pose a threat to durability, which is why they typically come with an integrated backup battery. OK, we're using ZFS, so IT mode HBA instead of RAID controller. I haven't read anything here about integrated batteries, so I'm assuming HBAs don't have caches and RAID card caches are disabled when flashed into IT mode?

Assuming I follow the recommendations about disk types & sizes, RAIDZ config, etc. can I run this thing confidently without UPS & how so?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
The way ZFS deals with HDDs is that it either disables the write cache on the drive, (really old ones had to have this done). Or it uses something called write barriers. This forces the hard drives to write it's internal cache of previously written data to the platters. Then, when this is done, new writes can start being written.

Next, ZFS is a copy on write. Say a file is updated, both contents and metadata, (it's permissions), ZFS will bundle up the writes into a transaction group, starting with the data, then metadata, and last the Uber blocks that now reference the updates. These updates will be in previously free space, so no data is over-written.

If the hard drive looses power before the Uber blocks are updated, then the file update never existed. If the Uber block update is completed, then the file update is present. In either case, the on-disk structures of ZFS remain both consistent and "perfect".

It is probable that SATA/SAS SSDs and NVMe drives don't have "write barriers" concept.


In the context of ZFS, the read cache, (more properly named L2ARC), is exactly that, read. The real data is still in the pool. And failure of a L2ARC device may impact performance, but not data integrity.

The tricky part comes when using a SLOG, (aka Separate intent LOG). This is a synchronous write intent log. Without being certain data is written to stable storage, we end up with having power loss prevention on SSDs. Some SSDs use direct writing, like Optane, or don't have write cache and won't consider the write complete until written to the stable storage. So those don't need super capacitors to keep power up for a few seconds, allowing the SSD to write out it's write cache.

Further, in the case of SLOG, if you have an unexpected shutdown / power loss, AND you have a SLOG failure on boot, the last few writes / ZFS transactions may be lost. Thus, Enterprise users may Mirror their SLOG to prevent loss of data.


ZFS was specifically designed to handle massive amount of power losses, WITHOUT DATA LOSS. Even during active writes. So, a UPS is nice, but not necessary for data integrity on disk. That said, a UPS can prevent loss of data in flight, like any file system.

This on disk data integrity is not a guarantee of BTRFS. It has a few places where it can loose data. Both conditions can't happen with ZFS because it was designed without those flaws.


On the subject of disk controllers that can have either RAID or IT firmware, yes, the IT firmware ignores any write cache attached to the disk controller.
 

shanemikel

Dabbler
Joined
Feb 8, 2022
Messages
49
Thanks, that makes sense.

I apologize in advance if this is a bit weird, but this is how I think.. don’t feel expected to answer or respond to all/any of these theories/questions/conclusions.

Anybody who comes across this: please read everything below as if it ends with a question mark. I really don’t know what I’m talking about.

So these are some implications if I understand correctly:
  1. Write barriers are an HDD firmware feature, where unsupported (old) drives fallback to having cache disabled by ZFS because of its particular commitment to guarantees, where other fs might in that case take the risk using cache and on sync request wait for an arbitrary time assumed to be long enough to flush the cache.
  2. On such a system, “sync: always” or “sync: standard,” require disabling the drive cache, because if we ever could be asked to do a sync and we can’t wait until we’re sure the cache is persisted, there better never be anything in the cache to wait on. But since there is no such guarantee in “sync: never” we don’t need to disable the cache in that case.
  3. In such a system, “sync: standard” gives the performance characteristics of using “sync: always” (but also the guaranteed durability of every write).
On a system with barriers:
  1. Using barriers allows ZFS to confidently make sync guarantees, thereby gaining performance of using the HDD cache and providing durability upon request like any other filesystem with sync support… However with the advantage of transactional updates protecting filesystem metadata integrity and obviating the need for an fs journal (as in ext4).
  2. Sounds like “sync: always” uses a write barrier after every write, unless there is a SLOG, which ZFS assumes to be perfectly durable… If I don’t need to use barriers always, well I don’t really need to use barriers anymore, ever
  3. SLOG device without PLP or mirror are still using transactional writes therefore don’t threaten metadata corruption. Worst case data loss is equivalent to having power failure on a system with no SLOG, “synchronous: never” and really, really huge HDD caches.

In simpler terms:
  1. sync takes away our HDD cache
  2. barriers give it back
  3. SLOG gives us a big cache (I know, don’t call it a cache) and gives our HDD cache back, without the need to wait for barriers
Aside: I’m breaking one of the rules calling SLOG a cache because it’s basically a write-only, write-back cache. Caches can have important data too, not everyone is write-through, and any cache that is not write-through sometimes contains the only copy of new data... Am I wrong?

Which leads me to think again about the barrier unsupported systems:
  1. SLOG could be a necessary optimization on such a system, unless ZFS is configured with “sync: never”.
  2. SLOG should even make it possible to re-enable the disabled cache on HDDs that don’t support barriers, giving an extra little boost.
 
Last edited:

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
I am not sure I can answer all your questions. But here goes for some:

On such a system, “sync: always” or “sync: standard,” require disabling the drive cache, because if we ever could be asked to do a sync and we can’t wait until we’re sure the cache is persisted, there better never be anything in the cache to wait on. But since there is no such guarantee in “sync: never” we don’t need to disable the cache in that case.
ZFS' "sync" feature refers to synchronous writes, like iSCSI should use, databases should use, or NFS uses. The normal method of asynchronous writes puts data in RAM, and tells the writer process it's complete. ZFS collects a bunch of other changes, like metadata to point to the new file data, and then in a few seconds, burst writes the "transaction group".

If you use synchronous writes, they go into a in-pool ZIL, (ZFS Intent Log), then the writer process is told the write is complete. The rest is similar to asynchronous writes, as the data is still in RAM. The ZIL is only used if a crash / power loss occurs after the ZIL write is complete and before the RAM copy can be used.

To be clear, both ZIL and SLOG keep copies in RAM, which are then flushed to the pool as normal. Only on crash / power loss does the information in the ZIL or SLOG come into play.

As far as I know, ZFS "sync" feature has nothing to do with hard drive write caches, nor write barriers.

Again as far as I know, ALL modern hard drives from the last 10 or years support write barriers properly. So their is no need to disable their write cache.

Get over the idea that SLOG is a write cache. It's not, system RAM is. ZIL & SLOG simply guarantees synchronous writes are saved to stable media before returning to the writer process. Thus, any use of synchronous writes is slower than asynchronous writes, even with a high speed SLOG.

Their are options for "sync" for ZFS, read the manual page.


It looks like you are over-thinking the process.

What is your goal?
To have data always written to stable media before writer process returns?

Then use "sync=always", with or without a SLOG.
 

shanemikel

Dabbler
Joined
Feb 8, 2022
Messages
49
Oh yeah forgot that ZIL goes to data pool when there’s no SLOG.

My goal is a deep enough understanding of ZFS to use it in production systems, on hardware and cloud abstractions, to know when not to use it and to compare it to alternatives such as ReFS.

I was referring to the Unix/Linux sync system call/command. https://en.wikipedia.org/wiki/Sync_(Unix)


I assumed that “sync: standard” obeyed the sync call normally, “sync: never” ignored it, treating all writes as async, and “sync: always” ignored it, but treated all writes as if they were followed immediately by sync.

You’re correct, it’s probably time to read the docs.
 

shanemikel

Dabbler
Joined
Feb 8, 2022
Messages
49
One can learn/infer so much about how a system works just knowing how it performs and falls down, and you guys have a lot of ZFS tuning/ops wisdom. So forgive me putting on my dev hat. That’s why I started a thread instead of dumping that stream of consciousness into somebody else’s convo.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Here is the snippet from the OpenZFS 2.1.5 manual page, "zfsprops", on the "sync" property:
Code:
     sync=standard|always|disabled
       Controls the behavior of synchronous requests (e.g. fsync, O_DSYNC).  standard is the POSIX-specified be-
       havior of ensuring all synchronous requests are written to stable storage and all devices are flushed to
       ensure data is not cached by device controllers (this is the default).  always causes every file system
       transaction to be written and flushed before its system call returns.  This has a large performance pen-
       alty.  disabled disables synchronous requests.  File system transactions are only committed to stable
       storage periodically.  This option will give the highest performance.  However, it is very dangerous as
       ZFS would be ignoring the synchronous transaction demands of applications such as databases or NFS.  Ad-
       ministrators should only use this option when the risks are understood.
 
Joined
Oct 22, 2019
Messages
3,641
Honestly, it's best to leave it at "sync=standard" to let each application decide. Otherwise, you're going to suffer a performance hit with SMB shares if it's set to "sync=always".
 
Top