Data Integrity after Disabling ZIL

RichC · Oct 29, 2013

I have read a lot about disabling the ZIL (or setting Sync=disabled) on a particular dataset, and from what I understand when this is done, Data is written into RAM, and if a power failure or system crash were to occur, data would be lost.

My question is, if we were taking automatic snapshots every x minutes, and this situation arose, could we simply revert to a previous snapshot, and loose any possible data corruption.

We would be running NFS on top of ZFS servicing ESXI hosts.

We use ZeusRam at present, for our Log, in most cases, but there are certain systems where loosing 5 minutes would not be an issue, we just would not want corruption.

All of our kit is on UPS and Generators, crashes could still be an issue though!

I have read lots, but can't seem to get to the bottom of whether actual corruption would occur.

Is it true to say data is written into RAM, then written to disks at leisure? Or does it happen fairly quickly? i.e is worst case loosing 5 minutes if power loss occurs or crash as bad as it will get?

I guessed maybe that the data gets written from memory to disk based upon a tuneable?

If so do snapshots force this commit from memory to spinning disks when the LOG is disabled?

Thanks in advance.

jgreco · Oct 29, 2013

Eliminating a SLOG device (ZeusRAM, etc) normally forces ZFS to use the in-pool ZIL. This has the expected performance issues.

Telling ZFS not to use the ZIL is a bit more complicated. It helps to understand that there's a normal write process in ZFS, which involves building a transaction group and then having that flushed out to disk (either based on time or demand). ZFS defaults to allowing up to 1/8th of system RAM for transaction group writes, so on a larger system (32GB+) you can easily have gigabytes of data queued to flush to disk. In ZFS v28+, the period defaults to 5 seconds. So there's a stunningly large amount of data that can be cached in RAM and not yet committed to the pool.

The ZIL (which can be moved to a Separate LOG device) provides a facility for ZFS to more rapidly acknowledge writes as having been committed to stable storage (as required by POSIX) while still allowing ZFS to gather up large transaction groups (and potentially aggregating larger blocks of data, good for compression/dedup/etc). I go into more detail http://forums.freenas.org/threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/ over there.

In theory, ZFS should be able to successfully use sync=disabled without risk. The on-disk image is supposed to be consistent. There's been some debate as to whether or not this is actually safe, though, and the conservative wisdom is to assume that it is dangerous.

RichC · Oct 29, 2013

Hi,

Thanks so much for the quick response.

Starting to understand a little more I think, I did read your post already tonight, which gave me a lot of clues, but i missed the part about the transaction groups! Really interesting the part about how to size memory against MB/s throughput of the disks.

When you say that the default flush is 5 seconds, you say "So there's a stunningly large amount of data that can be cached in RAM and not yet committed to the pool." If the default is 5 seconds, and lets say we are writing at 125MB/s do you mean there could be 125MB/s * Flush to disk time (5 seconds in ZFS v28+), or is there more because of what you were saying about 1/8th of ram in large ram systems?

Sorry, but just trying to understand how dangerous, dangerous is :)

We have had the advantage of running ZeusRam in everything we have used to date, so never really questioned it.

Also when you say the on-disk image is supposed to be consistent, you mean consistent but maybe 5 seconds out of date if we crashed without a ZFS write cache with sync=disabled?

Thanks Again.

jgreco · Oct 29, 2013

There could be at least a little more timewise. One txg can be flushing out while another is building up. If you can actually count on being limited to write at a peak of 125MB/s (limited by single gigE, in other words) ... but can you? Can you be sure no one ever does stuff at the console?

Anyways, expect it to be POSSIBLE for there to be two txg's worth of stuff queued. You know, disk shelf fails, ZFS builds up one txg and sends it off to commit (and hangs), second txg group builds up and then you get a ZFS hang waiting for txg completion of the first.

What's stored on-disk is supposed to be consistent from a valid-ZFS-pool point of view. From a storage administrator's PoV avoiding pool loss is important. There are some FreeBSD changes which apparently make this more doubtful/risky and really I think I'd set up a FreeNAS VM with sync=disabled and some aggressive I/O and an ESXi script to reset it every five minutes for a week before I'd start to trust it. This hasn't been sufficiently addressed by anyone with actual knowledge of the issues, alas.

Data loss of some sort without a ZIL is pretty much guaranteed unless perhaps the filer is quiescent.

RichC · Nov 13, 2013

Hi,

Thanks so much for the clarifications, really appreciated, and sorry for the delay in reply!

We may try the test case you have suggested, and see how it works!

Regards

Richard

RichC · Jan 31, 2014

Hi,

Sorry for the long delay! Just to follow on, we were planning to set-up a test case for this, and set sync=disabled on the pool. We were expecting very fast performance, as we have 98GB of ram, but in the test system we only have 12 disks in mirrored pairs.

Writing a 10GB file locally without any network, we get around 85MB /s, we were expecting that it would be writen to ram so would be very fast.

Is this situation normal?

jgreco · Feb 3, 2014

That doesn't sound right.

I haven't actually looked to see what's different in 9.2, and I know it changed, but in older ZFS the transaction group sizing was 1/8th system RAM, or about 12GB for yours. In theory a file of several gigs should appear to write instantaneously, regardless of your pool speed, and then ZFS will flush it to disk at the speed your pool is capable of.

85MB/sec might be an okay pool speed for a pair of oldish disks in a mirror, but seems way off for what I am taking to be a bunch of mirrored vdevs in a single pool, which ZFS would normally distribute the load amongst.

cyberjock · Feb 4, 2014

I did 12 mirrored disks in a pool about 2 weeks ago for a customer, we hit almost 1GB/sec.

jgreco · Feb 4, 2014

Sustained, I assume?

cyberjock · Feb 4, 2014

Yes. That was sustained over NFS from inside a handful of VMs.

Important Announcement for the TrueNAS Community.

Data Integrity after Disabling ZIL

RichC

Cadet

jgreco

Resident Grinch

RichC

Cadet

jgreco

Resident Grinch

RichC

Cadet

RichC

Cadet

jgreco

Resident Grinch

cyberjock

Inactive Account

jgreco

Resident Grinch

cyberjock

Inactive Account

Similar threads

Important Announcement for the TrueNAS Community.

Data Integrity after Disabling ZIL

Cadet

Resident Grinch

Cadet

Resident Grinch

Cadet

Cadet

Resident Grinch

Inactive Account

Resident Grinch

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Data Integrity after Disabling ZIL"

Similar threads