ZIL Usage

jlw52761

Explorer
Joined
Jan 6, 2020
Messages
87
I'm finally in a position to upgrade my FreeNAS system to TrueNAS Core, and one of the things I am trying to decide is if I need a SLOG or not. I currently do not have one (or L2ARC for that matter), but wanted to determine if, because I am using 7.2k spinning rust, if I need to or not.
What I'm looking for is something similar to arcstat but will report ZIL utilization. I know I can see what is being written into the ZIL using zilstat, but that doesn't tell me if the ZIL is reaching it's max or not, or even what the current max is. Similar to how we have hit/miss/size stats for ARC, I'm hoping to see, is the ZIL able to dump to disk fast enough or do I need a SLOG to ensure that I don't have data loss.

My current rig is eating about 52GB of ARC at a >98% hit rate. It's got (12) 2TB 7.2k SAS (4x 3-wide RaidZ) and I'm serving a mix of NFS, iSCSI (ESXi datastores) and internal BHYVE VMs.

For reference, here's the stats of the system I'm converting to

Dual Intel Xeon E5-2603 (Quad Core, No HT)
128GB RAM
(12) 2TB 7.2k SAS
(2) 100GB SATAIII SSD (ZFS Mirrored Boot)

I will be moving the workload from the current rig to this new one, moving from FreeNAS-11.2-U8 to TrueNAS Core. I have two M2400 systems, old EMC Avamar DataDomain Storage Nodes that have 12 3.5" drive bays and the backplane and SAS expander already there.
I am not married to the zpool configs to be honest as I can provide NFS from a Linux VM and use the array purely as iSCSI storage. If I go that route, I will probably go with a 6x 2-wide mirror zpool. That will be slightly faster than the 4x 3-wide RaidZ, but not by a lot, and factor in I'm also on a 1Gb network backend that may someday go to a 10Gbe backend, kinda planning for not having to do this again. The previous system, once eveyrthing's transported, will be rebuilt and located offsite and will be a replication partner for this new system, with some limited local ESXi running on it as well.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
L2ARC - you are at a 98% hit rate so L2ARC will I suspect do very little
SLOG (you mean this rather than ZIL). This only works on sync writes. This is basically NFS and iSCSI (which you mention). I have no idea about the VM's, but as you have the first two thats a tick in the box.

In a steady state the SLOG is only ever written to, and never read - thus you need something that can take a heavy write load. Most SSD's will be rapidly destroyed by SLOG use.

It also needs to be faster (much) than the disks it backs onto - you have spinning rust - so an SSD is good for a SLOG.

However the SLOG comes into play when you have a crash of some description - in which case when the NAS comes back up the ZIL (on the SLOG device) is read in case there are some uncommited writes. So you need PLP - ie if you lose power then data in the SLOG must be secure. Ideally (not not 100% required) this would be mirrored as well. If the SLOG fails during normal use then the pool will go back to using the ZIL on the disks - so not a major issue. Its only if the NAS and SLOG fail at the same time that you will have an issue with potential dataloss

I note that you are running RAIDZn. Not good for IOPS - you would do better with mirrors (ie 6*2) as you suggest for 6 disks of IOPS rather than 4.

You can test the effect of a decent SLOG by setting sync=disabled on the pool/dataset - however that is not datasafe in the event of a unplanned reboot or similar. This is as fast as a pool can go. A SLOG will be slower, but not as slow as being without one.

What a SLOG does is to move the ZIL to the SLOG. When data is written to the ZIL the OS acknowledges to the source system that the data is written to a safe location (thus a fast SLOG is required). ZFS can then get on to write the data to the slower permanent storage as normal.

At a reasonable price the best SLOG is probably the Optane 900P (or 905P) or you can try and find a Radian RMS (200 or 300). The Optane's are way too big as you only need about 5 seconds of maximum data transfer speed
5 sec * 1Gb = 0.625GB at 1Gb/s
5 sec * 10Gb = 6.25GB at 10Gb

Some SSD's (SATA or M.2) have PLP so would work (most do not) but the performance of the RMS or the Optane's will far outstrip normal SSD's. There is a long running thread on SLOG performance somewhere on this forum
 

jlw52761

Explorer
Joined
Jan 6, 2020
Messages
87
Not too worried about power loss, I have a UPS dedicated to the box and connected via USB, so will be able to clean shutdown if needed. Will write a small python script to initiate shutdown via VMware Tools, so not too worried about that.
My main concern, since my rust is 7.2k, that I can't clear the ZIL out of memory fast enough and run out of RAM. Subtracting the 52GB being used by ARC, that's 60GB+ of RAM available. I don't know if ZIL has the 32GB limit in RAM as it does on SLOG, but would love to see what my current ZIL size is in RAM. Probably not going to sweat a SLOG, doesn't sound like it would give me any benefit and may actually hurt performance since I have so much RAM available.
I think I'm definitely going to go with 6x 2-wide mirror vdevs for the zpool as it gives more IOPS, the only concern I have is, with snapshots and such, I've got 9.14TB allocated. Not sure how to see raw data vs snapshot. With the 12 2TB disks, I'm looking at just shy of 12TB, so I'm looking at already being at 76% utilized as is. I could probably shave a TB or 2, but probably not much more without knowing what the snapshots are consuming.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
What if the UPS fails? Or the power supply? Or the power cable connection is flaky? Whether any of these scenarios would be a problem, only you can say. But there is a reason why enterprise SSDs have PLP, while the servers there are usually connected with redundant power supplies and also redundant UPSes.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
1. A SLOG will definately help with Sync writes, unless you are using sync=disabled (your risk)

2. As @ChrisRJ says PLP is in case something unexpected goes wrong. Kernel crashes, UPS blows up etc - its the unexpected rather than the planned for. A power outage is expected, a fire isn't

3. If you write too much data (too quickly) then things will slow down until it all catches up. The absolute fastest you can go is (in this case) sync=disabled and mirrors as that is limited only by the raw speed of the disks (and I suppose network). 6 mirrors should be easily able to flood a 1 Gb network, so the limit will be the NIC, probably

4. For zvol useage, <50% is reccomended. I would suggest, either some more mirrors (aka more disks) or bigger disks
 

jlw52761

Explorer
Joined
Jan 6, 2020
Messages
87
For perspective, this is my home lab rig, not a production rig.

The server has two power supplies, one on UPS one on raw power. If the UPS fails, I only loose a single PSU, and I can quickly transition to the RAW power while my UPS get replaced (could be a few weeks). The chances of a dual PSU failure are pretty high to be honest, in 25+ years in IT I have never experienced it, and I've been around tens of thousands of servers.

If there's a fire, well, my concern will be about my family's living situation, not about the data. Additionally, the current system is going to be rebuilt with TrueNAS and located at my second home, about 120 miles away, and the data will replicate there. I run pfSense at both sites and have IPSec tunnels between the homes, so no worries there, the data will be snapped every 5 minutes or so to keep the delta's small.
If this were a production rig, I would purchase one of the prebuilt systems and the enterprise support.

As for the zvol usage, I don't have larger disks yet, and my system only holds 12 disks, so I have 12 2TB disks to play with. I may be able to trim some fat by moving my zoneminder to dedicated hardware with internal drives, it's a resource hog anyway, so I may get under the 50%, but I see the majority of folks, including Oracle, state anywhere from 70% to 80% utilization as the max, not <50%. I will just have to see what the real world results look like.

As for the original question, how do I see current ZIL utilization, not just what is being written into it, but how large is it.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I don't actually know
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Note that the ZIL is 100% in pool on data disks, not in RAM. SLOG changes that to a separate device, though attached to the pool. As for how much space a ZIL needs off the data disks, not very much. A few gigabytes or so. But it's not permanent storage.

Writes to the ZIL are initially built in RAM, (of course), as a ZFS transaction. However, until that transaction group is fully written, the synchronous write does not return as complete, (unless "sync=disabled"). The RAM entries stay in place until the regular data location can be written. When that is done, then RAM freed and the ZIL entry for that transaction group is also freed. Thus, the concept that a ZIL, (and SLOG), are write only unless ungraceful shutdown.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
As for the original question, how do I see current ZIL utilization, not just what is being written into it, but how large is it.

zilstat is your friend.
 

jlw52761

Explorer
Joined
Jan 6, 2020
Messages
87
So I may be completely misunderstanding ZIL then. My understanding was that ZIL was the "cache" in RAM that async writes went to before being committed to disk, giving the confirmation to the client before data is written into persistent storage. That's based on verbiage similar to the following:

"ZIL stands for ZFS Intent Log. The purpose of the ZIL in ZFS is to log synchronous operations to disk before it is written to your array. That synchronous part essentially is how you can be sure that an operation is completed and the write is safe on persistent storage instead of cached in volatile memory. The ZIL in ZFS acts as a write cache prior to the spa_sync() operation that actually writes data to an array. Since spa_sync() can take considerable time on a disk-based storage system, ZFS has the ZIL which is designed to quickly and safely handle synchronous operations before spa_sync() writes data to disk"

So right now, my ZIL is written to the pool that it will ultimately be written to? That to me makes very little sense to be honest, but I can see how a SSD based SLOG can speed up the ZIL.

So the ZIL is not the write cache that sits in RAM, is what I get from this, but what I initially understood. By default, ZFS will only do the synchronous writes, so when my client writes data it goes into the RAM, then into the ZIL as a transaction group. At that point, the client get the write confirmation, and at some other point the ZIL dumps to the pool. I honestly don't see the advantage of the ZIL in this case as it's using the same rust as it's backing.

Setting sync=disabled just makes everything become an asynchronous write, which goes from RAM to disk bypassing the ZIL, or does the ZIL still get used and just the layer of confirmation changes?

Since the drives I have have a decent cache on them that is not backed up by any mechanism, if I had a sudden power loss, anything in the individual disk caches that have not been committed to the platter is still lost, so in this case, having an SSD (or two) that are not PLP is not the end of the world because I can still face some level of data loss in the even of sudden power loss.

With that, I am not having much luck finding a small, reasonably priced, SSD with PLP. I would prefer a SATA interface, but if it has to be a m.2 on a PCIe adapter card, so be it. Since from what I am seeing the ZIL can only consume 32GB on SLOG, I sure wouldn't want to get anything more than a 120GB drive for that, specially since I wouldn't also put anything else on it.

For reference, here's my current system's zilstat, from what I'm seeing, I probably don't need to worry about PLP because at most I might lose 1MB of data??

Code:
     txg    N-Bytes  N-Bytes/s N-Max-Rate    B-Bytes  B-Bytes/s B-Max-Rate    ops  <=4kB 4-32kB >=32kB
  25469430    1255368     418456     526088    8593408    2864469    3031040    170     17     13    140
  25469431      12856      12856          0      36864      36864          0      1      0      0      1
  25469432    4215440     843088    2458808   17264640    3452928    6082560    271     54      0    217
  25469433      13232      13232          0      73728      73728          0      2      0      0      2
  25469434    2535480     845160     880896    7577600    2525866    3293184    160     58      0    102
  25469435       1008       1008          0       8192       8192          0      2      2      0      0
  25469436    1516776     505592     612120    6742016    2247338    2662400    178     52      0    126
  25469437      26096      26096          0      73728      73728          0      2      0      0      2
  25469438    1661512     830756    1302600    4153344    2076672    3059712     63     16      0     47
  25469439      13048      13048          0      36864      36864          0      1      0      0      1
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
ZIL is not a cache, and it's not in RAM. It's an indirect write journal whose purpose is to allow write replays when power is lost. That's why a SLOG has to be both fast and have power protection.

As for inexpensive, a quick Google search locates this Dell-rebranded 200 GB Intel DC S3710 (which has PLP) for $93.

 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Yeah - but which 1Mb of data?
Optane or RMS is your answer. You could try one of the little M10 drives but you won't get as much improvement
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
Please have a look into "ZIL and SLOG" from the recommended readings in my signature.
 

jlw52761

Explorer
Joined
Jan 6, 2020
Messages
87
Definitely not dropping $$$ on Optane for this.

Very interesting take from the article, which is something I originally thought of but was told "No, never RAID controller with ZFS, ever!", but ixSystems seems to see the same point I did, "An interesting but unorthodox alternative for SLOG is to use a RAID controller with battery backed write cache, along with conventional hard disks. Normally RAID controllers are frowned upon with ZFS, but here is an opportunity to take advantage of the capabilities".

After reading the article and understanding ZIL a little more, I actually see the argument that, in my case, a device without ZIL is acceptable because, it's not any different than if I had a power loss before the transaction group was dumped from RAM into the ZIL, pretty much the same result, and since my writes into the ZIL are reasonably small, I think it's well worth the risk to use a commercial SSD and get the ZIL onto a SLOG and off my spinning rust, knowing I could have up to a MB of lost data. There's nothing I'm dealing with that is soo mission critical that this level of loss is not acceptable, again, this is a home rig used for lab and running some home automation.

I also learned from the article kinda what I was after, that the transaction group can use up to 1/8 of system RAM, which is 16GB is my case, and the transaction group will write to ZIL either when it's reached this limit or 5 seconds has transpired, whichever comes first. Looking at zilstat, that's like a MB or two, so really paltry at best.

I've also learned that with that little of a ZIL, it's probably not any really any hit to the pool to leave it as is and not setup a SLOG. Maybe down the line if I put in 10G network between my hosts and this rig that can be revisited, but out of the gate, I'm planning on 4 1GB links in a LAG (same switch) to cover if I need to swap cables or anything. I understand a LAG doesn't truly combine the bandwidth of the ports, but it does allow spill-over, so technically at full throttle I could get up to 460MBps of transfer if everything's perfect (which we know is not).

I think for now I will forgo the SLOG and revisit later if I see major performance issues. I will say I have learned a lot about ZIL that I thought I knew but was corrected on, so that alone is awesome, thank you all!
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Oh dear, not the 2013-era thread again. Please see the later post in there I made after the switch to the OpenZFS-derived "improved write throttle" as a whole lot has changed especially with regards to thresholds of minimum/maximums. (The default max is the smaller of "10% of RAM" or 4GB, and it starts flushing to disk at 64MB or 5s, whichever is hit first.)


Your one line from zilstat shows a lot more than 1MB in play:

Code:
     txg    N-Bytes  N-Bytes/s N-Max-Rate    B-Bytes  B-Bytes/s B-Max-Rate    ops  <=4kB 4-32kB >=32kB
 25469432    4215440     843088    2458808   17264640    3452928    6082560    271     54      0    217


That's potentially (4.2M+17.3M) = 21.5M of data hanging in the breeze without sync writes - now if this was pure file data that could be recopied that's potentially okay, and even the locally running bhvye VMs are less likely to be impacted as they'll probably have crashed with the server - but remote VMFS from vSphere specifically responds very poorly to unexpectedly truncated data, to the point of potentially making an entire datastore unmountable. It might still be a homelab, but the question now is "the value of your time" and "do you really want to re-rip your media from the original DVDs that you of course own and/or redownload all of those 'Linux ISOs' again".

The UPS/power safety argument also doesn't take into account the potential of an HBA failure. I've had genuine LSI HBAs let the Magic Blue Smoke out and SLOG saved my bacon on that one. Again though; the risk assessment is yours to make, we can only offer you the advice here.

"An interesting but unorthodox alternative for SLOG is to use a RAID controller with battery backed write cache, along with conventional hard disks. Normally RAID controllers are frowned upon with ZFS, but here is an opportunity to take advantage of the capabilities".

That was edge-case advice back then, nowadays any decent enterprise SSD with PLP will crush the RAID controller's performance because SLOGs are highly sensitive to latency. Check the link in my signature for some SLOG benchmarks:


I'm planning on 4 1GB links in a LAG (same switch) to cover if I need to swap cables or anything.

LAGG and iSCSI don't play nicely together - if you're going with four links, and want to keep both file and block going, I'd suggest making a 2-port LAGG and a 2-port MPIO setup.

(Sorry to wall-of-text you here.)
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
@jlw52761 - You missed one of the most important points, that HoneyBadger just made, see below. When using zVols, (either through NFS or iSCSI), for another file system, (aka NTFS for MS-Windows or EXT2/3/4 for Linux), those file systems can respond very poorly to unexpected write failures. Top that off with another layer of indirection using VMWare VMFS, and it's madness time. So much so, those entire file systems can be destroyed. Remember, they were not designed like ZFS.

...
but remote VMFS from vSphere specifically responds very poorly to unexpectedly truncated data, to the point of potentially making an entire datastore unmountable. It might still be a homelab, but the question now is "the value of your time" and "do you really want to re-rip your media from the original DVDs that you of course own and/or redownload all of those 'Linux ISOs' again".
...

So, use "sync=always" for zVOLs, which WILL protect your data.

Last, ZIL writes are faster than writing to the final location for a transaction group. On pool import, ZFS will check the ZIL to see if any parts of the ZIL need to be applied, and if so, apply them.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Isn't sync=always the default behavior?
No, sync=standard is the default, which is "use a sync write if the client requests it."

VMware explicitly requests sync writes to NFS exports, but doesn't for iSCSI LUNs - because it expects the SAN hardware to provide a non-volatile write cache. That's why you have to manually set sync=always on zvols.
 
Top