Slideshow explaining VDev, zpool, ZIL and L2ARC for noobs!

cyberjock · Apr 5, 2014

Good writeup! Nice to see someone that likes to chat about deep level stuff.

There is something here I'm not sure I understand. zpools can be created with and without log devices. (Could be a singular device but mirrored is far wiser in the context of FreeNAS.) Without log devices, the ZIL info committed to stable storage is written to blocks in the main pool. I'm trying to imagine doing a sync write that is 32K of data. So a ZIL record is created that includes the 32K of data. Because it is a sync write that record gets committed to stable storage. if there is no log devices, that 32K now resides in the main pool, then the actual write needs to occur and that same 32K is rewritten to the pool? Interesting. I'm assuming the committed ZIL info that includes the 32K of data will be slightly larger than 32K. Say for the sake of the argument it is 36K total. That still sets up the a scenario for fragmentation as that 36K area would be large enough to hold a future 32K data write leaving a 4K fragment which is good for not much.

So here's what I know for certain:

1. You always have a ZIL. So even if you have no ZIL device, the pool has it's own ZIL. I'm very fuzzy on what those writes actually are. My guess is that writes fall into 2 categories:

Sync writes: sync writes must be honored so I'm guessing a tx would be opened and then closed and the data written to the pool. (this is a performance nightmware)

non-sync writes: My guess is that there is some degree of caching, but to what extend I'm not sure. But you have to open and close the tx/txg as well as write the data. From what I've seen in iostat is that when ZFS "breathes" it opens the tx/txg, writes the data, and closes the tx/txg fairly rapidly. Like a second or less for the whole thing. The exception are files that are being sent that take longer than a single "breath".

maurertodd · Apr 5, 2014

This agreed that each zpool will have it's own ZIL. The Intent Log is explained one place in similar terms as a database journal file. Play back the log and you'll end up in the same place. Mostly it seems that the records in the ZIL only ever exist in memory before they are destroyed. I'm not sure, but I think what happens is that a sync write is going to cause a commit, and that a commit is going to force any ZIL records in still in memory to be flushed to stable storage. It is possible that only the record for the write gets written I suppose, but I'm thinking not as those might need to be played back too in order to guarantee that the sync write is also successful.

From the good news side, complete the flush of the ZIL to either the log device(s) or main pool, and the write can be signaled as complete to the application. Is the log devices are a pair of SSD then that would be much faster than writing to spinny disks. On the down side regardless of where the sync write flushes the ZIL to, there is follow-up writing overhead to accomplish the actual writing.

I'm surprised the engineers didn't find some clever way that when the ZIL write of the data is to main pool, would double for the movement of the actual data to main pool and only have to fix up pointers after the fact instead of rewriting the data. In fact I don't know that they didn't, but the way I'm reading things now it seems that in my example that 32KB of actual data will move from system memory to the disks twice. This is the part that is bothering me most about my current understanding.

cyberjock · Apr 5, 2014

You need to keep something in mind.. the ZIL, while being "assigned" to a pool, isn't actually part of the pool itself. This is why any write to a ZIL will also later be followed up by a write to the pool. The pool should normally be able to stand up on its own.

I'm a little confused by what you are saying in the last paragraph. But any write to the ZIL if it is not on the actual pool disks will have to be followed up with a write to the pool. But, if a sync write is made to your pool with no external ZIL then the write is made immediately, and only one time. I'm confused on where you are getting the idea that there are two writes.

maurertodd · Apr 6, 2014

Again it may be me that is confused. :)

In this example, there is no log device (no device(s) dedicated for the ZIL) so when ZIL must be flushed to stable storage it goes to blocks on the main pool. If mirroring or RAID is in play, I'm not counting those additional IOs below, I'm only counting writes to the pool itself.

Application issues a sync write of 32KB of data.
The change including the 32KB of data is written to the ZIL in memory.
The ZIL is committed forcing a flush of the ZIL included the 32KB of data to stable storage.
- Since no log device(s) the ZIL including the 32KB of data is written to blocks of the main pool.
- For the sake of the arguement let's save the 32KB of data is written starting at block 1024 of the main pool.
- If the 32KB starts at 1024 presumably the other data for the ZIL either proceeds (before 1024), follows (after 1055) or both.
The ZIL flush completes, the application is signaled that the write is complete.

My question is "What happens next?"

Note that while the ZIL has been flushed, I expect that the contents also still reside in cached in memory.
Does ZFS now have to write the 32KB of data elsewhere in main pool to make it part of the filesystem?
- If so then the space for the ZIL including the 32KB at 1024 is deallocated.
Or is ZFS simply mapping the data that already resides at block 1024 into the file system?
- If so then the ZIL "transnational" information around 1024-1055 will be deallocated but 1024-1055 will remain allocated.
Or something else I haven't thought of?

The first option requires that the 32KB of data is written to the pool twice.
Why do I care? In one sense it doesn't really matter. The net result is logically equivalent, data is protected, if there are two writes to stable storage the app is free to continue on its merry way after only one and life is good. But my thing is performance. I need to understand what I see in the performance metrics. Understanding that the 32KB is moving once or moving twice to stable storage will make a difference in the performance data.
It also matters in the resulting state of the freespace of main pool. Leaving the 32B at 1024 would mean the ZIL space deallocated would be quite small, perhaps even only 1KB. This is what Oracle support suggested would be the net result of not having a log device, while the explanation given wasn't quite correct.

cyberjock · Apr 6, 2014

The ZIL doesn't have it's own special location for writes. Any write to the actual pool disks will not be rewritten again later. There's no point in dealing with double writes. when there's no distinction between any two places in the pool aside from 'available to use'.

The first option is totally inaccurate as it doesn't write the same 32kB to the pool twice. There's no reason to. The ZIL, when part of the pool itself, isn't some special carved out piece of the pool. It's just writing the transaction that the place where it wrote the data previously is its permanent location. No re-write is necessary. So in your example it will do the write at 1024 and that's it. Nothing at 1055 will happen within the confines of your example.

I cannot vouch for what Oracle's design does as nobody has access to Oracle's code(at least, not legally).

To be honest, I'm find it very hard to believe that one write versus 2 writes are going to be noticeable in any performance metric you are running unless the data you are collecting is also telling you where the data is being written. The reason is that ZFS has several ways it does what it does, including variable block sizes, location of existing data, fragmentation of the meta-slabs, and the transactions themselves. There's so many variables you aren't going to be able to really *know* what is going on without checking out the source code for yourself. Unless you are about to pick all of those apart I can't see how you are going to indentify a write that is 34kB versus 68kB(adding extra for the transactions) without actually seeing where and how much data is being written to a given disk.

And to be honest, if your performance is so abnormal that you are trying to figure out if you are making a single 32kB write or two 32kB writes when a sync write is made, your probably suffering from analysis paralysis. The fact that I don't know for 100% certainty what goes on with the sync writes, the order of the writes, etc should be in indicator that you don't really need to either. ;) Now, curiosity killed the cat, and if it's strictly curiosity that's one thing. But I don't think that knowledge will really help you in the end.

jonandermb · Apr 15, 2014

Great presentation. Must read for any newcomer.
Thanks for your work cyberjock :)

karpodiem · Apr 16, 2014

Thank you, very informative presentation!

Code · May 17, 2014

@cyberjock
And to Everyone & anyone wanting it..

The Link in the Presentation to Hard Disk Failures is throwing a 404 Not Found..
So Thought I'd share the link here for anyone that Needed it & for Cyberjack when he as time to add the New Link to his Edited Version if & when that happens.

Not Sure whether it's the Same has you had in there Cyberjack, I'll leave it up to you to Decide.

https://docs.google.com/file/d/15y7kJ7BbyKNuVJR2lpbwsVXUnuJNFPJ36cWNRhFz15LGphcK_h-_MIGaeKEj/edit

I've Also Attached the PDF. Document for you to download Below if anyone want's to take a Gander at it.

Virus Total Scan
MD5 = a0d32b0aa6f36f2bff4321a517560cac
https://www.virustotal.com/uk/file/...32174963cfc73b5d0aa882839776152c01b/analysis/

As I mentioned not sure if it's the same, but it does explain things about HDD Failures,
May be of interest.

cyberjock · May 17, 2014

Thanks. I plan to do a pdf update soon so I'll fix that link.

joeschmuck · May 18, 2014

While your updating the presentation, I noticed you still recommend a minimum of 2GB RAM for UFS systems. I'm not even sure you can install FreeNAS 9.2.1.x with such a small amount of RAM and get it to run, before setting up your drives, certainly we have seen some systems that couldn't do it. I'll leave it up to you to determine if you change it and what value you feel is safe now. You have definitely been much more involved in these forums than I have or will ever be.

gpsguy · May 18, 2014

Having seen your message this morning, I did a test install of 9.2.1.5 x64 in a VM with 2Gb RAM this afternoon. Installed it on an 4Gb drive (VMDK) and created a 20Gb UFS volume (VMDK). After installation and first bootup, it looked like it was stuck with messages about "mDNSResponder". After several minutes [I thought it was stuck], it finally proceeded to the the Console setup menu. Subsequent reboots were normal.

joeschmuck said:
... I noticed you still recommend a minimum of 2GB RAM for UFS systems. I'm not even sure you can install FreeNAS 9.2.1.x with such a small amount of RAM and get it to run, ...

avpullano · May 27, 2014

Excellent presentation Cyberjock. I'm disappointed in myself for not finding this sooner as it cleared up quite a few things for me.

You say that failure of one VDev results in loss of the entire ZPool. Where can I get more information on why this happens? I googled as hard as I could and did not find a useful source. Also, the only way to add new disks via the FreeNAS GUI seems to be by adding a new volume (which is a new ZPool, right?), so why is having multiple VDevs on a single ZPool even an issue? Wouldn't I have to go out of my way to add VDevs to an existing volume? Is there a reason that people are trying to do this?

Thanks so much for the huge contribution. I've already learned a ton from your work. Keep it up!

cyberjock · May 27, 2014

The answer is in the source code. Most stuff isn't well documented in ZFS, so you learn by doing and years of reading. I read about ZFS as often as I can, and still, after 2 years I'm still learning stuff and thinking "oh my".

As for adding disks, go back and read my presentation again. Adding a new volume is not the only way.

danb35 · Jun 3, 2014

avpullano, the reason that failure of a vdev causes the pool to fail is that vdevs are striped together. In a RAID0 array, if you lose a single drive, all your data is gone. Think of this the same way--each vdev is similar to a drive in a RAID0.

avpullano · Jun 4, 2014

danb35 said:
avpullano, the reason that failure of a vdev causes the pool to fail is that vdevs are striped together. In a RAID0 array, if you lose a single drive, all your data is gone. Think of this the same way--each vdev is similar to a drive in a RAID0.

Ah, that actually makes a ton of sense. I was thinking of each VDev as a separate storage location altogether (which I see now would actually be separate zpools). So, for example, if I add multiple VDevs to a single zpool, I cannot associate each VDev as a CIFS share, I can only share the entire zpool as a whole. Got it. Thanks a ton!

danb35 · Jun 4, 2014

I'm afraid you don't understand quite as well as you think. You can create a CIFS share pointing to any directory or dataset--they don't have to (and usually don't) point to the pool as a whole. You can't share a particular vdev as such, though--they're transparent to the application layer.

It is possible to create multiple zpools, though I don't think ZFS was intended to be used that way. This fragments your storage, but isolates the danger of vdev failure. For example, let's suppose that you have six drives, and you want to set them up as two, three-disk RAIDZ1 vdevs (ignoring temporarily that RAIDZ1 is no longer recommended). You could create those as two separate zpools, in which case they would be two distinct volumes, with distinct amounts of capacity, and no way to combine them. If two drives in a single vdev failed, you'd lose everything in that zpool, but the other zpool wouldn't be affected.

Alternatively, you could create a single zpool consisting of the two RAIDZ vdevs. This would give you a single volume with the combined free space of all the drives (less parity). However, if you were to lose two drives in the same vdev, you'd lose the entire pool.

avpullano · Jun 4, 2014

I was using the CIFS share as a way to associate (in my head) what is usable vs what is transparent. I forgot that you can share individual datasets as directories.

I guess what was confusing me was that having multiple VDevs in a single zpool just add another potential point of failure. I don't entirely understand why anyone would even consider doing that, but it sounds like it's just one way to grow a pool after you already have one VDev established.

cyberjock · Jun 4, 2014

VDevs help handle problems that are inherent with very very large servers. Even with RAID6 most hardware RAID controllers recommend against more than 10 disks in the same array. So if you had 30 disks and you followed the recommendations of hardware RAID you'd have 3 arrays. But on ZFS you could still have 1 large pool. VDevs actually free you from many limitations and are not a limitation in and of themselves.

avpullano · Jun 4, 2014

Hmm, that seems like an important point. I keep thinking of these concepts in terms of home servers. I always forget that they are developed in the corporate world where needs are much higher than mine, haha.

cyberjock · Jun 4, 2014

avpullano said:
Hmm, that seems like an important point. I keep thinking of these concepts in terms of home servers. I always forget that they are developed in the corporate world where needs are much higher than mine, haha.

Yeah. It's important to remember that ZFS is designed for absolutely ginormous large servers. Home use was never really a consideration when ZFS was engineered. The fact that we are using it in home use means you need to consider 2 very important points:

1. Most "home users" are not ZFS geniuses and are *much* more susceptible to the "noob mistake" that kills your pool. FreeNAS somewhat helps as the WebGUI prevents you from doing stupid things, but only if you use it(and you should).

2. Most "home users" are not interested in buying ultra-high end hardware that ZFS was designed to run on. We("we" being the community) get some major benefits from server-grade parts which are generally much better quality than consumer parts, but we still have tons of users that will not spend the extra $50 or can't procure server-grade where they are located on the planet. In those cases they are taking risks that they will have to decide if it is appropriate for them.

ZFS isn't for small fries. You go with ZFS and you are definitely going to spend some cash, and you are certainly in it for the long haul. ;)

Important Announcement for the TrueNAS Community.

Slideshow explaining VDev, zpool, ZIL and L2ARC for noobs!

Inactive Account

Dabbler

Inactive Account

Dabbler

Inactive Account

Explorer

Cadet

Cadet

Attachments

Inactive Account

Old Man

Active Member

Dabbler

Inactive Account

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Inactive Account

Dabbler

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Slideshow explaining VDev, zpool, ZIL and L2ARC for noobs!"

Similar threads