Slideshow explaining VDev, zpool, ZIL and L2ARC for noobs!

Serverbaboon · Mar 1, 2014

Well reading again every couple of months just to refresh yourself :)

Joakim Hamilton · Mar 7, 2014

Here is a *.odp version of the 'FreeNAS guide 9.2.0' Powerpoint for OpenOffice users. (Slightly edited, but just the formatting). PS: Only for the current release of document (Updated as of: January 8, 2014, 9.2.0)

cyberjock · Mar 15, 2014

Joakim Hamilton said:
Here is a *.odp version of the 'FreeNAS guide 9.2.0' Powerpoint for OpenOffice users. (Slightly edited, but just the formatting). PS: Only for the current release of document (Updated as of: January 8, 2014, 9.2.0)

Thanks for that.. do you know of any way I can easily convert my guide to OpenOffice in the future while keeping the animations? My problem before was that when i converted to OO it got AFU'd.

Joakim Hamilton · Mar 17, 2014

can't give a good answer on that, sorry ...

Yatti420 · Mar 20, 2014

Should Page 55 mention the WD Reds now with Idle timer?

* The only exception is converting a Vdev from a single disk to a mirror.

Consult the FREENAS manual for more information.

Just had another read.. So a single disk in vdev1 can be turned into a mirror?.. so disk 1 vdev 1.. add disk #2 vdev 1? or does it have it's own vdev(2) and just mirror vdev 1 (seems improper) .. I'm planning on migrating to a raidz2 6x2tb soon.. Was just curious if I understand (converting vdev from single to mirror) what that means..

cyberjock · Mar 20, 2014

A single disk vdev can be converted to a mirrored vdev. So disk1 vdev1 exists and you want to mirror the vdev, so you add the second disk as a mirror and you end up with a disk1vdev1 and disk2vdev1.

Yatti420 · Mar 20, 2014

Ok good to know.. This is the *only* situation where you can add another disk to the same vdev.. I can't wait to run a raidz2 pool..

ellupu · Mar 28, 2014

Hi cyberjock,
first of all I'm amazed what you did for us noobs! *bow* I'm at the moment reading your presentation, currently at slide 21 and I recognized that you've written 2 times "exlample 3" on the headline (also the slide before). I found that bug so it's mine! *haha* :D
But that's not the reason I'm writing.
This example decribes partial redundancy and if one of vdev2-7 fails, it would mean a completly loss of data in the zpool.. I didn't get that.. why? Let's see this from a physical point.. If vdev2 (with that harddrive) fails then normally I've lost the files on that harddrive but the rest should be safe.. I do understand that if it fails, all data is unaccessable but not lost.. creating a new zpool without that drive sould work or not? I would never ever make such a setup, like in this example, seriously but I would like to understand.

And before I die because of stupidity ... a vdev could be detached.. isn't that the same like removing? I'm asking this because I tried to create a vdev named volume1 before and it was damaged or courrpted. After detaching it, the vdev wasn't in my list anymore. I've added a picture and the current vdev should be HDD01, correct? Otherwise I'm talking about something different.

Fox · Mar 28, 2014

I'm a little lost with all the docs.. Specifically, I have seen a lot of usage of the VDEV in the forum msg's, but when I read the docs, I don't see much of any mention of it, at least in the terminaology section. For example:
http://doc.freenas.org/index.php/Hardware_Recommendations#ZFS_Overview

The above link uses the terms: Pool, Dataset, zVol.. Where does the vDev fit in? And then even more confusing is that I see the vDev frequently mentioned in other areas of the documentation, such as:
http://doc.freenas.org/index.php/Volumes#Encryption

Also, I have seen mention that we only have one pool, yet, in the documentation, I see:
http://doc.freenas.org/index.php/Volumes#Encryption

The encryption key is per ZFS volume (pool). If you create multiple pools, each pool has its own encryption key.

In cyberjocks guide, things area very well explained, but when I start to read the docs, I get a little confused.. I would assume as soon as I start using the system, things would be more obvious, but I'm wondering at this point if there are different words for the same concepts?

Thanks

PS. I don't have a system yet.. I do have a VM to test things out, which I probably need to boot back up and play around a bit more with it before reading the documentation..

ellupu · Mar 28, 2014

I feel with you fox... I think I need a big red arrow, poking on the specific point with the answer, to enlight my brain.

cyberjock · Mar 28, 2014

ellupu said:
This example decribes partial redundancy and if one of vdev2-7 fails, it would mean a completly loss of data in the zpool.. I didn't get that.. why? Let's see this from a physical point.. If vdev2 (with that harddrive) fails then normally I've lost the files on that harddrive but the rest should be safe.. I do understand that if it fails, all data is unaccessable but not lost.. creating a new zpool without that drive sould work or not? I would never ever make such a setup, like in this example, seriously but I would like to understand.

I have no clue where you are getting the idea of unaccessable meaning "not lost". If your pool is inaccessible the data is not available for you to use. If any vdev is unavailable because there are insufficient replicas, your entire pool goes offline at that instant. And the entire pool will not be able to be mounted again until you restore enough missing disks to make the vdevs at least degraded(degraded meaning disks are missing but there is enough rendundancy in the vdev to support a full copy of your data.. ex: 1 or 2 disks missing in a RAIDZ2 vdev).

ellupu said:
And before I die because of stupidity ... a vdev could be detached.. isn't that the same like removing? I'm asking this because I tried to create a vdev named volume1 before and it was damaged or courrpted. After detaching it, the vdev wasn't in my list anymore. I've added a picture and the current vdev should be HDD01, correct? Otherwise I'm talking about something different.

ANd I have no clue what you are talking about as vdevs have no "detaching" button. Disks themselves have a detach button. And if you don't have enough redundancy to support the vdev operating with that disk removed it will return an error that you cannot remove it from the pool.

Not sure what you are talking about, but vdev is not disk unless you are doing single disk vdevs. ANd in that case a detach button will exist but if you try to use it it won't work.

ellupu · Mar 29, 2014

Thanks for your reply cyberjock and sorry if I confused you with my post,
currently I have a single drive in my system for testing purposes, unless I switch to a 5-6 drive RAIDz2 setup. Referring to your example

The zpool can withstand up to 2 hard disk failures in VDev 1, but a failure of any hard disk on VDev 2,3,4,5,6 or 7 will result in a loss of all data.

This confused me. Do you mean a "loss of all data" in that vdev that failed or in that zpool, which means also a loss of all data in vdev1 - which was my interpretation of that quote, and the point I didn't get completely.

Before we continue talking let me see if I understood the basics right. Based upon that screenshot I posted before the

zpool = volume
vdev = /mnt/HDD01, (which in my case is just a single drive)

Am I right?

ser_rhaegar · Mar 29, 2014

There are disks
There are vdevs made of disks
There are zpools made of vdevs

If more drives than there are parity drives in a vdev fail, the vdev fails.
If any single vdev fails, the entire pool is lost.

cyberjock · Mar 29, 2014

ellupu:

From slide 12:

A zpool is one or more VDevs allocated together.

You can add more VDevs to a zpool after it is created.

If any VDev in a zpool fails, then all data in the zpool is unavailable.

Zpools are often referred to as volumes.

You can think of it simply as:

Hard drive(s) goes inside VDevs

Vdevs go inside zpools.

Zpools store your data.

Disk failure isn’t the concern with ZFS. Vdev failure is! Keep the VDevs healthy and your data is safe.

/mnt/HDD01, since you have a single disk and a single vdev, is your pool and your vdev. For everyone that is concerned with their data and uses redundancy or multiple vdevs, HDD01 would normally be your pool.

This confused me. Do you mean a "loss of all data" in that vdev that failed or in that zpool, which means also a loss of all data in vdev1 - which was my interpretation of that quote, and the point I didn't get completely.

Yes. I totally mean a "loss of all data". ALL of your data, ALL of it, is gone.

pschatz100 · Mar 30, 2014

cyberjock said:
Thanks for that.. do you know of any way I can easily convert my guide to OpenOffice in the future while keeping the animations? My problem before was that when i converted to OO it got AFU'd.

Great presentation!! I had no problems with your presentation using the latest version of LibreOffice. OpenOffice and LibreOffice are similar but they have different "issues." LibreOffice handles Powerpoint files much better (it also handles Excel better.) You might also try saving the file in the older Powerpoint format, .ppt, as older versions of OpenOffice will handle that format more reliably.

I would suggest a different background. The very light gray to white in the upper left makes the white font difficult to read.

cyberjock · Mar 30, 2014

pschatz100: I'll see about a darker background in the next release. Maybe a black marble or something.

Edit: And welcome to the forums!

ellupu · Apr 1, 2014

@ser_rhaegar, cyberjock: Thanks for your answer!
My questions may looked a bit ridiculous due to cyberjocks guide which is a pretty good start into that . I asked because I tried to figure out, why a failed vdev (with a single disc) corrupt the whole pool. Personally I thought a vdev is running in something like a sandbox but in fact it reminds me back to that time where accessing an optical drive, without a disk inside, caused a blue screen on windows 98. Anyway, thanks for sharing your information, it helped me alot!

maurertodd · Apr 3, 2014

cyberjock, there is a ton of good data in this slide deck. Wish I'd had it to ramp up my learning curve of ZFS.

The only page I am thrown by is slide 12 explaining the ZIL. Lots of good comments on that page. However I learned that the purpose of the ZIL isn't so much a write cache, but to make clear the intentions of an impending write. It is ZFS's way to make sure that the zpool stays consistent. When ZFS writes are copy-on-write so locations of portions of a file are changing and it is usually doing a given "write" across multiple physical disks (or in the case of SAN mulitple LUNs that appear to ZFS as physicals.) I thought the purpose of the ZIL was to record the changes to the files that were about to happen, so that in the event of something "bad" (think power failure, OS crash etc) that the zpool would remain consistent. It would either a) be consistent as if the write never occurs or b) as if it had, but never in a part way state.

When I hear the ZIL being used as a "write cache" I think about it writing the data to the ZIL, then later writing it...again to the new location. I don't believe that is what is happening.

What also happens is that the data being written to as ZIL is much smaller than the typical data write. Think of writing 8K database pages to a zpool. Write 1K ZIL info, then write 8K data page. Once the write is successfully completely the ZIL info is deallocated. Since ZFS is copy-on-write both those write must be allocated from free space. As soon as the write completes successfuly the 1K ZIL info is deleted (actually only marked free). Whether the initial location of the data is deallocated will depend on snapshots and write activity; copy-on-write is the magic that enables snapshots. Alternating small allocation, big allocation, delete small allocation is a great way to fragment the zpool. By having a disk, (or as cyberjock suggests a mirror pair) to handle the ZIL, then all that ever gets allocated and deallocated from the zpool itself is 8k database pages and free space doesn't tend to get more and more fragmented.

For those that thought ZFS is immune to the troubles of fragmentation, it is possible for fragmentation to sap the performance of you zpools. Oh and there is the problem that there isn't a good way to defrag zpools. In workloads where zpools are subject to ZIL writes, a ZIL disk helps to avoid fragmentation. Also avoid filling your zpool to its limit. Rules-of-thumb on how full to fill your zpool run the gamut. I've heard only allocate up to 80% to only allocate up to 95%. Mileage will vary based on write patterns.

Making the ZIL a SSD pair as is suggested in the slide deck further improves write performance as the writes happen quickly.

NOTE that all of this is in the context of FreeNAS where there is an assumption that we are dealing with local disks. When using ZFS with SAN, some things change. There using SSD for the ZIL isn't as important as the write performance is based on the array's write cache rather than the physical disk. Mirroring isn't as important if the LUN is already redundant.

cyberjock · Apr 3, 2014

maurertodd said:
When I hear the ZIL being used as a "write cache" I think about it writing the data to the ZIL, then later writing it...again to the new location. I don't believe that is what is happening.

Actually, that's exactly what is happening.

So normally writes are stored in RAM, then committed to the pool at regular intervals. The ZIL is literally a copy of everything in RAM that needs to be commited to the pool that is promised per sync writes. The purpose of the ZIL is to be a non-volatile storage for writes that the OS is claiming are written to non-volatile storage. A sync write, by definition, MUST be protected from kernel panics, power loss, etc the moment that the server returns the sync write was performed. This is POSIX compliance, and a lot of businesses and markets would immediately dismiss ZFS as a reliable file system if this wasn't the case. This is also why ZILs should be mirrored. Back in v15 there was the potential to have an inaccessible pool that won't mount if your ZIL is lost. But, with newer versions the problem is fixed, BUT you still need mirrors. Without the mirrors only 1 copy of your data exists if an inopportune crash/powerloss occurs. Some people will say "eh, I don't consider a crash or loss of power to be likely". Well, then just set sync=disabled, because literally that's the same thing if you are REALLY 100% committed to that comment. You can't argue your server is reliable, then do things that are unreliable while claiming the former. You only get to go one way.

Now, things are more complex than you think. And this is why I don't try to get complex with answers because I really don't want to school every Tom, Dick, and Harry that shows up on the forum.

As for your comment that data being written to the ZIL is smaller than the typical write, you're somewhat correct. There's many factors that go into what goes to a ZIL and what goes to the pool itself. But, any write that is >64kB is actually made straight to the pool. The ZIL is meant to cut down on I/O and not increase bandwidth. The ZIL would be less efficient if it had to store 1 64kB write instead of 4 16kB writes. Remember, only sync writes end up in the ZIL. non-sync writes are simply kept in RAM until flushed using internal thresholds that I won't try to explain here.

If you do your research on how full to fill a pool, there's times when you don't want to exceed 50%. Yes, 50% Now, 95% is a magic percentage because that's when ZFS goes from a "write for fastest speed" mode to a "write to fill the drive" mode. The metaslabs are likely to only have small amounts of data empty, so ZFS goes into a mode to try to use every last bit of disk space that is available without regard for optimizing writes. The second mode can be much much slower, and obvious creates fragmentation that can only be undone by deleting the actual data.

Your comment about "Making the ZIL a SSD pair as is suggested in the slide deck further improves write performance as the writes happen quickly." is both right and wrong. A ZIL that is on an SSD pair should be a mirror. PERIOD. And mirrors will NOT increase performance since both disks must have their writes completed. Technically, because you must do writes to both disks and both must be complete, your ZIL will be slightly slower. Sucky, but true. The actual penalty may not necessarily be very noticable if your SSDs are sufficiently fast.

ZFS is NOT a good fit for storage on a SAN. PERIOD. Doing it ranges from silly to just plain dangerous. ZFS is supposed to have direct disk access. Doing less than that can compromise recently written data and in a worst case, your pool. This is no different than the reasons why we tell people never to run ZFS on a hardware RAID controller. I'm not even going to discuss SSDs on a ZIL for an array's write cache because the idea is so preposterous I'm not going to discuss it.

I will grant you that a ZIL really isn't a write-cache in the same way that "write cache" is used in almost every other situation. But, it's the best choice of words I can come up with to explain it to a noobie. If you want to get detailed, hey, the ZFS code is open source. Have a ball with it. My presentation is NOT intended as a replacement for good ol' homework. It's a primer. That's it. Enough for you to have a very basic understanding of the acronyms and what they do for the most part.

maurertodd · Apr 5, 2014

Not only had ZFS been done on a SAN, but it was worse than that. (Not my design) SAN LUNs were presented to a Solaris Control Domain. The CDOM put LUNs in a zpool. 100% of the that zpool was presented to the LDOM as a virtual disk and then the LDOM put the vitrual disk into its own zpool with a filesystem built on top of it. So even if the application write has async, the write to the CDOM zpool translated into a sync write. While the LDOM zpool had plenty of freespace it would over time use each of the blocks of its zpool. Meanwhile the CDOM would have no idea what was allocated or deallocated only that space had been used. Since the LDOM zpool was the same size as the zpool "hosting" it, CDOM zpool fragmentation was inevitable. One application write would translate into 400-1000 writes at the CDOM level.

Re what is actually written to pool...I'm stuck between two credible experts. So some research is in order. As is the case in CPM (Capacity and Performance Management) the answer seems to be "it depends." https://blogs.oracle.com/realneel/entry/the_zfs_intent_log seems to be a credible explanation of the topic. It reads:

ZIL and write(2)

The zil behaves differently for different size of writes that happens. For small writes, the data is stored as a part of the log record. For writes greater than zfs_immediate_write_sz (64KB), the ZIL does not store a copy of the write, but rather syncs the write to disk and only a pointer to the sync-ed data is stored in the log record. We can examine thewrite(2) system call on ZFS using dtrace.

I'd think from a practical standpoint sync writes larger than 64KB would be an unusual workload. As far as small writes...it would appear that there are other transactions that would create a ZIL record, and might be committed to stable storage with a sync write. Interesting to hear the smallest ZIL block to stable storage seems to be 4KB, where my other expert implied 1KB was not unusual. Still alternating 4KB and 8KB writes would work to fragment the zpool, especially if it where too full to allow for natural free space coagulation before space needs to be reused.

As Cyberjock say this gets deep, but ZIL records are create for every activity to the zpool. Those records are destroyed when the activity completes successfully. However the vast majority of this is held in memory and is never committed to stable storage.

There is something here I'm not sure I understand. zpools can be created with and without log devices. (Could be a singular device but mirrored is far wiser in the context of FreeNAS.) Without log devices, the ZIL info committed to stable storage is written to blocks in the main pool. I'm trying to imagine doing a sync write that is 32K of data. So a ZIL record is created that includes the 32K of data. Because it is a sync write that record gets committed to stable storage. if there is no log devices, that 32K now resides in the main pool, then the actual write needs to occur and that same 32K is rewritten to the pool? Interesting. I'm assuming the committed ZIL info that includes the 32K of data will be slightly larger than 32K. Say for the sake of the argument it is 36K total. That still sets up the a scenario for fragmentation as that 36K area would be large enough to hold a future 32K data write leaving a 4K fragment which is good for not much.

Re performance of SDD mirror pairs...my original comment was poorly worded. Using SSD will usually be faster than a rotating disk, and therefore a pair of SSDs will usually be faster than a pair of rotating disk. Yes I understand the write penalty of mirrored (a.k.a. shadowed) disks. I've been doing OS based mirrored disks for nearly three decades. While Cyberjock probably knows this, NOOBs are reading too. Since the writes to the mirror member are asynchronous they hopefully will overlap. Still there is more overhead to issuing two writes instead of one, and since each I/O service time is variable, (a constant that is best case plus a little more random amount), doing two and waiting for both to complete means you have twice the chances of having one be longer. So for best performance, all other things being equal, provide separate data paths to the various members of a mirror set.

The point we both are trying to make is that in that vast majority of the cases the writes to the log devices:

Add overhead to the process of synchronous writes by adding more an additional set of writes to each synch write.
That the ZIL data written to stable storage only read in the contexts of recovery, usually is written then a short time latter deallocated.
That the amount of space needed for the ZIL is quite small compared to the size of the zpool.
Oh and that if the space log devices fills up, things don't die, ZFS writes the ZIL to blocks in the main pool.

Re not doing ZFS on SAN...That's the first time I've heard that. I'm not sure I would equate the best practice of "Don't use RAID controllers" with a SAN array. There are many similarities, but also also huge differences. In fact Oracle's Solaris ZFS docs talk very briefly about using SAN storage and whether or not to use RAIDZ or Mirroring. All that conversation says, (paraphrasing here) is use RAIDZ or Mirroring unless you REALLY REALLY trust the array's redundancy.

That said the sysadmin's choice of ZFS was an interesting one. The CDOM zpool was given "redundant" LUNs, but the redundant LUNs were both on the same array and the exact same SAN path. So to write the penalty of mirrored drives is made even worse by the fact that using the same SAN path serialized the writes to the LUNs with little or no real gain in reliability. Aside from the use of ARC, and other file systems including UFS have caches, I couldn't see any other feature of ZFS being leveraged. The zpools were also created with log devices making matters worse.

Important Announcement for the TrueNAS Community.

Slideshow explaining VDev, zpool, ZIL and L2ARC for noobs!

Dabbler

Cadet

Attachments

Inactive Account

Cadet

Wizard

Inactive Account

Wizard

Dabbler

Explorer

Dabbler

Inactive Account

Dabbler

Patron

Inactive Account

Guru

Inactive Account

Dabbler

Dabbler

Inactive Account

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Slideshow explaining VDev, zpool, ZIL and L2ARC for noobs!"

Similar threads