RAIDZ expansion, it's happening ... someday!

ornias · Jun 11, 2021

Yorick said:
It’s out of alpha, and there’s a talk about it at the FreeBSD dev summit today!

RAIDZ Expansion feature by ahrens · Pull Request #12225 · openzfs/zfs

Motivation and Context This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically wit...

github.com

FreeBSD Events

FreeBSD is an operating system used to power modern servers, desktops, and embedded platforms.

www.freebsd.org

It's not out of alpha, code that isn't merged is not considered stable or feature-complete. ;-)

Yorick · Jun 11, 2021

Out of alpha != stable.

Let’s not debate release stage semantics. You know what I meant.

ornias · Jun 11, 2021

Yorick said:
Out of alpha != stable.

Let’s not debate release stage semantics. You know what I meant.

It's not even done yet, it's a new PR and this time it's ready for code-review instead of design-review.
It might take between 2 and 12 months before it's merged.

It's not just "not stable", it's still in development.
Alpha means something is released in some form, nothing is released yet. Not even in ZFS nightly builds.

Constantin · Jun 12, 2021

It will be an interesting feature when it is released, particularly for users whose use case has shifted ever so slightly after an initial deployment.

Larger pool changes / needs are likely better addressed with the addition of VDEVs rather than the addition of more drives into existing VDEVs.

Yorick · Jun 12, 2021

Absolutely. It’s meant for single-vdev deployments, where someone started with 4/5/6x and now wants a bit more.

I see it mainly as a solution for hobbyists. How often is it pointed out to people that no, they cannot start their raidz2 5-wide and later add one or two more drives, at least not without destroying and recreating the pool?

It won’t switch from raidz-1 to raidz-2, it won’t remove drives, it won’t switch from mirrors to raidz. Send/receive remains the right solution for those changes.

hervon · Jun 15, 2021

ZFS fans, rejoice—RAIDz expansion will be a thing very soon

ZFS fans, rejoice—RAIDz expansion will be a thing very soon

Founding OpenZFS dev Matthew Ahrens opened a pull request last week.

arstechnica.com

Yorick · Jun 15, 2021

Thanks for the link! Inclusion in OpenZFS 2.2 around August 2022 sounds about right. And then of course that needs to be adopted by operating systems.

Newfoundland.Republic · Jun 15, 2021

Yorick said:
Thanks for the link! Inclusion in OpenZFS 2.2 around August 2022 sounds about right. And then of course that needs to be adopted by operating systems.

Let the office pool start on the date. From the end of the article:

TrueNAS may very well put it into production sooner than that, since ixSystems tends to pull ZFS features from master before they officially hit release status.

Constantin · Jun 16, 2021

Some thoughts:

The article points out how the storage efficiency of the "old part" of the pool initially will be at the lower level associated with the previous pool VDEV disk / parity ratio. The effective storage capacity will only increase when the data is moved and re-written, much like with sVDEVs. Thus, the benefit of adding a disk is somewhat tempered unless your pool has enough space to ZFS send datasets elsewhere, or is re-written a lot. To be sure that the effective storage efficiency goes up, the pool has to be nuked (negating the benefit of expansion).
This is a feature that brings some parity re: unRAID being able to add drives to the pool. As commentators pointed out, business users usually add "full" VDEVs, not incremental drives to VDEVs.
Impacts re: padding for a 6-drive vs. a 7 drive array were interesting as well. I can't say that I paid a lot of attention to padding when I built my pool. I presume my 8-drive Z3 pool likely has padding issues, but many posts here have suggested that padding is not important if the VDEV layout stays constant. However, what impact will an expanded VDEV experience?

So, I understand why the ZFS team took the pragmatic approach of not fully-reconstituting / optimizing the pools when sVDEVs are added or individual drives are added to a VDEV during expansion. It's the conservative thing to do. But net-net, it likely makes more sense to make multiple backups (ideally to another ZFS system) followed by rebuilding the pool from the ground up and moving the data back.

Ericloewe · Jun 16, 2021

Constantin said:
Impacts re: padding for a 6-drive vs. a 7 drive array were interesting as well. I can't say that I paid a lot of attention to padding when I built my pool. I presume my 8-drive Z3 pool likely has padding issues, but many posts here have suggested that padding is not important if the VDEV layout stays constant. However, what impact will an expanded VDEV experience?

Little to no change, if you're using compression. Compressed blocks don't tend to neatly line up anyway.

Constantin · Jun 16, 2021

Interesting. I have kept compression off since most of my pool is either efficient (JPEG) or already compressed (sparsebundles, disk images, etc). That was for the benefit of the sVDEV, to minimize metadata.

Some of my system image backups had over a million small files in them. That has a significant impact on rsync and like backup performance and metadata space requirements. Bundle it all up in a mountable disk image format and the metadata reduces considerably (2.5% vs less than 1% of pool capacity, IIRC).

Anyhow, it’s a cool feature and when it comes, I have no doubt some home users will use it on occasion.

Yorick · Jun 16, 2021

I’m not so sure the efficiency thing is a big deal. The data don’t take up more space post expansion, they’re just not stored quite as efficiently as they could be. I think some real world examples showed a potential 3% space gain by rewriting. That’s a lot of effort for not a lot of gain.

Yorick · Jun 21, 2021

Efficiency calculator spreadsheet: ZFS RAIDZ VDEV Expansion efficiency calculator - Google Sheets . Make a copy to use it! And do check my math - I think it's as simple as I have it, but if I have a brain fart there, point it out please.

As you'd expect, if you come from "4-wide raidz2", the potential efficiency gains are far larger than if you come from 6-wide. Everyone's own estimation as to whether a rewrite of all existing data is worth it.

Yorick · Jun 21, 2021

louwrentius over on github writes:

"I'm not sure that math is right. The scenario is that you add one drive at a time. If we take the original example and go from 4 drive RAIDZ2 to 10 drive RAIDZ2, what you basically will observe is that there will be data stored at various efficiency levels at each step when you add a drive. This is not taken into account in your calculation and that makes things seem worse than they are, I think, but maybe I'm wrong."

Correct. I am not calculating "get to X%, add a drive, get to X%, add a drive", I am calculating "get to X%, add number of drives to get to final width" - one and done. There is room for improvement. You can also fiddle with "when do I rewrite data" - at each step? When reaching final width? It can get a bit messy. And I'd need to think more on what kind of "one by one" calculation is actually useful. In the end, all data are rewritten, with the final width.

Fmstrat writes:

"I don't get it. ;) Shouldn't the capacity gain be greater in a "from 6 to 8" situation? Is "capacity gain" meaning something odd in this context? I.E. total capacity loss?"

By "capacity gain" I mean the potential capacity gain by rewriting existing data. From 6 to 8, I gain less by rewriting than from 4 to 6: Because the initial efficiency is already better. 4-wide is 50% parity, 6-wide is 33% parity, for raidz2.

There is no "capacity loss" really, the data don't take up more space after expansion. They're just not written as efficiently as they could be.

Louwrentius · Jun 21, 2021

I am Louwrentius.

As you keep adding drives one by one, new data is written with a more efficient parity-to-data ratio, every step of the way.

This is what I tried to model in my spreadsheet.

I think the overhead of this process - growing a VDEV with one disk at a time, on a per-need basis - is in the end not something to dismiss, but for many people an acceptable price to pay for ZFS. At least much more acceptable than the price of having to expand with entire VDEVs at a time.

Yorick · Jun 21, 2021

So, you are expecting to rewrite all data at each step? So the efficiency gain would then be "original width gain compared to final" + "plus one width gain compared to final" + "plus two width gain compared to final", and so on?

I am saying "compared to final" because while in your model data are rewritten multiple times, that gets a bit wild. So for 4-to-10, it'd be percentage of 4 disks, 4-wide, compared to 10-wide, plus percentage of 1 disk, 5-wide, compared to 10-wide, plus percentage of 1 disk, 6-wide, compared to 10-wide, and so on.

The question I have is how useful that degree of granularity is. Simply because writing data after expansion is part of the expansion, and while it's not optimal during use, it becomes optimal at the end. I guess you are comparing "don't rewrite at all and expand one by one while continuing to fill", whereas I am comparing "don't rewrite at all and expand to final width before writing anything more". I can see where your model could come in handy, as well, for users that need to decide on whether to rewrite, and that will expand their vdev very slowly.

Louwrentius · Jun 21, 2021

Yorick said:
So, you are expecting to rewrite all data at each step?

No, that is not my assumption, on the contrary. With every added drive, ZFS will redistribute the data across all drives, but it won't rewrite the data and I'm not proposing to do this at every step.

It is smart to just accept the overhead, until it becomes large enough to spend the time to rewrite the data. Or just accept the overhead and do nothing.

The question I have is how useful that degree of granularity is. Simply because writing data after expansion is part of the expansion, and while it's not optimal during use, it becomes optimal at the end. I guess you are comparing "don't rewrite at all and expand one by one while continuing to fill", whereas I am comparing "don't rewrite at all and expand to final width before writing anything more". I can see where your model could come in handy, as well, for users that need to decide on whether to rewrite, and that will expand their vdev very slowly.

I think I agree

Yorick · Jun 21, 2021

Okay I've updated the sheet. It now shows two calculations. Better language to make clear what those are would be welcome.

Calc 1: Expand vdev when it's full to threshold; expand to final width then rewrite data: How much space did we gain.

Calc 2 : Expand vdev when it's full to threshold; expand by just 1 drive; expand again when vdev is full to threshold; keep doing this until final width, then rewrite data: How much space did we gain.

The space gain with calc 2 is larger because there is more data to be rewritten that was not written with the final parity to disk ratio.

This now uses a "spaceLoop" function. I don't know whether this comes across when making a copy. In case it does not:

Code:

function spaceLoop(initial, final, size, parity, threshold) {
  var spacesaved = initial * size * threshold * ((parity / initial) - (parity / final));
  for (var drives = initial + 1; drives < final; drives++) {
     spacesaved += size * threshold * ((parity / drives) - (parity / final));
  }
  
  return spacesaved;
}

Louwrentius · Jun 21, 2021

Yorick said:
Okay I've updated the sheet. It now shows two calculations. Better language to make clear what those are would be welcome.

Thank you for this effort. I needed a long time to wrap my mind around this and your example(s) helped a lot. I'm planning to write a blog post about this whole topic, and I'm planning to make a copy of your sheet and credit you by linking to your account on truenas. Let me know if there's any problem with that.

DayBlur · Jun 21, 2021

Here's a simple example you can use to compare your understanding and calculations for these types of scenarios:

Start with 4-wide RAID-Z2, 80% full. Assume 1TB disks for simplicity: 1.6TB data, 1.6TB parity. This data will not change, just get redistributed.
Add 6 new disks, expand/redistribute to 10-wide RAID-Z2, and then fill to 80%. This adds 4.8TB at the new 80%/20% data/parity ratio, and so adds 3.84TB data, 0.96TB parity.
Total data/parity on the array is now 5.44TB data, 2.56TB parity (sanity check: 8TB total, 10-wide filled to 80%).
Compare the above final state to a fresh 10-wide RAID-Z2 at 80% full which will have 6.4TB data, 1.6TB parity (again, 8TB on-disk). You will see that 0.96TB is lost compared to if the data was written at the final array width (ie, 96% of one disk capacity).

Further analysis:

It is not a coincidence that the 0.96TB of data lost in the above example is the exact amount of parity used in filling the expanded array to the same percentage capacity as the initial array. This is because at that fill percentage, the two-disk parity was already allocated in the initial array (1.6TB in this example) and so ANY additional parity needed up to that same fill percentage is lost data space (an optimal RAID-Z2 array of ANY width should only have two disks of parity with the same percentage use as the entire array). Any usage in the final array beyond that original array's fill percentage is utilized with the space efficiency of the new array width and incurs no relative storage loss. In other words, only the array fill percentage at the time of expansion is important.
To generalize this, the amount of data lost in expansion is the amount of new parity required to fill to the same fractional capacity, or:
space lost (in terms of individual disk space) = f*P*(M-N)/M
Where f is the fill ratio at the time of expansion (ie, 0.8 for 80% full), P is the number of parity disks (P=2 for RAID-Z2), and N and M are the initial and final array widths (# of disks), respectively. For the above example, this equation gives 0.96, which is the same 96% the capacity of a single disk.
Note that I am defining filled ratio as the total bytes used (data+parity) in the array divided by the combined storage capacity of the array's disks since it is difficult to refer to only non-parity data usage and capacity in a RAIDZ expansion scenario.

Additional examples:

Using the above relation, you can see that if you instead waited until 100% usage to expand the original 4-wide RAID-Z2 array to 10-wide, you would lose 1.0*2*(10-4)/10 = 1.2, or 120% of a single disk capacity. This compares to the 96% disk space lost when expanded at 80% full.
For a 6-wide RAID-Z2 expanded at 80% capacity to 10-wide, the numbers are 0.8*2*(10-6)/10 = 0.64, or 64% of a single disk capacity.
For a 6-wide RAID-Z2 expanded at 30% capacity to 10-wide, the numbers are 0.3*2*(10-6)/10 = 0.24, or 24% of a single disk capacity.
For a 4-wide RAID-Z3 expanded at 80% capacity to 10-wide, the numbers are 0.8*3*(10-4)/10 = 1.44, or 144% of a single disk capacity.
For the more complicated incremental expansion (one disk at a time) every time 80% of the array capacity is used, the loss is calculated incrementally at each expansion using the same formula above (with the common 80% and RAID-Z2 factors brought out here for brevity):
0.8*2*(1/5+1/6+1/7+1/8+1/9+1/10) = 1.353, or ~135%.
Finally, here's an example you can use to show that the equation holds for arbitrary actions (correctness not proven here). Start with a 4-disk Z2 array, expand when it's 50% full to 5-disks, then expand when that is 80% full to 8 disks, then expand when that is 60% full to 10 disks.
space lost = 0.5*2*(5-4)/5 + 0.8*2*(8-5)/8 + 0.6*2*(10-8)/10 = 1.04, or 104% of a single disk.

Hopefully the above clarifies and formalizes my understanding of what's going on and will help with any spreadsheets/calculators. Let me know if you find I've missed something or made an error.

Important Announcement for The TrueNAS Community.

RAIDZ expansion, it's happening ... someday!

Wizard

Wizard

Wizard

Vampire Pig

Wizard

Patron

Wizard

Guru

Vampire Pig

Server Wrangler

Vampire Pig

Wizard

Wizard

Wizard

Cadet

Wizard

Cadet

Wizard

Cadet

Cadet

Similar threads

Important Announcement for The TrueNAS Community.