Rewrite pool in situ?

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Hello,

I'm planning to add two special vdevs to my main pool.
To my understanding, in order to make use of them, I'd have to 'rewrite in situ' my pools data to populate the special vdevs.

What would be the most efficient way to do this?
A large chunk of my data is "too large to be doubled during a send/recv" but in turn, is 'user friendly' enough that it would not suffer from a rsync -r.
Other parts are more of 'block nature' that is, larger blobs owned and operated by a ProxmoxBackupServer for example. Towards these, I'm a hesitant even a rsync --archive is sufficiently safe.

The other "potential" options would be ZFS send|recv, which would be nice for the 'owned blobs', but cannot be done on the overall pool due to space constraints.
Is there any "gotcha" to keep in mind when doing a zfs send/zfs recv on the same pool?

Here's a suggestions, would this be sufficient?

Code:
zfs snapshot tank/dataset@snapshot
zfs send -R -I -v tank/dataset@snapshot | pv -Wbrft | zfs recv -s tank/newdataset 

Noting that [zfs recv -nV] is a 'verbose dry run' that does not actually receive the dataset.


Any suggestions or input?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
There's a script for that...


 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Would you look at that... Nice!

I'll give it a run.
 

Jorsher

Explorer
Joined
Jul 8, 2018
Messages
88
Thanks. I was unaware of this script and would like to use it.

I started with 8 x 8TB, then replaced with 8 x 14TB, then added the 8 x 8TB, then added 8 x 16TB, then replaced the 8 x 8TB with 8 x 16TB. Upgrades/replacements done when I was practically out of storage space. I'm sure I could benefit from a rebalance, although it'll take a very long time with the amount of data...
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I wrote my own script to rebalance a specific dataset
Warning - do not do on encrypted dataset and you may want to do the final two commands manually. It has absolutely no error checking but does indicate the process

zfs snap BigPool/SMB/Archive@migrate zfs send -R BigPool/SMB/Archive@migrate | zfs recv -F BigPool/SMB/Archive_New zfs snap BigPool/SMB/Archive@migrate2 zfs send -i @migrate BigPool/SMB/Archive@migrate2 | zfs recv -F BigPool/SMB/Archive_New #zfs destroy -rf BigPool/SMB/Archive #zfs rename -f BigPool/SMB/Archive_New BigPool/SMB/Archive
 
Joined
Oct 22, 2019
Messages
3,641
Not to be a negative Mr. Murphy downer, but please please please please make sure you not only have a recent backup that is independent of this system, but also one that you confirm you can access.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Not to be a negative Mr. Murphy downer, but please please please please make sure you not only have a recent backup that is independent of this system, but also one that you confirm you can access.
In particular related to the use of special vdevs, or copy/rewriting files?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Both I think
 
Joined
Oct 22, 2019
Messages
3,641
Both I think

Definitely.

Imagine a hiccup in the "intra-pool" migration process or a mishap with the special vdevs. There is no "undo" button if something unexpected goes wrong.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Definitely.

Imagine a hiccup in the "intra-pool" migration process or a mishap with the special vdevs. There is no "undo" button if something unexpected goes wrong.
Yeah, well of course. You're not wrong in any way.
I might've overestimated how mature the special vdev is, my take is that it is slightly under-utilized/discussed on the forums compared to elsewhere. (edit: somehow did not do proper forum searching, but relied on 'eye-browsing memory' on what has been floating past recently - most thread activity seemed slightly older)

I opened up the good old browser and had a peak. Here's what I've gathered:
I've found one radical event, that seemingly was not easily transferable to other setups for testing, but occurred when attempting to re-import a pool which had been exported. Also, this is on ubuntu, and 2 years ago.

Maybe related issue on failure to reimport a pool, last post, also ubuntu.

There are a couple of other upstreams to keep track of too maybe, I had a look through present issues and did some searches)
https://github.com/openzfs/zfs/issues/10903 (seems to be an issue of attempting to releasing the special vdev from the pool of mirrors, and later suffering an unexpected shutdown that causes <le issue>). Also important comment "zpool import is less robust about detecting special vdevs when they are located in non-standard locations (outside /dev/)"

https://github.com/openzfs/zfs/issues/12501 (autoexpand on special vdev - doubt it is particularly important)

On top of these, I had a scroll through the release notes https://github.com/openzfs/zfs/releases?page=1
and did not find particularly anything the last approx 2 years that indicated direct fixes to these issues reported above.

I'd think there would've been more accounts and havoc, and patches, if these special vdevs were actually quite dangerous to use.

Hm. I might need to do some more sanity checking on my own before I proceed, just to validate the a few gremlings.
Even if I found far less than I anticipated going into this search.
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Definitely.

Imagine a hiccup in the "intra-pool" migration process or a mishap with the special vdevs. There is no "undo" button if something unexpected goes wrong.
If you run my entire script against an encrypted dataset - the result is a deleted dataset - don't ask me how I found out (its kinda obvious). Still thats what backups are for
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
I've done some playing around.

I setup a special vdev to my SSD pool, which is "live" but has good backups to test and play with.
It consists of 2x mirrors of 500Gb SSD. Now also with a special vdev.
I tried a couple of things first, such as, adding a singel special vdev, 36Gb in size. adding a larger 500G mirrored drive, then removing the smaller drive to see if it expands properly (as indicated in a finding - that had been somewhat a problem in particular cases).
As my pool is mirrored, I could remove the special vdev without problem via the gui too.
I tried exporting and importing the pool in various stages of testing too.
No hickups, nothing unexpected with regards to the special vdev.

I had one gotcha tho, that is the gui looses track of snapshots on the pool once exporting/importing, which I had not anticipated.

So onwards to more testing and attempting to make slightly more informed decisions on the SSD pool, before even attempting to do something on my rotating rust Raidz2 (where I cannot easily revoke mistakes).

I found this post particularly useful at the L1tech forums.
This oneliner will provide a histogram of filesize and amount of said files.
It is useful to give hints about useful record size on a dataset, and subsequently the size of 'small files' to be captured by the special vdev. In case the special vdev would become "full" it's not the end of the world, records that would've ended up in this "special class" would instead be demoted to "normal class" and end up on the pool as normal. I think that is a point of relief when sizing up the special vdev. I'm still going to try to at least diminish the guess work, as you'll read on about shortly.

As Wendell points out, the size of small files needs to be smaller than the record size for this to make sense.
find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'

I've ran this on my SSD pool, all SSDs about 1TB RAW, 500G used. Default recordsize @ 128KiB.
In total, there are boatloads of small files, and generally pretty much nothing else larger.
There are a few datasets that have very different characteristics.
Most notably - one where qcow2 files are stored, and one where photoprism has its backend store for auxiliary files.
This is the pool:
1657091411414.png


Here's another perspective of the ssd-pool. The idea here is to estimate an accumulated GiB of small block file sizes to better size the special vdev device.
In this case, if I'd run a 1MiB record size, the small block size would be at maxed at 512KiB, which would bring about 79GiB of small block data to the special vdev.
Luckily the small block filesize, as well as record size, can be adjusted per dataset, which brings additional level of granularity.

1657092650744.png


Diving deeper, looking into specific datasets (providing different paths to the script), I found that mostly all of the file-numbers data came from the same dataset, NFS backends for photoprism that generates humongous amounts of small files. Here's what it looked like:

1657091517713.png


I did some playing around with setting record size lower, and rewriting the pool (still in progress as it takes some estimated 2days to comeplete). I set recordsize to 4KiB, and small block size to 2KiB. (I'm a bit torn, I maybe should've gone for 16KiB instead. Oh well.)

The qcow2 dataset had only a handful of files less than 8KiB. I set record size to 1M and small block size to 512KiB.

I now set off to use the script recommended zfs-inplace-rebalancing which works perfectly. It takes a lot of time. Im looking at some 2-3 days of my SSD pools myriad of files. None the less, it is chugging along.

I keep an eye on the fill rate of the special vdev:
Code:
zpool list -v 30 | grep -i "gptid/a9aab6ef-fc67-11ec-8130-00259051e3f2"
gptid/a9aab6ef-fc67-11ec-8130-00259051e3f2     464G  5.65G   458G        -         -     4%  1.21%      -    ONLINE

currently 5.65G used.

This is where I'm at now. about 40% progress of the zfs-inplace-rebalancing script.
Once this completes, I'm at least one step closer to attacking my rust-pool.

Cheers!
 
Last edited:

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Hello,

I'm here for a revisit.
Some additional testing lead to a quirk that took a while to emerge.
This data is not sufficiently 'collected' to merit attempting a proper bug report.

Here's what I experienced:
I added (once again) a special vdev to my mirror pool.
Put in data, ran the 'zfs-rebalance-script', found the special meta data device got populated.
- The script is really not feasible for larger amounts of small files, in particular when you're forced to "restart" the script. It keeps a nice log, (which became somewhere 50Mb in size) that causes the 'awk' operations to check if a file has been handled by the script or not to become super slow. I estimated it would take a couple of WEEKS(!!) before it even had chugged through the logfile. Then It would continue to move files.

At some point I got tired of the experiment and decided to remove the metavdev.
I did like before in the GUI... At this point the metadata drive was occupied with some 15G.
During the removal, I quickly got the typical "error in python line X" message from the GUI (which I didnt save at the time).
Instantly the metadata device was removed from view in the gui.
I checked via CLI - but it was still there for a couple of seconds, then it dissappeared. (Which could make sense - it'll probably take some time to move the data off).

I come to realize I performed this operation with live VM's on the pool.
A scrub later, my pool is degraded, both drives in one mirror vdev (one mirror not affected) were suffering checksum errors.

To me, that looked like the special vdev didnt export properly.
Upon further investigation, it was revealed that only a few large VM' "virtual drives" had been affected - none of the small files/metadata etc that I was afraid was lost in the export.

Investigation continued. I found another problem, the 2nd LSI-9211-8i that recently moved in, I had not taken note that it did not run P20 firmware, and was indeed in IR mode.

In retrospect I think this may have been a compounding factor to the issues I experienced.
 
Top