Can someone on Cobia try something for me?

winnielinnie · Nov 13, 2023

Prerequisites:

You have TrueNAS SCALE Cobia
Your ZFS version is 2.2.0 (-rc4 or whatever, doesn't matter)
You are comfortable creating a temporary dataset for this test (you will delete the dataset when you're done)

The Test:

Create a new temporary dataset with the default options
Inside its path (e.g, /mnt/tank/temporary/) create a large 4 GB file
- (You can either use dd with /dev/urandom or copy a large video file)
Create a snapshot for the dataset
Delete the large 4 GB file from the live filesystem
Notice that the dataset still consumes 4 GB because of the snapshot?
Now copy (using "cp") the 4 GB file from the hidden snapshot "directory" back to your live filesystem
- (What happens if you invoke --reflink=always in your cp command?)
How much space does your dataset now consume? 4 GB or 8 GB?
Bonus: What happens if you destroy the snapshot?

The reason for this test:

I'm trying to gauge if block-cloning in OpenZFS 2.2 works with ZFS snapshots, in that it will simply reference the existing blocks.

This already works on a dataset's live filesystem.

But what if it works across a snapshot? This means you can restore individual files without consuming additional space. (Currently, when someone restores a single file, there are duplicate records that are referenced separately. Blocks pointed at from the file in the snapshot PLUS blocks pointed at from the "restored" file in the live filesystem. This is more noticeable for large files.)

Here is an earlier discussion before block-cloning was available for ZFS:

Can ZFS even do this? (Restore file, not rollback, an entire snapshot)

We all know we can revert a dataset to a previous snapshot. It's like hitting the "undo" or "rewind" button. Simple enough. Now you have your dataset as it was, record-by-record, in a previous state. Everything you deleted is restored 100%, as long as it lived on the snapshot. No additional...

www.truenas.com

Can you see why this would be a stellar feature?

WARNING: Please double and triple check your commands. There's no need to accidentally destroy important data just for this test. If you don't feel comfortable, don't even bother. (I could try this on a VirtualBox VM, but I figured it's a bit excessive just to literally test out a single command.)

danb35 · Nov 13, 2023

winnielinnie said:
How much space does your dataset now consume? 4 GB or 8 GB?

9.39 GiB (but the file I used was about 4.7 GiB). And when I try doing cp --reflink=always, I get an error saying cp: failed to clone '<destfile>' from '<srcfile>': Invalid cross-device link.

winnielinnie said:
Bonus: What happens if you destroy the snapshot?

Used space returns to 4.69 GiB.

winnielinnie · Nov 13, 2023

danb35 said:
9.39 GiB. And when I try doing cp --reflink=always, I get an error saying cp: failed to clone '<destfile>' from '<srcfile>': Invalid cross-device link.

That's what I feared.

Looks like it's not possible. As nice as block-cloning is (and I think it's a fantastic feature that was long-needed in ZFS), it doesn't fully leverage the fact that the same exact blocks are used in a snapshot.

I can't help but think this is feasible to develop for ZFS? Most of the groundwork is already laid down:

Copy-on-write filesystem? Check!
Block-cloning (i,e, "reflinking on copy")? Check!
Snapshots (which reference the same blocks)? Check!

And yet there's still a hurdle in the way.

@jgreco hinted in the earlier thread that these "snapshot directories" are being treated as separate devices. So while they're convenient for browsing, they don't leverage ZFS's underlying features. Perhaps it is for this reason that we cannot take advantage of block-cloning ("reflinking") from a snapshot "directory"?

winnielinnie · Nov 13, 2023

Oh, and before anyone says "Yeah but deduplication..."

This is for those of us who don't use (nor want to use) deduplication.

winnielinnie · Nov 13, 2023

UPDATE: So it appears this is known and understood, as seen in this comment from a month ago on the OpenZFS GitHub:

Rob N said:
Filesystems and snapshots are different datasets. Or, put another way, they're different mounts, and so different superblocks, and so Linux rejects the request.

I understand that this is frustrating, but no amount of pointing it out is going to make a quick fix happen. I'm aware of four possible solutions (or shapes of solutions):

Linux lifting the restriction

Adding OpenZFS-specific calls that Linux doesn't know about (and so won't intercept), and then adding support for those to common tools (like cp)

Adding zfs clonefile or similar command to do clones directly inside OpenZFS

Significantly modify OpenZFS to use the same superblock for all mounts

I've been quietly exploring all of these options for a few weeks now. They are all difficult and/or complicated, for different reasons, and I also have very little time available to look at it. If you've got some other idea, I'm happy to hear it.

What I find interesting is that they keep citing the Linux kernel as an obstacle, yet from what I understand, I don't see how this is exclusive to Linux? If on my TrueNAS Core system ("FreeBSD") the logic is "a snapshot is a 'different' filesystem than its dataset", then it shouldn't matter what the underlying operating system is. (I could be misinterpreting his comment.)

What looks promising is that regardless of the OS, there's the possibility to develop a ZFS-specific tool that would be used instead of "cp", which has an understanding and lower-level access of ZFS.

The example given was a theoretical tool called "zfs clonefile". So in theory, you would run the command like so:

Code:

zfs clonefile /mnt/tank/archives/.zfs/snapshot/auto-2022-01-01/MyProject3.tar /mnt/tank/archives/

This would essentially do what I was hoping for: single file restoration without duplicating the blocks and used space. So rather than using "cp" or "cp --reflink=always", you would use a ZFS-specific command of "zfs clonefile".

winnielinnie · Nov 13, 2023

@Volts, I remember you said you have TrueNAS Core 13.1 nightly, which already has ZFS 2.2.0.

To scratch an itch and rule something out, can you try the above steps on your Core 13.1 system?

(With special emphasis on the "cp --reflink=always" step.)

I'm curious to see if the underlying OS (FreeBSD vs Linux) really matters in this regard.

Volts · Nov 13, 2023

winnielinnie said:
(With special emphasis on the "cp --reflink=always" step.)

That's (--reflink) not in the nightly or FreeBSD 13.2 base. I don't have a sense if it has a kernel dependency. I can look later.

winnielinnie · Nov 13, 2023

Volts said:
That's (--reflink) not in the nightly or FreeBSD 13.2 base. I don't have a sense if it has a kernel dependency. I can look later.

Oh boy.

I wonder if it's because ZFS 2.2.0 will only ship by default with FreeBSD 14, and thus we won't have the cp reflink/clone feature in 13.2?

Which means that in order to leverage block-cloning, we need ZFS 2.2 and FreeBSD 14?

After all, it's because ZFS 2.2 is being backported by iXsystems that Core users will get it before the base OS is upgraded to FreeBSD 14.

I'm struggling to find a definitive answer on this, however.

EDIT: @Volts, ignoring the fact that you can't use "--reflinks=always" on FreeBSD 13.2, what was the results of the test in general? (Ignoring the step about using "--reflink=always", but rather just plain old "cp".)

jgreco · Nov 13, 2023

winnielinnie said:
@jgreco hinted in the earlier thread that these "snapshot directories" are being treated as separate devices.

That's probably not quite correct. My point was that the link count of 1 and the referenced inode being the same in both the live and snapshot copy implies a certain kind of UNIX-ish thinking was involved in the implementation; there was a desire to keep a UNIX-like appearance to the directory tree, even for snapshots. However, as in the topic at hand, this has other opportunities to be ... inconsistent ... and available design choices constrain it.

It may not be a meaningful distinction but the filesystem designers tackled this by referring to the underlying devices as "logical devices"; see for ex. rename(2):

Code:

     [EXDEV]            The link named by to and the file named by from are on
                        different logical devices (file systems).  Note that
                        this error code will not be returned if the implemen-
                        tation permits cross-device links.

which means that someone else somewhere also understood the nature of this potential problem. If we consider the use case for snapshots, it makes the most sense for a snap to appear as an independent filesystem, especially since it could also be a clone. But it would also be nice if you could at least make a hardlink ON a live filesystem that REFERRED to a snapshot file. That'd be very useful for recovery purposes, sorta like a file-level clone functionality.

Am I making any sort of sense?

winnielinnie · Nov 13, 2023

Indeed.

That's also why I got (re-)excited upon the news of block-cloning in OpenZFS 2.2. I had assumed that there was the possibility that copying a file from a snapshot "directory" would automatically invoke some low-level ZFS magic in which it would understand to use block-cloning to create a "new" file based on the very same blocks that already exist (which are being pointed to by the file that exists in the snapshot.)

We would no longer need to use hardlinks (which don't work across snapshot -> live, anyways), now that OpenZFS fully supports block-cloning (which is basically the same concept as reflinks in XFS and Btrfs.)

jgreco said:
If we consider the use case for snapshots, it makes the most sense for a snap to appear as an independent filesystem, especially since it could also be a clone. But it would also be nice if you could at least make a hardlink ON a live filesystem that REFERRED to a snapshot file. That'd be very useful for recovery purposes, sorta like a file-level clone functionality.

Regardless if the snapshot is presented to the user in a convenient "UNIX-like" manner that appears as an independent filesystem, reading a file in a snapshot "directory" still reads the blocks that comprise it. So surely the ability to do cost-free "file clone recoveries" can't be that far away? After all, a tool simply needs to "speak ZFS" to say "Yeah, we're making a 'new' file, and it's going to be made from these blocks." (And these blocks are referring to the blocks that a file in a snapshot already points to.)

From what was commented by Rob N. (not really sure who he is exactly), this could be implemented with a new theoretical command called "zfs clonefile".

As it stands now with OpenZFS 2.2, you can supposedly do this:

"Copy" a very large 4 GB file on the dataset's live filesystem to the same live filesystem, which invokes block-cloning.
Both copies are pointing to the same blocks. No additional space is consumed from copying a 4 GB file.

What would make ZFS even more awesome is this:

"Copy" a very large 4 GB file from a snapshot to the dataset's live filesystem, which invokes block-cloning.
Both copies are pointing to the same blocks. No additional space is consumed from copying a 4 GB file.
Even after destroying the snapshot, the file would still exist as it is, since the same blocks are being pointed to in the live filesystem.

You know how if you have no snapshots, but then create a snapshot, you don't suddenly duplicate the amount of space consumed? Because it just saves the file's pointers to the blocks on the disk?

Well, think of it like that, but in reverse for a single file. (i.e, Going from snapshot -> live )

winnielinnie · Nov 14, 2023

Since FreeBSD 14's release announcement is expected today, I'm going to soon try a test with a live USB of a vanilla FreeBSD 14 + OpenZFS 2.2.0 system.

The reason being is because "Linux" is singled out as the culprit that prevents such an action. I cannot find any mentions of FreeBSD in this context. (You'll notice OpenZFS has a heavy bias towards Linux in its discussions.)

danb35 · Nov 14, 2023

winnielinnie said:
You'll notice OpenZFS has a heavy bias towards Linux in its discussions.

Since that's the primary platform on which it's being developed, that's to be expected, I'd think.

Patrick M. Hausen · Nov 14, 2023

winnielinnie said:
Since FreeBSD 14's release announcement is expected today, I'm going to soon try a test with a live USB of a vanilla FreeBSD 14 + OpenZFS 2.2.0 system.

Code:

root@pi8:~ # freebsd-version
14.0-RC4
root@pi8:~ # zfs create zfs/test
root@pi8:~ # zfs set compression=off zfs/test
root@pi8:~ # dd if=/dev/zero of=/zfs/test/4G bs=1m count=4096
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 170.879427 secs (25134490 bytes/sec)
root@pi8:~ # zfs snap zfs/test@now
root@pi8:~ # rm /zfs/test/4G
root@pi8:~ # zfs list zfs/test
NAME       USED  AVAIL  REFER  MOUNTPOINT
zfs/test  4.00G   887G    96K  /zfs/test
root@pi8:~ # cp /zfs/test/.zfs/snapshot/now/4G /zfs/test/
root@pi8:~ # zfs list zfs/test
NAME       USED  AVAIL  REFER  MOUNTPOINT
zfs/test  8.00G   883G  4.00G  /zfs/test
root@pi8:~ # rm /zfs/test/4G
root@pi8:~ # zfs list zfs/test
NAME       USED  AVAIL  REFER  MOUNTPOINT
zfs/test  4.00G   887G    96K  /zfs/test
root@pi8:~ # ln /zfs/test/.zfs/snapshot/now/4G /zfs/test/
ln: /zfs/test//4G: Cross-device link

winnielinnie · Nov 14, 2023

Sigh. So it appears that block-cloning in OpenZFS 2.2 is only useful for files on a dataset's live filesystem. (Which is still great.)

I guess this means regardless of Linux or FreeBSD, we'll need that "theoretical" ZFS command, after all, if we want to "clone copy" a file from a snapshot (i.e, cost-free single file recovery).

In such a world we would have:
zfs clonefile /zfs/test/.zfs/snapshot/now/4G /zfs/test/

nickspacemonkey · Nov 16, 2023

Block cloning turned out to be a bit of a disappointment to me also. I was hoping I could make a copy of my music library and edit the metadata tags while keeping the originals intact. It appears that ZFS is making a new copy on write instead of just writing the new metadata.

winnielinnie · Nov 16, 2023

Holy canole! We might be in luck.

I made three crucial mistakes with three flawed assumptions:

Apparently, this does not work "across datasets" with ZFS encryption (not yet, but maybe in the future.) Unfortunately, this means that cost-free file recovery from a snapshot doesn't work with encrypted datasets. (This is a huge issue to me, since I use native ZFS encryption.) Apparently, it's "in the works" to fix this in the future if the same Master Key is used (which is true for snapshots of the dataset itself.)
For Linux systems, you're meant to use "cp --reflink=auto". Do not use "--reflink=always" nor "--reflink=never". With coreutils 9.0+, it works seamlessly with ZFS. (In fact, "auto" is the default behavior, which works with ZFS filesystems.)
We were reading the wrong metrics! (See below.)

So I tried this out on an Arch Linux system with Linux Kernel 6.1 and OpenZFS 2.2.0. Guess what? It works! You can actually recover files from a snapshot without any added space!

The "copy" is instant, and it saves space in the same way that deduplication does: no extra data blocks are written. From. A. Snapshot!

For bullet-point #3 above, we were meant to look at the pool properties bcloneused, bclonesaved, and bcloneratio.

Due to the way hardlinks and reflinks work on XFS, I was under the assumption that block-cloning was limited to each dataset (i.e, each separate "filesystem").

What made me scramble to test this out further and realize my mistakes was a clue dropped by @HoneyBadger in another thread.

What he said only made sense if this was a pool-wide property, rather than my mistaken assumption that it's a "per-dataset" property. (I was still in the mindset of traditional cross-filesystem limitations.)

So now I can confidently say that cost-free file recovery is possible, right here and now, starting with OpenZFS 2.2.0. (On Linux-based systems, at least.) Of course, with the MAJOR caveat: this doesn't work with native ZFS encryption (yet).

To try this yourself, make sure to invoke "cp" without any flags or with "--reflink=auto". Make sure not to use "--reflink=never" or "--reflink=always".

"Copy" a large file from a snapshot to the live filesystem.

Do not check any dataset or filesystem properties. Instead, check the pool's properties of: bcloneused, bclonesaved, and bcloneratio

Here is a discussion that demonstrates how it can be really confusing to understand how much space your datasets and snapshots actually consume if you're using block-cloning:

ZFS Block Cloning "Real" Space usage?

root@bigbertha[~]# zpool get bcloneused,bclonesaved,bcloneratio nvme NAME PROPERTY VALUE SOURCE nvme bcloneused 0 - nvme bclonesaved 0 - nvme bcloneratio 1.00x - root@bigbertha[~]# zpool get bcloneused,bclonesaved,bcloneratio nvme NAME...

discourse.practicalzfs.com

So based on the above, it's theoretically possible that a dataset can report "used" space that exceeds the pool's capacity.

EDIT: @Patrick M. Hausen, this means your test with FreeBSD 14.0-RC4 may in fact have been a success! Your pool's properties of bcloneused, bclonesaved, and bcloneratio might have revealed that?

EDIT 2: @HoneyBadger, had you jumped into this thread, you might have spared me from looking like a fool at the very start.

Don't worry, I'll keep the memery to a minimum in here.

EDIT 3: (Because two isn't enough.) My brain is too fried to figure out the implications for ZFS replications if you're using cloned blocks across datasets, especially if the destination pool doesn't support block-cloning. (Among other confusions.)

winnielinnie · Nov 16, 2023

nickspacemonkey said:
I was hoping I could make a copy of my music library and edit the metadata tags while keeping the originals intact. It appears that ZFS is making a new copy on write instead of just writing the new metadata.

Two questions in regards to this:

Are you basing this on the above "results"? If so, I made a critical mistake by looking at the wrong ZFS properties. (See my post directly above this reply.)
This could be a case of the software you're using: not ZFS itself. It's possible that the metadata-editing software does its own "copy-on-write" when modifying a file. This is in the same light how archiving/compression software does the same thing, unaware of the ZFS layer below. This is also true of rsync, which by default makes a temporary copy of the modified file, and then renames it under-the-hood to make it "appear" like an in-place operation. (At least with rsync, they provide an option called "--inplace" which forces an in-place modification, which is friendly for ZFS and other CoW filesystems.)

As an example of software "not playing nice" with ZFS:
If I have a .zip file and I simply modify it's metadata "comment" in-place, it actually creates a copy of the file under-the-hood with the updated metadata "comment". So if the old version of this .zip file exists in a snapshot, it does not share any blocks with the "modified" .zip file on the live filesystem.

So it's possible that your music metadata-editor is making extraneous copies of your "cloned" files, which defeats the purpose of leveraging block-cloning.

Volts · Nov 16, 2023

winnielinnie said:
It's possible that the metadata-editing software does its own "copy-on-write" when modifying a file.

This is pretty likely. It's the safest option and default for *many* editors. Write out a complete file, then move it into place.

Some taggers do allow in-place file edits *if possible*. But media files and metadata can be structured in a bunch of ways. Sometimes tags are stored at the end of the file, so only the last blocks need to be rewritten. Sometimes tags can be updated in-place if they are the same length or if padding is an option. But sometimes changing a tag requires rewriting the whole file.

(In a world where block deduplication & block cloning were very popular, it would be an interesting thought experiment to structure saved files to maximize the chance of duplicate blocks.)

winnielinnie said:
At least with rsync, they provide an option called "--inplace" which forces an in-place modification, which is friendly for ZFS and other CoW filesystems.)

That doesn't bypass ZFS to force a non-CoW overwrite of the affected blocks though. That just changes rsync from "make a whole new file and safely/atomically swap it in place" behavior to riskier/faster "overwrite some blocks in the file" behavior. ZFS still does CoW for the rewritten blocks.

rsync devs have been talking about & working on reflink support, which would be *amazing*:

Add new --reflink-dest option to be used instead of --link-dest · Issue #153 · RsyncProject/rsync

When using --link-dest a "new" hardlink is created when it is determined that a "prior" file is the "same" as the source file being transferred to the destination location. I would like a new optio...

github.com

winnielinnie · Nov 16, 2023

Volts said:
riskier/faster "overwrite some blocks in the file" behavior. ZFS still does CoW for the rewritten blocks.

I'd rather consume an additional few MiB instead of the 1GiB+ of a mail inbox each time a snapshot is created.

As far as "riskier". It is in theory, but I put this through the test. Suddenly aborting. Suddenly yanking out the ethernet cable. Suddenly killing the power to the client PC. No matter what I tried, the next run of rsync ran smoothly, and the resulting file's SHA256 checksum matched the source file. (Even though the file was corrupt from the previous aborted run.)

Hence, I always use "--inplace" for ZFS destinations.

Volts said:
rsync devs have been talking about & working on reflink support, which would be *amazing*:

My brain hurts enough as it is. I'm still trying to wrap my head around this dark magic of block-cloning.

(But that would be a slick feature for rsync.)

HoneyBadger · Nov 17, 2023

winnielinnie said:
My brain hurts enough as it is. I'm still trying to wrap my head around this dark magic of block-cloning.

Think of it like "snapshots at a file-level granularity" - all of the blocks are there, and you just let ZFS copy-on-write the changes to new records.

Important Announcement for the TrueNAS Community.

Can someone on Cobia try something for me?

MVP

Hall of Famer

MVP

MVP

MVP

MVP

Patron

MVP

Resident Grinch

MVP

MVP

Hall of Famer

Hall of Famer

MVP

Dabbler

MVP

MVP

Patron

MVP

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Can someone on Cobia try something for me?"

Similar threads