Can someone on Cobia try something for me?

Joined
Oct 22, 2019
Messages
3,641
Prerequisites:
  • You have TrueNAS SCALE Cobia
  • Your ZFS version is 2.2.0 (-rc4 or whatever, doesn't matter)
  • You are comfortable creating a temporary dataset for this test (you will delete the dataset when you're done)


The Test:
  1. Create a new temporary dataset with the default options
  2. Inside its path (e.g, /mnt/tank/temporary/) create a large 4 GB file
    • (You can either use dd with /dev/urandom or copy a large video file)
  3. Create a snapshot for the dataset
  4. Delete the large 4 GB file from the live filesystem
  5. Notice that the dataset still consumes 4 GB because of the snapshot?
  6. Now copy (using "cp") the 4 GB file from the hidden snapshot "directory" back to your live filesystem
    • (What happens if you invoke --reflink=always in your cp command?)
  7. How much space does your dataset now consume? 4 GB or 8 GB?
  8. Bonus: What happens if you destroy the snapshot?


The reason for this test:

I'm trying to gauge if block-cloning in OpenZFS 2.2 works with ZFS snapshots, in that it will simply reference the existing blocks.

This already works on a dataset's live filesystem.

But what if it works across a snapshot? This means you can restore individual files without consuming additional space. (Currently, when someone restores a single file, there are duplicate records that are referenced separately. Blocks pointed at from the file in the snapshot PLUS blocks pointed at from the "restored" file in the live filesystem. This is more noticeable for large files.) :frown:

Here is an earlier discussion before block-cloning was available for ZFS:

Can you see why this would be a stellar feature? :cool:



WARNING: Please double and triple check your commands. There's no need to accidentally destroy important data just for this test. If you don't feel comfortable, don't even bother. (I could try this on a VirtualBox VM, but I figured it's a bit excessive just to literally test out a single command.)
 
Last edited:

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
How much space does your dataset now consume? 4 GB or 8 GB?
9.39 GiB (but the file I used was about 4.7 GiB). And when I try doing cp --reflink=always, I get an error saying cp: failed to clone '<destfile>' from '<srcfile>': Invalid cross-device link.
Bonus: What happens if you destroy the snapshot?
Used space returns to 4.69 GiB.
 
Joined
Oct 22, 2019
Messages
3,641
9.39 GiB. And when I try doing cp --reflink=always, I get an error saying cp: failed to clone '<destfile>' from '<srcfile>': Invalid cross-device link.
That's what I feared. :frown:

Looks like it's not possible. As nice as block-cloning is (and I think it's a fantastic feature that was long-needed in ZFS), it doesn't fully leverage the fact that the same exact blocks are used in a snapshot.

I can't help but think this is feasible to develop for ZFS? Most of the groundwork is already laid down:
  • Copy-on-write filesystem? Check!
  • Block-cloning (i,e, "reflinking on copy")? Check!
  • Snapshots (which reference the same blocks)? Check!
And yet there's still a hurdle in the way.

@jgreco hinted in the earlier thread that these "snapshot directories" are being treated as separate devices. So while they're convenient for browsing, they don't leverage ZFS's underlying features. Perhaps it is for this reason that we cannot take advantage of block-cloning ("reflinking") from a snapshot "directory"?
 
Joined
Oct 22, 2019
Messages
3,641
Oh, and before anyone says "Yeah but deduplication..."

This is for those of us who don't use (nor want to use) deduplication. :wink:
 
Joined
Oct 22, 2019
Messages
3,641
UPDATE: So it appears this is known and understood, as seen in this comment from a month ago on the OpenZFS GitHub:
Rob N said:
Filesystems and snapshots are different datasets. Or, put another way, they're different mounts, and so different superblocks, and so Linux rejects the request.

I understand that this is frustrating, but no amount of pointing it out is going to make a quick fix happen. I'm aware of four possible solutions (or shapes of solutions):
  • Linux lifting the restriction
  • Adding OpenZFS-specific calls that Linux doesn't know about (and so won't intercept), and then adding support for those to common tools (like cp)
  • Adding zfs clonefile or similar command to do clones directly inside OpenZFS
  • Significantly modify OpenZFS to use the same superblock for all mounts
I've been quietly exploring all of these options for a few weeks now. They are all difficult and/or complicated, for different reasons, and I also have very little time available to look at it. If you've got some other idea, I'm happy to hear it.

What I find interesting is that they keep citing the Linux kernel as an obstacle, yet from what I understand, I don't see how this is exclusive to Linux? If on my TrueNAS Core system ("FreeBSD") the logic is "a snapshot is a 'different' filesystem than its dataset", then it shouldn't matter what the underlying operating system is. (I could be misinterpreting his comment.)

What looks promising is that regardless of the OS, there's the possibility to develop a ZFS-specific tool that would be used instead of "cp", which has an understanding and lower-level access of ZFS.

The example given was a theoretical tool called "zfs clonefile". So in theory, you would run the command like so:
Code:
zfs clonefile /mnt/tank/archives/.zfs/snapshot/auto-2022-01-01/MyProject3.tar /mnt/tank/archives/


This would essentially do what I was hoping for: single file restoration without duplicating the blocks and used space. So rather than using "cp" or "cp --reflink=always", you would use a ZFS-specific command of "zfs clonefile".
 
Joined
Oct 22, 2019
Messages
3,641
@Volts, I remember you said you have TrueNAS Core 13.1 nightly, which already has ZFS 2.2.0.

To scratch an itch and rule something out, can you try the above steps on your Core 13.1 system?

(With special emphasis on the "cp --reflink=always" step.)

I'm curious to see if the underlying OS (FreeBSD vs Linux) really matters in this regard.
 
Joined
Oct 22, 2019
Messages
3,641
That's (--reflink) not in the nightly or FreeBSD 13.2 base. I don't have a sense if it has a kernel dependency. I can look later.
Oh boy.

I wonder if it's because ZFS 2.2.0 will only ship by default with FreeBSD 14, and thus we won't have the cp reflink/clone feature in 13.2?

Which means that in order to leverage block-cloning, we need ZFS 2.2 and FreeBSD 14?

After all, it's because ZFS 2.2 is being backported by iXsystems that Core users will get it before the base OS is upgraded to FreeBSD 14.

I'm struggling to find a definitive answer on this, however.


EDIT: @Volts, ignoring the fact that you can't use "--reflinks=always" on FreeBSD 13.2, what was the results of the test in general? (Ignoring the step about using "--reflink=always", but rather just plain old "cp".)
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
@jgreco hinted in the earlier thread that these "snapshot directories" are being treated as separate devices.

That's probably not quite correct. My point was that the link count of 1 and the referenced inode being the same in both the live and snapshot copy implies a certain kind of UNIX-ish thinking was involved in the implementation; there was a desire to keep a UNIX-like appearance to the directory tree, even for snapshots. However, as in the topic at hand, this has other opportunities to be ... inconsistent ... and available design choices constrain it.

It may not be a meaningful distinction but the filesystem designers tackled this by referring to the underlying devices as "logical devices"; see for ex. rename(2):

Code:
     [EXDEV]            The link named by to and the file named by from are on
                        different logical devices (file systems).  Note that
                        this error code will not be returned if the implemen-
                        tation permits cross-device links.


which means that someone else somewhere also understood the nature of this potential problem. If we consider the use case for snapshots, it makes the most sense for a snap to appear as an independent filesystem, especially since it could also be a clone. But it would also be nice if you could at least make a hardlink ON a live filesystem that REFERRED to a snapshot file. That'd be very useful for recovery purposes, sorta like a file-level clone functionality.

Am I making any sort of sense?
 
Joined
Oct 22, 2019
Messages
3,641
Indeed.

That's also why I got (re-)excited upon the news of block-cloning in OpenZFS 2.2. I had assumed that there was the possibility that copying a file from a snapshot "directory" would automatically invoke some low-level ZFS magic in which it would understand to use block-cloning to create a "new" file based on the very same blocks that already exist (which are being pointed to by the file that exists in the snapshot.)

We would no longer need to use hardlinks (which don't work across snapshot -> live, anyways), now that OpenZFS fully supports block-cloning (which is basically the same concept as reflinks in XFS and Btrfs.)



If we consider the use case for snapshots, it makes the most sense for a snap to appear as an independent filesystem, especially since it could also be a clone. But it would also be nice if you could at least make a hardlink ON a live filesystem that REFERRED to a snapshot file. That'd be very useful for recovery purposes, sorta like a file-level clone functionality.
Regardless if the snapshot is presented to the user in a convenient "UNIX-like" manner that appears as an independent filesystem, reading a file in a snapshot "directory" still reads the blocks that comprise it. So surely the ability to do cost-free "file clone recoveries" can't be that far away? After all, a tool simply needs to "speak ZFS" to say "Yeah, we're making a 'new' file, and it's going to be made from these blocks." (And these blocks are referring to the blocks that a file in a snapshot already points to.)

From what was commented by Rob N. (not really sure who he is exactly), this could be implemented with a new theoretical command called "zfs clonefile".


As it stands now with OpenZFS 2.2, you can supposedly do this:
  1. "Copy" a very large 4 GB file on the dataset's live filesystem to the same live filesystem, which invokes block-cloning.
  2. Both copies are pointing to the same blocks. No additional space is consumed from copying a 4 GB file.

What would make ZFS even more awesome is this:
  1. "Copy" a very large 4 GB file from a snapshot to the dataset's live filesystem, which invokes block-cloning.
  2. Both copies are pointing to the same blocks. No additional space is consumed from copying a 4 GB file.
  3. Even after destroying the snapshot, the file would still exist as it is, since the same blocks are being pointed to in the live filesystem.

You know how if you have no snapshots, but then create a snapshot, you don't suddenly duplicate the amount of space consumed? Because it just saves the file's pointers to the blocks on the disk?

Well, think of it like that, but in reverse for a single file. (i.e, Going from snapshot -> live )
 
Last edited:
Joined
Oct 22, 2019
Messages
3,641
Since FreeBSD 14's release announcement is expected today, I'm going to soon try a test with a live USB of a vanilla FreeBSD 14 + OpenZFS 2.2.0 system.

The reason being is because "Linux" is singled out as the culprit that prevents such an action. I cannot find any mentions of FreeBSD in this context. (You'll notice OpenZFS has a heavy bias towards Linux in its discussions.)
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
You'll notice OpenZFS has a heavy bias towards Linux in its discussions.
Since that's the primary platform on which it's being developed, that's to be expected, I'd think.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Since FreeBSD 14's release announcement is expected today, I'm going to soon try a test with a live USB of a vanilla FreeBSD 14 + OpenZFS 2.2.0 system.
Code:
root@pi8:~ # freebsd-version
14.0-RC4
root@pi8:~ # zfs create zfs/test
root@pi8:~ # zfs set compression=off zfs/test
root@pi8:~ # dd if=/dev/zero of=/zfs/test/4G bs=1m count=4096
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 170.879427 secs (25134490 bytes/sec)
root@pi8:~ # zfs snap zfs/test@now
root@pi8:~ # rm /zfs/test/4G
root@pi8:~ # zfs list zfs/test
NAME       USED  AVAIL  REFER  MOUNTPOINT
zfs/test  4.00G   887G    96K  /zfs/test
root@pi8:~ # cp /zfs/test/.zfs/snapshot/now/4G /zfs/test/
root@pi8:~ # zfs list zfs/test
NAME       USED  AVAIL  REFER  MOUNTPOINT
zfs/test  8.00G   883G  4.00G  /zfs/test
root@pi8:~ # rm /zfs/test/4G
root@pi8:~ # zfs list zfs/test
NAME       USED  AVAIL  REFER  MOUNTPOINT
zfs/test  4.00G   887G    96K  /zfs/test
root@pi8:~ # ln /zfs/test/.zfs/snapshot/now/4G /zfs/test/
ln: /zfs/test//4G: Cross-device link
 
Joined
Oct 22, 2019
Messages
3,641
:frown:

Sigh. So it appears that block-cloning in OpenZFS 2.2 is only useful for files on a dataset's live filesystem. (Which is still great.)

I guess this means regardless of Linux or FreeBSD, we'll need that "theoretical" ZFS command, after all, if we want to "clone copy" a file from a snapshot (i.e, cost-free single file recovery).

In such a world we would have:
zfs clonefile /zfs/test/.zfs/snapshot/now/4G /zfs/test/
 

nickspacemonkey

Dabbler
Joined
Jan 13, 2022
Messages
22
Block cloning turned out to be a bit of a disappointment to me also. I was hoping I could make a copy of my music library and edit the metadata tags while keeping the originals intact. It appears that ZFS is making a new copy on write instead of just writing the new metadata.
 
Joined
Oct 22, 2019
Messages
3,641
Holy canole! We might be in luck.

I made three crucial mistakes with three flawed assumptions:
  1. Apparently, this does not work "across datasets" with ZFS encryption (not yet, but maybe in the future.) Unfortunately, this means that cost-free file recovery from a snapshot doesn't work with encrypted datasets. :frown: (This is a huge issue to me, since I use native ZFS encryption.) Apparently, it's "in the works" to fix this in the future if the same Master Key is used (which is true for snapshots of the dataset itself.)
  2. For Linux systems, you're meant to use "cp --reflink=auto". Do not use "--reflink=always" nor "--reflink=never". With coreutils 9.0+, it works seamlessly with ZFS. (In fact, "auto" is the default behavior, which works with ZFS filesystems.)
  3. We were reading the wrong metrics! (See below.)


So I tried this out on an Arch Linux system with Linux Kernel 6.1 and OpenZFS 2.2.0. Guess what? It works! You can actually recover files from a snapshot without any added space! :cool:

The "copy" is instant, and it saves space in the same way that deduplication does: no extra data blocks are written. From. A. Snapshot!

For bullet-point #3 above, we were meant to look at the pool properties bcloneused, bclonesaved, and bcloneratio.

Due to the way hardlinks and reflinks work on XFS, I was under the assumption that block-cloning was limited to each dataset (i.e, each separate "filesystem").



What made me scramble to test this out further and realize my mistakes was a clue dropped by @HoneyBadger in another thread.

What he said only made sense if this was a pool-wide property, rather than my mistaken assumption that it's a "per-dataset" property. (I was still in the mindset of traditional cross-filesystem limitations.)



So now I can confidently say that cost-free file recovery is possible, right here and now, starting with OpenZFS 2.2.0. (On Linux-based systems, at least.) Of course, with the MAJOR caveat: this doesn't work with native ZFS encryption (yet). :confused:



To try this yourself, make sure to invoke "cp" without any flags or with "--reflink=auto". Make sure not to use "--reflink=never" or "--reflink=always".

"Copy" a large file from a snapshot to the live filesystem.

Do not check any dataset or filesystem properties. Instead, check the pool's properties of: bcloneused, bclonesaved, and bcloneratio


Here is a discussion that demonstrates how it can be really confusing to understand how much space your datasets and snapshots actually consume if you're using block-cloning:

So based on the above, it's theoretically possible that a dataset can report "used" space that exceeds the pool's capacity. :tongue:



EDIT: @Patrick M. Hausen, this means your test with FreeBSD 14.0-RC4 may in fact have been a success! Your pool's properties of bcloneused, bclonesaved, and bcloneratio might have revealed that?

EDIT 2: @HoneyBadger, had you jumped into this thread, you might have spared me from looking like a fool at the very start. :wink: Don't worry, I'll keep the memery to a minimum in here.

EDIT 3: (Because two isn't enough.) My brain is too fried to figure out the implications for ZFS replications if you're using cloned blocks across datasets, especially if the destination pool doesn't support block-cloning. (Among other confusions.)
 
Last edited:
Joined
Oct 22, 2019
Messages
3,641
I was hoping I could make a copy of my music library and edit the metadata tags while keeping the originals intact. It appears that ZFS is making a new copy on write instead of just writing the new metadata.
Two questions in regards to this:
  1. Are you basing this on the above "results"? If so, I made a critical mistake by looking at the wrong ZFS properties. (See my post directly above this reply.)
  2. This could be a case of the software you're using: not ZFS itself. It's possible that the metadata-editing software does its own "copy-on-write" when modifying a file. This is in the same light how archiving/compression software does the same thing, unaware of the ZFS layer below. This is also true of rsync, which by default makes a temporary copy of the modified file, and then renames it under-the-hood to make it "appear" like an in-place operation. (At least with rsync, they provide an option called "--inplace" which forces an in-place modification, which is friendly for ZFS and other CoW filesystems.)

As an example of software "not playing nice" with ZFS:
If I have a .zip file and I simply modify it's metadata "comment" in-place, it actually creates a copy of the file under-the-hood with the updated metadata "comment". So if the old version of this .zip file exists in a snapshot, it does not share any blocks with the "modified" .zip file on the live filesystem. :frown:

So it's possible that your music metadata-editor is making extraneous copies of your "cloned" files, which defeats the purpose of leveraging block-cloning.
 

Volts

Patron
Joined
May 3, 2021
Messages
210
It's possible that the metadata-editing software does its own "copy-on-write" when modifying a file.

This is pretty likely. It's the safest option and default for *many* editors. Write out a complete file, then move it into place.

Some taggers do allow in-place file edits *if possible*. But media files and metadata can be structured in a bunch of ways. Sometimes tags are stored at the end of the file, so only the last blocks need to be rewritten. Sometimes tags can be updated in-place if they are the same length or if padding is an option. But sometimes changing a tag requires rewriting the whole file.

(In a world where block deduplication & block cloning were very popular, it would be an interesting thought experiment to structure saved files to maximize the chance of duplicate blocks.)

At least with rsync, they provide an option called "--inplace" which forces an in-place modification, which is friendly for ZFS and other CoW filesystems.)

That doesn't bypass ZFS to force a non-CoW overwrite of the affected blocks though. That just changes rsync from "make a whole new file and safely/atomically swap it in place" behavior to riskier/faster "overwrite some blocks in the file" behavior. ZFS still does CoW for the rewritten blocks.

rsync devs have been talking about & working on reflink support, which would be *amazing*:
 
Joined
Oct 22, 2019
Messages
3,641
riskier/faster "overwrite some blocks in the file" behavior. ZFS still does CoW for the rewritten blocks.
I'd rather consume an additional few MiB instead of the 1GiB+ of a mail inbox each time a snapshot is created. :wink:

As far as "riskier". It is in theory, but I put this through the test. Suddenly aborting. Suddenly yanking out the ethernet cable. Suddenly killing the power to the client PC. No matter what I tried, the next run of rsync ran smoothly, and the resulting file's SHA256 checksum matched the source file. (Even though the file was corrupt from the previous aborted run.)

Hence, I always use "--inplace" for ZFS destinations.


rsync devs have been talking about & working on reflink support, which would be *amazing*:
My brain hurts enough as it is. I'm still trying to wrap my head around this dark magic of block-cloning. o_O

(But that would be a slick feature for rsync.)
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
My brain hurts enough as it is. I'm still trying to wrap my head around this dark magic of block-cloning. o_O
Think of it like "snapshots at a file-level granularity" - all of the blocks are there, and you just let ZFS copy-on-write the changes to new records.
 
Top