NFS does *not* support server-side copy with ZFS? :(

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
For some reason "cp" (from coreutils 9.1) inspects the filetype before the actual copy operation.

Certain filetypes it will use the slow, round-trip over the network with read/write (I could reproduce this in 100% of the tests with .mp4 videos.)
Say what?

That is so bizarre I had to look into the source, but there's no smoking gun. What the hell?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,398
It's in the source, but it's not obvious to those who aren't conversant with the low-level details of file systems. In particular, look at copy.c:

Code:
/* Copy the regular file open on SRC_FD/SRC_NAME to DST_FD/DST_NAME,
   honoring the MAKE_HOLES setting and using the BUF_SIZE-byte buffer
   *ABUF for temporary storage, allocating it lazily if *ABUF is null.
   Copy no more than MAX_N_READ bytes.
   Return true upon successful completion;
   print a diagnostic and return false upon error.
   Note that for best results, BUF should be "well"-aligned.
   Set *LAST_WRITE_MADE_HOLE to true if the final operation on
   DEST_FD introduced a hole.  Set *TOTAL_N_READ to the number of
   bytes read.  */
static bool
sparse_copy (int src_fd, int dest_fd, char **abuf, size_t buf_size,
             size_t hole_size, bool punch_holes, bool allow_reflink,
             char const *src_name, char const *dst_name,
             uintmax_t max_n_read, off_t *total_n_read,
             bool *last_write_made_hole)
{
  *last_write_made_hole = false;
  *total_n_read = 0;

  /* If not looking for holes, use copy_file_range if functional,
     but don't use if reflink disallowed as that may be implicit.  */
  if (!hole_size && allow_reflink)
    while (max_n_read)
      {
        /* Copy at most COPY_MAX bytes at a time; this is min
           (SSIZE_MAX, SIZE_MAX) truncated to a value that is
           surely aligned well.  */
        ssize_t copy_max = MIN (SSIZE_MAX, SIZE_MAX) >> 30 << 30;
        ssize_t n_copied = copy_file_range (src_fd, NULL, dest_fd, NULL,
                                            MIN (max_n_read, copy_max), 0);
        if (n_copied == 0)
          {
            /* copy_file_range incorrectly returns 0 when reading from
               the proc file system on the Linux kernel through at
               least 5.6.19 (2020), so fall back on 'read' if the
               input file seems empty.  */
            if (*total_n_read == 0)
              break;
            return true;
          }
        if (n_copied < 0)
          {
            if (errno == ENOSYS || is_ENOTSUP (errno)
                || errno == EINVAL || errno == EBADF
                || errno == EXDEV || errno == ETXTBSY)
              break;

            /* copy_file_range might not be enabled in seccomp filters,
               so retry with a standard copy.  EPERM can also occur
               for immutable files, but that would only be in the edge case
               where the file is made immutable after creating/truncating,
               in which case the (more accurate) error is still shown.  */
            if (errno == EPERM && *total_n_read == 0)
              break;

            if (errno == EINTR)
              n_copied = 0;
            else
              {
                error (0, errno, _("error copying %s to %s"),
                       quoteaf_n (0, src_name), quoteaf_n (1, dst_name));
                return false;
              }
          }
        max_n_read -= n_copied;
        *total_n_read += n_copied;
      }

  bool make_hole = false;
  off_t psize = 0;

  while (max_n_read)
    {
      if (!*abuf)
        *abuf = xalignalloc (getpagesize (), buf_size);
      char *buf = *abuf;
      ssize_t n_read = read (src_fd, buf, MIN (max_n_read, buf_size));
      if (n_read < 0)
        {
          if (errno == EINTR)
            continue;
          error (0, errno, _("error reading %s"), quoteaf (src_name));
          return false;
        }
      if (n_read == 0)
        break;
      max_n_read -= n_read;
      *total_n_read += n_read;

      /* Loop over the input buffer in chunks of hole_size.  */
      size_t csize = hole_size ? hole_size : buf_size;
      char *cbuf = buf;
      char *pbuf = buf;

      while (n_read)
        {
          bool prev_hole = make_hole;
          csize = MIN (csize, n_read);

          if (hole_size && csize)
            make_hole = is_nul (cbuf, csize);

          bool transition = (make_hole != prev_hole) && psize;
          bool last_chunk = (n_read == csize && ! make_hole) || ! csize;

          if (transition || last_chunk)
            {
              if (! transition)
                psize += csize;

              if (! prev_hole)
                {
                  if (full_write (dest_fd, pbuf, psize) != psize)
                    {
                      error (0, errno, _("error writing %s"),
                             quoteaf (dst_name));
                      return false;
                    }
                }
              else
                {
                  if (! create_hole (dest_fd, dst_name, punch_holes, psize))
                    return false;
                }

              pbuf = cbuf;
              psize = csize;

              if (last_chunk)
                {
                  if (! csize)
                    n_read = 0; /* Finished processing buffer.  */

                  if (transition)
                    csize = 0;  /* Loop again to deal with last chunk.  */
                  else
                    psize = 0;  /* Reset for next read loop.  */
                }
            }
          else  /* Coalesce writes/seeks.  */
            {
              if (INT_ADD_WRAPV (psize, csize, &psize))
                {
                  error (0, 0, _("overflow reading %s"), quoteaf (src_name));
                  return false;
                }
            }

          n_read -= csize;
          cbuf += csize;
        }

      *last_write_made_hole = make_hole;

      /* It's tempting to break early here upon a short read from
         a regular file.  That would save the final read syscall
         for each file.  Unfortunately that doesn't work for
         certain files in /proc or /sys with linux kernels.  */
    }

  /* Ensure a trailing hole is created, so that subsequent
     calls of sparse_copy() start at the correct offset.  */
  if (make_hole && ! create_hole (dest_fd, dst_name, punch_holes, psize))
    return false;
  else
    return true;
}


If the file being copied looks like a sparse file, where there may be holes in the on-disk format, then it's eligible for copy_file_range and fast server-side copies. If that check fails, the copy falls back to the slow round-trip via the client-side.

This is probably to avoid overrunning the buffer, and also for copy consistency in transit. As coreutils 9.x is the first to support copy_file_range, this check should become less stringent over time.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,398
Joined
Oct 22, 2019
Messages
3,584
Say what?

That is so bizarre I had to look into the source, but there's no smoking gun. What the hell?
I'm equally confused. :frown: Always reproducible on my end.

Here are two examples on a TrueNAS NFS share mounted with default options, using NFS version 4.2.


This first example is an actual tarball (archive.tar.gz). Not only is the file extension correct, but it's also a legitimate compressed tar file.

Notice it uses copy_file_range? It's also VERY FAST.


This second example is an h264 MP4 video file which I renamed to "video.tar.gz", so that I can rule out the file's extension coming into play.

Notice it uses read/write? (The actual logfile is FLOODED will a billion read/write lines.) It's VERY SLOW.


How else is this possible except for "cp" (or a related call) determining the file type? (It appears to happen consistently with any MP4 video file I test this with. It's not limited to a specific video file.)


I'm completely open to "doing something wrong", but my poor little, non-developer, non-engineer brain cannot comprehend this.



UPDATE: Just read @Samuel Tai's reply above. So does this hint that it has something to do with how the file was initially created and stored on the disk, and not the file itself, per se? This is going way beyond my wheelhouse...

I'll try with --reflinks again, but last time it blurted out an error that "reflinks is not supported" (possibly because ZFS does not support them.)
 
Last edited:
Joined
Oct 22, 2019
Messages
3,584
@winnielinnie, did you try cp --reflinks in your traces?

"Operation not supported"

From what I read, this is because ZFS does not support reflinks (nor should it, due implementing its own CoW)

Code:
write(2, "cp: ", 4)                     = 4
write(2, "failed to clone '2.pdf' from '1."..., 36) = 36
write(2, ": Operation not supported", 25) = 25
write(2, "\n", 1)                       = 1
close(4)                                = 0
close(3)                                = 0
lseek(0, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
close(0)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(1)                           = ?
+++ exited with 1 +++
 
Joined
Oct 22, 2019
Messages
3,584
If the file being copied looks like a sparse file, where there may be holes in the on-disk format, then it's eligible for copy_file_range and fast server-side copies. If that check fails, the copy falls back to the slow round-trip via the client-side.
Why should it matter if it's a sparse file when dealing with server-side copy over NFS? Apparent sparse file or not, the end results is a byte-for-byte exact copy of the file, preferably done on the server's end (not over the network).

This is where I feel like I'm missing something as an end-user.

So if I took a large sequential file that does not look like a sparse file, and then punched a 4-KiB hole in it for no reason, I'll get to enjoy the speed and efficiency of a server-side copy because "cp" determined it to be "eligible" for copy_file_range? o_O
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,398
Ok, so this looks like an unfortunate interaction between OpenZFS 2.x and coreutils 9.x, and how OpenZFS reports sparse files to coreutils. This is also way beyond my expertise, and is worth a bug report. Maybe @mav@ can figure out what's going on, deep in the bowels of the file system.
 
Joined
Oct 22, 2019
Messages
3,584
Ok, so this looks like an unfortunate interaction between OpenZFS 2.x and coreutils 9.x, and how OpenZFS reports sparse files to coreutils. This is also way beyond my expertise, and is worth a bug report. Maybe @mav@ can figure out what's going on, deep in the bowels of the file system.

Surely there's something in the file's header that "cp" is determining which method to use (if it's a "sparse" file or not)? I'm still not sure how it determines this (and why it should even matter for server-side copies.)

See my below tests:


----------


Here's a simple test I did.

I took a 1 GiB video file, which is a compressed video using the h264 codec. Let's call it "original.mp4".

I loaded this file into Avidemux and re-saved without any compression (just a direct copy), using the MP4 container. Let's call this "copy.mp4".

I repeated the same thing with the "original.mp4" file, once again without any compression (just a direct copy), but this time saved it with the MKV container. Let's call this "copy.mkv"


----------


So I have three files:
  1. original.mp4
  2. copy.mp4
  3. copy.mkv
They all contain the same video/audio stream. They are all the same filesize. They were all saved with the same software.


----------


Well, guess what? Here are the results when copying the file using "cp" while in the NFS mount on the client:
  1. original.mp4 - Very slow, read/write, round-trip over the network
  2. copy.mp4 - Very slow, read/write, round-trip over the network
  3. copy.mkv - Super fast, copy_file_range, server-side copy

----------


What's even more interesting is that it does not matter if I do the above steps (re-saving copies with Avidemux) on my client's local drive first and then transfer the test files over to the NFS mount, or if I do the above steps (re-saving copies with Avidemux) directly on the NFS mount itself.

The results are the same. The MKV files are copied within a couple seconds, while the MP4 files always take a round-trip over the network.
 
Last edited:
Joined
Oct 22, 2019
Messages
3,584
Well, guess what again?

No issues with the same "original.mp4" file on a Linux NFS server resting on top of an XFS filesystem.

It always uses copy_file_range, and thus is using server-side copy, and thus is ultra fast. (Yes, the same exact files that did not use copy_file_range in my tests with TrueNAS 13.0-U1.1 "FreeBSD 13.1".)


----------


So this issue might be exclusive to ZFS or (FreeBSD + ZFS)?

I can try to repeat the above tests with a Linux NFS server + ZFS (instead of XFS).


----------


UPDATE:

Looks like this might be exclusive to ZFS, regardless if the server operating system is Linux or FreeBSD.

Same issues with a Linux NFS server, but with ZFS as the underlying filesystem. Same Linux NFS server, one NFS export points to a path in an XFS filesystem, and the other export points to a path in a ZFS filesystem.

Only the latter (ZFS) suffers the issue of not using server-side side ("copy_file_range").

What's peculiar about this is that we know ZFS on the server's end is fully capable of using server-side copy ("copy_file_range") when NFS is involved. This whole situation is just really stupid about which files are deemed "eligible" for server-side copy. :frown:

Another way to phrase this: If I was only dealing with tarball archives and MKV videos (but no MP4 videos), then I would be a happy camper and have no idea that this issue event exists. :tongue: Come on, that's just silly...
 
Last edited:
Joined
Jan 18, 2017
Messages
524

Samuel Tai has a point, has a bug report been created for this?​

 

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
@winnielinnie What block sizes are used by XFS? ZFS default block size is 128KB, that means it can have no holes in file that are smaller or not aligned to that. You can not punch 4KB hole in a file on ZFS -- sure, it will be filled with zeroes and block will be recompressed without it, saving space, but it won't be reported as a hole.
 
Joined
Oct 22, 2019
Messages
3,584
Unless I'm mistaken this seems to be an issue with coreutils `cp`. The manpage itself states that the heuristics for determining whether a file is sparse are "crude".
Crude indeed, because...

Did you try to specify `--sparse=never`?
...this works perfectly every time on ZFS over an NFS mount.

Good catch, @anodos! :cool:


So there must be something about the beginning of certain filetypes (such as MKV videos and tarball archives) that "pass" this "crude" test when cp is allowed to determine it automatically. Yet, on the other end, there are filetypes that always "fail" this test, such as MP4 videos.


And yet, this "crude test" always passes if the underlying filesystem is XFS (and perhaps Ext4 as well.)



----------



@winnielinnie What block sizes are used by XFS?
4 KiB (4096 bytes) from formatting with the default mkfs.xfs options on an SSD.


ZFS default block size is 128KB, that means it can have no holes in file that are smaller or not aligned to that.
What you wrote gave me an idea to test. Perhaps the quick "crude" method by which cp determines eligibility for copy_file_range has something to do with the underlying blocksize / recordsize?



----------



So here's what I found out (at least with my tests), SSDs and HDDs (mirror vdevs, ashift=12) on the TrueNAS server (ZFS):
  • ❌ Setting the dataset recordsize of 32K or greater results in the issue discussed in this thread (read/write, network round-trip)
  • ❌ Setting the dataset recordsize of 2K or smaller results in the issue discussed in this thread (read/write, network round-trip)
  • ✅ Setting the dataset recordsize between 4K to 16K results in "cp" always using copy_file_range (server-side copy)
  • ✅ Using "cp --sparse=never" always uses copy_file_range, regardless of recordsize or filetype
However, the first two scenarios still allow certain filetypes (MKVs, tarballs, etc) to use copy_file_range



----------



Not sure if that helps any?


Is it looking more like something that needs to be fixed on coreutils rather than OpenZFS? (Or perhaps from both projects?)

Are there any inherent risks if someone were to always invoke "--sparse=never"?
 
Last edited:

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,398
@winnielinnie, this sounds like you could alias cp in the system profile to always use --sparse=never as a work-around.
 

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
What you wrote gave me an idea to test. Perhaps the quick "crude" method by which cp determines eligibility for copy_file_range has something to do with the underlying blocksize / recordsize?
I am just saying that if holes in the file is very small (and I would not expect to see a big holes in a video file, it is not a VM image or database), then big file system block size could miss them, turning just into sequence of zeroes. If that is what it is, then there is nothing to do on OpenZFS side. I generally don't understand by would coreutils not use copy_file_range on sparse files. At least FreeBSD implementation of copy_file_range does handle holes by itself. May be some Linux implementation didn't at some point.
 
Joined
Oct 22, 2019
Messages
3,584
@winnielinnie, this sounds like you could alias cp in the system profile to always use --sparse=never as a work-around.
That can work for now (something I considered during these tests), but it still leaves two things open.


Not sure if it's safe? I don't see why not though.
Are there any inherent risks if someone were to always invoke "--sparse=never"?


And secondly, it doesn't address the issue of server-side copy not being used with NFS atop ZFS; which at this point looks like something that needs to be fixed on coreutils. :frown:
 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,545
And secondly, it doesn't address the issue of server-side copy not being used with NFS atop ZFS; which at this point looks like something that needs to be fixed on coreutils. :frown:

Right, it's important to determine with these sorts of issues whether the problem is client-side or server-side, and then once you've determined something is client-side you should narrow down whether it's actual kernel NFS client or the particular application.

We don't fork coreutils, and we don't offer support for operating as an NFS, SMB, etc client and so there's not really anything for us to fix here.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
The Linux man page for copy_file_range says:
NOTES
If fd_in is a sparse file, then copy_file_range() may expand any
holes existing in the requested range. Users may benefit from
calling copy_file_range() in a loop, and using the lseek(2)
SEEK_DATA and SEEK_HOLE operations to find the locations of data
segments.

copy_file_range() gives filesystems an opportunity to implement
"copy acceleration" techniques, such as the use of reflinks
(i.e., two or more inodes that share pointers to the same copy-
on-write disk blocks) or server-side-copy (in the case of NFS).
"May expand", but "you're better off doing this in chunks around the holes" sounds a lot like "will expand holes". FreeBSD has similar language but is a bit clearer on what's going on:
This system call attempts to maintain holes in the output file for the
byte range being copied. However, this does not always work well. It is
recommended that sparse files be copied in a loop using lseek(2) with
SEEK_HOLE, SEEK_DATA arguments and this system call for the data ranges
found.
 

Volts

Patron
Joined
May 3, 2021
Messages
210
How important is preserving sparseness on a filesystem with compression enabled? From a ZFS perspective, not very, right?

cp using a heuristic to determine sparseness and then using different syscalls is a weird layering/boundary violation. It seems like that behavior would ideally be gated behind the flag, rather than being default. I'm not going to cry to coreutils about it though.
 
Top