ZFS replication recursive "Failed to read from stream" for some child dataset(s).

npr

Cadet
Joined
Aug 9, 2023
Messages
3
I'm having a problem where zfs replication from one trueNAS to another reports 'ERROR' even though the replication seems to succeed. I'm trying to monitor it and this thing keeps crying wolf even though nothing is wrong. I fear that we won't notice once the replication task actually does fail. With how flaky/unreliable zfs replication is, I don't want to take that risk. I'm trying to replicate a dataset "tank/shares" into "tank/acme/nas01_backup/" recursively. It contains some common 'shared folders' like 'data', 'common', 'setup', and some profile datasets under 'profiles', 'profiles/user1', etc. Including everything there are 9 datasets in total.

The Error is that the "command returns with a non-zero exit value". It appears to report a value of '1'. I find either '1' or '127' when I try to make a 'manual' version to be stored inside "$?" but exactly how to format the bash is something I don't know (just read the stuff below, and try to disentangle that mess of quotes and ampersands).

What I did was modify the file "/usr/lib/python3/dist-packages/zettarepl/transport/async_exec_tee.py" to output the contents of the 'args' variable in there, then set replication down to 1 hour, and restart middlewared. I got it outputting this (IP addresses redacted);

Code:
Command: Pipe((Pipe((Pipe((['sh', '-c', '(zfs send -V -R -w -i tank/shares@auto-2024-01-23_10-00 -L -c tank/shares@auto-2024-01-23_11-00 & PID=$!; echo "zettarepl: zfs send PID is $PID" 1>&2; wait $PID)'], ['lz4c'])), ['throttle', '-B', '5242880'])), ['ssh', '-i', '/tmp/tmp3f_yu9eu', '-o', 'UserKnownHostsFile=/tmp/tmpury0d3xx', '-o', 'StrictHostKeyChecking=yes', '-o', 'BatchMode=yes', '-o', 'ConnectTimeout=10', '-p9122', 'sync@1.2.3.4', 'sh -c \'PATH=$PATH:/usr/local/sbin:/usr/sbin:/sbin sh -c \'"\'"\'exec 3>&1; eval $(exec 4>&1 >&3 3>&-; { lz4c -d 4>&-; echo "pipestatus0=$?;" >&4; } | { zfs recv -s -F -x sharesmb -x sharenfs -x mountpoint tank/acme/nas01_backup 4>&-; echo "pipestatus1=$?;" >&4; }); [ $pipestatus0 -ne 0 ] && exit $pipestatus0; [ $pipestatus1 -ne 0 ] && exit $pipestatus1; exit 0\'"\'"\'\''])) (Pipe((Pipe((['sh', '-c', '(zfs send -V -R -w -i tank/shares@auto-2024-01-23_10-00 -L -c tank/shares@auto-2024-01-23_11-00 & PID=$!; echo "zettarepl: zfs send PID is $PID" 1>&2; wait $PID)'], ['lz4c'])), ['throttle', '-B', '5242880'])), ['ssh', '-i', '/tmp/tmp3f_yu9eu', '-o', 'UserKnownHostsFile=/tmp/tmpury0d3xx', '-o', 'StrictHostKeyChecking=yes', '-o', 'BatchMode=yes', '-o', 'ConnectTimeout=10', '-p9122', 'sync@1.2.3.4', 'sh -c \'PATH=$PATH:/usr/local/sbin:/usr/sbin:/sbin sh -c \'"\'"\'exec 3>&1; eval $(exec 4>&1 >&3 3>&-; { lz4c -d 4>&-; echo "pipestatus0=$?;" >&4; } | { zfs recv -s -F -x sharesmb -x sharenfs -x mountpoint tank/acme/nas01_backup 4>&-; echo "pipestatus1=$?;" >&4; }); [ $pipestatus0 -ne 0 ] && exit $pipestatus0; [ $pipestatus1 -ne 0 ] && exit $pipestatus1; exit 0\'"\'"\'\''


The error doesn't really report anything. Without adding the extra print, I just get this as the exception message:

Code:
send from @auto-2024-01-23_10-00 to tank/shares/profiles@auto-2024-01-23_11-00 estimated size is 624B
send from @auto-2024-01-23_10-00 to tank/shares/profiles/syncadm@auto-2024-01-23_11-00 estimated size is 624B
send from @auto-2024-01-23_10-00 to tank/shares/profiles/administratie@auto-2024-01-23_11-00 estimated size is 624B
send from @auto-2024-01-23_10-00 to tank/shares/profiles/user1@auto-2024-01-23_11-00 estimated size is 624B
send from @auto-2024-01-23_10-00 to tank/shares/profiles/user2@auto-2024-01-23_11-00 estimated size is 624B
send from @auto-2024-01-23_10-00 to tank/shares/common@auto-2024-01-23_11-00 estimated size is 624B
send from @auto-2024-01-23_10-00 to tank/shares/setup@auto-2024-01-23_11-00 estimated size is 624B
send from @auto-2024-01-23_10-00 to tank/shares/data@auto-2024-01-23_11-00 estimated size is 624B
total estimated size is 5.48K


This is the stack trace (note, line in async_exec_tee will be a bit different due to me adding some debug statements in it, I believe the original would report line 104 instead of 109).

Code:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 181, in run_replication_tasks
    retry_contains_partially_complete_state(
  File "/usr/lib/python3/dist-packages/zettarepl/replication/partially_complete_state.py", line 16, in retry_contains_partially_complete_state
    return func()
  File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 182, in <lambda>
    lambda: run_replication_task_part(replication_task, source_dataset, src_context, dst_context,
  File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 278, in run_replication_task_part
    run_replication_steps(step_templates, observer)
  File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 666, in run_replication_steps
    replicate_snapshots(step_template, incremental_base, snapshots, include_intermediate, encryption, observer)
  File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 707, in replicate_snapshots
    run_replication_step(step, observer)
  File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 787, in run_replication_step
    ReplicationProcessRunner(process, monitor).run()
  File "/usr/lib/python3/dist-packages/zettarepl/replication/process_runner.py", line 33, in run
    raise self.process_exception
  File "/usr/lib/python3/dist-packages/zettarepl/replication/process_runner.py", line 37, in _wait_process
    self.replication_process.wait()
  File "/usr/lib/python3/dist-packages/zettarepl/transport/ssh.py", line 158, in wait
    stdout = self.async_exec.wait()
  File "/usr/lib/python3/dist-packages/zettarepl/transport/async_exec_tee.py", line 109, in wait
    raise ExecException(exit_event.returncode, "Command: %s %s ... Output: %s " % (self.args_bak, self.args_bak.cmds, self.output))


I tried to make an equivalent bash command, and run it on the command line with a verbose flag added.

Code:
zfs send -V -R -w -i tank/shares@auto-2024-01-23_10-00 -L -c tank/shares@auto-2024-01-23_11-00 | ssh -i ~/.ssh/stor02.key -p 9122 sync@1.2.3.4 /sbin/zfs recv -s -F -x sharesmb -x sharenfs -x mountpoint tank/acme/nas01_backup 


This will output the same 'send' lines, but also some more info for the receive side.

Code:
receiving incremental stream of tank/shares@auto-2024-01-23_12-00 into tank/acme/nas01_backup@auto-2024-01-23_12-00
snap tank/acme/nas01_backup@auto-2024-01-23_12-00 already exists; ignoring
received 0B stream in 1 seconds (0B/sec)
receiving incremental stream of tank/shares/profiles/syncadm@auto-2024-01-23_12-00 into tank/acme/nas01_backup/profiles/syncadm@auto-2024-01-23_12-00
snap tank/acme/nas01_backup/profiles/syncadm@auto-2024-01-23_12-00 already exists; ignoring
received 0B stream in 1 seconds (0B/sec)
receiving incremental stream of tank/shares/profiles/user1@auto-2024-01-23_12-00 into tank/acme/nas01_backup/profiles/user1@auto-2024-01-23_12-00
snap tank/acme/nas01_backup/profiles/user1@auto-2024-01-23_12-00 already exists; ignoring
received 0B stream in 1 seconds (0B/sec)
receiving incremental stream of tank/shares/common@auto-2024-01-23_12-00 into tank/acme/nas01_backup/common@auto-2024-01-23_12-00
snap tank/acme/nas01_backup/common@auto-2024-01-23_12-00 already exists; ignoring
received 0B stream in 1 seconds (0B/sec)
receiving incremental stream of tank/shares/data@auto-2024-01-23_12-00 into tank/acme/nas01_backup/data@auto-2024-01-23_12-00
cannot receive: failed to read from stream
snap tank/acme/nas01_backup/data@auto-2024-01-23_12-00 already exists; ignoring
received 0B stream in 1 seconds (0B/sec)


A few things here stand out:
  • "Cannot receive: failed to read from stream" error for one of the datasets 'data'.
  • Some folders seem skipped in this message? I didn't observe a line for 'profiles/user2' dataset, 'setup' dataset, or 'profiles' dataset. 9 datasets are sent but 6 have a message of being received.
  • When I go check on md5 checksums of all the files after a sync, everything is there on the receiving end AFAICT.
Code:
# On the first NAS:

cd /mnt/tank/shares/profiles/user2
find . -type f -print0 | sort -z | xargs -0 md5sum  | tee  ~/data1.log

# On the second NAS:

cd /mnt/tank/acme/nas01_backup/profiles/user2
find . -type f -print0 | sort -z | xargs -0 md5sum  | tee  ~/data2.log

# Then, after transferring the file

diff data1.log data2.log
## Gives no output.


I'm clueless as to what's going on exactly at this point.
 
Last edited:
Top