Replication failure: cannot receive incremental stream destination contains partially-complete state from "zfs receive -s"

RandomPrecision

Dabbler
Joined
Apr 17, 2023
Messages
21
I have a primary TrueNAS-SCALE-22.12.2 system with three pools. I am doing replication to another TrueNAS-SCALE-22.12.2 system, a backup system. While the primary system has three pools, the backup system has only one huge pool (whose size is greater than the combined pools of the primary system). The goal is to have three top-level datasets on the backup system, one for each pool on the primary system. I was initially struggling with this, as described in this post. Note that originally the backup system was running CORE, but has since been upgraded to SCALE.

I've had two of the three datasets successfully replicating for about a month, including a week of no-issues since upgrading the backup system from CORE to SCALE.

I waited to do replication of the third pool because it's the biggest: just shy of 10 TB. (I wanted some track record of comfort and success with the overall replication process before starting on this massive one.) I kicked off the replication task for this 10TB pool Friday evening (May 26). It's been running all weekend. Just now I checked on it's status, and it failed with the error in the subject:
Code:
zettarepl.transport.ssh_netcat.SshNetcatExecException: Active side: cannot receive incremental stream: destination backup-pool/sharerepl contains partially-complete state from "zfs receive -s".


It looks like this is usually caused by some kind of network issue or otherwise interruption. There should not be any deliberate interruption in my case. This is over a wired gigabit network. I did not make any changes on the backup side. I did periodically spot-check both primary and backup GUIs several times over the weekend, making sure there were no errors.

Actually, I'm not sure if that error corresponds to the initial sync I started Friday evening, or one of the subsequent automatic replication tasks. This is set to run daily at midnight (i.e. corresponding exactly to the snapshot schedule). The initial sync definitely took well over 48 hours, so while it was running, additional snapshots would have been created. I assumed TrueNAS would be smart enough to know it shouldn't run a replication task when the previous one hadn't finished.

Looking at the backup system, it appears most (maybe all) data actually made it. So I really don't want to restart this initial replication from scratch. It is possible to "resume" or otherwise re-start the sync without losing what has already been transferred? Should I just try to manually run the sync now? I haven't done anything yet - I'm coming here for the hand-holding to hopefully avoid having to restart this replication from the beginning!

Thanks!
 

RandomPrecision

Dabbler
Joined
Apr 17, 2023
Messages
21
I did nothing, including leaving the replication schedule alone. That means, it tried to run at midnight, along with the snapshot. That also failed, with a new error message:
Code:
[2023/05/31 00:00:02] ERROR    [replication_task__task_7] [zettarepl.replication.run] For task 'task_7' non-recoverable replication error NoIncrementalBaseReplicationError("No incremental base on dataset 'sharepool/share' and replication from scratch is not allowed")


It seems to be in kind of an "intermediate" state, for lack of precise term. If I look at the GUI of the backup system, it shows the dataset as using 9TB (it's 9.5TB on the primary system). Total available space on the backup system is very low, which is also consistent with that pool being replicated. I can also see the snapshots if I do "zfs list -t snapshot" on the backup system. So the data does appear to be there, in some form. But I cannot actually see the files from the commandline, even if I look in the special ".zfs" folders. In fact, doing a "du" against that dataset shows virtually no used space.

So again, is there any way to recover and work with what I have? Or am I going to have to just start over and hope it works this time?
 

RandomPrecision

Dabbler
Joined
Apr 17, 2023
Messages
21
Update: I went ahead and destroyed the backup dataset, then re-created it. I did the replication a second time. This time it worked. The only thing I did differently is, as soon as I started the replication, I changed the schedule such that the next replication would be many days away. IOW the intent was to make sure the current replication finished before another was attempted. That's the only thing I can think of that may have caused the first to fail (an automated run kicking off before the first finished). Whatever the reason, it worked. Hopefully the automatic replication continues to succeed.
 
Top