SOLVED Replication Failed with no log details

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
Strange problem. I have a TN box that serves as a backup to my main TN box. I have been replicating to it for quite some time and all the data is just fine.

Now, I have rebuilt my main TN box which necessitated the destruction of the pool. I am trying to replicate the data back from the backup box to the main box and the replication keeps failing with no log.

Since there is about 17T of data to replicate, it takes a long time since it gets to 95+% complete before throwing an error.

I have tried this replication 3 times now. First time, GUI says "FAILED" and I get a blank file when I click "Download Log." Second time, I get the same behavior except that the log gives me a cryptic message (see attached log) saying to "run the following zfs command." I do that and get an error about redirecting STDIN/OUT. I try a third time and get the same blank log business.

I'd rather replicate the data back, since its so much more efficient than rsync. However, I'm running out of time to get the data moved back to the original TN box, and will have to resort to rsync soon.

Have you any suggestions on where to look for what's REALLY causing the failure?
 

Attachments

  • 1765.log.txt
    4.3 KB · Views: 140

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
OK folks. I finally resolved this issue......well, kinda. After working with @NickF (thanks a million man!)

I ultimately discovered that I had some disk errors that were causing an error in the snapshot that the replication task creates. It would proceed with 95% of the replication until it got to the bad spot in the snapshot and then fail out. And because I was replicating multiple datasets and the task did not complete, the receiving system aborted the entire job.

So, I transfered my datasets individually. They all worked fine until I got to my largest dataset (of course). That one failed out. After doing a lot more digging, I discovered via zpool status -v that I had a permanent error in the snapshot. After deleting that snapshot and scrubbing the disk, the problem persisted.

So, I am transferring those files via rsync, using the --ignore-errors option, since it seems that the source files are what have the errors in them. Thankfully, these source files are easily replaced/recreated, so I won't worry about the 4-5 files that are corrupt, and will rsync the rest of the 14T of data.

Since I was already planning to delete and recreate the pool on the server with the errors, I will take an extra week to run badblocks on all the drives before proceeding to make sure everything is OK.
 
Top