SOLVED Replication Failed with no log details

2twisty · May 25, 2023

Strange problem. I have a TN box that serves as a backup to my main TN box. I have been replicating to it for quite some time and all the data is just fine.

Now, I have rebuilt my main TN box which necessitated the destruction of the pool. I am trying to replicate the data back from the backup box to the main box and the replication keeps failing with no log.

Since there is about 17T of data to replicate, it takes a long time since it gets to 95+% complete before throwing an error.

I have tried this replication 3 times now. First time, GUI says "FAILED" and I get a blank file when I click "Download Log." Second time, I get the same behavior except that the log gives me a cryptic message (see attached log) saying to "run the following zfs command." I do that and get an error about redirecting STDIN/OUT. I try a third time and get the same blank log business.

I'd rather replicate the data back, since its so much more efficient than rsync. However, I'm running out of time to get the data moved back to the original TN box, and will have to resort to rsync soon.

Have you any suggestions on where to look for what's REALLY causing the failure?

NickF · May 25, 2023

I sent you a PM for a debug, for both systems. I'll update here with my findings for you.

2twisty · May 25, 2023

PM Sent

2twisty · May 28, 2023

OK folks. I finally resolved this issue......well, kinda. After working with @NickF (thanks a million man!)

I ultimately discovered that I had some disk errors that were causing an error in the snapshot that the replication task creates. It would proceed with 95% of the replication until it got to the bad spot in the snapshot and then fail out. And because I was replicating multiple datasets and the task did not complete, the receiving system aborted the entire job.

So, I transfered my datasets individually. They all worked fine until I got to my largest dataset (of course). That one failed out. After doing a lot more digging, I discovered via zpool status -v that I had a permanent error in the snapshot. After deleting that snapshot and scrubbing the disk, the problem persisted.

So, I am transferring those files via rsync, using the --ignore-errors option, since it seems that the source files are what have the errors in them. Thankfully, these source files are easily replaced/recreated, so I won't worry about the 4-5 files that are corrupt, and will rsync the rest of the 14T of data.

Since I was already planning to delete and recreate the pool on the server with the errors, I will take an extra week to run badblocks on all the drives before proceeding to make sure everything is OK.

Important Announcement for the TrueNAS Community.

SOLVED Replication Failed with no log details

2twisty

Contributor

Attachments

NickF

Guru

2twisty

Contributor

2twisty

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED Replication Failed with no log details

2twisty

Contributor

Attachments

NickF

Guru

2twisty

Contributor

2twisty

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Replication Failed with no log details"

Similar threads