Frequent replication failures

taupehat

Explorer
Joined
Dec 20, 2016
Messages
54
We're seeing frequent but intermittent failures to replicate between two arrays at distant sites. After the initial replication failure the retry always works. The errors include (but may not be limited to):
  • Failed: ssh: connect to host {ipaddr} port 22: Operation timed out.
  • Connection to {ipaddr} closed by remote host. warning: cannot send 'pool/dataset@auto-2021-04-16_00-00': signal received..
  • The replication failed for the local ZFS pool/dataset/dataset1 while attempting to apply incremental send of snapshot auto-20210415.0000-4w -> auto-20210416.0000-4w to {ipaddr}
Ordinarily I would look at this and think there is bad network traffic but the link between these two sites is well monitored by the NOC and doesn't show the kind of flappiness that would explain the problem.

Are there any general debugging tips people are willing to offer to help resolve this?
 
Top