Replication task fails after server shut down; switching back to rsync

Patrick_3000

Contributor
Joined
Apr 28, 2021
Messages
167
I have a primary server and a backup server, both with SCALE installed. A few months ago, I switched from rsync to replication to back up the primary server to the backup server every hour. I'd prefer to use replication because there is a top-level dataset with 8 child datasets, and replication can handle it in one task whereas rsync requires 8 tasks, and configuring and managing 8 tasks in the SCALE UI is doable but cumbersome.

The problem is that recently I had to shut down the primary server for two days because the CPU fan failed and I had to order and install a replacement. During that time, all the snapshots used for replication expired, and when I got the primary server going again, the replication task failed since there were no snapshots on the backup server. No matter what I tried, including allowing it to replicate from scratch, there was no way to run the replication task. The only thing I could have done was destroy the top-level dataset on the backup server and recreate it, and then I presumably could have run the replication task like it was happening for the first time, with a full data transfer of several Terabytes.

But this points to a major flaw with replication. It appears that if the source server for the replication is down for a while and all the snapshots used for replication expire, there is no way to run the replication task once the source server is back up.

Consequently, I have stopped using replicate to back up the primary server to the backup server and have returned to using 8 rsync tasks, one for each child dataset, as cumbersome as that is. At least this way I won't run into any problems if I have more downtime in the future. I use rsync over ssh, not with modules, so it's not going away in Cobia as I understand it.

The bottom line, however, is this: does anyone know how to set up a replication task that's robust enough to be usable if the source server is shut down temporarily and the snapshots expire?
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Keep the snapshots for longer? You have a choice of how long to keep source and destination snaphots - and they can be different
 

Patrick_3000

Contributor
Joined
Apr 28, 2021
Messages
167
Keep the snapshots for longer? You have a choice of how long to keep source and destination snaphots - and they can be different

Yes, that's an option. However, the primary dataset is on a larger capacity pool than the backup dataset. So, ideally I don't want to keep the snapshots very long in the backup dataset since I don't want the pool on the backup server to get near its capacity. That's why I originally set the snapshots to expire after one day on the backup server.

Still, it's an option, and I'll consider it.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
That suggests to me that neither of the pools are big enough. A backup server (as a general rule of thumb) should have 1.5 to 2X the storage available of the primary. And the primary should keep the data for at least a week or more with substantial free space (50%) so that it runs smoothly, and quickly and can absorb large data changes if nessesary.

How long the backup keeps the data is a more nuanced and interesting question. I tend to keep snapshots for 4 weeks and use a file by file backup to another box to keep longer term backups. The TrueNAS snapshot UI is not intuitive when it comes to keeping snapshots of the same data for different time periods. I think IX could do with reviewing & improving this - but I also think there are more important things to do. Also, I suspect, not simple either.
 

Patrick_3000

Contributor
Joined
Apr 28, 2021
Messages
167
That suggests to me that neither of the pools are big enough. A backup server (as a general rule of thumb) should have 1.5 to 2X the storage available of the primary. And the primary should keep the data for at least a week or more with substantial free space (50%) so that it runs smoothly, and quickly and can absorb large data changes if nessesary.

How long the backup keeps the data is a more nuanced and interesting question. I tend to keep snapshots for 4 weeks and use a file by file backup to another box to keep longer term backups. The TrueNAS snapshot UI is not intuitive when it comes to keeping snapshots of the same data for different time periods. I think IX could do with reviewing & improving this - but I also think there are more important things to do. Also, I suspect, not simple either.

Not really. The primary pool has 12 TB of capacity, of which 4.5 TB is used for data to be backed up, and 1.5 TB is used for Zvols that hold VMs which are mostly empty space and are imaged periodically with Clonezilla. So, 6 TB total used, and it's at around 50% capacity. That's fine with ZFS, and there plenty of snapshots keeping the data, especially the most important and dynamic datasets, for weeks or months.

The backup pool has 8 TB of capacity, and it's only used to back up the 4.5 TB dataset. That means that without snapshots, it's at around 56% capacity, which isn't bad. I don't know why the backup pool would need to have so much more storage than the primary pool as you suggest. The backup server is only there for emergencies if the primary goes down, which actually happened recently when the primary was down for a couple of days due to a failed fan.

In any event, people can have different views about sizing pools, but it does seem that one flaw with replication is that if snapshots expire, it's unusable. I'm not sure if most people realize that about replication. I sure didn't. I suppose that's just the nature of how ZFS does replication.

In any case, I'm still evaluating, but unless there's some other solution, it seems that an rsync task for each dataset may be the best solution in my case, unless I want to invest in bigger disks for the backup server and store a bunch of snapshots on it at some point in the future, which I may eventually do.
 
Last edited:
Joined
Oct 22, 2019
Messages
3,641
I think IX could do with reviewing & improving this
No kidding. The propensity to break your regular backups is not uncommon. Just browse these forums and you'll see it's an issue that pops up for others. :confused:


it does seem that one flaw with using replication is that if snapshots expire, it's unusable. I'm not sure if most people realize that about replication. I sure didn't.
TrueNAS's Replication Tasks appear to be designed for a very particular and narrow scope. They can't really be configured for a wider range of use-cases: you have to use Periodic Snapshots and Replication Tasks in a simple, specific manner. Even though there's an option to prevent breakage from expired snapshots, it doesn't work that well. (As you and others can attest to.)

The developers could leverage the "hold" and "bookmark" features of ZFS, integrating them into the middelware and GUI (and even as an option in the Replication Tasks), but I doubt it would be a priority. "Our enterprise customers don't face this issue."

You can create a script that "holds" and "releases" your snapshots, in such a way that it assures at least one snapshot is being held at all times. This would prevent destruction of crucial snapshots in the case of an emergency, loss of power, offline maintenance, etc.


For the record, I don't use the GUI for my replications. I use a (messy) script. The Replication Tasks in the GUI gave me too many scares and headaches.
 
Last edited:
Top