First replication took three solid days. Subsequent replications have taken about as long.
What in detail would you like to know?
I mainly followed the how-to instruction video by Lawrence Systems, and also consulted the FreeNAS manual. Started by creating a periodic snapshot task for the dataset, and configured Replication Tasks to replicate to the secondary FreeNAS on weekends. It seems pretty simple and hard to screw up.
HHD? Do you mean HDD? As stated in my original post, I'm running enterprise-grade Seagate Exos 12TB SAS drives.
Hard to screw up is a matter of perspective.
With ZFS, a lot of simple things will hit the fan if you miss a step or make assumption without understanding the underlaying nature and behaviour of ZFS.
As per your info, 3 days lifespan for replication is your issue.
As pointed out by @
Johnny Fartpants, the snapshots don't live long enough on the source server to be found and compared to on the remote server.
I can think of two scenarios, as I am not familiar with the GUI based replication, I can't really affirm it:
With a one week replication task, you will tell replication task to start every 7 days.
If your snapshots have a 3 day lifespan and as it is automated snapshot generation, I doubt a Hold on the snapshot is performed. It would be nice to have for such scenario, but it is not without issues..
So if you have a 5 minute interval for snapshot creation, then the way snapshot work is as follow:
You need to remember each dataset has its own snapshots.
From the very first daythe snapshot are taken, the number of snaphot per dataset will increase, 1 snaphot every 5 minutes. Over 1 hour you will have about 12 snaphots.
Over 3 hours, you will have 36 snapshots. If you were to specify a lifespan covering a year worth of snapshot, you will keep seeing the number of snapshot grow at a rate of 1 snapshot every 5 minutes and you could end up with thousands of them.
But because you have only a 3 hours lifespan, snapshots that are older than 3 hours old will be destroyed by ZFS on the source server.
So the maximum number of snapshot per dataset for your system shouldn't exceed 36 with an automatic snapshot.
On the first replication, the source only saw 36 snapshots and started sending those to the server. Lots of data at first but this is fine. As replication takes place, those snapshots are going to be replicated.
When replication takes place, I believe ZFS place a temporary hold on the snapshot being replicated. This mean none of the snapshot and the snapshot after it will be deleted even when they have lived longer than thir expected lifespan. This is a safety net from ZFS. But as soon as the replciation is complete, the temporary hold will be removed and old snapshots destroyed.
1) If you have "Delete Stale Snapshots....", the replication process will force snapshots with longer than expected lifespan to be deleted, but the replication should start properly as ZFS, when looking for incremental snapshot needs to find the youngest common snapshot present at both source and destination.
In you case, possibly all the snapshots that exist on the destinaion no longer exist on the source, and as a result, I suspect the replication task is forcing replication and destroying the content of the replicated data in order to start over.
2) Depending on the number of snapshots, it is possible the replication task is scanning the source and the destination server to look for common snapshots and maybe this task is a slow process as it can't reconcile the snapshots easily.
The reason I asked about the HDD (not HHD as you pointed out) was to look for SMR drives. Those can add significant delays on the replication when snapshots are being destroyed and this could have accounted for longer than normal operations. It seems I missed that part.
I suspect case 1) is the reason of you problem.
I think the replication is trying to find a snaphot it can't find and as a result wipe the data on the destination in order to start fresh.
The answer would be to increase the lifespan of the snapshots, but I would not make it 2 weeks either,I would amke it a few weeks or months to be safe.
Or if you don't then you can have another automatic snapshot with longer lifespan but longer intervals, such as every 12hrs and lifespan of 6 months.