How to setup ZFS Replication to only send the latest snapshot

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
I got a smaller off-site server to backup my main server every day at midnight.

I can't afford lots of space on the backup server. Currently if I choose Same as Source it'll be too much data

1698521897299.png
= 1 month which I don't have enough space for

I'm not sure what to do in order that TrueNAS keeps the latest snapshot on the backup server. If I set it up to 1 day does it mean that once a new snapshot has been successfully replicated, snapshots older than 1 day will be deleted? In this case it would delete only one snapshot. OR will the data in the snapshot get deleted after one day no matter if there's a new replication or not?

1698521989370.png
= what I think would be optimal

Since I was afraid of the latter so I've setup the snapshot retention to 1 week since I have 48TB to transfer at this time.

1698522188642.png
= what's currently setup for the initial transfer

Now after the transfer can I set the retention to 1 day and on the next replication snapshots older than 1 day will be getting rid of?


Unnecessary backstory of how I ended up in this situation:
  1. The backup server filled up to 100% in one night
  2. I deleted all the snapshots but the datasets were empty even if shown as mounted in GUI
  3. I manually mounted all datasets in CLI and my data was back
  4. Pool size reduced from 56tb (full) to 48tb
  5. I started the replication task again on the main server
  6. Everything got deleted everything and now the whole 48TB is transferring all over again
 

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
I would really like a heads up on this question.

Thank you!
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Why not try to set it to 2 or 3 days and then count the snapshots on the target? I honestly don't know what "1 day" with a daily schedule means.
 

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
Why not try to set it to 2 or 3 days and then count the snapshots on the target? I honestly don't know what "1 day" with a daily schedule means.
My understanding of the snapshot lifetime option is that once the replication task is successful it'll check for all the snapshots older than, in my case 1 day, and will delete them.

I wrote this post to have a confirmation of this ^

If the behaviour is that the snapshots gets automatically deleted after 1 day by themselves because it was set in the replication task then it's problematic and data loss will occur.

If the zfs replication task delete the snapshots it means no data will be lost. Because it'll wait for the current replication to end before deleting old snapshots.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Ah ... waitaminute. It's complex and I did obviously not fully grasp your setup.

So ... there is:

Snapshot lifetime. On the source, set it to as long as you can afford. Older snapshots are deleted on the source. This is in the snapshot task, independent of any replication.

Retention policy.
On the destination. In your replication task look at the right side and change that from "Same as source" to "Custom" and you can set an arbitrary fitting duration just for the destination.
 

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
Ah ... waitaminute. It's complex and I did obviously not fully grasp your setup.

So ... there is:

Snapshot lifetime. On the source, set it to as long as you can afford. Older snapshots are deleted on the source. This is in the snapshot task, independent of any replication.

Retention policy.
On the destination. In your replication task look at the right side and change that from "Same as source" to "Custom" and you can set an arbitrary fitting duration just for the destination.
Still this doesn't answer my question of who is deleting my snapshots on the destination system.

a. The replication task after it'll run next time?
b. The destination server himself after the retention policy has been met?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
The replication task. The destination server is a passive SSH and ZFS target. Need not even be TrueNAS. Or to put it differently: the destination server has absolutely zero knowledge that a replication is even going on. It's just a bag of datasets or volumes and their snapshots.
 
Last edited:

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
The replication task. The destination server is a passive SSH and ZFS target. Need not even be TrueNAS. Or to put it differently: the destination server has absolutely zero knowledge that a replication is even going on. It's just a bag of datasets or volumes and their snapshots.
Thank you for confirming this, so I'll set my retention policy to 1 day and this way only the latest version of the data will be kept on the destination.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
That's why I recommended setting it to two days, running one replication, then counting the snapshots on the destination. For a borderline interpretation of "1 day" there is a small chance that means "delete both latest snapshots that are exactly 24 hours apart". Unless you test, probably nobody can tell for sure.
 

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
Yesterday the replication task finally finished after several painful days of patience.

I woke up this morning and all my data was deleted on the destination and a new replication is going on.

1699046185988.png


I always had that box checked in the past and it never happened.

How is it possible that sending the exact same data delete the data that is the exact copy from a day ago?
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Snapshots point to each others. So you need 100% of your data at time 0. That is snapshot 0 and takes 100% of your actual volume.
Next snapshot (snapshot 1) contains only what changed since 0. (making it of size A)
Next snapshot (snapshot 2) contains only what changed since 1. (making it of size B)

Let's presume that whatever changed in snapshot 1 remained the same in snapshot 2 and that whatever changed in snapshot 2 is anywhere else.

You delete snapshot 1.

To avoid ending with "holes" in the data, whatever was referenced in snapshot 1 "is moved' into snapshot 2.

That makes it of size A+B

As such, you did not saved a single bit of data by deleting snapshot 1.

You will save only for what changed in both snapshots 1 and 2. In this case, you will loose "version" in snapshot 1 and save only version of snapshot 2.

Still, that will not save you that much data.

So I recommend you design a solution for which the snapshots will be the same on both sides.
Reduce your snapshots at the source or increase the capacity at destination.

If neither of these are options, you may be better to look at another synchronisation tool like RSYNC to sync only the latest version of files from A to B and to propagate deletion also.
 
Last edited:

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
Thank you for replying. I'll try to translate it into a situation so I make sure that I understand.

Let's say my dataset has a daily snapshot task set to be kept for 1 month on source and my destination snapshot retention policy to 3 days.



Day #1 - I have 100GiB of data

Source - Snapshot 0 = 100GiB
-

Destination - Snapshot 0 = 100GiB

100% of data is sent to destination as snapshot 0. Snapshot won't be visible in the snapshot list on both system.



Day #2 - I'm deleting 10 GiB of data

Source - Snapshot 0 = 90GiB
Source - Snapshot 1 = 10GiB
-

Destination - Snapshot 0 = 90GiB
Destination - Snapshot 1 = 10GiB


10GiB have been deleted on snapshot 0 and now is in snapshot 1. 90GiB of data is now referenced in snapshot 1 but stays in snapshot 0.



Day #3 - I'm adding 5GiB of data

Source - Snapshot 0 = 95GiB
Source - Snapshot 1 = 10GiB
Source - Snapshot 2 = 0
-

Destination - Snapshot 0 = 95GiB
Destination - Snapshot 1 = 10GiB
Destination - Snapshot 2 = 0


I added 5GiB of data which is stored in snapshot 0. At midnight the replication task will send the new 5GiB of data to destination system's snapshot 0.



Day #4 - Destination is hit by the 3 days retention policy

Source - Snapshot 0 = 95GiB
Source - Snapshot 1 = 10GiB
Source - Snapshot 2 = 0

Source - Snapshot 3 = 0
-
Destination - Snapshot 0 = 95GiB
Destination - Snapshot 2 = 10GiB
Destination - Snapshot 3 = 0

Snapshot #1 is deleted on destination. On destination the 10GiB of data in snapshot #1 is moved in snapshot #2.



Day #5 - Snapshots are now different on source and destination systems

Source - Snapshot 0 = 95GiB
Source - Snapshot 1 = 10GiB
Source - Snapshot 2 = 0
Source - Snapshot 3 = 0

Source - Snapshot 4 = 0
-

Destination - Snapshot 0 = 95GiB
Destination - Snapshot 2 = 10GiB
Destination - Snapshot 3 = 0




Is it from that point that the replication task will delete everything on the destination system and copy all the data from source to destination again?

If it's the case that option should be greyed out if you input a retention policy that is shorter than the source snapshot task life.

I can't see why someone would be sending data to send it back all over again every time the retention policy is met.

So this leave that option to be useful only if you're sending a one-time backup to destination or if the retention policy is greater on the destination than on source system.

Correct me if I'm wrong, I've spent quite a lot of time reading and trying to understand how this work and I'm still feeling unsure.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
I cannot give you a final answer on your retention problem straight from the top of my head but maybe we can get there by collectively thinking about the way snapshots are implemented. Your snapshot space calculations are not quite correct and I think I know where you went wrong.

Snapshots are not copies.

So you have 100 G in your dataset and you take snapshot 0.

Snapshot 0 takes up 0 G.
The dataset takes up 100 G.

There is no difference between the two. The snapshot only records the state of the dataset at a certain point in time - it is not a copy of those 100 G. Otherwise snapshots could not be instantaneous. It will take up a couple of K or M for bookkeeping but for the sake of this discussion let's view that as 0 G - ok?

You delete 10 G

The dataset takes up 90 G.
Snapshot 0 takes up 10 G. Namely the blocks of the data you deleted that are not referenced in the live dataset anymore.

You take snapshot 1. It takes up 0 G.

You add 10 G.

Snapshot 0 takes up 10 G of the invisible deleted data.
The dataset takes up 100 G.

110 G total because that's all data you created since you started taking snapshots.

Snapshot 1 - still 0 G.

You take snapshot 2 - 0 G.

You delete another 10 G.

Snapshot 0 references the dataset at the point it was created so it now takes up 20 G.

Snapshot 1 - would take up 10 G but these are also in snapshot 0, so it's just some bookkeeping.
Snapshot 2 - same as 1.

Now you delete snapshot 0.

So the initially deleted 10 G are gone for good. The data blocks are freed.

The second deleted 10 G are still referenced in snapshot 1. So now snapshot 1 takes up those 10 G.

Snapshot 2 - 0 G, because the 10 G are already accounted for in snapshot 1.

Dataset: 90 G again. Total 100.


Relevant is what changes between snapshots. If nothing changes no space is used.


Now for replication. You do not replicate a snapshot. As if it was some separate entity. You replicated the dataset at that point in time when the snapshot was taken. So

Replicate snapshot 0 - 100 G
Replicate snapshot 1 and snapshot 0, because "retention" - 110 G.
Delete snapshot 0 at the destination - 100 G.

And for the differential replication to work you need at least one common older snapshot on both sides. Then you can say "replicate the dataset at the point in time of snapshot X bot only send the differential from snapshot Y." Y being the common older one.


Thats how I understand the theory of ZFS snapshots. Now what the TrueNAS replication framework makes out of this is a different matter altogether and best observed by trial on a live system with some test data. One irritating feature for example is the fact that the replication tasks rely on the naming of the snapshots for finding a common ancestor as well as for retention.


HTH,
Patrick
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
@gravelfreeman, It seems that you do not grasp the concept of snapshots. Snapshots are not data, but metadata—reference to data. Referred data is immutable: Once a snapshot is created, it will always refer to the same data.

Day #1 (should really be numbered as "day 0"…)
Source and destination have a (very visible) snapshot which refers to 100 GB (or GiB if you want). The snapshot itself is some MB of internal ZFS data holding the references.

Day #2 delete 10 GB from dataset
Source and destination have 100 GB of data and two snapshots: S0, referring to the full 100 GB, and S1, referring to 90 GB within it (S1 itself is a few MB of metadata describing the difference with its parent S0).

Day #3 add 5 GB to dataset
Source and destination have 105 GB of data, and three snapshots, referring to 100, 90 and 95 GB sets within these 105 GB.

Day #4 S0 gets the axe on destination! (What made you think it would persist?)
Source has the same 105 GB as yesterday, and four snapshots, the last of which occupies just the required space to store "I'm the same as S2".
Destination holds 95 GB of data and three snapshots, S1 (which has taken over all the references it needed from S0 to keep referring to the same data), S2, and S3. Compared to yesterday 10 GB of data are no longer referred to, and their space is now usable as free space.

Day #5 onwards
Destination holds 95 GB of data and three identical snapshots referring to it.

Edit. Damn, I've been ninja'd…
 
Last edited:
Top