Snapshot/Replication Question

onigiri

Dabbler
Joined
Jul 17, 2018
Messages
32
Hello iX Community!

I searched for a while and couldn't find an exact answer to this question.

Devices:
I have two(2) identical 500TB NASs (built by 45drives). One is on our local network(source) and one is in a remote office in another state that we replicate our snapshot to (destination). (1G/1G on both ends)

Problem:
Last week due to....reasons... I changed the retention policy on snapshots on the source from 1 week to 2 weeks. Well everything was just fine until it cycled through the last xxxxxxxxx-1w snapshot and then when it began to replicate the xxxxxxxx-2w snapshot it appears to be attempting to replicate the entirety of each dataset again. This is extremely bad because it is 100+TB going at ~15-20Mbps. We specifically seeded the initial datasets more than a year ago when both NASs were in the same building and could transfer data on our 10G network.

The destination pools appear to contain all of the data still (that is visible).

Is the replication task writing all of the data from each dataset again?

Let me know if you need more data.



Screenshots - I blurred out company specific naming):
Source
Pools
2020-02-19_8-52-41.png


2020-02-19_8-53-25.png


2020-02-19_8-54-02.png


2020-02-19_8-54-23.png


2020-02-19_8-55-16.png


Destination

2020-02-19_9-00-24.png


2020-02-19_9-01-39.png
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
I changed the retention policy on snapshots on the source from 1 week to 2 weeks.

Hi,

Are you sure that is what you did ? I think what you did is you started a brand new set of snapshots for 2 weeks retention and not modified the previous one. As such, because that task is brand new, indeed the first snapshot refers to the entire set of data, so 3.35 TB as indicated in your last screenshot.

To mess up with replication between systems far apart is something I would not have risk... From what I see in your screenshots, I doubt there is a recovery but to bring back that server for a fresh local sync...

Let see if others know better here...
 

onigiri

Dabbler
Joined
Jul 17, 2018
Messages
32
I 100% edited the snapshot task and changed retention from 1W to 2W. No new tasks were created at all.
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Ok... So snapshots points to each others. When one is removed, its content is either removed as well or associated with the next shapshot referring to it.

Question : Do you still have a VERY old snapshot, probably a manual one, that was made before your 1w sets ?

Here, I have a manual "Master" snapshot 0 that I created at the beginning. After that, I started my regular scheduled snapshots. That Master does not expire. Only the scheduled ones do and even there, some expire only after 4 years...

Replication transfers the snapshots, not the actual live data. It is only once data are frozen in a snapshot that they are transferred.

What I see here is that my oldest snapshots get bigger and bigger when smaller ones between them expire. By not having one before the last 1w snapshot, I wonder if some content ended up not referred to anymore and then required by the new snapshots...

I will keep thinking about it but I am afraid it will be more to understand what happen and why than finding a way to fix that.

Again, maybe others know better about this...
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Question : Was there any gap between the two series of snapshots ? When was last 1w snapshot taken and when did it expired ?
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Hi again Onigiri,

In all cases, because this looks to be time related, one thing you can do is to clone your oldest snapshot for each dataset. A clone does not take any extra space and that one will not expire by default. Should there be something to preserve as of now, a clone should do it.

I keep trying to understand exactly what happen...

The non-sense is that if the new 2w snapshots were not linked properly with previous 1w ones, the symptom should have started from the first 2w snapshot. So why the linking failed only once there were no more 1w snapshots in the queue...
 
Last edited:

onigiri

Dabbler
Joined
Jul 17, 2018
Messages
32
@Heracles

Thanks for taking the time to answer my questions. I have just retrieved the remote NAS and brought it onsite. Question: Can I use robocopy or some other robust copy method to sync the data and then turn on the snapshot replication? The reason I ask that is to simply replicate 100tb via the built in replication task (zfs send/receive) it is going to take much longer than my mulithreaded machines pushing that data over.

Thoughts on the best method?
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Hi again,

If you do your replication with Robocopy onsite, you will have to stick to Reobocopy once the server is remote again. The reason being that ZFS will not know in which state Robocopy is / was.

Robocopy will handle files while ZFS handles blocks. ZFS can replicate zvol while Robocopy can not. Any function that is block-level will also be managed properly by ZFS but not by a file level tool like Robocopy.

So I would really recommend you do ZFS replication onsite and stick to it once remote.

Also, to protect yourself against such an incident, once your sync is done, do an extra manual snapshot, one that will not expire. Then, at every say 3 months, you create a new manual snapshot and once it is replicated, you can delete your previous manual snapshot. With these permanent snapshots, you will never again drop to the same low point.

I still do bot understand what happened but from the symptom, this procedure should protect you.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Replication only works on top of replicated data. rsync won't help.
 

onigiri

Dabbler
Joined
Jul 17, 2018
Messages
32
Ok so deleting all snapshots from both units then copying the data to the 2nd unit and then starting snapshots again will not work?
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
If you delete the snapshots from the original server, you will loose them. You do not need to delete them from the source. Now that you have the second server on site, the sync will do whatever it needs to do to sync the second server. Once the sync is done, see if some old stuff survived in the second server. If some did survive, delete it and wait for the primary server to make a new snapshot and sync it. Once everything is sync no extra stuff is present in the second server, do the manual snapshot I recommended you to do and then you are ready to ship back that server.

Good luck with all of that,
 
Top