ZFS Replication starting from scratch 35TiB for the fifth time

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
I'm a so #&?&#% pissed. The family have lost Plex for over a month now and it's impacting my professional life as well.

Now that I've let it out please explain to me why in the world there's no way to replicate this dataset without getting all my data deleted when restarting the replication. All my other replications tasks are working like they always did and should, but this one is stubborn. Every time I make it to 20-25tb, something happens and the replication stop. This is normal behaviour for a transfer that is taking days to complete and usually it would restart where it left off. It's not the case anymore and I can't take it anymore.

Please could someone review these settings and tell me why it wouldn't work?

1700011820400.png


This is the related snapshot task that's now set to 7 days which is less the number of days it's taking for the transfer to complete so it should be okay.

1700012087696.png


Snapshots that are replicated.

1700012493399.png


The reason why I'm replicating 7 days is because I don't have enough space on destination to keep more than that. Destination has 55TiB total and source could fill up way more than that. I had everything working with 1 month but something happened which destroyed 10TiB of hardlinks. It took a week to fix with jdupes but that filled up my destination to 100% and broke the system.

I can't believe there's no way to do this? I'm gonna try to set it to 14 days but ideally it would have been 7 days.

Thanks for helping, it's been the worst homelabbing month of my life and I'm constantly having this image in my mind of burning the rack down. But that would be stupid since I'm so close to a perfectly fine setup. I guess sometimes everything just wants to break at the same time and drive you crazy.
 

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
I'm now trying to increase this

1700013250430.png


Now unchecked recursive since there's no child dataset in there

1700013279278.png


I checked this box (Prevent source system snapshots that have failed replication from being automatically removed by the Snapshot Retention Policy.)

1700013297921.png


Also modified the snapshot task to 2 weeks and I've checked this box

1700013367081.png


If anyone knows a way to replicate only 1 week of snapshots, please let me know.
 
Joined
Oct 22, 2019
Messages
3,641
Every time I make it to 20-25tb, something happens and the replication stop.
Something other than the replication causes it to stop? (Power loss? Network down? Etc?)

Or is it the replication task itself that spits an error and quits?

Regardless, I thought using the GUI for replications supported resuming from where they were aborted?

It's possible to run a full replication using the command-line / SSH in a tmux session, and force the transfer to generate a "resume token". At least from here you can get started.



I can't believe there's no way to do this? I'm gonna try to set it to 14 days but ideally it would have been 7 days.
If anyone knows a way to replicate only 1 week of snapshots, please let me know.
How much and how often are things being deleted/modified on the dataset? Is it in such a way that you're constantly deleting very large files, of which your snapshots retain an ample amount of used space?

If you're only (or nearly exclusively) adding files, then your snapshots essentially don't take up additional space. (In an alternative sense, snapshots don't "grow", but rather they "stubbornly grab and refuse to let go", which affects the dataset's used space.)

But if the problem is that within a span of 7 to 14 days you're on the brink of completely filling your backup pool, you'll have to address this underlying problem.
 

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
Something other than the replication causes it to stop? (Power loss? Network down? Etc?)

Or is it the replication task itself that spits an error and quits?
It was configuration errors at first. See my other topic for full explanation but to narrow it down, unchecking a dataset will modify the destination path to the root of the pool if there's only one dataset to be replicated on source. Since I didn't had enough space on destination I had to delete everything and restart from scratch.

Otherwise it was an error in the resume token after 2x reboot of the server. Sometimes I crashed the server and had to reboot. Usually it use the resume token but that time I had a message saying the resume token was corrupted and couldn't continue.

Regardless, I thought using the GUI for replications supported resuming from where they were aborted?
Yes it does and it worked many times. I don't understand why this just happened.

How much and how often are things being deleted/modified on the dataset? Is it in such a way that you're constantly deleting very large files, of which your snapshots retain an ample amount of used space?
Not often bu yes. The big issue was the hardlinks incident that I still can't explain. Still sometimes it happens, like deleting terabytes of the fullset of game console XYZ after I chose the games I want to keep. I'm a data hoarder and sometimes I get rid of huge chunk of data.
 

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
The other thing I can't explain is that I got 7 snapshots as you can see

1700016111986.png


Yet it's saying there's now 315 snapshots to be sent. The snapshots being sent never matches the number of snapshots existing on my dataset.

1700016170799.png


And it's still going up as we speak. There's something I can't explain here!
 
Joined
Oct 22, 2019
Messages
3,641
but that time I had a message saying the resume token was corrupted and couldn't continue.
I didn't know resume tokens could "corrupt". The only issue I'm familiar with is if they refer to a snapshot that no longer exists on the source dataset.



Still sometimes it happens, like deleting terabytes of the fullset of game console XYZ after I chose the games I want to keep. I'm a data hoarder and sometimes I get rid of huge chunk of data.
You'll have to figure out an alternative approach, because that isn't feasible with ongoing snapshots and replications to a destination pool that can easily be filled to capacity.

If anything, such activity should be done on a separate dataset that does not replicate to the backup pool. And even further than that, such a dataset should probably only use manual snapshots (and few of them), in which you're pruning them yourself. Saving and then deleting terabytes at a time of data (that exists on other media anyways) is an inefficient method. And to mix it in with your backups only makes the problem worse. :confused:



Yet it's saying there's now 315 snapshots to be sent. The snapshots being sent never matches the number of snapshots existing on my dataset.
And it's still going up as we speak. There's something I can't explain here!
Well that looks like a bug if I've ever seen one.



One avenue to try, if you're up for it, is to just do the initial replication over SSH + tmux + resume_token. At least you'll have total control over it, and you won't deal with the middleware or GUI for the first 35TiB. (You'll have to disable the Replication Task to make sure it doesn't try to run automatically.)
 
Last edited:

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
I didn't know resume tokens could "corrupt". The only issue I'm familiar with is if they refer to a snapshot that no longer exists on the source dataset.
Here's the log showing it since it just happened!

Code:
[2023/11/14 21:55:30] WARNING  [replication_task__task_13] [zettarepl.replication.partially_complete_state] Specified receive_resume_token, but received an error: contains partially-complete state. Allowing ZFS to catch up
[2023/11/14 21:56:31] INFO     [replication_task__task_13] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2023/11/14 21:56:31] INFO     [replication_task__task_13] [zettarepl.replication.run] Resuming replication for destination dataset 'sloche/media'
[2023/11/14 21:56:31] INFO     [replication_task__task_13] [zettarepl.replication.run] For replication task 'task_13': doing push from 'eden/media' to 'sloche/media' of snapshot=None incremental_base=None include_intermediate=None receive_resume_token='1-1120ef79f1-f0-789c636064000310a500c4ec50360710e72765a526973030a44541d460c8a7a515a7968064ec9a61f26c48f2499525a9c5405ae07342e14c2cfa4bf2d34b335318180a0b4d228f94de71744092e704cbe725e6a63230a4a6a4e6e9e7a6a664263a249696e4eb1a191819eb1a1aea1a58c41b18e81a18303020dcc7cd80f04f727e6e41516a71717e36031c00000c132295' encryption=False
[2023/11/14 21:56:31] INFO     [replication_task__task_13] [zettarepl.transport.ssh_netcat] Automatically chose connect address '10.0.10.13'
[2023/11/14 21:56:31] WARNING  [replication_task__task_13] [zettarepl.replication.partially_complete_state] Specified receive_resume_token, but received an error: contains partially-complete state. Allowing ZFS to catch up
[2023/11/14 21:56:31] ERROR    [replication_task__task_13] [zettarepl.replication.run] For task 'task_13' non-recoverable replication error ContainsPartiallyCompleteState()


You'll have to figure out an alternative approach, because that isn't feasible with ongoing snapshots and replications to a destination pool that can easily be filled to capacity.
I'm waiting for the black friday deals or that the $/tb decrease a bit before buying 4 more 20tb drives. I just can't afford it at the current price.

If anything, such activity should be done on a separate dataset that does not replicate to the backup pool. And even further than that, such a dataset should probably only use manual snapshots (and few of them), in which you're pruning them yourself. Saving and then deleting terabytes at a time of data (that exists on other media anyways) is an inefficient method. And to mix it in with your backups only makes the problem worse. :confused:
I never considered it before. If I make a manual snapshot this mean I would have all the time I want to replicate my pool.

Then once the first replication is completed I could setup the daily snapshot task + a replication with a 7 days lifetime.

I wonder what would happen to the first snapshot if I delete it after a few days.

The idea would be to keep going onward with the daily snapshots set with a 7 days lifetime.

Well that looks like a bug if I've ever seen one.
It's the fourth bug in a week I find on TrueNAS, I can't wait to go back to my normal life haha!

One avenue to try, if you're up for it, is to just do the initial replication over SSH + tmux + resume_token. At least you'll have total control over it, and you won't deal with the middleware or GUI for the first 35TiB. (You'll have to disable the Replication Task to make sure it doesn't try to run automatically.)
I'm afraid going that route that after I setup my replication task in the gui it restarts the whole process.

At this point I'm just having ptsds.
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
This is pretty bizzare, I don't recall any major issues open with the replication engine right now. Can you please open a bug ticket with detailed information + debugs on this?
 

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
This is pretty bizzare, I don't recall any major issues open with the replication engine right now. Can you please open a bug ticket with detailed information + debugs on this?
I just want to confirm before doing this if my ticket will be thrown away because I activated apt-get to install jdupes to fix 10tb of hardlinks. Anyone with a sane mind would have fixed 10tb worth of hardlinks. I didn't like last time when they killed my ticket telling me I was using an heavily modified system. I mean...

EDIT: Or I could reinstall a fresh Cobia since it's a fairly quick process before making the bug report but I don't have much time atm. I just want my data backed up asap!
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
I just want to confirm before doing this if my ticket will be thrown away because I activated apt-get to install jdupes to fix 10tb of hardlinks. Anyone with a sane mind would have fixed 10tb worth of hardlinks. I didn't like last time when they killed my ticket telling me I was using an heavily modified system. I mean...

EDIT: Or I could reinstall a fresh Cobia since it's a fairly quick process before making the bug report but I don't have much time atm. I just want my data backed up asap!

Yes, a fresh install to reproduce would be ideal. You got to look at it from our perspective, TrueNAS is an appliance. When we get issues on a system that has been modified from our base image, which means we're not troubleshooting the same thing that we already sent through a bunch of QA cycles to shake out bugs. Nothing more frustrating than digging into an issue and finding out it was caused because somebody did their own local modifications and broke what was working perfectly before.
 

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
Let's all wish this time is the good time!

1700327482850.png


Still weird that there's 40 snapshots when I see

1700327547940.png


Fingers crossed! I might buy 4x more 20tb drive if there are good deals on Black Friday and will have to re-do the whole process again!
 
Top