ZFS Replication starting from scratch 35TiB for the fifth time

CheeryFlame · Nov 14, 2023

I'm a so #&?&#% pissed. The family have lost Plex for over a month now and it's impacting my professional life as well.

Now that I've let it out please explain to me why in the world there's no way to replicate this dataset without getting all my data deleted when restarting the replication. All my other replications tasks are working like they always did and should, but this one is stubborn. Every time I make it to 20-25tb, something happens and the replication stop. This is normal behaviour for a transfer that is taking days to complete and usually it would restart where it left off. It's not the case anymore and I can't take it anymore.

Please could someone review these settings and tell me why it wouldn't work?

This is the related snapshot task that's now set to 7 days which is less the number of days it's taking for the transfer to complete so it should be okay.

Snapshots that are replicated.

The reason why I'm replicating 7 days is because I don't have enough space on destination to keep more than that. Destination has 55TiB total and source could fill up way more than that. I had everything working with 1 month but something happened which destroyed 10TiB of hardlinks. It took a week to fix with jdupes but that filled up my destination to 100% and broke the system.

I can't believe there's no way to do this? I'm gonna try to set it to 14 days but ideally it would have been 7 days.

Thanks for helping, it's been the worst homelabbing month of my life and I'm constantly having this image in my mind of burning the rack down. But that would be stupid since I'm so close to a perfectly fine setup. I guess sometimes everything just wants to break at the same time and drive you crazy.

CheeryFlame · Nov 14, 2023

I'm now trying to increase this

Now unchecked recursive since there's no child dataset in there

I checked this box (Prevent source system snapshots that have failed replication from being automatically removed by the Snapshot Retention Policy.)

Also modified the snapshot task to 2 weeks and I've checked this box

If anyone knows a way to replicate only 1 week of snapshots, please let me know.

winnielinnie · Nov 14, 2023

gravelfreeman said:
Every time I make it to 20-25tb, something happens and the replication stop.

Something other than the replication causes it to stop? (Power loss? Network down? Etc?)

Or is it the replication task itself that spits an error and quits?

Regardless, I thought using the GUI for replications supported resuming from where they were aborted?

It's possible to run a full replication using the command-line / SSH in a tmux session, and force the transfer to generate a "resume token". At least from here you can get started.

gravelfreeman said:
I can't believe there's no way to do this? I'm gonna try to set it to 14 days but ideally it would have been 7 days.

gravelfreeman said:
If anyone knows a way to replicate only 1 week of snapshots, please let me know.

How much and how often are things being deleted/modified on the dataset? Is it in such a way that you're constantly deleting very large files, of which your snapshots retain an ample amount of used space?

If you're only (or nearly exclusively) adding files, then your snapshots essentially don't take up additional space. (In an alternative sense, snapshots don't "grow", but rather they "stubbornly grab and refuse to let go", which affects the dataset's used space.)

But if the problem is that within a span of 7 to 14 days you're on the brink of completely filling your backup pool, you'll have to address this underlying problem.

CheeryFlame · Nov 14, 2023

winnielinnie said:
Something other than the replication causes it to stop? (Power loss? Network down? Etc?)

Or is it the replication task itself that spits an error and quits?

It was configuration errors at first. See my other topic for full explanation but to narrow it down, unchecking a dataset will modify the destination path to the root of the pool if there's only one dataset to be replicated on source. Since I didn't had enough space on destination I had to delete everything and restart from scratch.

Otherwise it was an error in the resume token after 2x reboot of the server. Sometimes I crashed the server and had to reboot. Usually it use the resume token but that time I had a message saying the resume token was corrupted and couldn't continue.

winnielinnie said:
Regardless, I thought using the GUI for replications supported resuming from where they were aborted?

Yes it does and it worked many times. I don't understand why this just happened.

winnielinnie said:
How much and how often are things being deleted/modified on the dataset? Is it in such a way that you're constantly deleting very large files, of which your snapshots retain an ample amount of used space?

Not often bu yes. The big issue was the hardlinks incident that I still can't explain. Still sometimes it happens, like deleting terabytes of the fullset of game console XYZ after I chose the games I want to keep. I'm a data hoarder and sometimes I get rid of huge chunk of data.

CheeryFlame · Nov 14, 2023

The other thing I can't explain is that I got 7 snapshots as you can see

Yet it's saying there's now 315 snapshots to be sent. The snapshots being sent never matches the number of snapshots existing on my dataset.

And it's still going up as we speak. There's something I can't explain here!

winnielinnie · Nov 14, 2023

gravelfreeman said:
but that time I had a message saying the resume token was corrupted and couldn't continue.

I didn't know resume tokens could "corrupt". The only issue I'm familiar with is if they refer to a snapshot that no longer exists on the source dataset.

gravelfreeman said:
Still sometimes it happens, like deleting terabytes of the fullset of game console XYZ after I chose the games I want to keep. I'm a data hoarder and sometimes I get rid of huge chunk of data.

You'll have to figure out an alternative approach, because that isn't feasible with ongoing snapshots and replications to a destination pool that can easily be filled to capacity.

If anything, such activity should be done on a separate dataset that does not replicate to the backup pool. And even further than that, such a dataset should probably only use manual snapshots (and few of them), in which you're pruning them yourself. Saving and then deleting terabytes at a time of data (that exists on other media anyways) is an inefficient method. And to mix it in with your backups only makes the problem worse.

gravelfreeman said:
Yet it's saying there's now 315 snapshots to be sent. The snapshots being sent never matches the number of snapshots existing on my dataset.

gravelfreeman said:
And it's still going up as we speak. There's something I can't explain here!

Well that looks like a bug if I've ever seen one.

One avenue to try, if you're up for it, is to just do the initial replication over SSH + tmux + resume_token. At least you'll have total control over it, and you won't deal with the middleware or GUI for the first 35TiB. (You'll have to disable the Replication Task to make sure it doesn't try to run automatically.)

CheeryFlame · Nov 14, 2023

winnielinnie said:
I didn't know resume tokens could "corrupt". The only issue I'm familiar with is if they refer to a snapshot that no longer exists on the source dataset.

Here's the log showing it since it just happened!

Code:

[2023/11/14 21:55:30] WARNING  [replication_task__task_13] [zettarepl.replication.partially_complete_state] Specified receive_resume_token, but received an error: contains partially-complete state. Allowing ZFS to catch up
[2023/11/14 21:56:31] INFO     [replication_task__task_13] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2023/11/14 21:56:31] INFO     [replication_task__task_13] [zettarepl.replication.run] Resuming replication for destination dataset 'sloche/media'
[2023/11/14 21:56:31] INFO     [replication_task__task_13] [zettarepl.replication.run] For replication task 'task_13': doing push from 'eden/media' to 'sloche/media' of snapshot=None incremental_base=None include_intermediate=None receive_resume_token='1-1120ef79f1-f0-789c636064000310a500c4ec50360710e72765a526973030a44541d460c8a7a515a7968064ec9a61f26c48f2499525a9c5405ae07342e14c2cfa4bf2d34b335318180a0b4d228f94de71744092e704cbe725e6a63230a4a6a4e6e9e7a6a664263a249696e4eb1a191819eb1a1aea1a58c41b18e81a18303020dcc7cd80f04f727e6e41516a71717e36031c00000c132295' encryption=False
[2023/11/14 21:56:31] INFO     [replication_task__task_13] [zettarepl.transport.ssh_netcat] Automatically chose connect address '10.0.10.13'
[2023/11/14 21:56:31] WARNING  [replication_task__task_13] [zettarepl.replication.partially_complete_state] Specified receive_resume_token, but received an error: contains partially-complete state. Allowing ZFS to catch up
[2023/11/14 21:56:31] ERROR    [replication_task__task_13] [zettarepl.replication.run] For task 'task_13' non-recoverable replication error ContainsPartiallyCompleteState()

winnielinnie said:
You'll have to figure out an alternative approach, because that isn't feasible with ongoing snapshots and replications to a destination pool that can easily be filled to capacity.

I'm waiting for the black friday deals or that the $/tb decrease a bit before buying 4 more 20tb drives. I just can't afford it at the current price.

winnielinnie said:
If anything, such activity should be done on a separate dataset that does not replicate to the backup pool. And even further than that, such a dataset should probably only use manual snapshots (and few of them), in which you're pruning them yourself. Saving and then deleting terabytes at a time of data (that exists on other media anyways) is an inefficient method. And to mix it in with your backups only makes the problem worse.

I never considered it before. If I make a manual snapshot this mean I would have all the time I want to replicate my pool.

Then once the first replication is completed I could setup the daily snapshot task + a replication with a 7 days lifetime.

I wonder what would happen to the first snapshot if I delete it after a few days.

The idea would be to keep going onward with the daily snapshots set with a 7 days lifetime.

winnielinnie said:
Well that looks like a bug if I've ever seen one.

It's the fourth bug in a week I find on TrueNAS, I can't wait to go back to my normal life haha!

winnielinnie said:
One avenue to try, if you're up for it, is to just do the initial replication over SSH + tmux + resume_token. At least you'll have total control over it, and you won't deal with the middleware or GUI for the first 35TiB. (You'll have to disable the Replication Task to make sure it doesn't try to run automatically.)

I'm afraid going that route that after I setup my replication task in the gui it restarts the whole process.

At this point I'm just having ptsds.

Kris Moore · Nov 15, 2023

This is pretty bizzare, I don't recall any major issues open with the replication engine right now. Can you please open a bug ticket with detailed information + debugs on this?

CheeryFlame · Nov 15, 2023

Kris Moore said:
This is pretty bizzare, I don't recall any major issues open with the replication engine right now. Can you please open a bug ticket with detailed information + debugs on this?

I just want to confirm before doing this if my ticket will be thrown away because I activated apt-get to install jdupes to fix 10tb of hardlinks. Anyone with a sane mind would have fixed 10tb worth of hardlinks. I didn't like last time when they killed my ticket telling me I was using an heavily modified system. I mean...

EDIT: Or I could reinstall a fresh Cobia since it's a fairly quick process before making the bug report but I don't have much time atm. I just want my data backed up asap!

Kris Moore · Nov 15, 2023

gravelfreeman said:
I just want to confirm before doing this if my ticket will be thrown away because I activated apt-get to install jdupes to fix 10tb of hardlinks. Anyone with a sane mind would have fixed 10tb worth of hardlinks. I didn't like last time when they killed my ticket telling me I was using an heavily modified system. I mean...

EDIT: Or I could reinstall a fresh Cobia since it's a fairly quick process before making the bug report but I don't have much time atm. I just want my data backed up asap!

Yes, a fresh install to reproduce would be ideal. You got to look at it from our perspective, TrueNAS is an appliance. When we get issues on a system that has been modified from our base image, which means we're not troubleshooting the same thing that we already sent through a bunch of QA cycles to shake out bugs. Nothing more frustrating than digging into an issue and finding out it was caused because somebody did their own local modifications and broke what was working perfectly before.

CheeryFlame · Nov 18, 2023

Let's all wish this time is the good time!

Still weird that there's 40 snapshots when I see

Fingers crossed! I might buy 4x more 20tb drive if there are good deals on Black Friday and will have to re-do the whole process again!

Important Announcement for the TrueNAS Community.

ZFS Replication starting from scratch 35TiB for the fifth time

CheeryFlame

Contributor

CheeryFlame

Contributor

winnielinnie

MVP

CheeryFlame

Contributor

CheeryFlame

Contributor

winnielinnie

MVP

CheeryFlame

Contributor

Kris Moore

SVP of Engineering

CheeryFlame

Contributor

Kris Moore

SVP of Engineering

CheeryFlame

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

ZFS Replication starting from scratch 35TiB for the fifth time

Contributor

Contributor

MVP

Contributor

Contributor

MVP

Contributor

SVP of Engineering

Contributor

SVP of Engineering

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZFS Replication starting from scratch 35TiB for the fifth time"

Similar threads