Replication Task Skipping all Snapshots

urfrndsandy

Dabbler
Joined
May 30, 2023
Messages
32
Then let's start looking at the pool. That might explain why the replication complains about the resume token.
In your log, there was mention of an error:

So portion of the snapshot has been transmitted and saved, but it needs to be resumed, which zettarepl.py can't seem to work it out.

Can you post details of your unhealthy pool? Something like:
@Apollo yes I got the error from the primary machine, it says a file has some error. But its an important one too.

pool: ServerPool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub in progress since Fri Mar 15 13:26:24 2024
1.10T scanned at 487M/s, 1.07T issued at 0B/s, 1.17T total
128K repaired, 91.80% done, no estimated completion time
config:

NAME STATE READ WRITE CKSUM
ServerPool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/fa647bce-15ca-11ee-a7ba-000c292b25b2 ONLINE 0 0 0
gptid/fa494fd4-15ca-11ee-a7ba-000c292b25b2 ONLINE 0 0 0


errors: Permanent errors have been detected in the following files:
ServerPool/MasterDataset/Projects@auto-2024-03-02_00-00:/AN-105-LIBERTY/026_BERLIN_STEEL/017_3125_DOMINION_SQUARE/Tekla_Models/3125_DOMINION_SQUARE/3125_DOMINION_SQUARE1.db1
<0x2a628>:<0x3c8980>
<0x2ad67>:<0x3c8980>
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Scrub is running so best to wait it completes on its own. Shouldn't take too long.
Wondering if the corruption is related to a jail. Not sure if .db1 is your database or a backup.
 

urfrndsandy

Dabbler
Joined
May 30, 2023
Messages
32
Scrub is running so best to wait it completes on its own. Shouldn't take too long.
Wondering if the corruption is related to a jail. Not sure if .db1 is your database or a backup.
.db1 is my database file. Can I delete it and try everything once?
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
.db1 is my database file. Can I delete it and try everything once?
I would try to attempt something else before.
1) See if you can edit your replication task to not use "Resume Token".
2) I would try to delete the partially received snapshot which has a resume token. Not clear how to do it. Normally it should be possible to list it with (if 1 doesn't take care of it)
zfs list -r pool
 

urfrndsandy

Dabbler
Joined
May 30, 2023
Messages
32
I would try to attempt something else before.
1) See if you can edit your replication task to not use "Resume Token".
2) I would try to delete the partially received snapshot which has a resume token. Not clear how to do it. Normally it should be possible to list it with (if 1 doesn't take care of it)
I tried to do it but below is the snip of replication settings, there is no option to uncheck "Resume Token"

1710541756508.png



This is the output of zfs list -r ServerPool

ServerPool 1.04T 2.47T96K /mnt/ServerPool
ServerPool/.system 63.7M 2.47T 136K legacy
ServerPool/.system/configs-7a6a616639514a76ab8da3e8aa6a9cc9 19.2M 2.47T 19.2M legacy
ServerPool/.system/cores 96K 1024M96K legacy
ServerPool/.system/rrd-7a6a616639514a76ab8da3e8aa6a9cc9 37.5M 2.47T 37.5M legacy
ServerPool/.system/samba4 656K 2.47T 656K legacy
ServerPool/.system/services 96K 2.47T96K legacy
ServerPool/.system/syslog-7a6a616639514a76ab8da3e8aa6a9cc9 6.02M 2.47T 6.02M legacy

Also, the partial snapshot is not listed under snapshots lists, unsure how to delete it.
 
Joined
Oct 22, 2019
Messages
3,641
Also, the partial snapshot is not listed under snapshots lists, unsure how to delete it.

You can remove the resume token:
Code:
zfs recv -A mypool/dataset


To verify it's gone:
Code:
zfs get receive_resume_token mypool/dataset


I'm still not sure what is going on, as I'd need to read this thread from the beginning. It's hard to tell if the resume token is the culprit, or the replication fails due to the presence of corrupted blocks.
 
Joined
Oct 22, 2019
Messages
3,641
.db1 is my database file. Can I delete it and try everything once?
You can't simply "delete" the file.

And you can't simply remove the most recent snapshot.

The only way to rid yourself of the corrupted file is to delete it from the live filesystem and delete all snapshots that it exists in. Then you can take another (new) snapshot, which you can replicate to the backup server.

I also noticed there's metadata corruption in your output. This could be a sign of a failing drive, or HBA controller, or RAM.

EDIT: Are you able to read/load the supposedly "corrupt" .db1 file? Do you get errors if you attempt it?

A completed scrub, followed by a "zpool clear" might get rid of those errors in the pool's status.
 
Last edited:

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
I have two machines with same structure for Dataset and the snapshot is stopping after a specific date

question: are you trying to replicate to the root of the 2nd pool? if so, that's likely to cause issues (eg overwrite .system).
instead, make yourself a backup dataset in the backup pool and replicate to that.
tank1 > tank2
vs
tank1 > tank2/tank1

IIRC there should be a zpool or zfs command to clear out a resumable partial replication. even deleting the dataset does not do this in my experience, as the partial is stored elsewhere. might be in the ZIL? not sure.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
question: are you trying to replicate to the root of the 2nd pool? if so, that's likely to cause issues (eg overwrite .system).
instead, make yourself a backup dataset in the backup pool and replicate to that.
tank1 > tank2
vs
tank1 > tank2/tank1
If there was such an issue, replication would have never started first hand. So no.

IIRC there should be a zpool or zfs command to clear out a resumable partial replication. even deleting the dataset does not do this in my experience, as the partial is stored elsewhere. might be in the ZIL? not sure.
I don't think ZIL is involved in this case.
I believe, it should be listed under the following, somewhere, but I am not sure:
zfs list -t snapshot -r ServerPool
In recent weeks, I believe I have seen the partial replication on the destination as a snapshot or dataset, and I have dealt with a similar situation a few years ago. I just don't recall the specifics.
I know that I never used the following command to get rid of it:
zfs recv -A mypool/dataset

I think I used the following command structure instead:

zfs destroy something/some receiv/something
 

urfrndsandy

Dabbler
Joined
May 30, 2023
Messages
32
If there was such an issue, replication would have never started first hand. So no.


I don't think ZIL is involved in this case.
I believe, it should be listed under the following, somewhere, but I am not sure:

In recent weeks, I believe I have seen the partial replication on the destination as a snapshot or dataset, and I have dealt with a similar situation a few years ago. I just don't recall the specifics.
I know that I never used the following command to get rid of it:


I think I used the following command structure instead:
@Apollo It looks like the problem is primary machine itself now am trying to somehow retrieve all the snapshots and format the primary machine.
Still not sure how to skip the snapshots that are stopping replication
 
Joined
Oct 22, 2019
Messages
3,641
Still not sure how to skip the snapshots that are stopping replication
Corrupted data will prevent a replication from completing successfully.

You may in fact have a corrupted file/block that exists on all your snapshots and your live filesystem.


See here:
 

urfrndsandy

Dabbler
Joined
May 30, 2023
Messages
32
Corrupted data will prevent a replication from completing successfully.

You may in fact have a corrupted file/block that exists on all your snapshots and your live filesystem.


See here:
@winnielinnie yes, I too believe this is the scenario. But the file (.db1) in question is reading properly in my application.
How do i fix this. I can delete all the preceding snapshots from the day of corruption. But I need to solve this issue.
 
Joined
Oct 22, 2019
Messages
3,641
That doesn't mean it's not "corrupt" in the eyes of ZFS.

To rule it out:
  1. Run zpool clear mypool
  2. Run another full scrub on the pool
  3. Check if the corruption still exists with zpool status -v mypool

Otherwise, you're stuck with this "tainted" pool.

You'll have to delete the file on the live filesystem (which you can make a copy of it elsewhere in the meantime), destroy all snapshots that it exists in (which is a bummer, because you're losing the deleted/modified data that the snapshots preserved), then create a new snapshot and try to replicate again.

However, you also have metadata corruption on the pool. So it's not just this one file...
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
The error being reported is specifically pointed to this snapshot (as far as I understand):
ServerPool/MasterDataset/Projects@auto-2024-03-02_00-00
The metadata, I have seen this in the past is originating from a jail corruption (don't know what kind of corruption). I could be incorrect, though:
<0x2a628>:<0x3c8980>
<0x2ad67>:<0x3c8980>
So, I believe the corruption occurred only in a block saved between ServerPool/MasterDataset/Projects@auto-2024-03-02_00-00 and the previous snapshot right before it, which could be ServerPool/MasterDataset/Projects@auto-2024-03-01_00-00 (needs to be confirmed).

So I would think, the worst case scenario would be to do a rollback of ServerPool/MasterDataset/Projects@auto-2024-03-01_00-00, which also means that the state of the database file and every files within this dataset would revert to its content at the time and date 2024-03-01_00-00.

You could do the rollback then copy the corrupted files from you PC which you know is up to date and copy it to the location it resides in the dataset.

Another thought, before being this drastic, I would even look at the possibility of simply destroying the snapshot
ServerPool/MasterDataset/Projects@auto-2024-03-02_00-00
The idea, is that if it is a database file which is being written/updated a few times a day, then I am hoping that the snapshot that hold the reference to the corrupted block/blocks could be freed when the snapshot is deleted. Hopefully, the snapshots after
ServerPool/MasterDataset/Projects@auto-2024-03-01_00-00
could still point to the blocks which aren't corrupted because they would have been created in more recent write/update of the file.

Not sure if this will tell us much, but you could run:
zfs diff ServerPool/MasterDataset/Projects@auto-2024-03-01_00-00 ServerPool/MasterDataset/Projects@auto-2024-03-02_00-00
and repeat with the other snapshots. That might tell us something interesting.
 
Joined
Oct 22, 2019
Messages
3,641
The error being reported is specifically pointed to this snapshot (as far as I understand):
It can be misleading. A scrub will mark the first "hit" of a corrupted file/block, working backwards from the most recent snapshots. So if he deletes the snapshot in question, then runs another scrub, guess what? It will inform him that a different snapshot contains a corrupted file. (If he keeps destroying the snapshots, working backwards, the scrub will just inform him of the "next in line" containing the corruption.) The reality is, there's a block of data that doesn't match its checksum. In ZFS, the same block is pointed to by several snapshots. (It's still the same block of physical data., corrupted or otherwise.)

The only "exception" to the above is if you modified a file, and it just so happens that only that modified block was corrupted. Then you can in fact find the latest snapshot before the modification occurred, where you'll find a 100% intact, non-corrupted file.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
It can be misleading. A scrub will mark the first "hit" of a corrupted file/block, working backwards from the most recent snapshots. So if he deletes the snapshot in question, then runs another scrub, guess what? It will inform him that a different snapshot contains a corrupted file. (If he keeps destroying the snapshots, working backwards, the scrub will just inform him of the "next in line" containing the corruption.) The reality is, there's a block of data that doesn't match its checksum. In ZFS, the same block is pointed to by several snapshots. (It's still the same block of physical data., corrupted or otherwise.)

The only "exception" to the above is if you modified a file, and it just so happens that only that modified block was corrupted. Then you can in fact find the latest snapshot before the modification occurred, where you'll find a 100% intact, non-corrupted file.
I completely agree. What I hope has happened is that the server/jail servicing the database file would have introduced a block corruption, or would have required ZFS to allocate blocks if the file has been updated. And I am hoping that the corrupted block could be freed. If the block/blocks (128k in size) exist early on, and several or al snapshots after that exist, then the block will still be referenced by any snapshots that reference that block. But my understanding is that the error occur with
ServerPool/MasterDataset/Projects@auto-2024-03-01_00-00
or
ServerPool/MasterDataset/Projects@auto-2024-03-02_00-00
which means snapshots prior to
ServerPool/MasterDataset/Projects@auto-2024-03-01_00-00
shouldn't have corrupted blocks, otherwise the replication would have failed then, unless
ServerPool/MasterDataset/Projects@auto-2024-03-01_00-00
is the first snapshot for the dataset.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
I completely agree. What I hope has happened is that the server/jail servicing the database file would have introduced a block corruption, or would have required ZFS to allocate blocks if the file has been updated. And I am hoping that the corrupted block could be freed. If the block/blocks (128k in size) exist early on, and several or al snapshots after that exist, then the block will still be referenced by any snapshots that reference that block. But my understanding is that the error occur with

or

which means snapshots prior to

shouldn't have corrupted blocks, otherwise the replication would have failed then, unless

is the first snapshot for the dataset.
This is why doing
zfs diff ....
could help figure out the state of the database file and give us some clues.
 
Joined
Oct 22, 2019
Messages
3,641
He'd have to use "grep DOMINION_SQUARE1.db1" in conjunction with the "zfs diff" command.

Start with the earliest snapshot, and compare it upwards, until he finds the one where the file was modified (coded with an "M".)
 

urfrndsandy

Dabbler
Joined
May 30, 2023
Messages
32
@Apollo and @winnielinnie you both have guessed it correct.

I deleted the snaphsot 2024-03-01_00-00 ran scrub now its showing error on 2024-03-02_00-00. The problem is the data is of multiple files and I will be losing all the snapshots. Is there a way to delete specific items in the snapshot?

I have fixed the main file DOMINION_SQUARE1.db1 using the native application, but I need to figure out how to use grep on snapshots.
 
Joined
Oct 22, 2019
Messages
3,641
Is there a way to delete specific items in the snapshot?
Snapshots are immutable. This is impossible.

That's why I wrote this:
You'll have to delete the file on the live filesystem (which you can make a copy of it elsewhere in the meantime), destroy all snapshots that it exists in (which is a bummer, because you're losing the deleted/modified data that the snapshots preserved), then create a new snapshot and try to replicate again.


I have fixed the main file DOMINION_SQUARE1.db1 using the native application, but I need to figure out how to use grep on snapshots.
If you don't need the previous snapshots, then you can safely destroy all snapshots, and then create a new one from "this point forward". Keep in mind that if you destroy the snapshots, they're gone for good, and this includes any previously deleted files that you might have second thoughts about.

The stuff about "zfs diff" and "grep" are moot at this point...
 
Top