Replication Task Skipping all Snapshots

urfrndsandy · Mar 15, 2024

Apollo said:
Then let's start looking at the pool. That might explain why the replication complains about the resume token.
In your log, there was mention of an error:

So portion of the snapshot has been transmitted and saved, but it needs to be resumed, which zettarepl.py can't seem to work it out.

Can you post details of your unhealthy pool? Something like:

@Apollo yes I got the error from the primary machine, it says a file has some error. But its an important one too.

pool: ServerPool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub in progress since Fri Mar 15 13:26:24 2024
1.10T scanned at 487M/s, 1.07T issued at 0B/s, 1.17T total
128K repaired, 91.80% done, no estimated completion time
config:

NAME STATE READ WRITE CKSUM
ServerPool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/fa647bce-15ca-11ee-a7ba-000c292b25b2 ONLINE 0 0 0
gptid/fa494fd4-15ca-11ee-a7ba-000c292b25b2 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:
ServerPool/MasterDataset/Projects@auto-2024-03-02_00-00:/AN-105-LIBERTY/026_BERLIN_STEEL/017_3125_DOMINION_SQUARE/Tekla_Models/3125_DOMINION_SQUARE/3125_DOMINION_SQUARE1.db1
<0x2a628>:<0x3c8980>
<0x2ad67>:<0x3c8980>

Apollo · Mar 15, 2024

Scrub is running so best to wait it completes on its own. Shouldn't take too long.
Wondering if the corruption is related to a jail. Not sure if .db1 is your database or a backup.

urfrndsandy · Mar 15, 2024

Apollo said:
Scrub is running so best to wait it completes on its own. Shouldn't take too long.
Wondering if the corruption is related to a jail. Not sure if .db1 is your database or a backup.

.db1 is my database file. Can I delete it and try everything once?

Apollo · Mar 15, 2024

urfrndsandy said:
.db1 is my database file. Can I delete it and try everything once?

I would try to attempt something else before.
1) See if you can edit your replication task to not use "Resume Token".
2) I would try to delete the partially received snapshot which has a resume token. Not clear how to do it. Normally it should be possible to list it with (if 1 doesn't take care of it)

zfs list -r pool

urfrndsandy · Mar 15, 2024

Apollo said:
I would try to attempt something else before.
1) See if you can edit your replication task to not use "Resume Token".
2) I would try to delete the partially received snapshot which has a resume token. Not clear how to do it. Normally it should be possible to list it with (if 1 doesn't take care of it)

I tried to do it but below is the snip of replication settings, there is no option to uncheck "Resume Token"

This is the output of zfs list -r ServerPool

ServerPool 1.04T 2.47T96K /mnt/ServerPool
ServerPool/.system 63.7M 2.47T 136K legacy
ServerPool/.system/configs-7a6a616639514a76ab8da3e8aa6a9cc9 19.2M 2.47T 19.2M legacy
ServerPool/.system/cores 96K 1024M96K legacy
ServerPool/.system/rrd-7a6a616639514a76ab8da3e8aa6a9cc9 37.5M 2.47T 37.5M legacy
ServerPool/.system/samba4 656K 2.47T 656K legacy
ServerPool/.system/services 96K 2.47T96K legacy
ServerPool/.system/syslog-7a6a616639514a76ab8da3e8aa6a9cc9 6.02M 2.47T 6.02M legacy

Also, the partial snapshot is not listed under snapshots lists, unsure how to delete it.

winnielinnie · Mar 15, 2024

urfrndsandy said:
Also, the partial snapshot is not listed under snapshots lists, unsure how to delete it.

You can remove the resume token:

Code:

zfs recv -A mypool/dataset

To verify it's gone:

Code:

zfs get receive_resume_token mypool/dataset

I'm still not sure what is going on, as I'd need to read this thread from the beginning. It's hard to tell if the resume token is the culprit, or the replication fails due to the presence of corrupted blocks.

winnielinnie · Mar 15, 2024

urfrndsandy said:
.db1 is my database file. Can I delete it and try everything once?

You can't simply "delete" the file.

And you can't simply remove the most recent snapshot.

The only way to rid yourself of the corrupted file is to delete it from the live filesystem and delete all snapshots that it exists in. Then you can take another (new) snapshot, which you can replicate to the backup server.

I also noticed there's metadata corruption in your output. This could be a sign of a failing drive, or HBA controller, or RAM.

EDIT: Are you able to read/load the supposedly "corrupt" .db1 file? Do you get errors if you attempt it?

A completed scrub, followed by a "zpool clear" might get rid of those errors in the pool's status.

artlessknave · Mar 15, 2024

urfrndsandy said:
I have two machines with same structure for Dataset and the snapshot is stopping after a specific date

question: are you trying to replicate to the root of the 2nd pool? if so, that's likely to cause issues (eg overwrite .system).
instead, make yourself a backup dataset in the backup pool and replicate to that.
tank1 > tank2
vs
tank1 > tank2/tank1

IIRC there should be a zpool or zfs command to clear out a resumable partial replication. even deleting the dataset does not do this in my experience, as the partial is stored elsewhere. might be in the ZIL? not sure.

Apollo · Mar 15, 2024

artlessknave said:
question: are you trying to replicate to the root of the 2nd pool? if so, that's likely to cause issues (eg overwrite .system).
instead, make yourself a backup dataset in the backup pool and replicate to that.
tank1 > tank2
vs
tank1 > tank2/tank1

If there was such an issue, replication would have never started first hand. So no.

artlessknave said:
IIRC there should be a zpool or zfs command to clear out a resumable partial replication. even deleting the dataset does not do this in my experience, as the partial is stored elsewhere. might be in the ZIL? not sure.

I don't think ZIL is involved in this case.
I believe, it should be listed under the following, somewhere, but I am not sure:

zfs list -t snapshot -r ServerPool

In recent weeks, I believe I have seen the partial replication on the destination as a snapshot or dataset, and I have dealt with a similar situation a few years ago. I just don't recall the specifics.
I know that I never used the following command to get rid of it:

zfs recv -A mypool/dataset

I think I used the following command structure instead:

zfs destroy something/some receiv/something

urfrndsandy · Mar 16, 2024

Apollo said:
If there was such an issue, replication would have never started first hand. So no.

I don't think ZIL is involved in this case.
I believe, it should be listed under the following, somewhere, but I am not sure:

In recent weeks, I believe I have seen the partial replication on the destination as a snapshot or dataset, and I have dealt with a similar situation a few years ago. I just don't recall the specifics.
I know that I never used the following command to get rid of it:

I think I used the following command structure instead:

@Apollo It looks like the problem is primary machine itself now am trying to somehow retrieve all the snapshots and format the primary machine.
Still not sure how to skip the snapshots that are stopping replication

winnielinnie · Mar 16, 2024

urfrndsandy said:
Still not sure how to skip the snapshots that are stopping replication

Corrupted data will prevent a replication from completing successfully.

You may in fact have a corrupted file/block that exists on all your snapshots and your live filesystem.

See here:

Replication Task Skipping all Snapshots

Then let's start looking at the pool. That might explain why the replication complains about the resume token. In your log, there was mention of an error: So portion of the snapshot has been transmitted and saved, but it needs to be resumed, which zettarepl.py can't seem to work it out. Can...

www.truenas.com

urfrndsandy · Mar 16, 2024

winnielinnie said:
Corrupted data will prevent a replication from completing successfully.

You may in fact have a corrupted file/block that exists on all your snapshots and your live filesystem.

See here:

Replication Task Skipping all Snapshots

Then let's start looking at the pool. That might explain why the replication complains about the resume token. In your log, there was mention of an error: So portion of the snapshot has been transmitted and saved, but it needs to be resumed, which zettarepl.py can't seem to work it out. Can...

www.truenas.com

@winnielinnie yes, I too believe this is the scenario. But the file (.db1) in question is reading properly in my application.
How do i fix this. I can delete all the preceding snapshots from the day of corruption. But I need to solve this issue.

winnielinnie · Mar 16, 2024

That doesn't mean it's not "corrupt" in the eyes of ZFS.

To rule it out:

Run zpool clear mypool
Run another full scrub on the pool
Check if the corruption still exists with zpool status -v mypool

Otherwise, you're stuck with this "tainted" pool.

You'll have to delete the file on the live filesystem (which you can make a copy of it elsewhere in the meantime), destroy all snapshots that it exists in (which is a bummer, because you're losing the deleted/modified data that the snapshots preserved), then create a new snapshot and try to replicate again.

However, you also have metadata corruption on the pool. So it's not just this one file...

Apollo · Mar 16, 2024

The error being reported is specifically pointed to this snapshot (as far as I understand):

ServerPool/MasterDataset/Projects@auto-2024-03-02_00-00

The metadata, I have seen this in the past is originating from a jail corruption (don't know what kind of corruption). I could be incorrect, though:

<0x2a628>:<0x3c8980>
<0x2ad67>:<0x3c8980>

So, I believe the corruption occurred only in a block saved between ServerPool/MasterDataset/Projects@auto-2024-03-02_00-00 and the previous snapshot right before it, which could be ServerPool/MasterDataset/Projects@auto-2024-03-01_00-00 (needs to be confirmed).

So I would think, the worst case scenario would be to do a rollback of ServerPool/MasterDataset/Projects@auto-2024-03-01_00-00, which also means that the state of the database file and every files within this dataset would revert to its content at the time and date 2024-03-01_00-00.

You could do the rollback then copy the corrupted files from you PC which you know is up to date and copy it to the location it resides in the dataset.

Another thought, before being this drastic, I would even look at the possibility of simply destroying the snapshot

ServerPool/MasterDataset/Projects@auto-2024-03-02_00-00

The idea, is that if it is a database file which is being written/updated a few times a day, then I am hoping that the snapshot that hold the reference to the corrupted block/blocks could be freed when the snapshot is deleted. Hopefully, the snapshots after

ServerPool/MasterDataset/Projects@auto-2024-03-01_00-00

could still point to the blocks which aren't corrupted because they would have been created in more recent write/update of the file.

Not sure if this will tell us much, but you could run:

zfs diff ServerPool/MasterDataset/Projects@auto-2024-03-01_00-00 ServerPool/MasterDataset/Projects@auto-2024-03-02_00-00

and repeat with the other snapshots. That might tell us something interesting.

winnielinnie · Mar 16, 2024

Apollo said:
The error being reported is specifically pointed to this snapshot (as far as I understand):

It can be misleading. A scrub will mark the first "hit" of a corrupted file/block, working backwards from the most recent snapshots. So if he deletes the snapshot in question, then runs another scrub, guess what? It will inform him that a different snapshot contains a corrupted file. (If he keeps destroying the snapshots, working backwards, the scrub will just inform him of the "next in line" containing the corruption.) The reality is, there's a block of data that doesn't match its checksum. In ZFS, the same block is pointed to by several snapshots. (It's still the same block of physical data., corrupted or otherwise.)

The only "exception" to the above is if you modified a file, and it just so happens that only that modified block was corrupted. Then you can in fact find the latest snapshot before the modification occurred, where you'll find a 100% intact, non-corrupted file.

Apollo · Mar 16, 2024

winnielinnie said:
It can be misleading. A scrub will mark the first "hit" of a corrupted file/block, working backwards from the most recent snapshots. So if he deletes the snapshot in question, then runs another scrub, guess what? It will inform him that a different snapshot contains a corrupted file. (If he keeps destroying the snapshots, working backwards, the scrub will just inform him of the "next in line" containing the corruption.) The reality is, there's a block of data that doesn't match its checksum. In ZFS, the same block is pointed to by several snapshots. (It's still the same block of physical data., corrupted or otherwise.)

The only "exception" to the above is if you modified a file, and it just so happens that only that modified block was corrupted. Then you can in fact find the latest snapshot before the modification occurred, where you'll find a 100% intact, non-corrupted file.

I completely agree. What I hope has happened is that the server/jail servicing the database file would have introduced a block corruption, or would have required ZFS to allocate blocks if the file has been updated. And I am hoping that the corrupted block could be freed. If the block/blocks (128k in size) exist early on, and several or al snapshots after that exist, then the block will still be referenced by any snapshots that reference that block. But my understanding is that the error occur with

ServerPool/MasterDataset/Projects@auto-2024-03-01_00-00

or

ServerPool/MasterDataset/Projects@auto-2024-03-02_00-00

which means snapshots prior to

ServerPool/MasterDataset/Projects@auto-2024-03-01_00-00

shouldn't have corrupted blocks, otherwise the replication would have failed then, unless

ServerPool/MasterDataset/Projects@auto-2024-03-01_00-00

is the first snapshot for the dataset.

Apollo · Mar 16, 2024

Apollo said:
I completely agree. What I hope has happened is that the server/jail servicing the database file would have introduced a block corruption, or would have required ZFS to allocate blocks if the file has been updated. And I am hoping that the corrupted block could be freed. If the block/blocks (128k in size) exist early on, and several or al snapshots after that exist, then the block will still be referenced by any snapshots that reference that block. But my understanding is that the error occur with

or

which means snapshots prior to

shouldn't have corrupted blocks, otherwise the replication would have failed then, unless

is the first snapshot for the dataset.

This is why doing

zfs diff ....

could help figure out the state of the database file and give us some clues.

winnielinnie · Mar 16, 2024

He'd have to use "grep DOMINION_SQUARE1.db1" in conjunction with the "zfs diff" command.

Start with the earliest snapshot, and compare it upwards, until he finds the one where the file was modified (coded with an "M".)

urfrndsandy · Mar 17, 2024

@Apollo and @winnielinnie you both have guessed it correct.

I deleted the snaphsot 2024-03-01_00-00 ran scrub now its showing error on 2024-03-02_00-00. The problem is the data is of multiple files and I will be losing all the snapshots. Is there a way to delete specific items in the snapshot?

I have fixed the main file DOMINION_SQUARE1.db1 using the native application, but I need to figure out how to use grep on snapshots.

winnielinnie · Mar 17, 2024

urfrndsandy said:
Is there a way to delete specific items in the snapshot?

Snapshots are immutable. This is impossible.

That's why I wrote this:

winnielinnie said:
You'll have to delete the file on the live filesystem (which you can make a copy of it elsewhere in the meantime), destroy all snapshots that it exists in (which is a bummer, because you're losing the deleted/modified data that the snapshots preserved), then create a new snapshot and try to replicate again.

urfrndsandy said:
I have fixed the main file DOMINION_SQUARE1.db1 using the native application, but I need to figure out how to use grep on snapshots.

If you don't need the previous snapshots, then you can safely destroy all snapshots, and then create a new one from "this point forward". Keep in mind that if you destroy the snapshots, they're gone for good, and this includes any previously deleted files that you might have second thoughts about.

The stuff about "zfs diff" and "grep" are moot at this point...

Important Announcement for the TrueNAS Community.

Replication Task Skipping all Snapshots

Dabbler

Wizard

Dabbler

Wizard

Dabbler

MVP

MVP

Wizard

Wizard

Dabbler

MVP

Dabbler

MVP

Wizard

MVP

Wizard

Wizard

MVP

Dabbler

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Replication Task Skipping all Snapshots"

Similar threads