Corrupted files

Blaster6 · Sep 8, 2021

I apparently have some corrupted files that I do not recognize. This leads me to believe this is bad.

FreeBSD 12.2-RELEASE-p9 2ee62d665f0(HEAD) TRUENAS

TrueNAS (c) 2009-2021, iXsystems, Inc.
All rights reserved.
TrueNAS code is released under the modified BSD license with some
files copyrighted by (c) iXsystems, Inc.

For more information, documentation, help or support, go here:
http://truenas.com
Welcome to FreeNAS

Warning: settings changed through the CLI are not written to
the configuration database and will be reset on reboot.

root@freenas[~]# zpool status -xv
pool: pool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: resilvered 2.78T in 05:10:05 with 3 errors on Wed Sep 8 12:12:58 2021
config:

NAME STATE READ WRITE CKSUM
pool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/44dd29c6-1043-11ec-ad24-2cf05d07cc05 ONLINE 0 0 0
gptid/3b3787c5-0ab8-11ec-bdf6-2cf05d07cc05 ONLINE 0 0 0
gptid/0429ad3f-09e3-11ec-99ac-2cf05d07cc05 ONLINE 0 0 0
gptid/4f234214-1094-11ec-8d7f-2cf05d07cc05 ONLINE 0 0 0
cache
gptid/5a7dcf62-8a4f-11ea-b7d4-2cf05d07cc05 ONLINE 0 0 0
gptid/5c4a7c95-8a4f-11ea-b7d4-2cf05d07cc05 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

<0xd7>:<0x9b5b>
<0xd7>:<0x10084>
<0xdb>:<0x45c22>
root@freenas[~]#

Some additional info:
I had this error for a couple months and have not found anything corrupted or missing. I started seeing errors when my storage space was almost full. I recently replaced all 4 drives with larder ones and had some difficulty due to no valid replicas but everything was backed up so I just pulled them out one by one and replaced them without taking them offline. Everything seems to be fine with the data I access. Are these some sort of system files? Is there a way to repair this?

winnielinnie · Sep 8, 2021

Run a full scrub again, then check the status output after it finishes. (Do this overnight and preferably when there will be no major I/O disk usage.)

I had one "error", but it was due to some underlying bug with syncoid + native encryption. A full scrub + a clear got me back to a healthy pool.

For the sake of safety, do you have a backup of this data?

SweetAndLow · Sep 8, 2021

Those are metadata files that got corrupted. Pool is toast, backup what you can because you will be rebuilding this.

Blaster6 · Sep 9, 2021

These errors have survived several scrubs. I have a complete backup. How do I rebuild? Do I delete the pool and start over.
I don't mind reloading the data but I have to lose all the configuration.

winnielinnie · Sep 9, 2021

As a last-ditch attempt, since you're going to destroy and rebuild, by chance do you have any residual directories that remained from previously destroyed datasets?

For example,

You create a dataset named pool/sandbox
You do stuff with it
You later decide to destroy it
Little do you realize, there's a phantom folder that remains at /mnt/pool/sandbox

This is what happened when I played around with syncoid and got "permanent" metadata errors that coincided with the phantom folders. (I think I manually removed the phantom folders after destroying the datasets? Either way, the pool started to complain about permanent errors that supposedly cannot be fixed.)

Running scrub correctly fixed this and everything was back to normal. It's back to HEALTHY state with zero errors, permanent or otherwise; and all four drives passed the extended SMART tests.

winnielinnie · Sep 9, 2021

You might find this article useful, as it shares similarities to your issue, especially if your drives are fine.

Expand the quote and take note of the part to stop the scrub "within a minute".

The full article goes into more detail, which seems to reflect the same issue I faced with "phantom" folders and associated "errors".

IceSquare said:
https://icesquare.com/wordpress/zfs...rs-have-been-detected-in-the-following-files/

First, make sure that you have no checksum error and the pool is healthy, i.e., all hard drives are online, and all counts are zero.

Next, try to scrub the pool again:

sudo zpool scrub mypool

Within a minute, try to stop the process:

sudo zpool scrub -s mypool

Check the status again. The error should be gone:

Code:
sudo zpool status -v pool: mypool state: ONLINE scan: scrub canceled on Sun Feb 3 12:18:06 2019 errors: No known data errors

If the error still presents, you may need to scrub the pool again.

Blaster6 · Sep 10, 2021

I don't know what may have been there before because I did not build this. I do know that for the last couple years there were no errors until it started to get full. With the new drives I am 30% full.

I have now run a long SMART test on all drives and got no errors. I still see errors on the same 3 files.
I am now running a scrub on the pool. It looks like it will take a while.

winnielinnie said:
You might find this article useful, as it shares similarities to your issue, especially if your drives are fine.

Expand the quote and take note of the part to stop the scrub "within a minute".

The full article goes into more detail, which seems to reflect the same issue I faced with "phantom" folders and associated "errors".

The article looks exactly like my problem.
How do I make sure the checksums are all 0? What if they aren't?

The article implies this won't work if the drives are not all 0 but does not explain how to get them there.

I don't know if it makes a difference but all files are added and deleted through Windows SMB share.

winnielinnie · Sep 10, 2021

Blaster6 said:
The article looks exactly like my problem.
How do I make sure the checksums are all 0? What if they aren't?

Your first output from your original post shows 0 checksum errors.

However, your second screenshot shows 4 errors (throughout 3 different drives.)

You may in fact be facing a different issue then.

Blaster6 said:
I am now running a scrub on the pool. It looks like it will take a while.

Did you do what the article said and cancel the scrub within a minute?
You can try canceling the scrub within a minute of starting it, after this full scrub completes.

One minute after starting the scrub, issue this command to cancel it:
zpool scrub -s pool

But since you do have checksum errors, it's probably more than just phantom files/folders that don't exist anymore.

Are you using native encryption by any chance?

You might have to resort to starting from a backup if the "cancel within 1 minute trick" doesn't clear it.

Blaster6 said:
I have now run a long SMART test on all drives and got no errors. I still see errors on the same 3 files.

That's one assuring thing. Not as thorough as badblocks, but at least you can rule out read errors for now.

Important Announcement for the TrueNAS Community.

Corrupted files

Blaster6

Cadet

winnielinnie

MVP

SweetAndLow

Sweet'NASty

Blaster6

Cadet

winnielinnie

MVP

winnielinnie

MVP

Blaster6

Cadet

winnielinnie

MVP

Similar threads

Important Announcement for the TrueNAS Community.

Corrupted files

Cadet

MVP

Sweet'NASty

Cadet

MVP

MVP

Cadet

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Corrupted files"

Similar threads