Resilver FINISHED; 1 Error

ppmax · Jan 31, 2020

Hello--

I recently replaced a disk that had some back blocks.

Using the FreeNAS GUI (11.2 U7) I offline the disk, shut down, connected, then restarted Freenas. Again, using the GUI, I followed the FreeNAS documentation to replace a failed disk.

The Resilvering process took about a day, and when it completed I ran zpool status -v [name of my volume] and I noticed an error that said a file was corrupted. I then rm'd that single file.

I then logged into FreeNAS and now see this in the Storage/Pools/Pool Status page:

So then I ran zpool status -v again and see this output:

Code:

  pool: volume1
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 710G in 0 days 11:36:15 with 1 errors on Fri Jan 31 06:08:29 2020
config:

    NAME                                              STATE     READ WRITE CKSUM
    volume1                                           DEGRADED     0     0     1
      raidz2-0                                        DEGRADED     0     0     2
        gptid/5dd66899-aabe-11e1-90d1-6805ca067062    ONLINE       0     0     1
        replacing-1                                   DEGRADED     0     0     0
          12211041995604376651                        UNAVAIL      0     0     0  was /dev/gptid/5e3d98a1-aabe-11e1-90d1-6805ca067062
          gptid/62dc881e-43c9-11ea-8b8b-6805ca067062  ONLINE       0     0     0
        gptid/5ea4a2db-aabe-11e1-90d1-6805ca067062    ONLINE       0     0     0
        gptid/5f14865b-aabe-11e1-90d1-6805ca067062    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        volume1/backups:<0x983ea>

Can anyone offer any advice for what I should do here?

Should I ONLINE the new/replacement disk via the GUI?
Should I delete the volume1/backups dir?
Should I OFFLINE the disk, the REPLACE the disk again and go through the resilver process again?

Thanks again for any tips you may have to offer...this is my first zpool disk replacement experience ;)

PP

Apollo · Jan 31, 2020

It seems you offlined the disk before so you can install the replacement and then try to resilver? That seems the case. Why would you do that?
I think you should have kept the failed disk, insert the replacement disk and then resilver. Only when resilvering has been completed should you remove the failed disk newly offlined disk.
I would try inserting the old disk back and let it resilver, maybe it will fix things.

ppmax · Jan 31, 2020

Thanks for your reply--much appreciated.

I followed the documentation here at the link below. It's totally possible I misunderstood, but it seems to be saying to offline the disk, power down, plug in new disk, boot, and then click Replace:
9.5.1. Replacing a Failed Disk

I can't find any documentation about what to do if there are errors during resilvering....

SweetAndLow · Jan 31, 2020

You got smart data for all your drives? It's strange that you have corruption in a raid z2. That means you have 3 disks with problems. Your first disk has a checksum error also.

Apollo · Jan 31, 2020

ppmax said:
Thanks for your reply--much appreciated.

I followed the documentation here at the link below. It's totally possible I misunderstood, but it seems to be saying to offline the disk, power down, plug in new disk, boot, and then click Replace:
9.5.1. Replacing a Failed Disk

I can't find any documentation about what to do if there are errors during resilvering....

I see what you mean. Then again there is a note below.

Note
A disk that is failing but has not completely failed can be replaced in place, without first removing it. Whether this is a good idea depends on the overall condition of the failing disk. A disk with a few newly-bad blocks that is otherwise functional can be left in place during the replacement to provide data redundancy. A drive that is experiencing continuous errors can actually slow down the replacement. In extreme cases, a disk with serious problems might spend so much time retrying failures that it could prevent the replacement resilvering from completing before another drive fails.

I think the documentation should provide more clarity by making the distinction between a Failed ( entirely screwed up drive) as opposed to a almost perfectly good drive that caused a few errors which caused the pool to be in a Degraded state.

Doing the replacement while the failed drive is still fine will increase the chance for the resilver to complete successfully. That failed drive can die entirely or act up in ways it isn't possible to efficiently and reliably resilver the pool, and only then can you think about offlining it.

If you leave the failed disk in in the online state but replace it, you will still retain RAIDZ2 redundancy except for the blocks on the disk that failed. Offlining it automatically takes your redundancy one level down so you end up with RAIDZ1 redundancy, hoping no other disk start to shot signs of trouble.

I hope it makes sense?

Apollo · Jan 31, 2020

SweetAndLow said:
You got smart data for all your drives? It's strange that you have corruption in a raid z2. That means you have 3 disks with problems. Your first disk has a checksum error also.

I don't think I would be too concerned about this issue. I recall, but I could be wrong, that what the error indicates is an error that could have been triggered by an iocage jail present in the iocage mount point of the volume.
If this really was a file corruption, then the file name and folder location would have been provided.

There was a post some month ago with what I believe to be a similar issue.
I also seem to remember encountering this issue at some point:

Permanent errors on volume with hex codes

I have some permanent errors on one of my volumes, but these are shown as hex codes. For example: ... pool: oracle02 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if...

www.ixsystems.com

SweetAndLow · Jan 31, 2020

Apollo said:
I don't think I would be too concerned about this issue. I recall, but I could be wrong, that what the error indicates is an error that could have been triggered by an iocage jail present in the iocage mount point of the volume.
If this really was a file corruption, then the file name and folder location would have been provided.

There was a post some month ago with what I believe to be a similar issue.
I also seem to remember encountering this issue at some point:

Permanent errors on volume with hex codes

I have some permanent errors on one of my volumes, but these are shown as hex codes. For example: ... pool: oracle02 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if...

www.ixsystems.com

It's usually metadata corruption unless there is a new understanding

ppmax · Feb 1, 2020

Thank you everyone for your comments and replies. I appreciate all the help!

Last night I powered down and rebooted which triggered another resilvering attempt. zpool status -v volume1:

Code:

  pool: volume1
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 710G in 0 days 06:51:46 with 2 errors on Sat Feb  1 04:05:41 2020
config:

    NAME                                              STATE     READ WRITE CKSUM
    volume1                                           DEGRADED     0     0     2
      raidz2-0                                        DEGRADED     0     0     4
        gptid/5dd66899-aabe-11e1-90d1-6805ca067062    ONLINE       0     0     1
        replacing-1                                   DEGRADED     0     0     0
          12211041995604376651                        UNAVAIL      0     0     0  was /dev/gptid/5e3d98a1-aabe-11e1-90d1-6805ca067062
          gptid/62dc881e-43c9-11ea-8b8b-6805ca067062  ONLINE       0     0     0
        gptid/5ea4a2db-aabe-11e1-90d1-6805ca067062    ONLINE       0     0     0
        gptid/5f14865b-aabe-11e1-90d1-6805ca067062    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /mnt/volume1/backups/Photos/2015/2015-02-01_14-21-22_31.cr2

Any opinion on which path I should take:

Delete the cr2 file, then reboot and resilver?
Power down, put the old drive back in, and scrub?
Power down, take the new drive out, and scrub?
Something else?

Upthread someone asked about SMART data for the other disks; I did smartctl -a for each of the 3 original disks in the pool and don't see any bad blocks...they all seem good.

As mentioned above, this is a 4 disk raid z2 pool. I think my mobo only has 4 SATA ports so I can't plug the original disk back in and resilver the new disk...Maybe I need to go poke around to see if there is any chance to add a 5th drive somehow?

[EDIT: Just checked...my mobo only has 4 SATA ports :( ]

Thanks again for the help--
PP

Apollo · Feb 2, 2020

If you have snapshots, you could do a zdiff from the first snapshot for the dataset that contains your corrupted file and see when it was last added.
You can use the grep command to isolate fot the file name or the entire path including the name.
This should give some listing with "+", "-", "M" type of characters at the beginning of the lines. they indicate what operation was performed.

If you can find the details about this file, then you can clone the corresponding snapshot (could be tedious) and copy the file in a new location or the location it is meant to be as seen in the status report.
I could be wrong and maybe the file cannot be recovered that way. If you have replicated snapshots somewhere, then you can see if you can recover the file from there.

However, until you resolve the "Unavailable" status, your pool will always be seen as degraded.
Adding the old disk back might fix the corrupted file but I don't have the experience of this particular issue.

ppmax · Feb 2, 2020

Thanks for the reply and for the input Apollo--much appreciated

This device contains backups of other device data. I don't have a backup of this device or snapshots. In addition to backups this pool also contains some original data that I'd like to try and preserve...but ultimately can be nuked and paved if necessary.

What I'd really like to do is figure out why the resilver process is producing errors...and fix those issues. I've run smartctl -t long on the 3 "original" drives (as well as on the new/replaced drive) and don't see any issues. The drive I replaced is the only drive that was reporting issues.

The resilver process reported there was a "permanent error" on a specific file; if I delete that file, power down, put the old drive back in, then do a scrub...would that potentially remedy this situation?

Thanks again--
PP

ppmax · Feb 4, 2020

Solved

95856316 · Sep 17, 2020

Would you like to show more information about how to solve the problem?Recently I had the same problem as you.Thank you!

Important Announcement for the TrueNAS Community.

Resilver FINISHED; 1 Error

ppmax

Contributor

Apollo

Wizard

ppmax

Contributor

SweetAndLow

Sweet'NASty

Apollo

Wizard

Apollo

Wizard

Permanent errors on volume with hex codes

SweetAndLow

Sweet'NASty

Permanent errors on volume with hex codes

ppmax

Contributor

Apollo

Wizard

ppmax

Contributor

ppmax

Contributor

95856316

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Resilver FINISHED; 1 Error

Contributor

Wizard

Contributor

Sweet'NASty

Wizard

Wizard

Sweet'NASty

Contributor

Wizard

Contributor

Contributor

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Resilver FINISHED; 1 Error"

Similar threads