Resilver Completed with 2 errors -- how can I delete/restore only the affected file?

Phase · May 11, 2022

The scrub finished with a different error -- i.e. no individual file highlighted as corrupted, yet the checksum error is on the new drive that replaced the failed one. as opposed to in all drives.

It seems that what went wrong was the resilvering process? Thoughts?

  pool: OVERWATCH1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 24K in 21:34:32 with 0 errors on Wed May 11 06:10:07 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        OVERWATCH1                                      ONLINE       0     0     0
          raidz3-0                                      ONLINE       0     0     0
            gptid/b23acad3-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b412da76-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/5917f150-ca22-11ec-be73-000743649900  ONLINE       0     0     1
            gptid/b67f8bdd-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b69817bb-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b770f6e2-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b834db66-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b8610e24-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b9540f0b-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0

errors: No known data errors{/ICODE]

Phase · May 11, 2022

Not sure what to do...

Step 1: added 3 more disks physically to the machine
Step 2: system rebooted as I moved the chassis back into place (I may have accidently touched the reset button)
Step 3: system comes up, and checksum error is now in a different pool, which is driven off the mobo as opposed to the LSI controllers

  pool: OVERWATCH
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:02:46 with 0 errors on Sun Apr 10 00:02:46 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        OVERWATCH                                       ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/a4beb54d-2df6-11ec-bd85-6805cac304ca  ONLINE       0     0     1
            gptid/a4ca6f2a-2df6-11ec-bd85-6805cac304ca  ONLINE       0     0     1

errors: No known data errors

  pool: OVERWATCH1
 state: ONLINE
  scan: scrub repaired 24K in 21:34:32 with 0 errors on Wed May 11 06:10:07 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        OVERWATCH1                                      ONLINE       0     0     0
          raidz3-0                                      ONLINE       0     0     0
            gptid/b23acad3-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b412da76-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/5917f150-ca22-11ec-be73-000743649900  ONLINE       0     0     0
            gptid/b67f8bdd-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b69817bb-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b770f6e2-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b834db66-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b8610e24-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b9540f0b-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:26 with 0 errors on Thu May  5 03:45:26 2022
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0

Phase · May 11, 2022

Rebooted it again, now it is clean...

Thinking of Marathon Man... IS IT SAFE?

  pool: OVERWATCH
 state: ONLINE
  scan: scrub repaired 0B in 00:02:46 with 0 errors on Sun Apr 10 00:02:46 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        OVERWATCH                                       ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/a4beb54d-2df6-11ec-bd85-6805cac304ca  ONLINE       0     0     0
            gptid/a4ca6f2a-2df6-11ec-bd85-6805cac304ca  ONLINE       0     0     0

errors: No known data errors

  pool: OVERWATCH1
 state: ONLINE
  scan: scrub repaired 24K in 21:34:32 with 0 errors on Wed May 11 06:10:07 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        OVERWATCH1                                      ONLINE       0     0     0
          raidz3-0                                      ONLINE       0     0     0
            gptid/b23acad3-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b412da76-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/5917f150-ca22-11ec-be73-000743649900  ONLINE       0     0     0
            gptid/b67f8bdd-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b69817bb-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b770f6e2-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b834db66-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b8610e24-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b9540f0b-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:26 with 0 errors on Thu May  5 03:45:26 2022
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0

errors: No known data errors

Phase · May 11, 2022

Okay, so I did a new install on the boot drive. I tried to use TrueNAS13, but the install hangs. So it was 12.8.

Now running the HDD check, which should take a day.

smartctl -t long /dev/da0 . . 11

Then the plan is to rebuild the pools and replicate back from the 2nd server.

Ericloewe · May 12, 2022

Phase said:
Rebooted it again, now it is clean...

Did you resilver after the reboot? Error counts are not persistent.

Phase said:
The scrub finished with a different error -- i.e. no individual file highlighted as corrupted, yet the checksum error is on the new drive that replaced the failed one. as opposed to in all drives.

At this point, I suggest scrubbing again. If it comes back clean we can be fairly certain that something happened before/during the first resilver. That will leave the question of whether the files were stored correctly in the first place, replicated correctly, etc.

Phase · May 12, 2022

Ericloewe said:
Did you resilver after the reboot? Error counts are not persistent.

Oh. That explains it. I did not resilver or rerun scrub after the reboot.

Ericloewe said:
At this point, I suggest scrubbing again. If it comes back clean we can be fairly certain that something happened before/during the first resilver. That will leave the question of whether the files were stored correctly in the first place, replicated correctly, etc.

Here is the kicker... that 1-bit difference seems to have been fixed with the last scrub -- this was the first manual scrub you suggested I did, after the resilvering. One other factor is that I did a clean install of TrueNAS 12.8 (the 13.0 install hung). So if there were any bad OS files that would be fixed now.

I will run a scrub on the drives once the smart long test is done.

I added more disks. In an hour or two, besides the copy of the data in the problem pool, and the replicated copy, I'll have a third copy "restored" from the replication. The plan was to rebuild the pool while having 2 "good" copies of the data at all times. But I could also use it for a diff over the entire dataset.

Ericloewe · May 12, 2022

Well, contact your local exorcist or voodoo practitioner for assistance with a cleansing ritual.

Phase · May 16, 2022

Well, the saga comes to an end... hopefully.

zpool status comes back clean
smart ctl long test came back clean

Actions performed:

Resilvering after replacing failed disk
Scrub (24K repaired)
Fresh OS/TrueNAS install <-- EDIT, forgot to mention this
Scrub (clean)
Short and Long SmartCtl tests (clean)

Could not find a way to compare the datasets contents, even smallish 6 TiB compares did not finish after a few days. It would have been helpful if the scrub told us which files were affected by the 24K "repair" -- since there should have been none. We could have, perhaps, compared only those files instead of losing confidence in the entire dataset.

I'm recreating the dataset from a replicated copy. Network utilization indicates that Z3 is 1/3 of the speed of a 3-Disk Strip with no redundancy (which I replicated last week to have 2 "live" copies of the data at all times). I suppose this is telling us that the bottleneck is disk writing. It is going to take a looong time to complete.

Ericloewe · May 16, 2022

Phase said:
Network utilization indicates that Z3 is 1/3 of the speed of a 3-Disk Strip with no redundancy

Pretty much as expected for an IOPS-bound scenario.

Important Announcement for the TrueNAS Community.

Resilver Completed with 2 errors -- how can I delete/restore only the affected file?

Phase

Explorer

Phase

Explorer

Phase

Explorer

Phase

Explorer

Ericloewe

Server Wrangler

Phase

Explorer

Ericloewe

Server Wrangler

Phase

Explorer

Ericloewe

Server Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

Resilver Completed with 2 errors -- how can I delete/restore only the affected file?

Explorer

Explorer

Explorer

Explorer

Server Wrangler

Explorer

Server Wrangler

Explorer

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Resilver Completed with 2 errors -- how can I delete/restore only the affected file?"

Similar threads