Resilver Completed with 2 errors -- how can I delete/restore only the affected file?

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
The scrub finished with a different error -- i.e. no individual file highlighted as corrupted, yet the checksum error is on the new drive that replaced the failed one. as opposed to in all drives.

It seems that what went wrong was the resilvering process? Thoughts?

pool: OVERWATCH1 state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P scan: scrub repaired 24K in 21:34:32 with 0 errors on Wed May 11 06:10:07 2022 config: NAME STATE READ WRITE CKSUM OVERWATCH1 ONLINE 0 0 0 raidz3-0 ONLINE 0 0 0 gptid/b23acad3-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b412da76-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/5917f150-ca22-11ec-be73-000743649900 ONLINE 0 0 1 gptid/b67f8bdd-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b69817bb-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b770f6e2-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b834db66-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b8610e24-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b9540f0b-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 errors: No known data errors{/ICODE]
 

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
Not sure what to do...

Step 1: added 3 more disks physically to the machine
Step 2: system rebooted as I moved the chassis back into place (I may have accidently touched the reset button)
Step 3: system comes up, and checksum error is now in a different pool, which is driven off the mobo as opposed to the LSI controllers

pool: OVERWATCH state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P scan: scrub repaired 0B in 00:02:46 with 0 errors on Sun Apr 10 00:02:46 2022 config: NAME STATE READ WRITE CKSUM OVERWATCH ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gptid/a4beb54d-2df6-11ec-bd85-6805cac304ca ONLINE 0 0 1 gptid/a4ca6f2a-2df6-11ec-bd85-6805cac304ca ONLINE 0 0 1 errors: No known data errors pool: OVERWATCH1 state: ONLINE scan: scrub repaired 24K in 21:34:32 with 0 errors on Wed May 11 06:10:07 2022 config: NAME STATE READ WRITE CKSUM OVERWATCH1 ONLINE 0 0 0 raidz3-0 ONLINE 0 0 0 gptid/b23acad3-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b412da76-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/5917f150-ca22-11ec-be73-000743649900 ONLINE 0 0 0 gptid/b67f8bdd-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b69817bb-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b770f6e2-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b834db66-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b8610e24-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b9540f0b-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 errors: No known data errors pool: boot-pool state: ONLINE scan: scrub repaired 0B in 00:00:26 with 0 errors on Thu May 5 03:45:26 2022 config: NAME STATE READ WRITE CKSUM boot-pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p2 ONLINE 0 0 0 ada1p2 ONLINE 0 0 0
 

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
Rebooted it again, now it is clean...

Thinking of Marathon Man... IS IT SAFE?

pool: OVERWATCH state: ONLINE scan: scrub repaired 0B in 00:02:46 with 0 errors on Sun Apr 10 00:02:46 2022 config: NAME STATE READ WRITE CKSUM OVERWATCH ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gptid/a4beb54d-2df6-11ec-bd85-6805cac304ca ONLINE 0 0 0 gptid/a4ca6f2a-2df6-11ec-bd85-6805cac304ca ONLINE 0 0 0 errors: No known data errors pool: OVERWATCH1 state: ONLINE scan: scrub repaired 24K in 21:34:32 with 0 errors on Wed May 11 06:10:07 2022 config: NAME STATE READ WRITE CKSUM OVERWATCH1 ONLINE 0 0 0 raidz3-0 ONLINE 0 0 0 gptid/b23acad3-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b412da76-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/5917f150-ca22-11ec-be73-000743649900 ONLINE 0 0 0 gptid/b67f8bdd-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b69817bb-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b770f6e2-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b834db66-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b8610e24-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b9540f0b-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 errors: No known data errors pool: boot-pool state: ONLINE scan: scrub repaired 0B in 00:00:26 with 0 errors on Thu May 5 03:45:26 2022 config: NAME STATE READ WRITE CKSUM boot-pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p2 ONLINE 0 0 0 ada1p2 ONLINE 0 0 0 errors: No known data errors
 

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
Okay, so I did a new install on the boot drive. I tried to use TrueNAS13, but the install hangs. So it was 12.8.

Now running the HDD check, which should take a day.

smartctl -t long /dev/da0 . . 11

Then the plan is to rebuild the pools and replicate back from the 2nd server.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Rebooted it again, now it is clean...
Did you resilver after the reboot? Error counts are not persistent.

The scrub finished with a different error -- i.e. no individual file highlighted as corrupted, yet the checksum error is on the new drive that replaced the failed one. as opposed to in all drives.
At this point, I suggest scrubbing again. If it comes back clean we can be fairly certain that something happened before/during the first resilver. That will leave the question of whether the files were stored correctly in the first place, replicated correctly, etc.
 

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
Did you resilver after the reboot? Error counts are not persistent.

Oh. That explains it. I did not resilver or rerun scrub after the reboot.

At this point, I suggest scrubbing again. If it comes back clean we can be fairly certain that something happened before/during the first resilver. That will leave the question of whether the files were stored correctly in the first place, replicated correctly, etc.

Here is the kicker... that 1-bit difference seems to have been fixed with the last scrub -- this was the first manual scrub you suggested I did, after the resilvering. One other factor is that I did a clean install of TrueNAS 12.8 (the 13.0 install hung). So if there were any bad OS files that would be fixed now.

I will run a scrub on the drives once the smart long test is done.

I added more disks. In an hour or two, besides the copy of the data in the problem pool, and the replicated copy, I'll have a third copy "restored" from the replication. The plan was to rebuild the pool while having 2 "good" copies of the data at all times. But I could also use it for a diff over the entire dataset.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Well, contact your local exorcist or voodoo practitioner for assistance with a cleansing ritual.
 

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
Well, the saga comes to an end... hopefully.
  • zpool status comes back clean
  • smart ctl long test came back clean
Actions performed:
  • Resilvering after replacing failed disk
  • Scrub (24K repaired)
  • Fresh OS/TrueNAS install <-- EDIT, forgot to mention this
  • Scrub (clean)
  • Short and Long SmartCtl tests (clean)
Could not find a way to compare the datasets contents, even smallish 6 TiB compares did not finish after a few days. It would have been helpful if the scrub told us which files were affected by the 24K "repair" -- since there should have been none. We could have, perhaps, compared only those files instead of losing confidence in the entire dataset.

I'm recreating the dataset from a replicated copy. Network utilization indicates that Z3 is 1/3 of the speed of a 3-Disk Strip with no redundancy (which I replicated last week to have 2 "live" copies of the data at all times). I suppose this is telling us that the bottleneck is disk writing. It is going to take a looong time to complete.
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Network utilization indicates that Z3 is 1/3 of the speed of a 3-Disk Strip with no redundancy
Pretty much as expected for an IOPS-bound scenario.
 
Top