Resilver Completed with 2 errors -- how can I delete/restore only the affected file?

Phase · May 7, 2022

Hi everyone!

One disk failed. The resilvering took 3.5 days. After completion "sudo zpool status -v" yields:


  pool: OVERWATCH1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 7.80T in 3 days 03:21:23 with 2 errors on Thu May  5 13:37:15 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        OVERWATCH1                                      ONLINE       0     0     0
          raidz3-0                                      ONLINE       0     0     0
            gptid/b23acad3-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     2
            gptid/b412da76-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     2
            gptid/5917f150-ca22-11ec-be73-000743649900  ONLINE       0     0     4
            gptid/b67f8bdd-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     2
            gptid/b69817bb-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     2
            gptid/b770f6e2-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     2
            gptid/b834db66-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     2
            gptid/b8610e24-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     2
            gptid/b9540f0b-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     2

errors: Permanent errors have been detected in the following files:

        OVERWATCH1/<OLD FILE PATH>/<OLD FILE NAME>

I would like to delete the file and make the error go away. The path and name of the file indicates it is from an old snapshot and it would no longer exist in the current live file system. If this affects one of the existing files in some way, I have a replicated copy of the entire pool and could copy the individual file back from there. Though, from the name it is not an important file and I can live without it.

Also, it says "Restore the file in question if possible". Not sure on how to go about doing that.

Thoughts? Thanks!!

sretalla · May 8, 2022

Phase said:
errors: Permanent errors have been detected in the following files: OVERWATCH1/<OLD FILE PATH>/<OLD FILE NAME>

Are you sure there's not a snapshot ... pool/dataset@snapshot somewhere in the path?

If the file really isn't there, it can't be the case that ZFS reports corruption in it.

Phase · May 8, 2022

Yes! You are correct! The path starts with "OVERWATCH1/Bravo@auto-2021-08-13_18-58:/"

Does that help?

Last time I deleted a snapshot I think that somehow invalidated the replication target (going to the other machine) and I had to replicate the entire pool from scratch.

sretalla · May 8, 2022

Well, you have a snapshot with corruption in it and you can't really edit it to just remove that one file.

Maybe somebody who has messed a lot more with replication may be able to guide you to somehow bringing back the old replica of the snapshot from the replication target... not sure if it can really work though.

Otherwise, destroy the snapshot and start the replication again.

Phase · May 8, 2022

I see. Thanks!

Is there a way to delete the snapshot without breaking the replication?

Or perhaps I can just let the error linger there and in 4 months after is auto deletes (1 year retention) the error goes away?

Why does this happen, shouldn't the periodic scrub have picked that up, even before the bad disk failed?

Thanks!!!

jgreco · May 8, 2022

Phase said:
Though, from the name it is not an important file and I can live without it.

A more pertinent question might be, how did you manage to lose data on a RAIDZ3?

Can you please provide a hardware manifest of your system, specifically detailing chassis, controllers, etc.?

Phase · May 9, 2022

Sure,

It is a i7-11700K with an ASUS Prime Z590-A Mobo and 32GB (see pic) running TrueNAS-12.0-U8

It has 3 pools:
- Pool #1 is the storage itself with 9 HDD and 2 LSI controllers
- - 9 x 14.0TB Toshiba TOSMG07ACA14TE
- - 2 x LSI SAS9211-8i
- Pool #2 is a boot SSD pool of the mobo controller
- Pool #3 is an iocage SSD pool of the mobo controller

It has 2 network cards
- Chelsio T62100-SO-CR: 2-port 40/50/100GbE Server Offload Adapter w/ Chelsio SM40G-SR: 40G Short Reach QSFP+ Optical Module
- Intel EXPI9301CT Gigabit PCIe

Power
- PS is a Corsair RM850x
- UPS is an APC STR1000XLA (this is the APC double-conversion high-density line for servers and networking. This unit supplies 2 servers of similar size and the network equipment)

Phase · May 9, 2022

HDD are in shock absorbent wiggly mounts and are kept very cool with 4 Gentle Typhoon frontal fans.
Boards have their own fan blowing air on them, mostly because the Chelsio NIC gets hot and is passively cooled

jgreco · May 9, 2022

Phase said:
Sure,

Well damn. Usually it's easy to find a culprit or two. My only criticism would be the lack of ECC, which could potentially cause an occasional problem. Which could be what this is. But it's so much nicer to have a specific correction suggestion.

Your HBA's have vented blanks and while maybe not optimal airflow, they do have clear space, so overheating is not likely. Like the Chelsio, the HBA's are server cards and expect airflow to be forced over them. The fan should be sufficient.

@Ericloewe any thoughts?

Ericloewe · May 9, 2022

Yeah, this is a bit of a bizarre one. It would be helpful to have a better idea of when the corruption happened, but my general thoughts are:

Overheating cards are not my first suspect here because the damage seems minimal and very specific. 2-4 bad blocks on every single disk, seemingly affecting a single file? I can't imagine that happening inside the HBA.
So, random bit flip?
- It would account for the limited nature of the problem...
- But when did it happen? When the file was originally written? It would have to be deep in the bowels of ZFS, otherwise the file would just be silently corrupted at the Posix layer of abstraction. Even then, how would that happen with RAIDZ3? I do have a scenario in mind, but it's contrived and relies on unreleased code:
  - The RAIDZ rebuild code for pools that underwent RAIDZ expansion has a special case where it could conceivably try an absurd number of combinations of the data on the available disks (on the order of thousands) until one matches the checksum, so that was made to just error out with a TLER-esque justification.
- So... The mythical Scrub of Death? In this case it would be a Scrub of Moderate Annoyance, since it's limited to a single file that's no longer relevant in an old snapshot. But here's the thing: if ZFS did not manage to reconstruct the offending blocks, it shouldn't have written anything... And if the on-disk data is actually good, a second scrub would find nothing wrong and clear the error. @Phase, I'd suggest running another scrub to see what happens.
What else could it have been? Time-limited and affecting all 9 disks in a way that they wouldn't know any better.
- A mechanical shock that messed with the disks' heads just enough? I can't say I've heard of this happening on a modern disk, but visibility into what goes on in there is just as bad as with SSDs, which is saying something. Not very likely in my mind.
- Bad power? This one feels deeply unsatisfying in the best of times due to how little visibility one has into this. In this case? With a quality PSU, sitting behind an APC UPS, one that's doing double-conversion, localized entirely within the OP's kitchen? Let's move on from this one.
- Of course, RAIDZ introduces a highly-predictable correlation of block offsets to on-disk sectors, so this could be something ultra-bizarre, like a bug affecting only specific LBAs - since we haven't heard of any such thing in LSI's mps stack, it would likely have to be in the HDDs. Maybe some remnants of debug code?
- It could also be something mundane and stupid, like all disks having a bad area in precisely the same location. How? A slight misalignment in a manufacturing step and careless QC by Toshiba. I said mundane and stupid, not likely. But at least it's a testable hypothesis, unlike most of my rambling.

sretalla · May 10, 2022

The errors are all in the checksum columns... wouldn't that point at cabling also? (although would mean at least 2 cables in this case)

jgreco · May 10, 2022

Ericloewe said:
Overheating cards are not my first suspect here because the damage seems minimal and very specific. 2-4 bad blocks on every single disk, seemingly affecting a single file? I can't imagine that happening inside the HBA.

Well, that is unlikely to be overheating, I think we agree. Overheating does indeed tend to cause evenly spewed garbage into a pool, but this is so specific...

However, is the correct firmware installed? This damage resembles what we often see with a RAID controller that has dropped some data. If stuff is in-flight and there's a controller reset, the usual culprit is that some of the in-flight data in writeback is lost, or something like that. A 9211 doesn't have writeback, but we do know that it's important for the correct firmware to be used. This is actually where I was expecting to go with my request for a hardware manifest. A 9240 in MegaRAID (MFI) mode, i could see this happening with.

Ericloewe said:
And if the on-disk data is actually good, a second scrub would find nothing wrong and clear the error. @Phase, I'd suggest running another scrub to see what happens.

I think this is a good idea, to help clarify what the reality is.

Phase · May 10, 2022

Hi, Thanks for looking into this!

Scrub: I started the scrub. 5% done, it has 19 hours to go.

Firmware: The firmware for the LSI cards is 20.00.07.00-IT

Cabling: Using 3 cables, each with 4 SATA whips. One controller drives 8 drives, the second controller drives only one drive. The driver that failed and was successfully replaced/resilvered was attached to the first controller.

Drive Origin: 3 came from New Egg Business and 6 came from OCW (macsales). Orders were place over the course of 2 months

When did the issue happen: I have not been able to find a way to look at previous scrubs results or at previous alerts. So I do not know if the issue was previously flagged or was only flagged as part of the resilvering process

"grep -i scrub * | grep OVERWATCH1" on /var/log yields

Code:

middlewared.log:[2021/10/31 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4183 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 10, 17, 0, 0, 7)
middlewared.log:[2021/11/07 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4183 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 10, 17, 0, 0, 7)
middlewared.log:[2021/11/14 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4183 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 10, 17, 0, 0, 7)
middlewared.log:[2021/11/28 00:00:02] (DEBUG) middlewared.plugins.pool.__run():4183 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 11, 21, 0, 0, 7)
middlewared.log:[2021/12/05 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4183 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 11, 21, 0, 0, 7)
middlewared.log:[2021/12/12 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 11, 21, 0, 0, 7)
middlewared.log:[2021/12/19 00:00:02] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 11, 21, 0, 0, 7)
middlewared.log:[2022/01/02 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 12, 26, 0, 0, 5)
middlewared.log:[2022/01/09 00:00:02] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 12, 26, 0, 0, 5)
middlewared.log:[2022/01/16 00:00:00] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 12, 26, 0, 0, 5)
middlewared.log:[2022/01/23 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 12, 26, 0, 0, 5)
middlewared.log:[2022/02/06 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 1, 30, 0, 0, 6)
middlewared.log:[2022/02/13 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 1, 30, 0, 0, 6)
middlewared.log:[2022/02/20 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 1, 30, 0, 0, 6)
middlewared.log:[2022/02/27 00:00:02] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 1, 30, 0, 0, 6)
middlewared.log:[2022/03/13 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 3, 6, 0, 0, 7)
middlewared.log:[2022/03/20 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 3, 6, 0, 0, 7)
middlewared.log:[2022/03/27 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 3, 6, 0, 0, 7)
middlewared.log:[2022/04/03 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 3, 6, 0, 0, 7)
middlewared.log:[2022/04/17 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 4, 10, 0, 0, 7)
middlewared.log:[2022/04/24 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 4, 10, 0, 0, 7)
middlewared.log:[2022/05/01 00:00:02] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 4, 10, 0, 0, 7)
middlewared.log:[2022/05/08 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 5, 1, 20, 51, 54)

Thanks again!

Ericloewe · May 10, 2022

jgreco said:
However, is the correct firmware installed? This damage resembles what we often see with a RAID controller that has dropped some data. If stuff is in-flight and there's a controller reset, the usual culprit is that some of the in-flight data in writeback is lost, or something like that. A 9211 doesn't have writeback, but we do know that it's important for the correct firmware to be used. This is actually where I was expecting to go with my request for a hardware manifest. A 9240 in MegaRAID (MFI) mode, i could see this happening with.

I thought along those lines too but ended up forgetting to mention the firmware. I don't remember this sort of behavior in old pre-P20.00.07 firmware versions, but the drivers have been updated since then, so who knows what might be happening in there. The damage feels too contained to be a full controller reset (and the OS should pick up the disk write failures and pass the error on to ZFS, which would inevitably fault the pool and likely panic the system). Also, two controllers mean this is insanely unlikely, even with bad power. It would never be this coordinated.

jgreco · May 10, 2022

Ericloewe said:
The damage feels too contained to be a full controller reset (and the OS should pick up the disk write failures and pass the error on to ZFS, which would inevitably fault the pool and likely panic the system)

Well THAT is probably overly dramatic, it shouldn't fault the pool for a handful of errors.

Ericloewe said:
Also, two controllers mean this is insanely unlikely, even with bad power. It would never be this coordinated.

But with one disk reporting 4 errors and the others reporting 2, some sort of driver boggle that affected the single drive slightly differently is not totally inconceivable.

Unlikely, yes. Grasping for straws. I am not so much of an LSI fanboi (as you well know) to think that it's impossible for it to go awry.

Phase · May 10, 2022

Weird... 20% done and 16 hours to go, it shows no checksum errors so far, but that same snapshot file is already mentioned

pool: OVERWATCH1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Tue May 10 08:35:35 2022
        17.7T scanned at 1.25G/s, 15.3T issued at 1.09G/s, 76.1T total
        0B repaired, 20.09% done, 15:56:37 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        OVERWATCH1                                      ONLINE       0     0     0
          raidz3-0                                      ONLINE       0     0     0
            gptid/b23acad3-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b412da76-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/5917f150-ca22-11ec-be73-000743649900  ONLINE       0     0     0
            gptid/b67f8bdd-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b69817bb-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b770f6e2-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b834db66-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b8610e24-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0
            gptid/b9540f0b-115c-11eb-9f63-20cf3009d192  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        OVERWATCH1/Bravo@auto-2021-08-13_18-58:/<PATH>/<FILENAME>

sretalla · May 10, 2022

Phase said:
When did the issue happen: I have not been able to find a way to look at previous scrubs results or at previous alerts. So I do not know if the issue was previously flagged or was only flagged as part of the resilvering process

"grep -i scrub * | grep OVERWATCH1" on /var/log

Good idea, but zpool history OVERWATCH1 | grep scrub will give you the proper answer you were after. ... well, at least the actual run dates and times.

I don't have a pool with some scrub errors to find if there's an output if there's an error to be caught... zpool history has a lot of information, but maybe not the outcomes as much as the commands that were used.

Phase · May 10, 2022

Thanks!

Yet... the plot thickens...

I brought current copies of 5 files (38GB) that would have been "adjacent" to that snapshot entry. By adjacent I mean 4 files in that same folder and 1 file in the parent folder.

There is a 1-bit difference in the file that is in the parent folder in the recovered pool versus the previously replicated pool

fc /b replicated_copy resilvered_file
Comparing files replicated_copy and resilvered_file
2F522400: 9F 9E

UPDATE: For better or worse the replication has been running every night, even after the resilver

zettarepl.log:[2022/05/08 00:00:00] INFO     [Thread-36] [zettarepl.paramiko.replication_task__task_1] Connected (version 2.0, client OpenSSH_8.4-hpn14v15)
zettarepl.log:[2022/05/08 00:00:00] INFO     [Thread-36] [zettarepl.paramiko.replication_task__task_1] Authentication (publickey) successful!
zettarepl.log:[2022/05/08 00:00:01] INFO     [replication_task__task_1] [zettarepl.replication.run] For replication task 'task_1': doing push from 'OVERWATCH1/Bravo' to 'OVERVOLT1/Juliet' of snapshot='auto-2022-05-08_00-00' incremental_base='auto-2022-05-07_00-00' receive_resume_token=None encryption=False
zettarepl.log:[2022/05/09 00:00:00] INFO     [Thread-41] [zettarepl.paramiko.replication_task__task_1] Connected (version 2.0, client OpenSSH_8.4-hpn14v15)
zettarepl.log:[2022/05/09 00:00:00] INFO     [Thread-41] [zettarepl.paramiko.replication_task__task_1] Authentication (publickey) successful!
zettarepl.log:[2022/05/09 00:00:05] INFO     [replication_task__task_1] [zettarepl.replication.run] For replication task 'task_1': doing push from 'OVERWATCH1/Bravo' to 'OVERVOLT1/Juliet' of snapshot='auto-2022-05-09_00-00' incremental_base='auto-2022-05-08_00-00' receive_resume_token=None encryption=False
zettarepl.log:[2022/05/10 00:00:00] INFO     [Thread-6] [zettarepl.paramiko.replication_task__task_1] Connected (version 2.0, client OpenSSH_8.4-hpn14v15)
zettarepl.log:[2022/05/10 00:00:01] INFO     [Thread-6] [zettarepl.paramiko.replication_task__task_1] Authentication (publickey) successful!
zettarepl.log:[2022/05/10 00:00:05] INFO     [replication_task__task_1] [zettarepl.replication.run] For replication task 'task_1': doing push from 'OVERWATCH1/Bravo' to 'OVERVOLT1/Juliet' of snapshot='auto-2022-05-10_00-00' incremental_base='auto-2022-05-09_00-00' receive_resume_token=None encryption=False

Ericloewe · May 10, 2022

jgreco said:
Well THAT is probably overly dramatic, it shouldn't fault the pool for a handful of errors.

It should if it loses all disks at once. Granted, if the controller recovers quickly enough, the pool shouldn't end up faulted.

Phase said:
Weird... 20% done and 16 hours to go, it shows no checksum errors so far, but that same snapshot file is already mentioned

I've never actually looked into the intricacies of what gets updated when. Could simply be stale info.

Phase said:
Thanks!

Yet... the plot thickens...

I brought current copies of 5 files (38GB) that would have been "adjacent" to that snapshot entry. By adjacent I mean 4 files in that same folder and 1 file in the parent folder.

There is a 1-bit difference in the file that is in the parent folder in the recovered pool versus the previously replicated pool

fc /b replicated_copy resilvered_file Comparing files replicated_copy and resilvered_file 2F522400: 9F 9E

Interesting... Which one is correct? Do I understand correctly that these files still exist in the current dataset state?

Phase · May 10, 2022

Ericloewe said:
Interesting... Which one is correct? Do I understand correctly that these files still exist in the current dataset state?

Yes, they are both in the live/current file system. One in the source and the other in the target replication. I cannot tell which is correct from looking at the files. However...

The replication uses lz4 to transmit data.
The pool in the target machine is also lz4.
The pool in the source machine is not compressed.

That seems to indicate that the issue originated in the source machine... which is the one we have been discussing.

I'm attempting to find any other differences with rsync, but it seems to take forever. I'll try again once the scrub is over/

rsync -n -arc --no-perms --no-owner --no-group --no-times '/mnt/OVERWATCH1/Bravo/PATH/' '192.168.4.7:/mnt/OVERVOLT1/Juliet/PATH/'

Important Announcement for the TrueNAS Community.

Resilver Completed with 2 errors -- how can I delete/restore only the affected file?

Explorer

Powered by Neutrality

Explorer

Powered by Neutrality

Explorer

Resident Grinch

Explorer

Explorer

Resident Grinch

Server Wrangler

Powered by Neutrality

Resident Grinch

Explorer

Server Wrangler

Resident Grinch

Explorer

Powered by Neutrality

Explorer

Server Wrangler

Explorer

Similar threads