Resilver Completed with 2 errors -- how can I delete/restore only the affected file?

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
Hi everyone!

One disk failed. The resilvering took 3.5 days. After completion "sudo zpool status -v" yields:

pool: OVERWATCH1 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: resilvered 7.80T in 3 days 03:21:23 with 2 errors on Thu May 5 13:37:15 2022 config: NAME STATE READ WRITE CKSUM OVERWATCH1 ONLINE 0 0 0 raidz3-0 ONLINE 0 0 0 gptid/b23acad3-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 2 gptid/b412da76-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 2 gptid/5917f150-ca22-11ec-be73-000743649900 ONLINE 0 0 4 gptid/b67f8bdd-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 2 gptid/b69817bb-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 2 gptid/b770f6e2-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 2 gptid/b834db66-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 2 gptid/b8610e24-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 2 gptid/b9540f0b-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 2 errors: Permanent errors have been detected in the following files: OVERWATCH1/<OLD FILE PATH>/<OLD FILE NAME>

I would like to delete the file and make the error go away. The path and name of the file indicates it is from an old snapshot and it would no longer exist in the current live file system. If this affects one of the existing files in some way, I have a replicated copy of the entire pool and could copy the individual file back from there. Though, from the name it is not an important file and I can live without it.

Also, it says "Restore the file in question if possible". Not sure on how to go about doing that.

Thoughts? Thanks!!
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
errors: Permanent errors have been detected in the following files: OVERWATCH1/<OLD FILE PATH>/<OLD FILE NAME>
Are you sure there's not a snapshot ... pool/dataset@snapshot somewhere in the path?

If the file really isn't there, it can't be the case that ZFS reports corruption in it.
 

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
Yes! You are correct! The path starts with "OVERWATCH1/Bravo@auto-2021-08-13_18-58:/"

Does that help?

Last time I deleted a snapshot I think that somehow invalidated the replication target (going to the other machine) and I had to replicate the entire pool from scratch.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Well, you have a snapshot with corruption in it and you can't really edit it to just remove that one file.

Maybe somebody who has messed a lot more with replication may be able to guide you to somehow bringing back the old replica of the snapshot from the replication target... not sure if it can really work though.

Otherwise, destroy the snapshot and start the replication again.
 

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
I see. Thanks!

Is there a way to delete the snapshot without breaking the replication?

Or perhaps I can just let the error linger there and in 4 months after is auto deletes (1 year retention) the error goes away?

Why does this happen, shouldn't the periodic scrub have picked that up, even before the bad disk failed?

Thanks!!!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Though, from the name it is not an important file and I can live without it.

A more pertinent question might be, how did you manage to lose data on a RAIDZ3?

Can you please provide a hardware manifest of your system, specifically detailing chassis, controllers, etc.?
 

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
Sure,

It is a i7-11700K with an ASUS Prime Z590-A Mobo and 32GB (see pic) running TrueNAS-12.0-U8

It has 3 pools:
- Pool #1 is the storage itself with 9 HDD and 2 LSI controllers
- - 9 x 14.0TB Toshiba TOSMG07ACA14TE
- - 2 x LSI SAS9211-8i
- Pool #2 is a boot SSD pool of the mobo controller
- Pool #3 is an iocage SSD pool of the mobo controller

It has 2 network cards
- Chelsio T62100-SO-CR: 2-port 40/50/100GbE Server Offload Adapter w/ Chelsio SM40G-SR: 40G Short Reach QSFP+ Optical Module
- Intel EXPI9301CT Gigabit PCIe

Power
- PS is a Corsair RM850x
- UPS is an APC STR1000XLA (this is the APC double-conversion high-density line for servers and networking. This unit supplies 2 servers of similar size and the network equipment)

bios.png
 

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
HDD are in shock absorbent wiggly mounts and are kept very cool with 4 Gentle Typhoon frontal fans.
Boards have their own fan blowing air on them, mostly because the Chelsio NIC gets hot and is passively cooled

chassis.png


ups.png
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680

Well damn. Usually it's easy to find a culprit or two. My only criticism would be the lack of ECC, which could potentially cause an occasional problem. Which could be what this is. But it's so much nicer to have a specific correction suggestion.

Your HBA's have vented blanks and while maybe not optimal airflow, they do have clear space, so overheating is not likely. Like the Chelsio, the HBA's are server cards and expect airflow to be forced over them. The fan should be sufficient.

@Ericloewe any thoughts?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Yeah, this is a bit of a bizarre one. It would be helpful to have a better idea of when the corruption happened, but my general thoughts are:
  • Overheating cards are not my first suspect here because the damage seems minimal and very specific. 2-4 bad blocks on every single disk, seemingly affecting a single file? I can't imagine that happening inside the HBA.
  • So, random bit flip?
    • It would account for the limited nature of the problem...
    • But when did it happen? When the file was originally written? It would have to be deep in the bowels of ZFS, otherwise the file would just be silently corrupted at the Posix layer of abstraction. Even then, how would that happen with RAIDZ3? I do have a scenario in mind, but it's contrived and relies on unreleased code:
      • The RAIDZ rebuild code for pools that underwent RAIDZ expansion has a special case where it could conceivably try an absurd number of combinations of the data on the available disks (on the order of thousands) until one matches the checksum, so that was made to just error out with a TLER-esque justification.
    • So... The mythical Scrub of Death? In this case it would be a Scrub of Moderate Annoyance, since it's limited to a single file that's no longer relevant in an old snapshot. But here's the thing: if ZFS did not manage to reconstruct the offending blocks, it shouldn't have written anything... And if the on-disk data is actually good, a second scrub would find nothing wrong and clear the error. @Phase, I'd suggest running another scrub to see what happens.
  • What else could it have been? Time-limited and affecting all 9 disks in a way that they wouldn't know any better.
    • A mechanical shock that messed with the disks' heads just enough? I can't say I've heard of this happening on a modern disk, but visibility into what goes on in there is just as bad as with SSDs, which is saying something. Not very likely in my mind.
    • Bad power? This one feels deeply unsatisfying in the best of times due to how little visibility one has into this. In this case? With a quality PSU, sitting behind an APC UPS, one that's doing double-conversion, localized entirely within the OP's kitchen? Let's move on from this one.
    • Of course, RAIDZ introduces a highly-predictable correlation of block offsets to on-disk sectors, so this could be something ultra-bizarre, like a bug affecting only specific LBAs - since we haven't heard of any such thing in LSI's mps stack, it would likely have to be in the HDDs. Maybe some remnants of debug code?
    • It could also be something mundane and stupid, like all disks having a bad area in precisely the same location. How? A slight misalignment in a manufacturing step and careless QC by Toshiba. I said mundane and stupid, not likely. But at least it's a testable hypothesis, unlike most of my rambling.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
The errors are all in the checksum columns... wouldn't that point at cabling also? (although would mean at least 2 cables in this case)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
  • Overheating cards are not my first suspect here because the damage seems minimal and very specific. 2-4 bad blocks on every single disk, seemingly affecting a single file? I can't imagine that happening inside the HBA.

Well, that is unlikely to be overheating, I think we agree. Overheating does indeed tend to cause evenly spewed garbage into a pool, but this is so specific...

However, is the correct firmware installed? This damage resembles what we often see with a RAID controller that has dropped some data. If stuff is in-flight and there's a controller reset, the usual culprit is that some of the in-flight data in writeback is lost, or something like that. A 9211 doesn't have writeback, but we do know that it's important for the correct firmware to be used. This is actually where I was expecting to go with my request for a hardware manifest. A 9240 in MegaRAID (MFI) mode, i could see this happening with.

And if the on-disk data is actually good, a second scrub would find nothing wrong and clear the error. @Phase, I'd suggest running another scrub to see what happens.

I think this is a good idea, to help clarify what the reality is.
 

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
Hi, Thanks for looking into this!

Scrub: I started the scrub. 5% done, it has 19 hours to go.

Firmware: The firmware for the LSI cards is 20.00.07.00-IT

Cabling: Using 3 cables, each with 4 SATA whips. One controller drives 8 drives, the second controller drives only one drive. The driver that failed and was successfully replaced/resilvered was attached to the first controller.

Drive Origin: 3 came from New Egg Business and 6 came from OCW (macsales). Orders were place over the course of 2 months

When did the issue happen: I have not been able to find a way to look at previous scrubs results or at previous alerts. So I do not know if the issue was previously flagged or was only flagged as part of the resilvering process

"grep -i scrub * | grep OVERWATCH1" on /var/log yields
Code:
middlewared.log:[2021/10/31 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4183 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 10, 17, 0, 0, 7)
middlewared.log:[2021/11/07 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4183 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 10, 17, 0, 0, 7)
middlewared.log:[2021/11/14 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4183 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 10, 17, 0, 0, 7)
middlewared.log:[2021/11/28 00:00:02] (DEBUG) middlewared.plugins.pool.__run():4183 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 11, 21, 0, 0, 7)
middlewared.log:[2021/12/05 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4183 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 11, 21, 0, 0, 7)
middlewared.log:[2021/12/12 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 11, 21, 0, 0, 7)
middlewared.log:[2021/12/19 00:00:02] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 11, 21, 0, 0, 7)
middlewared.log:[2022/01/02 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 12, 26, 0, 0, 5)
middlewared.log:[2022/01/09 00:00:02] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 12, 26, 0, 0, 5)
middlewared.log:[2022/01/16 00:00:00] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 12, 26, 0, 0, 5)
middlewared.log:[2022/01/23 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2021, 12, 26, 0, 0, 5)
middlewared.log:[2022/02/06 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 1, 30, 0, 0, 6)
middlewared.log:[2022/02/13 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 1, 30, 0, 0, 6)
middlewared.log:[2022/02/20 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 1, 30, 0, 0, 6)
middlewared.log:[2022/02/27 00:00:02] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 1, 30, 0, 0, 6)
middlewared.log:[2022/03/13 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 3, 6, 0, 0, 7)
middlewared.log:[2022/03/20 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 3, 6, 0, 0, 7)
middlewared.log:[2022/03/27 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 3, 6, 0, 0, 7)
middlewared.log:[2022/04/03 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 3, 6, 0, 0, 7)
middlewared.log:[2022/04/17 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 4, 10, 0, 0, 7)
middlewared.log:[2022/04/24 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 4, 10, 0, 0, 7)
middlewared.log:[2022/05/01 00:00:02] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 4, 10, 0, 0, 7)
middlewared.log:[2022/05/08 00:00:01] (DEBUG) middlewared.plugins.pool.__run():4195 - Pool 'OVERWATCH1' last scrub datetime.datetime(2022, 5, 1, 20, 51, 54)


Thanks again!
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
However, is the correct firmware installed? This damage resembles what we often see with a RAID controller that has dropped some data. If stuff is in-flight and there's a controller reset, the usual culprit is that some of the in-flight data in writeback is lost, or something like that. A 9211 doesn't have writeback, but we do know that it's important for the correct firmware to be used. This is actually where I was expecting to go with my request for a hardware manifest. A 9240 in MegaRAID (MFI) mode, i could see this happening with.
I thought along those lines too but ended up forgetting to mention the firmware. I don't remember this sort of behavior in old pre-P20.00.07 firmware versions, but the drivers have been updated since then, so who knows what might be happening in there. The damage feels too contained to be a full controller reset (and the OS should pick up the disk write failures and pass the error on to ZFS, which would inevitably fault the pool and likely panic the system). Also, two controllers mean this is insanely unlikely, even with bad power. It would never be this coordinated.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The damage feels too contained to be a full controller reset (and the OS should pick up the disk write failures and pass the error on to ZFS, which would inevitably fault the pool and likely panic the system)

Well THAT is probably overly dramatic, it shouldn't fault the pool for a handful of errors.

Also, two controllers mean this is insanely unlikely, even with bad power. It would never be this coordinated.

But with one disk reporting 4 errors and the others reporting 2, some sort of driver boggle that affected the single drive slightly differently is not totally inconceivable.

Unlikely, yes. Grasping for straws. I am not so much of an LSI fanboi (as you well know) to think that it's impossible for it to go awry.
 

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
Weird... 20% done and 16 hours to go, it shows no checksum errors so far, but that same snapshot file is already mentioned

pool: OVERWATCH1 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub in progress since Tue May 10 08:35:35 2022 17.7T scanned at 1.25G/s, 15.3T issued at 1.09G/s, 76.1T total 0B repaired, 20.09% done, 15:56:37 to go config: NAME STATE READ WRITE CKSUM OVERWATCH1 ONLINE 0 0 0 raidz3-0 ONLINE 0 0 0 gptid/b23acad3-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b412da76-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/5917f150-ca22-11ec-be73-000743649900 ONLINE 0 0 0 gptid/b67f8bdd-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b69817bb-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b770f6e2-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b834db66-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b8610e24-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 gptid/b9540f0b-115c-11eb-9f63-20cf3009d192 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: OVERWATCH1/Bravo@auto-2021-08-13_18-58:/<PATH>/<FILENAME>
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
When did the issue happen: I have not been able to find a way to look at previous scrubs results or at previous alerts. So I do not know if the issue was previously flagged or was only flagged as part of the resilvering process

"grep -i scrub * | grep OVERWATCH1" on /var/log
Good idea, but zpool history OVERWATCH1 | grep scrub will give you the proper answer you were after. ... well, at least the actual run dates and times.

I don't have a pool with some scrub errors to find if there's an output if there's an error to be caught... zpool history has a lot of information, but maybe not the outcomes as much as the commands that were used.
 

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
Thanks!

Yet... the plot thickens...

I brought current copies of 5 files (38GB) that would have been "adjacent" to that snapshot entry. By adjacent I mean 4 files in that same folder and 1 file in the parent folder.

There is a 1-bit difference in the file that is in the parent folder in the recovered pool versus the previously replicated pool

fc /b replicated_copy resilvered_file Comparing files replicated_copy and resilvered_file 2F522400: 9F 9E

UPDATE: For better or worse the replication has been running every night, even after the resilver

zettarepl.log:[2022/05/08 00:00:00] INFO [Thread-36] [zettarepl.paramiko.replication_task__task_1] Connected (version 2.0, client OpenSSH_8.4-hpn14v15) zettarepl.log:[2022/05/08 00:00:00] INFO [Thread-36] [zettarepl.paramiko.replication_task__task_1] Authentication (publickey) successful! zettarepl.log:[2022/05/08 00:00:01] INFO [replication_task__task_1] [zettarepl.replication.run] For replication task 'task_1': doing push from 'OVERWATCH1/Bravo' to 'OVERVOLT1/Juliet' of snapshot='auto-2022-05-08_00-00' incremental_base='auto-2022-05-07_00-00' receive_resume_token=None encryption=False zettarepl.log:[2022/05/09 00:00:00] INFO [Thread-41] [zettarepl.paramiko.replication_task__task_1] Connected (version 2.0, client OpenSSH_8.4-hpn14v15) zettarepl.log:[2022/05/09 00:00:00] INFO [Thread-41] [zettarepl.paramiko.replication_task__task_1] Authentication (publickey) successful! zettarepl.log:[2022/05/09 00:00:05] INFO [replication_task__task_1] [zettarepl.replication.run] For replication task 'task_1': doing push from 'OVERWATCH1/Bravo' to 'OVERVOLT1/Juliet' of snapshot='auto-2022-05-09_00-00' incremental_base='auto-2022-05-08_00-00' receive_resume_token=None encryption=False zettarepl.log:[2022/05/10 00:00:00] INFO [Thread-6] [zettarepl.paramiko.replication_task__task_1] Connected (version 2.0, client OpenSSH_8.4-hpn14v15) zettarepl.log:[2022/05/10 00:00:01] INFO [Thread-6] [zettarepl.paramiko.replication_task__task_1] Authentication (publickey) successful! zettarepl.log:[2022/05/10 00:00:05] INFO [replication_task__task_1] [zettarepl.replication.run] For replication task 'task_1': doing push from 'OVERWATCH1/Bravo' to 'OVERVOLT1/Juliet' of snapshot='auto-2022-05-10_00-00' incremental_base='auto-2022-05-09_00-00' receive_resume_token=None encryption=False
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Well THAT is probably overly dramatic, it shouldn't fault the pool for a handful of errors.
It should if it loses all disks at once. Granted, if the controller recovers quickly enough, the pool shouldn't end up faulted.
Weird... 20% done and 16 hours to go, it shows no checksum errors so far, but that same snapshot file is already mentioned
I've never actually looked into the intricacies of what gets updated when. Could simply be stale info.
Thanks!

Yet... the plot thickens...

I brought current copies of 5 files (38GB) that would have been "adjacent" to that snapshot entry. By adjacent I mean 4 files in that same folder and 1 file in the parent folder.

There is a 1-bit difference in the file that is in the parent folder in the recovered pool versus the previously replicated pool

fc /b replicated_copy resilvered_file Comparing files replicated_copy and resilvered_file 2F522400: 9F 9E
Interesting... Which one is correct? Do I understand correctly that these files still exist in the current dataset state?
 

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
Interesting... Which one is correct? Do I understand correctly that these files still exist in the current dataset state?

Yes, they are both in the live/current file system. One in the source and the other in the target replication. I cannot tell which is correct from looking at the files. However...

The replication uses lz4 to transmit data.
The pool in the target machine is also lz4.
The pool in the source machine is not compressed.

That seems to indicate that the issue originated in the source machine... which is the one we have been discussing.

I'm attempting to find any other differences with rsync, but it seems to take forever. I'll try again once the scrub is over/

rsync -n -arc --no-perms --no-owner --no-group --no-times '/mnt/OVERWATCH1/Bravo/PATH/' '192.168.4.7:/mnt/OVERVOLT1/Juliet/PATH/'
 
Last edited:
Top