For what it's worth, even though the reproducer scripts demonstrate this corruption bug, I'm tentative to simply dismiss it as "It's so rare to happen in the wild that maybe no one using ZFS was affected by it."
- The OP of the original bug report did nothing outrageous: They simply installed packages from their package manager (i.e, Gentoo, Portage) and noticed odd behavior with their system. This led them to investigate further and confirm that some files were corrupted, which even a ZFS scrub will not detect.
- The reproducer scripts do indeed "push the envelope", but that's the point. No matter how hard you push, you should never have any "silent" corruption. Never. Never. Never. Why do gamers, who overclock their CPUs and GPUs, do stress testing with outrageous tools to purposefully cook their hardware? Because if there's even a tiny, rare problem when they push their system to the limits, this is enough to alarm them to pause, backtrack, or find a solution. They don't dismiss it with "Oh well. It only failed with my synthetic tests. I'm not actually going to use my PC like that in my daily life. It's fine..."
- This is "silent" corruption we're seeing. This means that there may in fact have been other conditions and combinations (possibly "rare", I agree) in which files contain corrupted chunks somewhere in the middle; yet you wouldn't immediately know because ZFS would not report it, nor would a scrub detect it. What makes this unnerving is that the corrupted chunks (with the length of the dataset's recordsize) can live anywhere in the middle of the file. The spam of zeroes are not exclusive to the beginning or end of the file. This makes it almost unfeasible to scan your entire dataset with a script to search for "possibly" corrupted files.
So I understand the assurance of "You probably weren't affected by this, don't worry." Pardon me if I sound too critical, but that's not the point. I don't care how unlikely or rare this is, or that it theoretically will only affect those in certain environments. I believe a 0% chance of
silent corruption, under any circumstance, should be the standard. (This is ZFS we're talking about. We use it primarily for data integrity before any of the "bells-and-whistles" features.)
You wouldn't accept a filesystem that leaves you with silently corrupted files from a rare combination of circumstances, hardware, and actions, would you? What if I told you "Yeah, but it's like a 1/10,000 chance."
EDIT: I'm willing, later on, to run a script that will scan every single file, then output a "report" of all files that contain a string of "X amount of consecutive zeroes"
anywhere within the file. I'm sure any such files on the report will be false positives, and that's fine. I'll manually inspect them myself. I'll let this thing run overnight (or throughout a number of days.)
I suppose it's really a matter of using "grep" and specifying an expression that matches the pattern "X number of consecutive zeroes".
EDIT 2:
This is the "sky view" series of events, without going into detail:
- Gentoo user files bug against Gentoo because their compilers don't work after installing packages via the package manager
-
"After emerging dev-lang/go I'm unable to compile any Go programs as the internal compiler tools have been striped to the point where they are no longer executable programs."
- In this bug report, it's discovered this is not likely a Gentoo bug, but rather a ZFS bug
- New bug report is filed against OpenZFS
- A way is discovered to reproduce this across different OSes and versions of OpenZFS