Silent corruption with OpenZFS (ongoing discussion and testing)

Glowtape · Nov 27, 2023

Guess I'll be pushing up the long planned and necessary Feng Shui until said patches are in. That DMU tunable or not.

Etorix · Nov 27, 2023

winnielinnie said:
I don't think "moves" can even be theoretically affected, so such actions (even if done under times of heavy I/O or in bulk) should be fine.

The original reproducer script and the newer zhammer.sh do use cp to trigger the bug… It is a matter of concurrent threads writing and reading at the same time while not checking for the appropriate kind of "dirty".
The saving grace is that it requires a highly parallel workload which is not a typical work pattern—except for compilers.

Ericloewe · Nov 27, 2023

It is important to keep in mind that the bug is actually in reading data. Data is being written correctly, so any workload that leaves data as it was, regardless if it got bad data, is safe to the data. Moving the data to a different filesystem, by virtue of needing to read the data, is vulnerable to the bug.

alex992 · Nov 27, 2023

Ericloewe said:
It is important to keep in mind that the bug is actually in reading data. Data is being written correctly, so any workload that leaves data as it was, regardless if it got bad data, is safe to the data. Moving the data to a different filesystem, by virtue of needing to read the data, is vulnerable to the bug.

Is only reading (not moving to) the data by Windows machine will create corruption?

Ericloewe · Nov 27, 2023

If you mean to ask whether the bug will corrupt data stored on disk that is being read, the answer is no, the impact is only on what is sent to whoever is reading.

alex992 · Nov 27, 2023

Ericloewe said:
If you mean to ask whether the bug will corrupt data stored on disk that is being read, the answer is no, the impact is only on what is sent to whoever is reading.

Thanks

Juan Manuel Palacios · Nov 27, 2023

Ericloewe said:
It is important to keep in mind that the bug is actually in reading data. Data is being written correctly (…)

Hence the lack of checksum failures, because, if I understood correctly, when the bug hits, data is written correctly to some destination, even if that's corrupted data, and then at a later time that destination checksums OK when read once again and/or when scrubbed… correct?

AlexGG · Nov 27, 2023

Juan Manuel Palacios said:
Hence the lack of checksum failures

Correct.

Ericloewe · Nov 28, 2023

Juan Manuel Palacios said:
Hence the lack of checksum failures, because, if I understood correctly, when the bug hits, data is written correctly to some destination, even if that's corrupted data, and then at a later time that destination checksums OK when read once again and/or when scrubbed… correct?

Pretty much. Checksums also pass when reading because ZFS is reading all the bits correctly. It's "just" reporting holes where there are none, causing affected userland applications to seek past good data that ZFS would and could gladly provide, had it not just lied to the app.

Ericloewe · Nov 28, 2023

Some better preliminary data from my testing, for those not following the ticket on Github:

Future me here to say that past me should debug his scripts before starting the tests - I was testing everything with 1 MB files. File size does seem to make a difference, at least at first glance. I only have three full tests so far with 64 kB files, but I'm already seeing more errors than with 1 MB files.
I'll let the updated tests run while I figure out how to present the data. In the meantime, here's a list of things that make it more likely to hit this bug, in approximate order of significance:

Extreme CPU/DRAM workloads parallel to and independent from the file I/O
Smaller files/less time writing and more time handling metadata, in relative terms
Slow disk I/O performance

I'm not sure yet if more parallel operations have an impact, I'll have to crunch this data down to something usable first.

My preliminary take is that users doing heavy computational work while simultaneously doing a ton of small file I/O on potato-grade storage that's not blocking the computational part are most at risk. I think most will agree this is a seriously contrived scenario. In less contrived scenarios, a rough figure is 8 errors per million files for affected applications (which look for holes). I expect this to be a bit of a worst-case for realistic workloads.

neofusion · Nov 28, 2023

Are the fixes going to be backported to Bluefin?
Cobia is still very new and having to choose from a new (and thus less tested) release with the fixes or an otherwise well tested version that has now been fight to possibly corrupt data is not ideal.

winnielinnie · Nov 28, 2023

You can set the above tunable to "0" in the meantime.

To apply immediately:

Code:

echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync

To survive reboots (I haven't tested this, as I don't use SCALE):

Code:

sudo midclt call system.advanced.update '{"kernel_extra_options": "zfs.zfs_dmu_offset_next_sync=0"}'

To later undo this, clear the extra options:

Code:

sudo midclt call system.advanced.update '{"kernel_extra_options": ""}'

An alternative approach is to create an Init Task that runs at startup (pre-init), which simply executes the command at every boot:

Code:

echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync

Ericloewe · Nov 28, 2023

neofusion said:
Are the fixes going to be backported to Bluefin?
Cobia is still very new and having to choose from a new (and thus less tested) release with the fixes or an otherwise well tested version that has now been fight to possibly corrupt data is not ideal.

I would expect so. The fix is being backported to 2.1 upstream, so the impact should be minimal.

Juan Manuel Palacios · Nov 28, 2023

Ericloewe said:
I would expect so. The fix is being backported to 2.1 upstream, so the impact should be minimal.

Have there been any recent updates on some kind of an estimated timeline for the fix to land in OpenZFS, and then in TrueNAS Core?

Thank you!

Etorix · Nov 28, 2023

#15566 and #15571 have been merged… but there's now a new #15599.
Relief is on the way, but not the end of this story.

Gnome · Nov 28, 2023

The info on the thread is excellent.
The mechanism to solve it are appreciated.

Any chance a "How to mitigate" for both Core and Scale and how to verify the mitigation is in place can be pinned somewhere?

grahamperrin · Nov 28, 2023

Etorix said:
Relief is on the way, but not the end of this story.

zfs-2.2.2 patchset by tonyhutter · Pull Request #15602 · openzfs/zfs

Ericloewe · Nov 28, 2023

I'm waiting for the final set of results from the fast system before sharing the full dataset. For now, I wanted to share the quick overview of results from my "potato" system, which combines a decently fast server with the crappiest, slowest HDDs I have had the misfortune of dealing with: The Toshiba L200 2.5" laptop SMR 2 TB HDD, attached to a Dell HBA330 Mini/LSI SAS3008 HBA.
The server proper is a Dell R6515 with a single Epyc 7543P CPU, 512 GB of DDR4-3200, running Ubuntu 22.04 with Linux 6.2. The tests were all run with cp 9.4.
I believe this is the worst-case scenario I tested, slightly ahead of the fast system bogged down by CPU busywork, so take these numbers with a big spoonful of salt. I do not claim anyone has seen, will see or could see these error rates in a real workload on a realistic machine. This is basically just a rough indication of what scenarios are more susceptible to hitting this bug - to the two people in the room running SMR laptop drives on their servers, my condolences. I especially want to highlight that the results for the fast system will be different - the 64 kB scenario is the worst one there, for instance.

The full data will be available later, but each data point corresponds to 60 000-640 000 files written, depending on the number of workers. The data points are at 12, 24, 32 and 64 workers (out of 32 physical cores).

Ericloewe · Nov 28, 2023

By the way, questions on the test methodology and related matters are also welcome in advance.

Etorix · Nov 29, 2023

grahamperrin said:
zfs-2.2.2 patchset by tonyhutter · Pull Request #15602 · openzfs/zfs

…whose discussion spawned #15603 about cloned blocks during ZIL replay.
Talk about a gift which keeps on giving.

And this comment:

ricebrain said:
The problem is complicated and has technically been in the codebase for a very long time, but various things sometimes made it much more obvious and reproducible.

I can reproduce this on 0.6.5. So I would not bet on "my version is too old" to make you feel better.

It's very hard to hit without a very specific workload, though, which is why we didn't notice that "it's fixed" in those cases was actually "it's just much harder to hit".

Important Announcement for the TrueNAS Community.

Silent corruption with OpenZFS (ongoing discussion and testing)

Dabbler

Wizard

Server Wrangler

Explorer

Server Wrangler

Explorer

Contributor

Contributor

Server Wrangler

Server Wrangler

Contributor

MVP

Server Wrangler

Contributor

Wizard

Explorer

Dabbler

Server Wrangler

Server Wrangler

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Silent corruption with OpenZFS (ongoing discussion and testing)"

Similar threads