Silent corruption with OpenZFS (ongoing discussion and testing)

Juan Manuel Palacios · Dec 1, 2023

OpenZFS 2.2.2 & 2.1.14 released with fixes for this data corruption issue! Now lets wait for iX to do its magic and release updates on the already announced mid-December timeline.

winnielinnie · Dec 1, 2023

Heads up for those anticipating the return of block-cloning with OpenZFS 2.2.2:

Rob N. said:
The "write" part of the file change can be anything - write, clone, fill, etc. What may make a difference is the relative speed of those operations, and obviously a clone is much faster than a write. There may also be a second bug in cloning that contributed to the this that we haven't found yet. This is part of the reason that cloning is still disabled in 2.2.2.

(Emphasis added.)

As much as I think block-cloning is a game-changer for ZFS, which I was very excited for, I have to agree with this "play it safe for now" approach.

Glowtape · Dec 2, 2023

What is the criteria to go from "may be" to "is" or "is not"?

cap · Dec 2, 2023

Juan Manuel Palacios said:
OpenZFS 2.2.2 & 2.1.14 released with fixes for this data corruption issue! Now lets wait for iX to do its magic and release updates on the already announced mid-December timeline.

I was hoping for an earlier update with the current ZFS version ("hotfix"). Unraid, for example, already has ZFS 2.1.14

awasb · Dec 2, 2023

FreeBSD 14 did some backports, too.

src - FreeBSD source tree

cgit.freebsd.org

src - FreeBSD source tree

cgit.freebsd.org

src - FreeBSD source tree

cgit.freebsd.org

HarambeLives · Dec 9, 2023

After applying the fix, is it a good idea to do a scrub ASAP or just stick to my usual schedule?

bcat · Dec 9, 2023

Scrubs will not help with this issue since, as far as ZFS is concerned, there's no on-disk corruption. ZFS writes exactly the bytes it's told to by the application (e.g., cp). Instead, the corruption occurs when ZFS tells an application the wrong thing at read time (saying there is a "hole" in a file when, in fact, there is data there).

HarambeLives · Dec 9, 2023

Thanks

Etorix · Dec 9, 2023

Put otherwise: The bug does not corrupt data that is already in the pool; it's a "read bug" which returns wrong data, which may result in corrupted copies being stored—originals are safe.
Scrubs are of no use: The original is fine, but the copy was written with its own checksum and is "properly corrupted".

Juan Manuel Palacios · Dec 9, 2023

Juan Manuel Palacios said:
Hence the lack of checksum failures, because, if I understood correctly, when the bug hits, data is written correctly to some destination, even if that's corrupted data, and then at a later time that destination checksums OK when read once again and/or when scrubbed… correct?

@HarambeLives here's my analysis of how the bug impacts copies.

tiberiusQ · Dec 10, 2023

Etorix said:
Put otherwise: The bug does not corrupt data that is already in the pool; it's a "read bug" which returns wrong data, which may result in corrupted copies being stored—originals are safe.
Scrubs are of no use: The original is fine, but the copy was written with its own checksum and is "properly corrupted".

What about zfs replications to eg. another Truenas ?

Patrick M. Hausen · Dec 10, 2023

Not affected by this bug.

Juan Manuel Palacios · Dec 10, 2023

Patrick M. Hausen said:
Not affected by this bug.

How come? Because if the bug is about incorrectly reporting holes, wouldn't it stand to reason it could also occur while reading the blocks that make up a snapshot?

After all, if block traversal is what occurs while supplying the data for a file that a specific userland tool, say cp(1), requests while attempting to copy said file to a destination, and the bug occurs at that point when a given range of blocks is misread as a hole, under a specific set of racy conditions… why can't that also occur while traversing a snapshot's blocks for the purpose of ZFS replication, provided the system is under the exact same set of racy conditions that trigger the problem in the former case?

Patrick M. Hausen · Dec 10, 2023

Juan Manuel Palacios said:
How come? Because if the bug is about incorrectly reporting holes, wouldn't it stand to reason it could also occur while reading the blocks that make up a snapshot?

Replication works at the block level. The bug is about user space programs operating at the file/vnode interface actively using "hole aware" system calls. ZFS replication is completely file agnostic.

Juan Manuel Palacios · Dec 10, 2023

Patrick M. Hausen said:
Replication works at the block level. The bug is about user space programs operating at the file/vnode interface actively using "hole aware" system calls. ZFS replication is completely file agnostic.

Right, I understand that replication works at the block level. But, at the end of the day, supplying the data for a file that a userland tool intends to copy translates into reading the blocks that make up that file when the ZFS layer is asked for its data.

So I guess what I'm wondering is at what layer does this erroneous hole reporting come into play, because, after all, and if I'm not sorely misunderstanding something, it is ZFS who's incorrectly reporting those holes.

Patrick M. Hausen · Dec 10, 2023

My understanding is that it occurs at the file system layer and not in ZFS' operation on blocks. Now that you keep asking, I ponder if I might be wrong ...

Juan Manuel Palacios · Dec 10, 2023

Patrick M. Hausen said:
My understanding is that it occurs at the file system layer and not in ZFS' operation on blocks. Now that you keep asking, I ponder if I might be wrong ...

Well, I don't pretend to be to too knowledgeable about ZFS internals, even if I consider myself experienced enough with the filesystem to the point of having managed to pull myself out of several rabbit… holes over the years (debugging pools a few times with zdb, migrating from GELI to ZFS native encryption via replication, etc.).

But, in any case, the fix for the issue talks about, if I'm understanding it correctly, blocks being in an inconsistent state, presumably while held in RAM (with the original blocks still being consistent on the storage media), and that keeps taking me back to the argument of ZFS supplying the data either to userland tools or to a replication stream ultimately boiling down to reading storage media blocks (which would arguably fail with the erroneous hole reporting under the correct racy conditions).

Ericloewe · Dec 10, 2023

As I understand it, the conclusion was that the affected code was not called at all outside the handler for the lseek system call, and thus could not happen in any ZFS-internal context.

There's also an important factor to keep in mind: replication is operating on a snapshot, not a live dataset. Given the need to have a dirty dnode, that alone probably makes the timing impossible, even if the bug were relevant for a replication, which is just sending the new blocks and has no need to seek around inside files (nor does it really understand what a file is).

morganL · Dec 10, 2023

cap said:
I was hoping for an earlier update with the current ZFS version ("hotfix"). Unraid, for example, already has ZFS 2.1.14

It's worth understanding the TrueNAS fix cycle and how we approach these issues.

We work with community to confirm the issue and how relevant it is.
We try to publish any recommendations on mitigating or avoiding the problem. We did that on Nov 29th https://www.truenas.com/community/threads/old-openzfs-issue-found-and-being-resolved.114556/
We include a fix in the nightlies... for both internal and community testing. (we do not push these untested versions on anyone)
It takes almost a week to get through a QA cycle. Unlike unRaid, we have a significant lab and enterprise customers.
We release a hot patch if necessary. This was done for CORE on Dec 7. https://www.truenas.com/community/threads/old-openzfs-issue-found-and-being-resolved.114556/
For SCALE, we are releasing the fix with OpenZFS 2.2.2. This (SCALE 23.10.1) needs a full 2 week QA cycle. Unfortunately, we found another unrelated issue and delayed for a week. The plan is to release on Dec 19.

The bottom line is:

If you want fast response, follow the recommendations. The nightlies should only be used if necessary or there is no risk (e.g you have your own QA system).

If you want tested and verified software, wait for the official versions. Its reasonable to argue that some users should wait another few weeks for community feedback on the new software. See the software status page.

We don't plan to issue official versions without some professional level of testing. The unintended consequences of bug fixes can be worse than the original bug. In this case, the original bug lasted 15 years without detection.

We hope this approach meets the vast majority of TrueNAS user needs.

Ericloewe · Dec 10, 2023

morganL said:
he original bug lasted 15 years without detection

This is a very important point to keep in mind. In fact, nobody seems to have come forward with any "now that you mention it, back in the day I saw this in the wild" stories.
The unfortunate side of this is that this bug will, for a while at least, be used by clueless users* as a scapegoat for everything from dying disks to PEBKACs.

* I mean no disrespect, the cluelessness derives a lot from irresponsible guides pushing users into solutions they understand little about, leading them to make mistakes. Couple this with news articles of varying quality, and it's just the right set of conditions for Joe User, whose system is held together by a prayer and firmware, to see errors reported after a scrub and think to himself "Ah ha! I hit the infamous bug!"

Important Announcement for the TrueNAS Community.

Silent corruption with OpenZFS (ongoing discussion and testing)

Contributor

MVP

Dabbler

Contributor

Patron

Contributor

Explorer

Contributor

Wizard

Contributor

Contributor

Hall of Famer

Contributor

Hall of Famer

Contributor

Hall of Famer

Contributor

Server Wrangler

Captain Morgan

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Silent corruption with OpenZFS (ongoing discussion and testing)"

Similar threads