Silent corruption with OpenZFS (ongoing discussion and testing)

Juan Manuel Palacios · Dec 10, 2023

Ericloewe said:
As I understand it, the conclusion was that the affected code was not called at all outside the handler for the lseek system call, and thus could not happen in any ZFS-internal context.

There's also an important factor to keep in mind: replication is operating on a snapshot, not a live dataset. Given the need to have a dirty dnode, that alone probably makes the timing impossible, even if the bug were relevant for a replication, which is just sending the new blocks and has no need to seek around inside files (nor does it really understand what a file is).

Ah, then those two very solid arguments would definitely explain why this bug impacted userland tools and not ZFS internal operations.

If this problem occurred only in the lseek(2) system call handler, and ZFS internal operations never call lseek(2), then it stands to reason the offending code path would never have been hit by something like 'zfs send'.

Also, even if that code path were hit by 'zfs send', e.g. because in some alternate implementation that zfs action did indeed rely on lseek(2), the conditions under which the bug occurred might have (would have?) never materialized, because (and, again, if I'm understanding things correctly) the root cause of the problem revolved around data blocks being in an inconsistent state in RAM, as a result of its very specific racy conditions leading to data having been modified unexpectedly; but that'd never happen with the data blocks belonging to a snapshot, due to its read-only nature.

Nuvi · Dec 21, 2023

Is this fix coming to Bluefin as well? I'm thinking of upgrading from Core to Scale but need to go through Bluefin, before ending in Colbia

Glowtape · Dec 21, 2023

Considering the very low frequency of it happening and the specific conditions for it, going to Bluefin and then immediately to Cobia, things ought to be safe.

NugentS · Dec 21, 2023

Ericloewe said:
* I mean no disrespect, the cluelessness derives a lot from irresponsible guides pushing users into solutions they understand little about, leading them to make mistakes. Couple this with news articles of varying quality, and it's just the right set of conditions for Joe User, whose system is held together by a prayer and firmware, to see errors reported after a scrub and think to himself "Ah ha! I hit the infamous bug!"

I think you forgot the professional clueless user - he that uses duct tape or velcro

Gcon · Dec 27, 2023

I still have the following as a pre-init command for bug mitigation in my newly-upgraded TrueNAS SCALE 23.10.1
echo 0 >> /sys/module/zfs/parameters/zfs_dmu_offset_next_sync

I have also now updated the ZFS flags on my storage pools, as I upgraded from Bluefin.

Can I now safely delete that pre-init command from my init/shutdown scripts?

EDIT: Anyone back from holidays know the answer to this? It's all so very quiet in this thread now.

Yorick · Jan 6, 2024

Yes you can get rid of that pre-init command. No you didn't need to update ZFS flags, but it also doesn't hurt to do so as long as the pool(s) only ever get imported into a version of ZFS that supports those flags.

Important Announcement for the TrueNAS Community.

Silent corruption with OpenZFS (ongoing discussion and testing)

Juan Manuel Palacios

Contributor

Nuvi

Cadet

Glowtape

Dabbler

NugentS

MVP

Gcon

Explorer

Yorick

Wizard

Similar threads

Important Announcement for the TrueNAS Community.

Silent corruption with OpenZFS (ongoing discussion and testing)

Juan Manuel Palacios

Contributor

Nuvi

Cadet

Glowtape

Dabbler

NugentS

MVP

Gcon

Explorer

Yorick

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Silent corruption with OpenZFS (ongoing discussion and testing)"

Similar threads