Silent corruption with OpenZFS (ongoing discussion and testing)

Juan Manuel Palacios

Contributor
Joined
May 29, 2017
Messages
146
As I understand it, the conclusion was that the affected code was not called at all outside the handler for the lseek system call, and thus could not happen in any ZFS-internal context.

There's also an important factor to keep in mind: replication is operating on a snapshot, not a live dataset. Given the need to have a dirty dnode, that alone probably makes the timing impossible, even if the bug were relevant for a replication, which is just sending the new blocks and has no need to seek around inside files (nor does it really understand what a file is).
Ah, then those two very solid arguments would definitely explain why this bug impacted userland tools and not ZFS internal operations.

If this problem occurred only in the lseek(2) system call handler, and ZFS internal operations never call lseek(2), then it stands to reason the offending code path would never have been hit by something like 'zfs send'.

Also, even if that code path were hit by 'zfs send', e.g. because in some alternate implementation that zfs action did indeed rely on lseek(2), the conditions under which the bug occurred might have (would have?) never materialized, because (and, again, if I'm understanding things correctly) the root cause of the problem revolved around data blocks being in an inconsistent state in RAM, as a result of its very specific racy conditions leading to data having been modified unexpectedly; but that'd never happen with the data blocks belonging to a snapshot, due to its read-only nature.
 

Nuvi

Cadet
Joined
Jan 16, 2017
Messages
4
Is this fix coming to Bluefin as well? I'm thinking of upgrading from Core to Scale but need to go through Bluefin, before ending in Colbia
 

Glowtape

Dabbler
Joined
Apr 8, 2017
Messages
45
Considering the very low frequency of it happening and the specific conditions for it, going to Bluefin and then immediately to Cobia, things ought to be safe.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
* I mean no disrespect, the cluelessness derives a lot from irresponsible guides pushing users into solutions they understand little about, leading them to make mistakes. Couple this with news articles of varying quality, and it's just the right set of conditions for Joe User, whose system is held together by a prayer and firmware, to see errors reported after a scrub and think to himself "Ah ha! I hit the infamous bug!"
I think you forgot the professional clueless user - he that uses duct tape or velcro
 

Gcon

Explorer
Joined
Aug 1, 2015
Messages
59
I still have the following as a pre-init command for bug mitigation in my newly-upgraded TrueNAS SCALE 23.10.1
echo 0 >> /sys/module/zfs/parameters/zfs_dmu_offset_next_sync

I have also now updated the ZFS flags on my storage pools, as I upgraded from Bluefin.

Can I now safely delete that pre-init command from my init/shutdown scripts?

EDIT: Anyone back from holidays know the answer to this? It's all so very quiet in this thread now.
 
Last edited:

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Yes you can get rid of that pre-init command. No you didn't need to update ZFS flags, but it also doesn't hurt to do so as long as the pool(s) only ever get imported into a version of ZFS that supports those flags.
 
Top