Juan Manuel Palacios
Contributor
- Joined
- May 29, 2017
- Messages
- 146
Ah, then those two very solid arguments would definitely explain why this bug impacted userland tools and not ZFS internal operations.As I understand it, the conclusion was that the affected code was not called at all outside the handler for the lseek system call, and thus could not happen in any ZFS-internal context.
There's also an important factor to keep in mind: replication is operating on a snapshot, not a live dataset. Given the need to have a dirty dnode, that alone probably makes the timing impossible, even if the bug were relevant for a replication, which is just sending the new blocks and has no need to seek around inside files (nor does it really understand what a file is).
If this problem occurred only in the lseek(2) system call handler, and ZFS internal operations never call lseek(2), then it stands to reason the offending code path would never have been hit by something like 'zfs send'.
Also, even if that code path were hit by 'zfs send', e.g. because in some alternate implementation that zfs action did indeed rely on lseek(2), the conditions under which the bug occurred might have (would have?) never materialized, because (and, again, if I'm understanding things correctly) the root cause of the problem revolved around data blocks being in an inconsistent state in RAM, as a result of its very specific racy conditions leading to data having been modified unexpectedly; but that'd never happen with the data blocks belonging to a snapshot, due to its read-only nature.