The original reproducer script and the newer zhammer.sh do useI don't think "moves" can even be theoretically affected, so such actions (even if done under times of heavy I/O or in bulk) should be fine.![]()
cp
to trigger the bug… It is a matter of concurrent threads writing and reading at the same time while not checking for the appropriate kind of "dirty".Is only reading (not moving to) the data by Windows machine will create corruption?It is important to keep in mind that the bug is actually in reading data. Data is being written correctly, so any workload that leaves data as it was, regardless if it got bad data, is safe to the data. Moving the data to a different filesystem, by virtue of needing to read the data, is vulnerable to the bug.
ThanksIf you mean to ask whether the bug will corrupt data stored on disk that is being read, the answer is no, the impact is only on what is sent to whoever is reading.
Hence the lack of checksum failures, because, if I understood correctly, when the bug hits, data is written correctly to some destination, even if that's corrupted data, and then at a later time that destination checksums OK when read once again and/or when scrubbed… correct?It is important to keep in mind that the bug is actually in reading data. Data is being written correctly (…)
Correct.Hence the lack of checksum failures
Pretty much. Checksums also pass when reading because ZFS is reading all the bits correctly. It's "just" reporting holes where there are none, causing affected userland applications to seek past good data that ZFS would and could gladly provide, had it not just lied to the app.Hence the lack of checksum failures, because, if I understood correctly, when the bug hits, data is written correctly to some destination, even if that's corrupted data, and then at a later time that destination checksums OK when read once again and/or when scrubbed… correct?
echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync
sudo midclt call system.advanced.update '{"kernel_extra_options": "zfs.zfs_dmu_offset_next_sync=0"}'
sudo midclt call system.advanced.update '{"kernel_extra_options": ""}'
echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync
I would expect so. The fix is being backported to 2.1 upstream, so the impact should be minimal.Are the fixes going to be backported to Bluefin?
Cobia is still very new and having to choose from a new (and thus less tested) release with the fixes or an otherwise well tested version that has now been fight to possibly corrupt data is not ideal.
Have there been any recent updates on some kind of an estimated timeline for the fix to land in OpenZFS, and then in TrueNAS Core?I would expect so. The fix is being backported to 2.1 upstream, so the impact should be minimal.
Relief is on the way, but not the end of this story.
…whose discussion spawned #15603 about cloned blocks during ZIL replay.
ricebrain said:The problem is complicated and has technically been in the codebase for a very long time, but various things sometimes made it much more obvious and reproducible.
I can reproduce this on 0.6.5. So I would not bet on "my version is too old" to make you feel better.
It's very hard to hit without a very specific workload, though, which is why we didn't notice that "it's fixed" in those cases was actually "it's just much harder to hit".