Silent corruption with OpenZFS (ongoing discussion and testing)

Davvo · Nov 23, 2023

Juan Manuel Palacios said:
Any other downsides we may want to consider before flipping this sysctl on TrueNAS Core?

The whole thing is still under development/observation, I trust that iX will officially speak to us about it after the weekend.

jumpingkoala · Nov 23, 2023

On TrueNAS Core 13.0-U5.3, I could not reproduce this bug with 16 parallel instances on an encrypted RaidZ1 dataset with spinning hard drives.

HoneyBadger · Nov 23, 2023

I'm trying with 32x parallel instances on (unencrypted) 8x RAIDZ2 now under SCALE 23.10.0.1

Ericloewe · Nov 23, 2023

Ericloewe said:
I've asked admnd over on Github to elaborate on their test setup.

For those not following this over on Github, the answer is that the script is equivalent to the one I posted. Still, I have been unable to reproduce the issue thus far.

winnielinnie · Nov 23, 2023

Ericloewe said:
Still, I have been unable to reproduce the issue thus far.

Interestingly, he's able to reproduce it with only 4 parallel instances on TrueNAS Core 13.0-U5.3.

This got me thinking, and I decided to approach from a different angle. (Sort of. Read below.)

TL;DR

TL;DR: I am able to somewhat consistently reproduce this on TrueNAS Core 13.0-U6. No block-cloning involved whatsoever.

The Setup

This is what I did:

Created a new unencrypted dataset (pool/playground)
- This unencrypted dataset has the default options, except for recordsize=1M
Created a new 13.2-RELEASE Basejail called "vanilla"
Made a mountpoint for it that accesses /mnt/pool/playground via its internal /media/playground
In the jail (all proceedings steps are in the jail, by the way), switched to the latest pkg repo
Installed bash, nano, and coreutils (9.1)
Ran the script with only 4 parallel instances while inside the mountpoint's path /media/playground

The Results

Here is a sample of two results:

Sample #1

Code:

for i in {1..4} ; do ./reproducer.sh & done; wait
[1] 19685
[2] 19686
[3] 19687
[4] 19688
writing files
writing files
writing files
writing files
checking files
checking files
checking files
checking files
Binary files reproducer_29316_0 and reproducer_29316_186 differ
Binary files reproducer_29316_0 and reproducer_29316_373 differ
Binary files reproducer_29316_0 and reproducer_29316_374 differ
Binary files reproducer_13948_0 and reproducer_13948_492 differ
Binary files reproducer_29316_0 and reproducer_29316_747 differ
Binary files reproducer_29316_0 and reproducer_29316_748 differ
Binary files reproducer_29316_0 and reproducer_29316_749 differ
Binary files reproducer_29316_0 and reproducer_29316_750 differ
Binary files reproducer_13948_0 and reproducer_13948_985 differ
Binary files reproducer_13948_0 and reproducer_13948_986 differ
[1]   Done                    ./reproducer.sh
[3]-  Done                    ./reproducer.sh
[4]+  Done                    ./reproducer.sh
[2]+  Done                    ./reproducer.sh

Sample #2

Code:

for i in {1..4} ; do ./reproducer.sh & done; wait
[1] 95060
[2] 95061
[3] 95062
[4] 95063
writing files
writing files
writing files
writing files
checking files
checking files
checking files
checking files
Binary files reproducer_29959_0 and reproducer_29959_822 differ
[1]   Done                    ./reproducer.sh
[3]-  Done                    ./reproducer.sh
[4]+  Done                    ./reproducer.sh
[2]+  Done                    ./reproducer.sh

What makes this even more insidious is that sometimes it will result with no corruption. (This makes it tricky to reproduce on-demand.)

Additional Info and Evidence

@HoneyBadger: This is on a pool comprised of two mirror vdevs, spinning HDDs (WD Red Pluses, CMR).

System Info:

TrueNAS Core 13.0-U6
ZFS: 2.1.13
coreutils: 9.1
No block-cloning
32 GiB ECC RAM
Intel Xeon E-2144G, 4 cores, 8 threads
WD Red Plus HDDs
2 x two-way mirrors (a total of 4 spinners)

Because this ran in a jail, I had to modify the script so that it used the correct path for bash (/usr/local/bin/bash) and GNU coreutil's "cp" (/usr/local/bin/gcp).

Here is the modified version of the script I used, which works in a FreeBSD jail:

Code:

#!/usr/local/bin/bash
prefix="reproducer_${BASHPID}_"
dd if=/dev/urandom of=${prefix}0 bs=1M count=1 status=none

echo "writing files"
end=1000
h=0
for i in `seq 1 2 $end` ; do
        let "j=$i+1"
        /usr/local/bin/gcp ${prefix}$h ${prefix}$i
        /usr/local/bin/gcp ${prefix}$i ${prefix}$j
        let "h++"
done

echo "checking files"
for i in `seq 1 $end` ; do
        diff ${prefix}0 ${prefix}$i
done

What confuses me is that I could not reproduce this on the TrueNAS host using the FreeBSD base system's "cp". I'm not sure what that means, or if it's a red herring.

The cherry on top is the proof of data corruption during the copy operation. Behold:

Code:

hexdump reproducer_29316_186
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000

Code:

du -hs reproducer_29316_186
512B    reproducer_29316_186

That is absolutely not the original urandom'd 1 MiB file. It's basically a barrage of zero's.

For reference, this is the original file:

Code:

du -hs reproducer_29316_0
1.0M    reproducer_29316_0

A hexdump also confirms the theory.

So basically, like some others in the GitHub bug report and in here, I just copied a bunch of files that silently corrupted, and I would never know about it had I not been actively trying to reproduce the bug.

Now I'm asking myself, "What's the likelihood this already happened to me?"

The above corruption was with 4 parallel instances of the script. I realize it's A LOT of I/O happening in such a small cluster, and such scenarios are unlikely for home users. But I still have my concerns, as you can tell.

It's possible that this bug has existed for longer than suspected, and might have inadvertently returned with OpenZFS 2.1.5, when a tunable was enabled by default: zfs_dmu_offset_next_sync

So this obviously isn't exclusive to block-cloning or OpenZFS 2.2.0.

Perhaps this "silent corruption" bug indeed has a common source? Maybe block-cloning "enhances" its potential occurrence? An emergency "fix" in the meantime is that they basically disabled block-cloning (by default) with version 2.2.1. But that doesn't address the bug for OpenZFS 2.1.x, and with no block-cloning involved.

bcat · Nov 23, 2023

@winnielinnie Thank you for the detailed testing report! I know it's hard to prove the absence of a bug (whereas only one successful repro demonstrates the bug exists), but does setting zfs_dmu_offset_next_sync=0 in your test environment appear to prevent the issue?

winnielinnie · Nov 23, 2023

bcat said:
@winnielinnie Thank you for the detailed testing report! I know it's hard to prove the absence of a bug (whereas only one successful repro demonstrates the bug exists), but does setting zfs_dmu_offset_next_sync=0 in your test environment appear to prevent the issue?

I can try it right now.

Just keep in mind it's been a hectic day for me (unrelated to TrueNAS), so I'm feeling burned out and tired. So apologies if I suddenly go quite the rest of this evening.

I've read about that module parameter, but I'm still unsure what it actually "does".

(Shall reply in here after running the reproducer script several times in the same environment, but with the tunable disabled.)

winnielinnie · Nov 23, 2023

@bcat

On my TrueNAS 13.0-U6 host:

Code:

sysctl vfs.mu_offset_next_sync
vfs.zfs.dmu_offset_next_sync: 1

sysctl -w .zfs.dmu_offset_next_sync=0
vfs.zfs.dmu_offset_next_sync: 1 -> 0

sysctl vfs.zfs.dmu_offset_next_sync
vfs.zfs.dmu_offset_next_sync: 0

Then I repeated the test several times under the same exact conditions.

Two things:

Ran it several times, never results in corruption
It was WAYYYYYYYYYYYY faster to finish the operations this time. (Previously these tests took a decent amount of time. Now it's like... within seconds.)

My brain hurts. What does this tunable do? Something about improving how to copy sparse files?

EDIT: For good measure, I ran it some more, still no corruption. I'll try with a higher parallel instance. (Maybe 16.)

EDIT 2: Still no corruption, even with a few attempts at 16 parallel instances.

So I guess this means the culprit is zfs_dmu_offset_next_sync? If it's safe to disable, I'll gladly make it a persistent tunable. What's the drawback? Less efficient handling of sparse files?

bcat · Nov 23, 2023

winnielinnie said:
I can try it right now.

Just keep in mind it's been a hectic day for me (unrelated to TrueNAS), so I'm feeling burned out and tired. So apologies if I suddenly go quite the rest of this evening.

Of course, didn't mean to create any sense of obligation! Just a mixture of curious (I've been trying to learn more about ZFS by using TrueNAS at home) and concerned.

winnielinnie said:
I've read about that module parameter, but I'm still unsure what it actually "does".

(Shall reply in here after running the reproducer script several times in the same environment, but with the tunable disabled.)

For what it's worth, the comment in the kernel module says this:

Code:

   78 /*
   79  * Enable/disable forcing txg sync when dirty checking for holes with lseek().
   80  * By default this is enabled to ensure accurate hole reporting, it can result
   81  * in a significant performance penalty for lseek(SEEK_HOLE) heavy workloads.
   82  * Disabling this option will result in holes never being reported in dirty
   83  * files which is always safe.
   84  */
   85 static int zfs_dmu_offset_next_sync = 1;

Which seems to imply it's safe to turn off (at the cost of worse... hole reporting... which I don't really understand the consequences of). Maybe more space gets used in sparse file type situations? I genuinely don't know...

winnielinnie · Nov 23, 2023

bcat said:
Which seems to imply it's safe to turn off (at the cost of worse... hole reporting... which I don't really understand the consequences of). Maybe more space gets used in sparse file type situations? I genuinely don't know...

In my opinion, I'd rather lose efficiency when dealing with sparse "holey" files (of which I don't believe I deal with much, or at all), if it means improved performance and removing the risk of data corruption, even if the risk is "very rare".

bcat · Nov 23, 2023

Totally agreed. To be really clear, I am NOT recommending anyone set that kernel parameter for now, given that I don't fully understand what it does and the situation is still developing.

But, disclaimer aside, it does look like 1) that option was set to 0 until one of the 2.1.x releases, 2) it appears to resolve the root cause, and 3) it sounds like it prevents ZFS from forcing syncs of sparse files, which (I am making an unverified assumption here) may result in less efficient sparse file disk usage (but it sounds like that doesn't hurt correctness, only space efficiency?).

bcat · Nov 23, 2023

Okay, so man 2 lseek goes into more detail on SEEK_HOLE. In short:

These operations allow applications to map holes in a sparsely allocated file. This can be useful for applications such as file backup tools, which can save space when creating backups and preserve holes, if they have a mechanism for discovering holes.

The man page also notes that filesystems aren't required to report holes, so it doesn't seem like making SEEK_HOLE less efficient in edge cases for the time being is a deal breaker.

tannisroot · Nov 23, 2023

I understand that the issue seems to go beyond just block cloning, but seeing how it at least aggravates the likelyhood of data corruption and in 2.2.1it's disabled, is there anything that a user of an upgraded pool can do to prevent (further) data corruption? If block cloning feature can't be disabled after the pool upgrade, is it possible to at least somehow to disable reflinks, maybe with some global paramater, so that block cloning is at least not triggered?

katbyte · Nov 23, 2023

bcat said:
@winnielinnie Thank you for the detailed testing report! I know it's hard to prove the absence of a bug (whereas only one successful repro demonstrates the bug exists), but does setting zfs_dmu_offset_next_sync=0 in your test environment appear to prevent the issue?

no repo on a truenas scale 24.10.0.1 in a proxmox VM w/ HBA passthrough, 32disks in 8xraidz2 and 2 cache NVMEs - block cloning enabled

even when running 64 at once

for i in {1..64}; do ./zfs-bclone-repo.sh & done; wait

now the proxmox host is also using zfs, simple 2disk mirror, block cloning is not enabled as far as i can tell, but i can repo it, even with as low as 8!! echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync fixes it

morganL · Nov 23, 2023

As indicated, we'll investigate for TrueNAS, but have seen no evidence of a problem in the field.

If anyone can reproduce the issue on 13.0-U6 (ZFS 2.1) .. or 23.10.0.1 (ZFS 2.2), please let us know. Its not clear whether the issue depends on the OS as well.

Obviously, when data is accessed via SMB or NFS, there are additional file locking mechanisms... so most users are not copying data with direct ZFS access.

bcat · Nov 23, 2023

morganL said:
If anyone can reproduce the issue on 13.0-U6 (ZFS 2.1) .. or 23.10.0.1 (ZFS 2.2), please let us know. Its not clear whether the issue depends on the OS as well.

I can repro intermittently but fairly frequently (i.e., not on every attempt, but every other try or so) on SCALE 23.10.0.1. See my prior posts in the thread for full details on the script I was running (from the upstream GitHub issue) and more details on my setup.

But TL;DR on the configuration that triggers this bug for me: Single vdev, 5 HDD pool in RAIDZ2. Stock SCALE 23.10.0.1 install on a standard C246 mobo, Xeon E2276G w/ ECC RAM. Pool flags upgraded through SCALE UI to enable block cloning. No non-default ZFS kernel params.

winnielinnie · Nov 24, 2023

morganL said:
If anyone can reproduce the issue on 13.0-U6 (ZFS 2.1) .. or 23.10.0.1 (ZFS 2.2), please let us know. Its not clear whether the issue depends on the OS as well.

It has been reproduced on SCALE and Core.

See posts #11 (SCALE 23.10.0.1) and #25 (Core 13.0-U6), as well as the replies on the GitHub bug report, in which it has been reproduced on Core 13.0-U5.3, Gentoo, and Ubuntu. (It also happened on a Proxmox host for what it's worth, on post #34.)

This bug is OS-agnostic, as it has been reproduced on TrueNAS Core, TrueNAS SCALE, FreeBSD, Gentoo, Arch Linux, Ubuntu, and possibly others. In fact, in the latest bug report the OP notes that he discovered this because he was installing the Go compiler with his distro's (Gentoo) package manager. That's not exactly an unusual task.

This bug exists on OpenZFS 2.1.x (possibly starting with 2.1.5), and 2.2.0 (and possibly even with the temporary "fix" introduced in 2.2.1).

Doesn't seem to matter whether it's on a RAIDZ or mirror vdev.

Because of it's inconsistency, it's difficult to reproduce "on-demand".

It appears to more likely reproduce on slow spinners (HDDs) than SSDs / NVMe.

Block-cloning might simply be an "enhancer" to this bug, since it exists on OpenZFS 2.1.x as well (which doesn't have block-cloning.)

What seems to consistently resolve this bug is disabling the parameter zfs_dmu_offset_next_sync.
(We are unsure if this is safe to disable indefinitely, however, it apparently was disabled by default prior to OpenZFS 2.1.5.)

morganL said:
Obviously, when data is accessed via SMB or NFS, there are additional file locking mechanisms... so most users are not copying data with direct ZFS access.

Of course, and I mainly use Rsync for my transfers, as well as SMB and NFS. The concern is that when we reproduced this using FreeBSD's "cp" and GNU coreutils "cp", there should ideally be 0% chance of silent corruption. Even if it's rare, and even if it's exclusively under high I/O and copy operations within a narrow window, it's still a worrying issue. Perhaps our reproducing steps just make it more obvious? It's unclear, because if this did affect someone in the past, it means it would have happened silently. (ZFS will not detect corruption on such a subsequent read or scrub. Everything will appear "normal".)

* We're talking about ZFS here! You'd expect it to handle being slammed with tons of I/O, and at worst your system slows down or crashes. You don't expect it to silently corrupt your data, yet report "Everything's fine! Your data is intact, trust me". There's no fine print that reads "ZFS is a resilient CoW filesystem that can detect bitrot and prevent corruption due to its CoW nature. Oh, except if you copy a bunch of things or use a script that slams it with I/O. So, just don't don't that, okay?"

When I have time, I might try to see if I can reproduce this using SMB.

But for the meantime, we know this bug exists on Core, SCALE, and other Linux distros, and it affects OpenZFS 2.1.x and 2.2.0. (It possibly also affects 2.2.1, even with block-cloning disabled. See the discussions above and in the bug report.)

There is also a possible "solution" by disabling the module parameter zfs_dmu_offset_next_sync.

Ericloewe · Nov 24, 2023

There's also a theory proposed on the github issue, but I lack the expertise to meaningfully opine on it (much to my frustration), with a proto-patch attached.

Cellobita · Nov 24, 2023

As of this moment, how safe would it be to set zfs_dmu_offset_next_sync to 0, pending a final, effective patch?

winnielinnie · Nov 24, 2023

tannisroot said:
If block cloning feature can't be disabled after the pool upgrade, is it possible to at least somehow to disable reflinks, maybe with some global paramater, so that block cloning is at least not triggered?

For block-cloning specifically, nothing you can do about it on OpenZFS 2.2.0. *However, for 2.2.1, they introduced a parameter that disables by block-cloning by default, even if you already enabled it via a pool upgrade. (It's an emergency bugfix to stop any new block-cloning operations for version 2.2.1+. According to Rob N., this is likely an "indefinite" change. So I take it to mean he sees this issue as critical enough to not even put forth a timeline to reintroduce block-cloning for OpenZFS. I can't imagine it being reversed in 2.2.2 without a serious fix of the underlying problem.)

* Your pool's features will show block-cloning as "enabled", rather than "active". The wording is misleading, but if it says "enabled", then it basically means your pool supports block-cloning, but it will not use it.

Cellobita said:
As of this moment, how safe would it be to set zfs_dmu_offset_next_sync to 0, pending a final, effective patch?

I'm not an expert (far from it!), but personally I'm leaning towards resorting to this myself. It's relatively recently that it was enabled by default (starting with OpenZFS 2.1.5 from what I understand.) So surely it can't be dangerous to set it to 0? (I don't even deal with sparse / "holey" files. If I do, it's quite uncommon for me.)

Important Announcement for the TrueNAS Community.

Silent corruption with OpenZFS (ongoing discussion and testing)

MVP

Cadet

actually does care

Server Wrangler

MVP

Explorer

MVP

MVP

Explorer

MVP

Explorer

Explorer

Dabbler

Dabbler

Captain Morgan

Explorer

MVP

Server Wrangler

Contributor

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Silent corruption with OpenZFS (ongoing discussion and testing)"

Similar threads