Silent corruption with OpenZFS (ongoing discussion and testing)

winnielinnie · Nov 24, 2023

bcat said:
Edit: lol, my bug was duped against the original bug almost immediately. :) eh, i did ask before filing... :P

Did you forget to enable dedup?

Concurrent bug tracker on the FreeBSD side was started today.

275308 – EN tracking issue for potential ZFS data corruption

bugs.freebsd.org

I'll probably started a new thread after this is comfortably resolved, to determine ways to scan through an entire ZFS dataset to search for files that are potentially corrupted. (My hunch is zero, and I hope I'm right.) It would likely involved grep'ing the header of every file and searching for an extended string of binary 0's? (I don't mind if I hit too many false positives, since I'll end up manually checking each file on the "results list" to verify it's actually fine.) But that's for another day, and for another thread.

winnielinnie · Nov 24, 2023

From my understanding of how this is progressing and what is being discovered, is there a "silver lining" that we may in fact be able to safely use block-cloning going forward, since it apparently isn't the underlying cause of the issue? (Block-cloning is an amazing feature.)

Davvo · Nov 24, 2023

bcat said:
Edit: lol, my bug was duped against the original bug almost immediately. :) eh, i did ask before filing... :P

From my understanding they linked both tickets in order to track the issue while leaving the branches separated.

bcat · Nov 24, 2023

Davvo said:
From my understanding they linked both tickets in order to track the issue while leaving the branches separated.

Ah gotcha. This is probably just me not understanding how JIRA works then. :)

katbyte · Nov 24, 2023

Cellobita said:
As of this moment, how safe would it be to set zfs_dmu_offset_next_sync to 0, pending a final, effective patch?

I set it on multiple proxmox hosts and VMs yesterday and all has been fine since.

winnielinnie said:
Hence, slow systems, and especially VMs, are the least likely to reproduce this bug. This means that a virtualized environment is not a good place to test for this bug.

This might explain why you were able to reproduce it on your Proxmox host, but not on your virtualized TrueNAS SCALE.

that makes sense, the host I repo'd it on is a very new EPYC system with a fast mirror of PCI4 NVME disks while the VM was also on spinners

if I had more time to spare I'd try it on some single disk/slower hosts but at the point I've changed the sync setting and moving on

axhxrx · Nov 24, 2023

I feel like we should be more heavily avocating the already-discussed "disable zfs_dmu_offset_next_sync" mitigation.

This worked for me, and many others in the GitHub bug thread, and I'd think it should be very safe, since it was the default until recently.

For the record, I was able to very easily reproduce this on my physical machine on which I recently installed TrueNAS-SCALE-23.10.0.1. As in, repeatedly running the reproducer.sh script from the GitHub bug thread, I could reproduce the bug within 5 minutes. That link has details in the comments but it reproduced for me on the TrueNAS `boot-pool`, and on my ZFS filesystems of all types: NVMe, SATA SSD, and RAIDZ magnetic hard disks.

After running this command as root, I could no longer reproduce the bug on any of my filesystems, after running the reproducer script thousands of times over several hours.

echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync

I actually came to this forum to find out the right way to make that change permanent in TrueNAS SCALE, since just doing it that way, the change won't persist across reboots.

Etorix · Nov 25, 2023

Davvo said:
From my understanding they linked both tickets in order to track the issue while leaving the branches separated.

Anyway we are not going to complain if a CORE bug report takes hierarchical precedence over a SCALE report for the same bug, are we?

Ericloewe · Nov 25, 2023

katbyte said:
the host I repo'd it on is a very new EPYC system with a fast mirror of PCI4 NVME disks while the VM was also on spinners

Under what conditions? With Epyc Gen 3, a fast NVMe pool, OpenZFS 2.1, Linux and Coreutils 8.x I've been getting nothing.
I realize this a harder test case, but I'm particularly interested in understanding what conditions make this race more likely to be lost, and thus data to be corrupted. There's the possibility that this is also tied to specific microarchitectures (in terms of increased likelihood, anyway)...

I still need to test the "fast server with potato storage" case, and after that I'll try to start compiling people's results, in hopes of finding a pattern.

winnielinnie · Nov 25, 2023

Ericloewe said:
Coreutils 8.x I've been getting nothing.

I don’t believe anyone has (or can) reproduce this with coreutils 8.x.

asap2go · Nov 25, 2023

Is there a zfs test module compiled with TSan enabled that one can use?
Or is this just testing on which versions are affected?

scottrus71 · Nov 25, 2023

For TrueNAS Core 13.0-U6 here is how I set the tunable. Must be running as root user.

Set the tunable now

sysctl -w vfs.zfs.dmu_offset_next_sync=0

Persist the tunable after reboot

echo ‘vfs.zfs.dmu_offset_next_sync=0’ >> /etc/sysctl.conf.local

winnielinnie · Nov 25, 2023

scottrus71 said:
Persist the tunable after reboot

Use the GUI instead:

winnielinnie · Nov 25, 2023

For what it's worth, even though the reproducer scripts demonstrate this corruption bug, I'm tentative to simply dismiss it as "It's so rare to happen in the wild that maybe no one using ZFS was affected by it."

The OP of the original bug report did nothing outrageous: They simply installed packages from their package manager (i.e, Gentoo, Portage) and noticed odd behavior with their system. This led them to investigate further and confirm that some files were corrupted, which even a ZFS scrub will not detect.
The reproducer scripts do indeed "push the envelope", but that's the point. No matter how hard you push, you should never have any "silent" corruption. Never. Never. Never. Why do gamers, who overclock their CPUs and GPUs, do stress testing with outrageous tools to purposefully cook their hardware? Because if there's even a tiny, rare problem when they push their system to the limits, this is enough to alarm them to pause, backtrack, or find a solution. They don't dismiss it with "Oh well. It only failed with my synthetic tests. I'm not actually going to use my PC like that in my daily life. It's fine..."
This is "silent" corruption we're seeing. This means that there may in fact have been other conditions and combinations (possibly "rare", I agree) in which files contain corrupted chunks somewhere in the middle; yet you wouldn't immediately know because ZFS would not report it, nor would a scrub detect it. What makes this unnerving is that the corrupted chunks (with the length of the dataset's recordsize) can live anywhere in the middle of the file. The spam of zeroes are not exclusive to the beginning or end of the file. This makes it almost unfeasible to scan your entire dataset with a script to search for "possibly" corrupted files.

So I understand the assurance of "You probably weren't affected by this, don't worry." Pardon me if I sound too critical, but that's not the point. I don't care how unlikely or rare this is, or that it theoretically will only affect those in certain environments. I believe a 0% chance of silent corruption, under any circumstance, should be the standard. (This is ZFS we're talking about. We use it primarily for data integrity before any of the "bells-and-whistles" features.)

You wouldn't accept a filesystem that leaves you with silently corrupted files from a rare combination of circumstances, hardware, and actions, would you? What if I told you "Yeah, but it's like a 1/10,000 chance."

EDIT: I'm willing, later on, to run a script that will scan every single file, then output a "report" of all files that contain a string of "X amount of consecutive zeroes" anywhere within the file. I'm sure any such files on the report will be false positives, and that's fine. I'll manually inspect them myself. I'll let this thing run overnight (or throughout a number of days.)

I suppose it's really a matter of using "grep" and specifying an expression that matches the pattern "X number of consecutive zeroes".

EDIT 2:

This is the "sky view" series of events, without going into detail:

Gentoo user files bug against Gentoo because their compilers don't work after installing packages via the package manager
- "After emerging dev-lang/go I'm unable to compile any Go programs as the internal compiler tools have been striped to the point where they are no longer executable programs."
  
  Click to expand...
In this bug report, it's discovered this is not likely a Gentoo bug, but rather a ZFS bug
New bug report is filed against OpenZFS
A way is discovered to reproduce this across different OSes and versions of OpenZFS

winnielinnie · Nov 25, 2023

axhxrx said:
I actually came to this forum to find out the right way to make that change permanent in TrueNAS SCALE, since just doing it that way, the change won't persist across reboots.

axhxrx said:
echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync

@axhxrx You need to create an Init script in the GUI that automatically runs that "echo" command upon startup. That will apply it at every reboot. I believe in SCALE it's found under the "Tasks" menu.

cobrakiller58 · Nov 25, 2023

winnielinnie said:
I'm willing, later on, to run a script that will scan every single file, then output a "report" of all files that contain a string of "X amount of consecutive zeroes" anywhere within the file. I'm sure any such files on the report will be false positives, and that's fine. I'll manually inspect them myself. I'll let this thing run overnight (or throughout a number of days.)

I want to also run such a script after this bug is resolved, I have encountered image files that failed to completely load on my storage. Even if this was not the cause I'd rather locate and hopefully replace them if possible.
Edit: to my disbelief I located another corrupted image...... pinako.jpg which was intact when added to the folder

victort · Nov 25, 2023

cobrakiller58 said:
I want to also run such a script after this bug is resolved, I have encountered image files that failed to completely load on my storage. Even if this was not the cause I'd rather locate and hopefully replace them if possible.

I've actually also come across this issue. They weren't important images so I just deleted them and didn't think much of it. But this has been 4-5 times now in the last year or so.

Ericloewe · Nov 25, 2023

winnielinnie said:
So I understand the assurance of "You probably weren't affected by this, don't worry." Pardon me if I sound too critical, but that's not the point. I don't care how unlikely or rare this is, or that it theoretically will only affect those in certain environments. I believe a 0% chance of silent corruption, under any circumstance, should be the standard. (This is ZFS we're talking about. We use it primarily for data integrity before any of the "bells-and-whistles" features.)

You wouldn't accept a filesystem that leaves you with silently corrupted files from a rare combination of circumstances, hardware, and actions, would you? What if I told you "Yeah, but it's like a 1/10,000 chance."

You're not wrong, but it's important to not overstate the impact of this issue. I fully agree that 0% corruption is the target and no effort may be spared in pursuit of that goal. However, it is vanishingly unlikely that every piece of code is always correct, and therefore it is always possible that something will happen and data will be lost.
Ideally, there would be two separate but 100%-compatible implementations of ZFS. In practice, some of the benefits can be had by using both Linux and FreeBSD, but that's far from perfect, as we're seeing.

Ericloewe said:
Under what conditions? With Epyc Gen 3, a fast NVMe pool, OpenZFS 2.1, Linux and Coreutils 8.x I've been getting nothing.
I realize this a harder test case, but I'm particularly interested in understanding what conditions make this race more likely to be lost, and thus data to be corrupted. There's the possibility that this is also tied to specific microarchitectures (in terms of increased likelihood, anyway)...

I still need to test the "fast server with potato storage" case, and after that I'll try to start compiling people's results, in hopes of finding a pattern.

Update to this after absolutely hammering the system with coreutils 8.x: Using coreutils 9.4 on the same system (compiled from source) results in sporadic errors, as per the following output:

Code:

./reproducer.sh 64 5 10000 16K
Using 64 runners.
Using 5 iterations.
Using 10000 runners.
Using bs=16K.
Starting iteration 1
Spawned 64 workers. Waiting...
Binary files tmp/reproducer_2484909_0 and tmp/reproducer_2484909_5474 differ
Non-zero diff: tmp/reproducer_2484909_5474
Iteration complete
Starting iteration 2
Spawned 64 workers. Waiting...
Binary files tmp/reproducer_702546_0 and tmp/reproducer_702546_3472 differ
Non-zero diff: tmp/reproducer_702546_3472
Binary files tmp/reproducer_702546_0 and tmp/reproducer_702546_6945 differ
Non-zero diff: tmp/reproducer_702546_6945
Binary files tmp/reproducer_702546_0 and tmp/reproducer_702546_6946 differ
Non-zero diff: tmp/reproducer_702546_6946
Iteration complete
Starting iteration 3
Spawned 64 workers. Waiting...

(Sidenote: I got really fscking tired of editing the script to edit parameters, so I spent some time polishing it to make my life easier, cut down on the console spam and enable longer-term testing. If there's still interest at this point, I'd be happy to share.)

ConvexSERV · Nov 25, 2023

I'm following this bug here among other places.

Looks like there's a pull request to fix it.
https://github.com/openzfs/zfs/pull/15571

Author of the pull request says that the dmu_offset_next_sync=0 should be a very reliable workaround.

His words, and credit where credit is due. All of this is above my head.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275308#c8

Rob Norris 2023-11-25 09:37:00 UTC
Hi, I'm the author of 15571. Just a couple of notes:

- dmu_offset_next_sync=0 does appear to be a very reliable workaround, just not technically perfect. Only one person has been able to trip it on an extremely contrived test, as opposed to the default =1, which multiple people have been able to reproduce consistently.

- The incorrect dirty check in 12.4 (as in Illumos) is in dmu_object_wait_synced(). I have not explored this at all though; OpenZFS never had this version and a lot has changed since. There may be other reasons why it can't be tripped.

Let me know if there's anything I can assist with.

Rob.

Arwen · Nov 25, 2023

Been following the issue before this thread, scary. Worse, my 4 Non-NAS home systems are Gentoo Linux with ZFS: desktop, miniature media server and 2 laptops, (old and new). All use ZFS OS pools and separate ZFS data pools.

I've postponed further Gentoo updates until the fix is included. And the 2 active computers, desktop and miniature media server, have the work around permanently in place, (aka survives a reboot);
echo 0 >>/sys/module/zfs/parameters/zfs_dmu_offset_next_sync

The really frightening part is that this may have been in ZFS for a very long time. May even still be in Oracle Solaris 11.4 ZFS. Hell, I may just annoy Oracle support with a ticket reporting this bug.

I am not going to try and re-produce it. Just want to bury my head in the sand, and wait for it to get all better

.

HoneyBadger · Nov 25, 2023

winnielinnie said:
@axhxrx You need to create an Init script in the GUI that automatically runs that "echo" command upon startup. That will apply it at every reboot. I believe in SCALE it's found under the "Tasks" menu.

@axhxrx and others, you should use the GUI options below to persist this change across reboots:

CORE: Using System -> Tunables, add

Variable: vfs.zfs.dmu_offset_next_sync
Value: 0
Type: sysctl

SCALE: This step clearly isn't working, but it should - in the interim, please take the steps advised by @winnielinnie or @bcat here:

Silent corruption with OpenZFS (ongoing discussion and testing)

Seems someone is writing ZDB functions related to block cloning; https://github.com/openzfs/zfs/pull/15541 It won't be perfect to find all corrupted files, but it is something to help. Same author as the OpenZFS 2 part fix.

www.truenas.com

~~SCALE~~~~: Using System -> Sysctl, add~~

~~Variable: zfs_dmu_offset_next_sync~~
~~Value: 0~~

Important Announcement for the TrueNAS Community.

Silent corruption with OpenZFS (ongoing discussion and testing)

MVP

MVP

MVP

Explorer

Dabbler

Cadet

Wizard

Server Wrangler

MVP

Patron

Dabbler

MVP

MVP

MVP

Guru

Guru

Server Wrangler

Cadet

MVP

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Silent corruption with OpenZFS (ongoing discussion and testing)"

Similar threads