Silent corruption with OpenZFS (ongoing discussion and testing)

Ericloewe · Nov 29, 2023

And here's the data for the fast system (2x Epyc 7643 for 192 cores total, 1 TB DDR4-3200, storage is a mirror of Kioxia CM6-R. Full data attached in the spreadsheet. More detailed analyses welcome, I will also share a few more structured thoughts later.

winnielinnie · Nov 29, 2023

Interestingly, in both your tests, it seems to show a trend of a higher error rate with fewer parallel workloads.

That might explain why something as innocuous as compiling code that outputs corrupted files has been seen in the wild.

Perhaps it's because the heavier the overall workload, it has a net effect to slow down each individual thread? (Hence, it's akin to the "potato safety" against the race condition)?

Ericloewe · Nov 29, 2023

I suspect there's a peak at a lowish percentage of workers per CPU core.

My impression is that the window has a very tight gap on either end: Too slow or too fast and you don't hit it. What shocked me was how extremely heavy unrelated non-storage workload had such an impact, whereas lighter loads didn't move the needle. I only tested those with 1 MB files, but if there's interest from the wider community, I'm open to adding tests to my present dataset.

Etorix · Nov 29, 2023

@Ericloewe I suppose that these results are without mitigation/patches, and that more data points for the fast system under full/heavy/moderate load are coming?
As for the analysis…

since the issue is triggered by heavy concurrent access, it seems counterintuitive that more threads result in lesser error rates.

Ericloewe · Nov 29, 2023

Etorix said:
As for the analysis… since the issue is triggered by heavy concurrent access, it seems counterintuitive that more threads result in lesser error rates.

Here's the thing... The error rates on a per-test basis are stable over the tests, but tests with more parallel workers are writing more files (each worker writes 10k files on SSD and 1k files on SMR HDD). On a time basis, the error rate would look a lot more stable over the number of workers. It's not clear to me how this could be dependent primarily on wall clock time instead of raw number of operations, but that is the vibe I get. The super-heavy load test sort of aligns with this, since it took forever and also had a truckload of errors.
I won't pretend to have a coherent theory of how it all fits together, at least at this point.

Etorix said:
I suppose that these results are without mitigation/patches

Yes, unpatched OpenZFS 2.1, on Linux 5.19 or 6.2, as one would have had in production as of two weeks ago or so. My goals are to:

Help people estimate their exposure to this, based on their workload
Help design a useful regression test to keep this race condition firmly quashed going forward

1. is looking confusing, frankly. There is some hope for 2. though.

Ericloewe · Nov 29, 2023

Also, I want to highlight rincebrain's write-up over on GitHub: https://gist.github.com/rincebrain/e23b4a39aba3fadc04db18574d30dc73

I've also added it to the first post for added visibility.

Ericloewe · Nov 29, 2023

Etorix said:
that more data points for the fast system under full/heavy/moderate load are coming?

Yes, but first I want to test the potato system with SSDs, to get a better feeling for how comparable the results are on a direct basis. It'll take a while, I've been collecting data for several days now.

winnielinnie · Nov 29, 2023

Ericloewe said:
Also, I want to highlight rincebrain's write-up over on GitHub: https://gist.github.com/rincebrain/e23b4a39aba3fadc04db18574d30dc73

I don't want to veer off-topic, but I feel it's best to ask in here from a home-user perspective. His commentary got me thinking:

What is the point (or even benefit) of using "sparse" and "holey" files on a CoW filesystem with fast inline compression? I could never wrap my head around this.

Removing "hole-seeking" out the equation, and leaving everything to inline compression, seems much more streamlined and simple.

If I have a file with a massive chunk of zeroes (think as large as you want!), even using LZ4 inline compression squishes these records down to basically nothing. A 100 MiB chunk of zeroes in the middle of a file will consume only kilobytes of disk space. To "read" or "seek" the contents of this file is trivial.

So I can understand its main benefit outside of CoW and inline compression. But with ZFS, sparse files seem redundant at best.

Ericloewe · Nov 29, 2023

winnielinnie said:
What is the point (or even benefit) of using "sparse" and "holey" files on a CoW filesystem with fast inline compression? I could never wrap my head around this.

A hole can be huge, and it would be faster to report it as such. You're right that inline compression mitigates the disk I/O side of things, but applications receive the uncompressed data, and if they get a gigabyte of zeros, that's still a gigabyte they need to process however they process things. If you instead report a hole, the application can just skip over the whole segment.

It is situational, absolutely, but it is useful.

Davvo · Nov 29, 2023

winnielinnie said:
You can set the above tunable to "0" in the meantime.

According to this comment though it's not an effective workaround. Or am I interpreting wrong?

This is also why you can still reproduce it with that tunable set to 0 - because the tunable only matters if dnode_is_dirty returns true, and if it doesn't, then you are sad.

Ericloewe · Nov 29, 2023

It's effective in that it substantially reduced the likelihood of hitting this. It's not perfect because it does not fix the underlying race condition.

FrankWard · Nov 29, 2023

I've read a lot of threads this morning as I discovered this is an issue. I just build a Scale box and have an existing Core box. From what I can tell in my cursory research is that there are no fixes yet, just some band-aids which are different for Core and Scale. Is there a clear instructional post giving us the best mitigation approach for both systems? I've seen several ideas, but I can't pin down what I need to do exactly to both of my systems to get this mitigated as much as possible until a true fix is released? Thank you!

Etorix · Nov 29, 2023

The fixes are on their way in OpenZFS 2.1.14 and 2.2.2.
Mitigation is essentially the same on CORE and SCALE: Set zfs_dmu_offset_next_sync to 0.
In SCALE, you do that with echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync as in init task, and immediately if you don't reboot.
In CORE, the tunable is set directly from the Tunables screen in the GUI, but it has a prefix: vfs.zfs.dmu_offset_next_sync

Edit. corrections

FrankWard · Nov 29, 2023

Etorix said:
The fixes are on their way in OpenZFS 2.1.14 and 2.2.1.
Mitigation is essentially the same on CORE and SCALE: Set zfs_dmu_offset_next_sync to 0.
In SCALE, you do that with echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync as in init task, and immediately if you don't reboot.
In CORE, the tunable is set directly from the Tunables screen in the GUI, but it has a prefix: vfs.zfs_dmu_offset_next_sync

Thank you @Etorix. You sir, are a gentleman and a scholar.

das1996 · Nov 29, 2023

Etorix said:
In CORE, the tunable is set directly from the Tunables screen in the GUI, but it has a prefix: vfs.zfs_dmu_offset_next_sync

I think you mean "vfs.zfs.dmu_offset_next_sync". Note the period between zfs and dmu rather than underscore.
``
root@nas1[~]# sysctl -a | grep vfs.zfs_dmu_offset_next_sync
root@nas1[~]#
``

vs

``
root@nas1[~]# sysctl -a | grep vfs.zfs.dmu_offset_next_sync
vfs.zfs.dmu_offset_next_sync: 1
``

This is on TrueNAS-13.0-U5.1

morganL · Nov 30, 2023

The official IX response is here: https://www.truenas.com/community/threads/old-openzfs-issue-found-and-being-resolved.114556/

titan_rw · Nov 30, 2023

Just came across this thread.

I thought it would test my nas to see if it was vulnerable.

It's Truenas core 13.0-U5.3.

It's got a Xeon E5 1650 v2 (6C 12T) and 128 GB ECC memory on a supermicro MB.

Storage is a total of 24 sata 'nas' drives. Some are WD Red's, some are seagate exos. The 24 drives are in 4 vdevs of 6 drives in Raid-Z2.

Reproducer script ran with (the built in) freebsd cp, not gnu cp. Set to 1 meg files, and 1000 copies.

The first run on this nas failed:

Code:

root@nas testdirectory # for i in {1..8} ; do ~/reproducer.sh & done; wait
[1] 37367
[2] 37368
[3] 37369
[4] 37370
[5] 37371
[6] 37372
[7] 37373
[8] 37376
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
Binary files reproducer_37370_0 and reproducer_37370_142 differ
Binary files reproducer_37370_0 and reproducer_37370_285 differ
Binary files reproducer_37370_0 and reproducer_37370_286 differ
Binary files reproducer_37367_0 and reproducer_37367_292 differ
Binary files reproducer_37370_0 and reproducer_37370_571 differ
Binary files reproducer_37370_0 and reproducer_37370_572 differ
Binary files reproducer_37370_0 and reproducer_37370_573 differ
Binary files reproducer_37370_0 and reproducer_37370_574 differ
Binary files reproducer_37367_0 and reproducer_37367_585 differ
Binary files reproducer_37367_0 and reproducer_37367_586 differ
[1]   Done                    ~/reproducer.sh
[2]   Done                    ~/reproducer.sh
[3]   Done                    ~/reproducer.sh
[5]   Done                    ~/reproducer.sh
[7]-  Done                    ~/reproducer.sh
[8]+  Done                    ~/reproducer.sh
[4]-  Done                    ~/reproducer.sh
[6]+  Done                    ~/reproducer.sh

As expected, all the non identical files are full of nulls:

Code:

root@nas testdirectory # hexdump reproducer_37370_142
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000
root@nas testdirectory # hexdump reproducer_37370_285
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000
root@nas testdirectory # hexdump reproducer_37370_286
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000
root@nas testdirectory # hexdump reproducer_37370_572
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000
root@nas testdirectory # hexdump reproducer_37367_585
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000

I then tried running it at work on our Truenas X10, and it also failed on the first run:

Code:

Binary files reproducer_98467_0 and reproducer_98467_386 differ
Binary files reproducer_98467_0 and reproducer_98467_773 differ
Binary files reproducer_98467_0 and reproducer_98467_774 differ

I know this is unlikely to be an issue with network access, but it's still very worrying.

Davvo · Nov 30, 2023

titan_rw said:
I know this is unlikely to be an issue with network access, but it's still very worrying.

In order to mitigate the issue you may set the vfs.zfs.dmu_offset_next_sync variable in the tunables (System>Tunables) as a SYSCTL with value 0, as written in the official statement.

titan_rw · Nov 30, 2023

Davvo said:
In order to mitigate the issue you may set the vfs.zfs.dmu_offset_next_sync variable in the tunables (System>Tunables) as a SYSCTL with value 0, as written in the official statement.

I'm aware of the workaround.

Both of these nas's don't do any local file manipulation, no jails, etc. Everything is handled over the network. So I doubt it'll be a problem anyway.

Etorix · Dec 1, 2023

The bug that keeps on giving: #15615

robn said:
#15571 is a reliable fix for #15526, but it wasn't clear why it was necessary. This PR explains it, and offers the correct fix.

For avoidance of doubt: #15571 fixes the problem. There's no new problem that this fixes. There's no hurry at all to ship this (assuming its even right).

ZFS code is undergoing a thorough scrub. Thanks again to all involved!

Important Announcement for the TrueNAS Community.

Silent corruption with OpenZFS (ongoing discussion and testing)

Server Wrangler

Attachments

MVP

Server Wrangler

Wizard

Server Wrangler

Server Wrangler

Server Wrangler

MVP

Server Wrangler

MVP

Server Wrangler

Explorer

Wizard

Explorer

Dabbler

Captain Morgan

Guru

MVP

Guru

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Silent corruption with OpenZFS (ongoing discussion and testing)"

Similar threads