Silent corruption with OpenZFS (ongoing discussion and testing)

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
And here's the data for the fast system (2x Epyc 7643 for 192 cores total, 1 TB DDR4-3200, storage is a mirror of Kioxia CM6-R. Full data attached in the spreadsheet. More detailed analyses welcome, I will also share a few more structured thoughts later.

ZFS hole corruption.png
 

Attachments

  • ZFS hole corruption.xlsx
    49.1 KB · Views: 77
Joined
Oct 22, 2019
Messages
3,641
Interestingly, in both your tests, it seems to show a trend of a higher error rate with fewer parallel workloads.

That might explain why something as innocuous as compiling code that outputs corrupted files has been seen in the wild.

Perhaps it's because the heavier the overall workload, it has a net effect to slow down each individual thread? (Hence, it's akin to the "potato safety" against the race condition)?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I suspect there's a peak at a lowish percentage of workers per CPU core.

My impression is that the window has a very tight gap on either end: Too slow or too fast and you don't hit it. What shocked me was how extremely heavy unrelated non-storage workload had such an impact, whereas lighter loads didn't move the needle. I only tested those with 1 MB files, but if there's interest from the wider community, I'm open to adding tests to my present dataset.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
@Ericloewe I suppose that these results are without mitigation/patches, and that more data points for the fast system under full/heavy/moderate load are coming?
As for the analysis…o_O since the issue is triggered by heavy concurrent access, it seems counterintuitive that more threads result in lesser error rates.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
As for the analysis…o_O since the issue is triggered by heavy concurrent access, it seems counterintuitive that more threads result in lesser error rates.
Here's the thing... The error rates on a per-test basis are stable over the tests, but tests with more parallel workers are writing more files (each worker writes 10k files on SSD and 1k files on SMR HDD). On a time basis, the error rate would look a lot more stable over the number of workers. It's not clear to me how this could be dependent primarily on wall clock time instead of raw number of operations, but that is the vibe I get. The super-heavy load test sort of aligns with this, since it took forever and also had a truckload of errors.
I won't pretend to have a coherent theory of how it all fits together, at least at this point.
I suppose that these results are without mitigation/patches
Yes, unpatched OpenZFS 2.1, on Linux 5.19 or 6.2, as one would have had in production as of two weeks ago or so. My goals are to:
  1. Help people estimate their exposure to this, based on their workload
  2. Help design a useful regression test to keep this race condition firmly quashed going forward
1. is looking confusing, frankly. There is some hope for 2. though.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
that more data points for the fast system under full/heavy/moderate load are coming?
Yes, but first I want to test the potato system with SSDs, to get a better feeling for how comparable the results are on a direct basis. It'll take a while, I've been collecting data for several days now.
 
Joined
Oct 22, 2019
Messages
3,641
Also, I want to highlight rincebrain's write-up over on GitHub: https://gist.github.com/rincebrain/e23b4a39aba3fadc04db18574d30dc73
I don't want to veer off-topic, but I feel it's best to ask in here from a home-user perspective. His commentary got me thinking:

What is the point (or even benefit) of using "sparse" and "holey" files on a CoW filesystem with fast inline compression? I could never wrap my head around this.

Removing "hole-seeking" out the equation, and leaving everything to inline compression, seems much more streamlined and simple.

If I have a file with a massive chunk of zeroes (think as large as you want!), even using LZ4 inline compression squishes these records down to basically nothing. A 100 MiB chunk of zeroes in the middle of a file will consume only kilobytes of disk space. To "read" or "seek" the contents of this file is trivial.

So I can understand its main benefit outside of CoW and inline compression. But with ZFS, sparse files seem redundant at best.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
What is the point (or even benefit) of using "sparse" and "holey" files on a CoW filesystem with fast inline compression? I could never wrap my head around this.
A hole can be huge, and it would be faster to report it as such. You're right that inline compression mitigates the disk I/O side of things, but applications receive the uncompressed data, and if they get a gigabyte of zeros, that's still a gigabyte they need to process however they process things. If you instead report a hole, the application can just skip over the whole segment.

It is situational, absolutely, but it is useful.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
You can set the above tunable to "0" in the meantime. :smile:
According to this comment though it's not an effective workaround. Or am I interpreting wrong?
This is also why you can still reproduce it with that tunable set to 0 - because the tunable only matters if dnode_is_dirty returns true, and if it doesn't, then you are sad.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
It's effective in that it substantially reduced the likelihood of hitting this. It's not perfect because it does not fix the underlying race condition.
 

FrankWard

Explorer
Joined
Feb 13, 2023
Messages
71
I've read a lot of threads this morning as I discovered this is an issue. I just build a Scale box and have an existing Core box. From what I can tell in my cursory research is that there are no fixes yet, just some band-aids which are different for Core and Scale. Is there a clear instructional post giving us the best mitigation approach for both systems? I've seen several ideas, but I can't pin down what I need to do exactly to both of my systems to get this mitigated as much as possible until a true fix is released? Thank you!
 
  • Like
Reactions: cap

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
The fixes are on their way in OpenZFS 2.1.14 and 2.2.2.
Mitigation is essentially the same on CORE and SCALE: Set zfs_dmu_offset_next_sync to 0.
In SCALE, you do that with echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync as in init task, and immediately if you don't reboot.
In CORE, the tunable is set directly from the Tunables screen in the GUI, but it has a prefix: vfs.zfs.dmu_offset_next_sync

Edit. corrections
 
Last edited:

FrankWard

Explorer
Joined
Feb 13, 2023
Messages
71
The fixes are on their way in OpenZFS 2.1.14 and 2.2.1.
Mitigation is essentially the same on CORE and SCALE: Set zfs_dmu_offset_next_sync to 0.
In SCALE, you do that with echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync as in init task, and immediately if you don't reboot.
In CORE, the tunable is set directly from the Tunables screen in the GUI, but it has a prefix: vfs.zfs_dmu_offset_next_sync
Thank you @Etorix. You sir, are a gentleman and a scholar.
 

das1996

Dabbler
Joined
May 26, 2020
Messages
25
In CORE, the tunable is set directly from the Tunables screen in the GUI, but it has a prefix: vfs.zfs_dmu_offset_next_sync

I think you mean "vfs.zfs.dmu_offset_next_sync". Note the period between zfs and dmu rather than underscore.
``
root@nas1[~]# sysctl -a | grep vfs.zfs_dmu_offset_next_sync
root@nas1[~]#
``

vs

``
root@nas1[~]# sysctl -a | grep vfs.zfs.dmu_offset_next_sync
vfs.zfs.dmu_offset_next_sync: 1
``

This is on TrueNAS-13.0-U5.1
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Just came across this thread.

I thought it would test my nas to see if it was vulnerable.

It's Truenas core 13.0-U5.3.

It's got a Xeon E5 1650 v2 (6C 12T) and 128 GB ECC memory on a supermicro MB.

Storage is a total of 24 sata 'nas' drives. Some are WD Red's, some are seagate exos. The 24 drives are in 4 vdevs of 6 drives in Raid-Z2.

Reproducer script ran with (the built in) freebsd cp, not gnu cp. Set to 1 meg files, and 1000 copies.

The first run on this nas failed:

Code:
root@nas testdirectory # for i in {1..8} ; do ~/reproducer.sh & done; wait
[1] 37367
[2] 37368
[3] 37369
[4] 37370
[5] 37371
[6] 37372
[7] 37373
[8] 37376
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
Binary files reproducer_37370_0 and reproducer_37370_142 differ
Binary files reproducer_37370_0 and reproducer_37370_285 differ
Binary files reproducer_37370_0 and reproducer_37370_286 differ
Binary files reproducer_37367_0 and reproducer_37367_292 differ
Binary files reproducer_37370_0 and reproducer_37370_571 differ
Binary files reproducer_37370_0 and reproducer_37370_572 differ
Binary files reproducer_37370_0 and reproducer_37370_573 differ
Binary files reproducer_37370_0 and reproducer_37370_574 differ
Binary files reproducer_37367_0 and reproducer_37367_585 differ
Binary files reproducer_37367_0 and reproducer_37367_586 differ
[1]   Done                    ~/reproducer.sh
[2]   Done                    ~/reproducer.sh
[3]   Done                    ~/reproducer.sh
[5]   Done                    ~/reproducer.sh
[7]-  Done                    ~/reproducer.sh
[8]+  Done                    ~/reproducer.sh
[4]-  Done                    ~/reproducer.sh
[6]+  Done                    ~/reproducer.sh


As expected, all the non identical files are full of nulls:

Code:
root@nas testdirectory # hexdump reproducer_37370_142
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000
root@nas testdirectory # hexdump reproducer_37370_285
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000
root@nas testdirectory # hexdump reproducer_37370_286
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000
root@nas testdirectory # hexdump reproducer_37370_572
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000
root@nas testdirectory # hexdump reproducer_37367_585
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000



I then tried running it at work on our Truenas X10, and it also failed on the first run:

Code:
Binary files reproducer_98467_0 and reproducer_98467_386 differ
Binary files reproducer_98467_0 and reproducer_98467_773 differ
Binary files reproducer_98467_0 and reproducer_98467_774 differ


I know this is unlikely to be an issue with network access, but it's still very worrying.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
I know this is unlikely to be an issue with network access, but it's still very worrying.
In order to mitigate the issue you may set the vfs.zfs.dmu_offset_next_sync variable in the tunables (System>Tunables) as a SYSCTL with value 0, as written in the official statement.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
In order to mitigate the issue you may set the vfs.zfs.dmu_offset_next_sync variable in the tunables (System>Tunables) as a SYSCTL with value 0, as written in the official statement.

I'm aware of the workaround.

Both of these nas's don't do any local file manipulation, no jails, etc. Everything is handled over the network. So I doubt it'll be a problem anyway.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
The bug that keeps on giving: #15615
robn said:
#15571 is a reliable fix for #15526, but it wasn't clear why it was necessary. This PR explains it, and offers the correct fix.

For avoidance of doubt: #15571 fixes the problem. There's no new problem that this fixes. There's no hurry at all to ship this (assuming its even right).
ZFS code is undergoing a thorough scrub. Thanks again to all involved!
 
Top