Silent corruption with OpenZFS (ongoing discussion and testing)

elbzone

Cadet
Joined
Mar 30, 2023
Messages
7
Hello,

I am running a TrueNAS-SCALE-23.10.0.1 with block cloning enabled.
But I am not able to set `.zfs.dmu_offset_next_sync=0`.
root@truenas[/mnt/elbzone-4000-4lki/truenas/home/nvb]# sysctl vfs.zfs.dmu_offset_next_sync=0
sysctl: cannot stat /proc/sys/vfs/zfs/dmu_offset_next_sync: No such file or directory
root@truenas[/mnt/elbzone-4000-4lki/truenas/home/nvb]# sysctl -w vfs.zfs.dmu_offset_next_sync=0
sysctl: cannot stat /proc/sys/vfs/zfs/dmu_offset_next_sync: No such file or directory

First Question, am I doing something wrong?
Second Question, How long does this `reproducer.sh` runs on your machine?

FTL: I've tried to read this three pages of the forum plus as much information from the official github issue page or the phoronix forum. Hopefully my question is not something for /dev/null.

Best regards and thanks in advance

[edit]
So far, with 8 threads, I was not able to reproduce the issue. I will give it a try with 12 threads since I have a 12 core system (AMD Ryzen 5 2600 Six-Core Processor, 64 GB ECC Memory)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Second Question, How long does this `reproducer.sh` runs on your machine?
A few minutes at most, depending on how many in parallel and how capable the system is.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
First Question, am I doing something wrong?
Second Question, How long does this `reproducer.sh` runs on your machine?
1. You're attempting to use a FreeBSD sysctl on a Linux kernel - use:
Code:
echo 0 >> /sys/module/zfs/parameters/zfs_dmu_offset_next_sync
instead.

2. Runtime varies depending on hardware and parallelism as mentioned by @Ericloewe but you should expect it to sit for several minutes on the "writing files" pass. You can run a second terminal and monitor iostat output to see that your disks are still active.
 

tiberiusQ

Contributor
Joined
Jul 10, 2017
Messages
190
@winnielinnie @Ericloewe @Etorix update: was able to reproduce on CORE U6, 1M files.

Code:
root@truenas[/mnt/alpha/seeker]# for i in {1..8} ; do ./reproducer.sh & done; wait
[2] 82307
[3] 82308
[4] 82309
[5] 82310
[6] 82313
[7] 82316
[8] 82317
[9] 82318
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
checking files
checking files
checking files
checking files
checking files
checking files
Binary files reproducer_82317_0 and reproducer_82317_24 differ
Binary files reproducer_82317_0 and reproducer_82317_49 differ
Binary files reproducer_82317_0 and reproducer_82317_50 differ
Binary files reproducer_82317_0 and reproducer_82317_99 differ
checking files
Binary files reproducer_82317_0 and reproducer_82317_100 differ
Binary files reproducer_82317_0 and reproducer_82317_101 differ
Binary files reproducer_82317_0 and reproducer_82317_102 differ
checking files
Binary files reproducer_82317_0 and reproducer_82317_199 differ
Binary files reproducer_82317_0 and reproducer_82317_200 differ
Binary files reproducer_82317_0 and reproducer_82317_201 differ
Binary files reproducer_82317_0 and reproducer_82317_202 differ
Binary files reproducer_82317_0 and reproducer_82317_203 differ
Binary files reproducer_82317_0 and reproducer_82317_204 differ
Binary files reproducer_82317_0 and reproducer_82317_205 differ
Binary files reproducer_82317_0 and reproducer_82317_206 differ
Binary files reproducer_82317_0 and reproducer_82317_399 differ
Binary files reproducer_82317_0 and reproducer_82317_400 differ
Binary files reproducer_82317_0 and reproducer_82317_401 differ
Binary files reproducer_82317_0 and reproducer_82317_402 differ
Binary files reproducer_82317_0 and reproducer_82317_403 differ
Binary files reproducer_82317_0 and reproducer_82317_404 differ
Binary files reproducer_82317_0 and reproducer_82317_405 differ
Binary files reproducer_82317_0 and reproducer_82317_406 differ
Binary files reproducer_82317_0 and reproducer_82317_407 differ
Binary files reproducer_82317_0 and reproducer_82317_408 differ
Binary files reproducer_82317_0 and reproducer_82317_409 differ
Binary files reproducer_82317_0 and reproducer_82317_410 differ
Binary files reproducer_82317_0 and reproducer_82317_411 differ
Binary files reproducer_82317_0 and reproducer_82317_412 differ
Binary files reproducer_82317_0 and reproducer_82317_413 differ
Binary files reproducer_82317_0 and reproducer_82317_414 differ
[5]    done       ./reproducer.sh
[2]    done       ./reproducer.sh
Binary files reproducer_82317_0 and reproducer_82317_799 differ
Binary files reproducer_82317_0 and reproducer_82317_800 differ
Binary files reproducer_82317_0 and reproducer_82317_801 differ
Binary files reproducer_82317_0 and reproducer_82317_802 differ
Binary files reproducer_82317_0 and reproducer_82317_803 differ
Binary files reproducer_82317_0 and reproducer_82317_804 differ
Binary files reproducer_82317_0 and reproducer_82317_805 differ
Binary files reproducer_82317_0 and reproducer_82317_806 differ
Binary files reproducer_82317_0 and reproducer_82317_807 differ
Binary files reproducer_82317_0 and reproducer_82317_808 differ
Binary files reproducer_82317_0 and reproducer_82317_809 differ
Binary files reproducer_82317_0 and reproducer_82317_810 differ
Binary files reproducer_82317_0 and reproducer_82317_811 differ
Binary files reproducer_82317_0 and reproducer_82317_812 differ
Binary files reproducer_82317_0 and reproducer_82317_813 differ
Binary files reproducer_82317_0 and reproducer_82317_814 differ
Binary files reproducer_82317_0 and reproducer_82317_815 differ
Binary files reproducer_82317_0 and reproducer_82317_816 differ
Binary files reproducer_82317_0 and reproducer_82317_817 differ
Binary files reproducer_82317_0 and reproducer_82317_818 differ
Binary files reproducer_82317_0 and reproducer_82317_819 differ
Binary files reproducer_82317_0 and reproducer_82317_820 differ
Binary files reproducer_82317_0 and reproducer_82317_821 differ
Binary files reproducer_82317_0 and reproducer_82317_822 differ
Binary files reproducer_82317_0 and reproducer_82317_823 differ
Binary files reproducer_82317_0 and reproducer_82317_824 differ
Binary files reproducer_82317_0 and reproducer_82317_825 differ
Binary files reproducer_82317_0 and reproducer_82317_826 differ
Binary files reproducer_82317_0 and reproducer_82317_827 differ
Binary files reproducer_82317_0 and reproducer_82317_828 differ
Binary files reproducer_82317_0 and reproducer_82317_829 differ
Binary files reproducer_82317_0 and reproducer_82317_830 differ
[9]  + done       ./reproducer.sh
[7]  - done       ./reproducer.sh
[3]    done       ./reproducer.sh
[4]    done       ./reproducer.sh
[8]  + done       ./reproducer.sh
[6]  + done       ./reproducer.sh
Interesting because I was unable to reproduce with one of my Core 13.0-U6 systems....and the zfs guys wrote:
Users running the 2.1.x branch or older are unaffected, as block cloning is a 2.2.x-only feature.
Ref. https://github.com/openzfs/zfs/releases/tag/zfs-2.2.1

So Core 13.0.x systems should be safe becaue of using the 2.1.x branch !?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I realize this is moving quickly (and @winnielinnie, I'll take the liberty of adding a small update to your first post to help people out), but it pays to read the first post carefully: This has been reproduced on ZFS 2.1, as block cloning lifted the carpet hiding this mess but did not create the mess.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Sidenote: Because reproducing this with block cloning enabled and in use is not interesting at this point, I've been trying to reproduce this on OpenZFS 2.1.x, with zero success so far on either FreeBSD or Linux. At this point, I'm literally spamming a 96-core machine - bare metal - with a brand-new mirror of Kioxia CM6-R PCIe 4.0 SSDs with hundreds of runs with all sorts of numbers of workers from 4 to 384.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Sidenote: Because reproducing this with block cloning enabled and in use is not interesting at this point, I've been trying to reproduce this on OpenZFS 2.1.x, with zero success so far on either FreeBSD or Linux. At this point, I'm literally spamming a 96-core machine - bare metal - with a brand-new mirror of Kioxia CM6-R PCIe 4.0 SSDs with hundreds of runs with all sorts of numbers of workers from 4 to 384.
Since it seems to be a race condition, possibly it's a matter of a sufficiently fast machine combined with a slower I/O subsystem? In your case everything is top notch. Just speculating, of course.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Since it seems to be a race condition, possibly it's a matter of a sufficiently fast machine combined with a slower I/O subsystem? In your case everything is top notch. Just speculating, of course.
The thought has crossed my mind. I'm going through a long run (16 workers, 128 loops 1000 16k files), afterwards I'll try to force some fancy things like seeking to holes (not that there are any). Then I'll see what I can do about other, weirder combinations - though I can't test SATA on this machine, it's only wired for NVMe, I have a more pedestrian machine that's also Epyc Milan, albeit a mere 32 cores and 512 GB of memory. Just need to call my colleagues on site to put in a disk or four. That's going to be an interesting call.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
@winnielinnie @Ericloewe @Etorix update: was able to reproduce on CORE U6, 1M files.

Code:
root@truenas[/mnt/alpha/seeker]# for i in {1..8} ; do ./reproducer.sh & done; wait
[2] 82307
[3] 82308
[4] 82309
[5] 82310
[6] 82313
[7] 82316
[8] 82317
[9] 82318
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
checking files
checking files
checking files
checking files
checking files
checking files
Binary files reproducer_82317_0 and reproducer_82317_24 differ
Binary files reproducer_82317_0 and reproducer_82317_49 differ
Binary files reproducer_82317_0 and reproducer_82317_50 differ
Binary files reproducer_82317_0 and reproducer_82317_99 differ
checking files
Binary files reproducer_82317_0 and reproducer_82317_100 differ
Binary files reproducer_82317_0 and reproducer_82317_101 differ
Binary files reproducer_82317_0 and reproducer_82317_102 differ
checking files
Binary files reproducer_82317_0 and reproducer_82317_199 differ
Binary files reproducer_82317_0 and reproducer_82317_200 differ
Binary files reproducer_82317_0 and reproducer_82317_201 differ
Binary files reproducer_82317_0 and reproducer_82317_202 differ
Binary files reproducer_82317_0 and reproducer_82317_203 differ
Binary files reproducer_82317_0 and reproducer_82317_204 differ
Binary files reproducer_82317_0 and reproducer_82317_205 differ
Binary files reproducer_82317_0 and reproducer_82317_206 differ
Binary files reproducer_82317_0 and reproducer_82317_399 differ
Binary files reproducer_82317_0 and reproducer_82317_400 differ
Binary files reproducer_82317_0 and reproducer_82317_401 differ
Binary files reproducer_82317_0 and reproducer_82317_402 differ
Binary files reproducer_82317_0 and reproducer_82317_403 differ
Binary files reproducer_82317_0 and reproducer_82317_404 differ
Binary files reproducer_82317_0 and reproducer_82317_405 differ
Binary files reproducer_82317_0 and reproducer_82317_406 differ
Binary files reproducer_82317_0 and reproducer_82317_407 differ
Binary files reproducer_82317_0 and reproducer_82317_408 differ
Binary files reproducer_82317_0 and reproducer_82317_409 differ
Binary files reproducer_82317_0 and reproducer_82317_410 differ
Binary files reproducer_82317_0 and reproducer_82317_411 differ
Binary files reproducer_82317_0 and reproducer_82317_412 differ
Binary files reproducer_82317_0 and reproducer_82317_413 differ
Binary files reproducer_82317_0 and reproducer_82317_414 differ
[5]    done       ./reproducer.sh
[2]    done       ./reproducer.sh
Binary files reproducer_82317_0 and reproducer_82317_799 differ
Binary files reproducer_82317_0 and reproducer_82317_800 differ
Binary files reproducer_82317_0 and reproducer_82317_801 differ
Binary files reproducer_82317_0 and reproducer_82317_802 differ
Binary files reproducer_82317_0 and reproducer_82317_803 differ
Binary files reproducer_82317_0 and reproducer_82317_804 differ
Binary files reproducer_82317_0 and reproducer_82317_805 differ
Binary files reproducer_82317_0 and reproducer_82317_806 differ
Binary files reproducer_82317_0 and reproducer_82317_807 differ
Binary files reproducer_82317_0 and reproducer_82317_808 differ
Binary files reproducer_82317_0 and reproducer_82317_809 differ
Binary files reproducer_82317_0 and reproducer_82317_810 differ
Binary files reproducer_82317_0 and reproducer_82317_811 differ
Binary files reproducer_82317_0 and reproducer_82317_812 differ
Binary files reproducer_82317_0 and reproducer_82317_813 differ
Binary files reproducer_82317_0 and reproducer_82317_814 differ
Binary files reproducer_82317_0 and reproducer_82317_815 differ
Binary files reproducer_82317_0 and reproducer_82317_816 differ
Binary files reproducer_82317_0 and reproducer_82317_817 differ
Binary files reproducer_82317_0 and reproducer_82317_818 differ
Binary files reproducer_82317_0 and reproducer_82317_819 differ
Binary files reproducer_82317_0 and reproducer_82317_820 differ
Binary files reproducer_82317_0 and reproducer_82317_821 differ
Binary files reproducer_82317_0 and reproducer_82317_822 differ
Binary files reproducer_82317_0 and reproducer_82317_823 differ
Binary files reproducer_82317_0 and reproducer_82317_824 differ
Binary files reproducer_82317_0 and reproducer_82317_825 differ
Binary files reproducer_82317_0 and reproducer_82317_826 differ
Binary files reproducer_82317_0 and reproducer_82317_827 differ
Binary files reproducer_82317_0 and reproducer_82317_828 differ
Binary files reproducer_82317_0 and reproducer_82317_829 differ
Binary files reproducer_82317_0 and reproducer_82317_830 differ
[9]  + done       ./reproducer.sh
[7]  - done       ./reproducer.sh
[3]    done       ./reproducer.sh
[4]    done       ./reproducer.sh
[8]  + done       ./reproducer.sh
[6]  + done       ./reproducer.sh
Thanks... can you submit a TrueNAS bug report and provide your script/setup. We can then track with a NAS ticket ID.

BTW.. did you have sync = always or sync = never? I would guess the speed and behaviour could be quite different.
 
Joined
Jan 18, 2017
Messages
525
I'll just add another data point......

Motherboard: SuperMicro X8DTU-F
CPU: 2 x Intel Xeon X5690
Storage Drive:6 x 2TB Seagate ES.2 in mirrored pairs
Version: TrueNAS-13.0-U4


Code:
writing files
checking files
checking files
Binary files reproducer_67222_0 and reproducer_67222_318 differ
checking files
checking files
checking files
checking files
checking files
checking files
Binary files reproducer_67222_0 and reproducer_67222_637 differ
Binary files reproducer_67222_0 and reproducer_67222_638 differ
[10]   Done                          ./reproducer.sh
[9]    Done                          ./reproducer.sh
[8]    Done                          ./reproducer.sh
[7]    Done                          ./reproducer.sh
[6]    Done                          ./reproducer.sh
[5]    Done                          ./reproducer.sh
[4]    Done                          ./reproducer.sh
[3]    Done                          ./reproducer.sh
% zfs version
zfs-2.1.9-1
zfs-kmod-v2023012500-zfs_9ef0b67f8
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I'll just add another data point......

Motherboard: SuperMicro X8DTU-F
CPU: 2 x Intel Xeon X5690
Storage Drive:6 x 2TB Seagate ES.2 in mirrored pairs
Version: TrueNAS-13.0-U4


Code:
writing files
checking files
checking files
Binary files reproducer_67222_0 and reproducer_67222_318 differ
checking files
checking files
checking files
checking files
checking files
checking files
Binary files reproducer_67222_0 and reproducer_67222_637 differ
Binary files reproducer_67222_0 and reproducer_67222_638 differ
[10]   Done                          ./reproducer.sh
[9]    Done                          ./reproducer.sh
[8]    Done                          ./reproducer.sh
[7]    Done                          ./reproducer.sh
[6]    Done                          ./reproducer.sh
[5]    Done                          ./reproducer.sh
[4]    Done                          ./reproducer.sh
[3]    Done                          ./reproducer.sh
% zfs version
zfs-2.1.9-1
zfs-kmod-v2023012500-zfs_9ef0b67f8

13.0-U6 is the check we need.... seems unexpected to me given that it is before ZFS2.2 and "block cloning"

Also interested in SCALE 23.10.0.1 reproductions .... this would be with ZFS 2.2 and "block cloning"
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Thanks... can you submit a TrueNAS bug report and provide your script/setup. We can then track with a NAS ticket ID.
Done. NAS-125356

BTW.. did you have sync = always or sync = never? I would guess the speed and behaviour could be quite different.
Value was set to standard. Additional informations as written in the bug report: test executed on the HDDs unencrypted pool.
 

bcat

Explorer
Joined
Oct 20, 2022
Messages
84
13.0-U6 is the check we need.... seems unexpected to me given that it is before ZFS2.2 and "block cloning"
Unfortunately, the bug still occurs even with block cloning disabled and with ZFS 2.1.x versions. Per investigation on the upstream issue, it's a longstanding race condition that block cloning makes somewhat easier to trigger, but disabling block cloning is not sufficient to fix.

What does appear to work around the issue is setting the zfs_dmu_offset_next_sync=0 tunable (on FreeBSD) or kernel module parameter (on Linux). This seems to disable the section of code that triggers that race condition, at the cost of less accurate (but still compliant) SEEK_HOLE reporting in lseek (and so potentially more bytes used on disk to copy newly written files with holes).
Also interested in SCALE 23.10.0.1 reproductions .... this would be with ZFS 2.2 and "block cloning"
I posted just such a case. Do you need more details? I'm happy to provide them if so, just let me know which.

Edit: I put the SCALE 23.10.0.1 repro case on the TrueNAS bug for posterity. If you'd rather I open a separate issue for SCALE, I can as well, but it seems to be caused by the same upstream OpenZFS bug in any case.

Edit 2: Added link to more descriptive repro instructions for ZFS 2.1.13.
 
Last edited:

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
I posted just such a case. Do you need more details? I'm happy to provide them if so, just let me know which.
I suppose that @morganL would like a bug report in JIRA for SCALE 23.10.0.1, just like @Davvo did for Core 13.0-U6, even though the tickets might be merged when developers are satisfied that the issue lies upstream in OpenZFS-2.1.x onwards.
 

bcat

Explorer
Joined
Oct 20, 2022
Messages
84
I suppose that @morganL would like a bug report in JIRA for SCALE 23.10.0.1, just like @Davvo did for Core 13.0-U6, even though the tickets might be merged when developers are satisfied that the issue lies upstream in OpenZFS-2.1.x onwards.
Makes sense. I'll wait for Morgan to confirm (just so I don't spam them with unwanted issues if not), but I'm happy to file a separate ticket for the SCALE repro if so.

Also, for what it's worth, setting zfs_dmu_offset_next_sync=0 may have mitigated the issue for me. At the very least, I wasn't able to repro in 10 or so tries (which would have been enough to trigger the corruption multiple times without that parameter override).
 
Last edited:

Juan Manuel Palacios

Contributor
Joined
May 29, 2017
Messages
146
On Core? You can immediately set the parameter with sysctl, like I did here


Code:
sysctl -w .zfs.dmu_offset_next_sync=0

It has to be run as "root", of course.


EDIT: Just saw your edit. You need to use the "-w" flag to set a value. Otherwise, without the "-w" flag, it will only read the current value.
Doing only that will not make it permanent, and the setting will only be active during the current boot. To make it survive across reboots, you have to set it as a tunable in the TrueNAS Core GUI.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I'm happy to file a separate ticket for the SCALE repro if so.
Please do, and include your SCALE debug as well. I'll take any flak that gets generated for this. :tongue:

One thing I noticed in your sample output is that only one of your 16 threads choked up (PID 198130) - the others worked. It's not exclusive but that also tracks with my local-repro attempts.
 

bcat

Explorer
Joined
Oct 20, 2022
Messages
84
Please do, and include your SCALE debug as well. I'll take any flak that gets generated for this. :tongue:

One thing I noticed in your sample output is that only one of your 16 threads choked up (PID 198130) - the others worked. It's not exclusive but that also tracks with my local-repro attempts.
Interesting. I saw the same thing when I generated a fresh repro case to report the bug just now.

Filed NAS-125358 for the SCALE 23.10.0.1 report.

Edit: lol, my bug was duped against the original bug almost immediately. :) eh, i did ask before filing... :P
 
Last edited:
Top