Silent corruption with OpenZFS (ongoing discussion and testing)

Cellobita

Contributor
Joined
Jul 15, 2011
Messages
107
I'm leaning towards resorting to this myself. It's relatively recently that it was enabled by default (starting with OpenZFS 2.1.5 from what I understand.) So surely it can't be dangerous to set it to 0? (I don't even deal with sparse / "holey" files. If I do, it's quite uncommon for me.)

I'm thinking along the same lines myself.

It can be set via GUI (Tunables) - will it require a reboot to activate?

P.S.: never mind, I can also set it via shell sysctl vfs.zfs.dmu_offset_next_sync=0, thus avoiding having to reboot to activate
 
Joined
Oct 22, 2019
Messages
3,641
It can be set via GUI (Tunables) - will it require a reboot to activate?

On Core? You can immediately set the parameter with sysctl, like I did here


Code:
sysctl -w .zfs.dmu_offset_next_sync=0

It has to be run as "root", of course.


EDIT: Just saw your edit. You need to use the "-w" flag to set a value. Otherwise, without the "-w" flag, it will only read the current value.
 
Last edited:

kspare

Guru
Joined
Feb 19, 2015
Messages
508
How does this affect people using core for iscsi with a sparse zvol?
 
Joined
Oct 22, 2019
Messages
3,641
How does this affect people using core for iscsi with a sparse zvol?
I don't use zvols or iSCSI, so I am not sure. That's why I don't believe disabling the parameter is a "universal" workaround for everyone.

Someone with more knowledge of ZFS might be able to provide more clarity.
 

Cellobita

Contributor
Joined
Jul 15, 2011
Messages
107
On Core? You can immediately set the parameter with sysctl, like I did here


Code:
sysctl -w .zfs.dmu_offset_next_sync=0

It has to be run as "root", of course.


EDIT: Just saw your edit. You need to use the "-w" flag to set a value. Otherwise, without the "-w" flag, it will only read the current value.

I have tried without the -w, and it seems to work

Code:
root@servidor[~]# sysctl vfs.zfs.dmu_offset_next_sync=1
vfs.zfs.dmu_offset_next_sync: 0 -> 1
root@servidor[~]# sysctl vfs.zfs.dmu_offset_next_sync 
vfs.zfs.dmu_offset_next_sync: 1
root@servidor[~]# sysctl vfs.zfs.dmu_offset_next_sync=0
vfs.zfs.dmu_offset_next_sync: 1 -> 0
root@servidor[~]# sysctl vfs.zfs.dmu_offset_next_sync 
vfs.zfs.dmu_offset_next_sync: 0
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
How does this affect people using core for iscsi with a sparse zvol?
iSCSI doesn't hook into the block cloning feature, so that won't be a factor.

dmu_offset_next_sync is a different challenge though, but the code appears to primarily be triggered by file-level copies, so the block-level actions of iSCSI are less likely to trip over it. Things like SCSI XCOPY commands against a zvol that contains a lot of sparse space may hit it though, as to ZFS both files and blocks are made up of records.
 
Joined
Oct 22, 2019
Messages
3,641
no repo on a truenas scale 24.10.0.1 in a proxmox VM w/ HBA passthrough, 32disks in 8xraidz2 and 2 cache NVMEs - block cloning enabled

even when running 64 at once

for i in {1..64}; do ./zfs-bclone-repo.sh & done; wait

now the proxmox host is also using zfs, simple 2disk mirror, block cloning is not enabled as far as i can tell, but i can repo it, even with as low as 8!! echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync fixes it

Upon reading the latest comments in the GitHub bug report (from the past couple hours), this is likely because virtualization has enough of a "delay" to prevent this race condition.

Hence, slow systems, and especially VMs, are the least likely to reproduce this bug. This means that a virtualized environment is not a good place to test for this bug.

This might explain why you were able to reproduce it on your Proxmox host, but not on your virtualized TrueNAS SCALE.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I'm not sure anyone has a really good theory for the timing involved in this race condition... I've read people saying it's easier to hit with slower and faster systems with slower and faster storage.

I'm going to try for ludicrous speed with a VM with mirrored passthrough NVMe SSDs. I can also try for ludicrously slow storage using SMR 2.5" laptop drives, I still have a bunch of them lying around...
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
When you say instances, do you mean virtualized?
Both virtual and bare metal initially, but I moved on to exclusively bare-metal testing last night to eliminate the potential for VM/hypervisor interference.
 
Joined
Oct 22, 2019
Messages
3,641
I can also try for ludicrously slow storage using SMR 2.5" laptop drives, I still have a bunch of them lying around...
Off-topic: A few months ago I actually had a dream where I started a "charity" that distributed "free" TrueNAS boxes made from scrap parts, but the recipients had to sign a contract that would obligate them to become lifelong bug testers. :tongue:

Kind of wish that dream was true...

(I'm not evil. I swear.)
 
Joined
Oct 22, 2019
Messages
3,641
but I moved on to exclusively bare-metal testing last night to eliminate the potential for VM/hypervisor interference.
Did you see the proto-patch created by Rob N., referenced by @Ericloewe? Could it be used in your tests, or iXsystems' tests? (I am in no place to patch any part of the the OpenZFS source code, let alone understand what the patch "does" exactly.)
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Knock knock.
Race condition.

Who's there?

Did you see the proto-patch created by Rob N., referenced by @Ericloewe? Could it be used in your tests, or iXsystems' tests? (I am in no place to patch any part of the the OpenZFS source code, let alone understand what the patch "does" exactly.)
I did see that, see also actual commit here: https://github.com/openzfs/zfs/pull/15571

Notably I finally just now managed to repro this locally by changing the script to fire off 10000x 128K files vs 1000x 1M - it faulted a single file which has the expected "holey" behavior.

Code:
Binary files reproducer_745948_0 and reproducer_745948_6582 differ
root@mini-r[/mnt/cobia-raidz2/repro]# hexdump reproducer_745948_6582
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0020000
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
It has been reproduced on SCALE and Core.

See posts [...] #25 (Core 13.0-U6)
To be fair, isn't that only after installing coreutils?

I am not able to reproduce the bug in a standard U6 install.
 
Joined
Oct 22, 2019
Messages
3,641
To be fair, isn't that only after installing coreutils?
Correct. (And because I didn't want to "install" anything in the host system, I used a jail. Nor should you ever install stuff on the host system, whether it's Core or SCALE, whether with "pkg" or "apt".)

However:
  • Data corruption is data corruption, either way. Whether it happens from within a jail, or whether it uses a GNU tool or native FreeBSD tool.
  • Another user in the GitHub discussion reproduced this on TrueNAS Core 13.0-U5.3 without GNU coreutils (FreeBSD's "cp").
  • I never fully ruled out I could not do this outside of my jail. (Maybe it's more difficult with my specific system, but as has been said before: "It's hard to prove a bug doesn't exist." )
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
As indicated, we'll investigate for TrueNAS, but have seen no evidence of a problem in the field.
There's a growing suspicion that this issue could be a case of it, triggered by rebalancing vdevs:
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
There's a growing suspicion that this issue could be a case of it, triggered by rebalancing vdevs:
I did notice a commit to this repository intended to prevent the use of block cloning on Linux through the use of --reflink=never:

https://github.com/markusressel/zfs...mmit/979edb265cb866922d77d2189399948629a4479f

The script also uses md5sum to compare source and destination files after a copy, which hopefully would catch the errors at a logical level.

This script does make heavy use of local cp/mv/rm commands - for what it's worth, while I've been able to reproduce this with the reproducer script (by using greater numbers of small files) I have not been able to do this over SMB or NFS using the same "large number of small files" logic.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
@winnielinnie @Ericloewe @Etorix update: was able to reproduce on CORE U6, 1M files.

Code:
root@truenas[/mnt/alpha/seeker]# for i in {1..8} ; do ./reproducer.sh & done; wait
[2] 82307
[3] 82308
[4] 82309
[5] 82310
[6] 82313
[7] 82316
[8] 82317
[9] 82318
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
checking files
checking files
checking files
checking files
checking files
checking files
Binary files reproducer_82317_0 and reproducer_82317_24 differ
Binary files reproducer_82317_0 and reproducer_82317_49 differ
Binary files reproducer_82317_0 and reproducer_82317_50 differ
Binary files reproducer_82317_0 and reproducer_82317_99 differ
checking files
Binary files reproducer_82317_0 and reproducer_82317_100 differ
Binary files reproducer_82317_0 and reproducer_82317_101 differ
Binary files reproducer_82317_0 and reproducer_82317_102 differ
checking files
Binary files reproducer_82317_0 and reproducer_82317_199 differ
Binary files reproducer_82317_0 and reproducer_82317_200 differ
Binary files reproducer_82317_0 and reproducer_82317_201 differ
Binary files reproducer_82317_0 and reproducer_82317_202 differ
Binary files reproducer_82317_0 and reproducer_82317_203 differ
Binary files reproducer_82317_0 and reproducer_82317_204 differ
Binary files reproducer_82317_0 and reproducer_82317_205 differ
Binary files reproducer_82317_0 and reproducer_82317_206 differ
Binary files reproducer_82317_0 and reproducer_82317_399 differ
Binary files reproducer_82317_0 and reproducer_82317_400 differ
Binary files reproducer_82317_0 and reproducer_82317_401 differ
Binary files reproducer_82317_0 and reproducer_82317_402 differ
Binary files reproducer_82317_0 and reproducer_82317_403 differ
Binary files reproducer_82317_0 and reproducer_82317_404 differ
Binary files reproducer_82317_0 and reproducer_82317_405 differ
Binary files reproducer_82317_0 and reproducer_82317_406 differ
Binary files reproducer_82317_0 and reproducer_82317_407 differ
Binary files reproducer_82317_0 and reproducer_82317_408 differ
Binary files reproducer_82317_0 and reproducer_82317_409 differ
Binary files reproducer_82317_0 and reproducer_82317_410 differ
Binary files reproducer_82317_0 and reproducer_82317_411 differ
Binary files reproducer_82317_0 and reproducer_82317_412 differ
Binary files reproducer_82317_0 and reproducer_82317_413 differ
Binary files reproducer_82317_0 and reproducer_82317_414 differ
[5]    done       ./reproducer.sh
[2]    done       ./reproducer.sh
Binary files reproducer_82317_0 and reproducer_82317_799 differ
Binary files reproducer_82317_0 and reproducer_82317_800 differ
Binary files reproducer_82317_0 and reproducer_82317_801 differ
Binary files reproducer_82317_0 and reproducer_82317_802 differ
Binary files reproducer_82317_0 and reproducer_82317_803 differ
Binary files reproducer_82317_0 and reproducer_82317_804 differ
Binary files reproducer_82317_0 and reproducer_82317_805 differ
Binary files reproducer_82317_0 and reproducer_82317_806 differ
Binary files reproducer_82317_0 and reproducer_82317_807 differ
Binary files reproducer_82317_0 and reproducer_82317_808 differ
Binary files reproducer_82317_0 and reproducer_82317_809 differ
Binary files reproducer_82317_0 and reproducer_82317_810 differ
Binary files reproducer_82317_0 and reproducer_82317_811 differ
Binary files reproducer_82317_0 and reproducer_82317_812 differ
Binary files reproducer_82317_0 and reproducer_82317_813 differ
Binary files reproducer_82317_0 and reproducer_82317_814 differ
Binary files reproducer_82317_0 and reproducer_82317_815 differ
Binary files reproducer_82317_0 and reproducer_82317_816 differ
Binary files reproducer_82317_0 and reproducer_82317_817 differ
Binary files reproducer_82317_0 and reproducer_82317_818 differ
Binary files reproducer_82317_0 and reproducer_82317_819 differ
Binary files reproducer_82317_0 and reproducer_82317_820 differ
Binary files reproducer_82317_0 and reproducer_82317_821 differ
Binary files reproducer_82317_0 and reproducer_82317_822 differ
Binary files reproducer_82317_0 and reproducer_82317_823 differ
Binary files reproducer_82317_0 and reproducer_82317_824 differ
Binary files reproducer_82317_0 and reproducer_82317_825 differ
Binary files reproducer_82317_0 and reproducer_82317_826 differ
Binary files reproducer_82317_0 and reproducer_82317_827 differ
Binary files reproducer_82317_0 and reproducer_82317_828 differ
Binary files reproducer_82317_0 and reproducer_82317_829 differ
Binary files reproducer_82317_0 and reproducer_82317_830 differ
[9]  + done       ./reproducer.sh
[7]  - done       ./reproducer.sh
[3]    done       ./reproducer.sh
[4]    done       ./reproducer.sh
[8]  + done       ./reproducer.sh
[6]  + done       ./reproducer.sh
 
Top