Silent corruption with OpenZFS (ongoing discussion and testing)

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Seems someone is writing ZDB functions related to block cloning;
It won't be perfect to find all corrupted files, but it is something to help. Same author as the OpenZFS 2 part fix.
 

probain

Patron
Joined
Feb 25, 2023
Messages
211
@axhxrx and others, you should use the GUI options below to persist this change across reboots:

CORE: Using System -> Tunables, add
  • Variable: vfs.zfs.dmu_offset_next_sync
  • Value: 0
  • Type: sysctl
SCALE: Using System -> Sysctl, add
  • Variable: zfs_dmu_offset_next_sync
  • Value: 0
Doesn't seem to work.
1000007701.png
 

LIGISTX

Guru
Joined
Apr 12, 2015
Messages
525
@axhxrx and others, you should use the GUI options below to persist this change across reboots:

CORE: Using System -> Tunables, add
  • Variable: vfs.zfs.dmu_offset_next_sync
  • Value: 0
  • Type: sysctl
SCALE: Using System -> Sysctl, add
  • Variable: zfs_dmu_offset_next_sync
  • Value: 0
I am coming into this knowing it’s way over my head (thanks random LTT WAN show commenter mentioning this), but looking to understand if I am even affected and if the above tunables would help mitigate this for me.

I am on 13.0-U5.3, but I can’t recall if I actually upgrade my pool. Running zpool get version results in my version being “-“…. Useful… super useful. Is there a different command I should be trying to actually report out what version of ZFS I am on to see if I am affected or not?
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
@probain It is vfs.zfs.dmu_offset_next_sync.

@LIGISTX Just zpool version or zfs version.
Setting the tunable to 0 cannot hurt, regardless whether you upgraded, as it makes it more difficult to trigger the race condition. Remember: The bug is NOT in block cloning, it is merely made worse by block cloning.
 

victort

Guru
Joined
Dec 31, 2021
Messages
973
@probain It is vfs.zfs.dmu_offset_next_sync.

@LIGISTX Just zpool version or zfs version.
Setting the tunable to 0 cannot hurt, regardless whether you upgraded, as it makes it more difficult to trigger the race condition. Remember: The bug is NOT in block cloning, it is merely made worse by block cloning.
@HoneyBadger is this correct? Did you accidentally post the wrong value?
 

LIGISTX

Guru
Joined
Apr 12, 2015
Messages
525
@LIGISTX Just zpool version or zfs version.
Setting the tunable to 0 cannot hurt, regardless whether you upgraded, as it makes it more difficult to trigger the race condition. Remember: The bug is NOT in block cloning, it is merely made worse by block cloning.
Ah, thank you. I am on 2.1.11-1, so I think I am in fact affected?
 
Last edited:

probain

Patron
Joined
Feb 25, 2023
Messages
211

bcat

Explorer
Joined
Oct 20, 2022
Messages
84
Just to be clear, you are on SCALE? If so, it's not a sysctl that you need to set, but rather a kernel module parameter. Which, sadly, I think we still can't set through the GUI, but can through the API:
Code:
$ sudo midclt call system.advanced.update '{"kernel_extra_options": "zfs.zfs_dmu_offset_next_sync=0"}'

This will only take effect on future reboots, so unless you intend to reboot right away, you should also override the value in the currently loaded kernel module (not persistent, only affects current boot):
Code:
$ echo 0 | sudo tee /sys/module/zfs/parameters/zfs_dmu_offset_next_sync >/dev/null

Note that there's only one kernel_extra_options string, so if you have existing kernel command line parameters set, the above command will overwrite them, and you'll need something like "kernel_extra_options": "[existing_options_here] zfs.zfs_dmu_offset_next_sync=0" instead.
 
Last edited:

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134

probain

Patron
Joined
Feb 25, 2023
Messages
211
Just to be clear, you are on SCALE? If so, it's not a sysctl that you need to set, but rather a kernel module parameter.
I was following the instructions posted by @HoneyBadger
They don't seem to be correct though

I have actually applied this in another way. Just informing that the instructions posted don't work.
 

bcat

Explorer
Joined
Oct 20, 2022
Messages
84
I was following the instructions posted by @HoneyBadger
They don't seem to be correct though

I have actually applied this in another way. Just informing that the instructions posted don't work.
Right, I think there is no sysctl for this option on Linux, but rather a parameter for the ZFS kernel module. So adding this via the sysctl list in the TrueNAS UI won't work on SCALE.

To set these in SCALE in a way that takes effect as soon as the module loads and persists across upgrades/reinstalls, you can set the kernel_extra_options "advanced setting" via the TrueNAS API. (On other Linux systems, this could be done through /etc/modprobe.d instead of the kernel command line, but I would assume /etc/modprobe.d is wiped on updates, so going through the TrueNAS API seems safer.)
 
Last edited:

diskdiddler

Wizard
Joined
Jul 9, 2014
Messages
2,377
I've actually also come across this issue. They weren't important images so I just deleted them and didn't think much of it. But this has been 4-5 times now in the last year or so.

I'm afraid I'm going to have to join this party guys. As I run a particularly messy filesystem I've always assumed it was just me doing something wrong, yet another duplication I failed to copy or something along those lines.

I am relatively sure in the past 2 years I've come across a few files which I was darn sure, should be fine and they were infact not.


So as a low skill long term user here, how can I help / test / confirm anything (without *any* further risk to data?) - what test.?

Also, now that the topic is in the air, what tools do we have, which will check data for corruption? The tool which will look for 0000's sounds useful though I'd actually be curious for a general tool which will check images and videos for corruption. I had something not half bad at this, years ago but I'll be damned if I recall the tool.
 

grahamperrin

Dabbler
Joined
Feb 18, 2013
Messages
27

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I was following the instructions posted by @HoneyBadger
They don't seem to be correct though

I have actually applied this in another way. Just informing that the instructions posted don't work.
Thanks for the heads up. The GUI method definitely should be working as that's where several other tunables live (and do take effect) but perhaps something's amiss. (And yes, it still shows them as "sysctl" despite it being "kernel parameter" on Linux.)

1700949838016.png


Code:
root@mini-r[~]# cat /sys/module/zfs/parameters/zfs_dirty_data_max_max \
> /sys/module/zfs/parameters/l2arc_noprefetch \
> /sys/module/zfs/parameters/l2arc_write_max \
> /sys/module/zfs/parameters/l2arc_write_boost
12884901888
0
10000000
40000000
 

bcat

Explorer
Joined
Oct 20, 2022
Messages
84
Thanks for the heads up. The GUI method definitely should be working as that's where several other tunables live (and do take effect) but perhaps something's amiss. (And yes, it still shows them as "sysctl" despite it being "kernel parameter" on Linux.)

View attachment 72799

Code:
root@mini-r[~]# cat /sys/module/zfs/parameters/zfs_dirty_data_max_max \
> /sys/module/zfs/parameters/l2arc_noprefetch \
> /sys/module/zfs/parameters/l2arc_write_max \
> /sys/module/zfs/parameters/l2arc_write_boost
12884901888
0
10000000
40000000
Hmm, that's weird. It's very possible there is in fact a way to set this through the GUI that I just missed, but I couldn't find one.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
If this is already noted, apologies:
  • FreeBSD 14.0-RELEASE has 2.2 with block cloning disabled by default.
Beyond that: no other observation from me. <https://old.reddit.com/r/freebsd/comments/182pgki/-/> is pinned, with a pinned comment; my discovery of the bug was through BSD Cafe.
Even if already noted, it's work restating - FreeBSD 14.0 has a sysctl for this vfs.zfs.bclone_enabled that is indeed disabled=0 by default. However, vfs.zfs.dmu_offset_next_sync is still default=1 there.
Hmm, that's weird. It's very possible there is in fact a way to set this through the GUI that I just missed, but I couldn't find one.
The method I outlined is the way (and I swear I've done it in that manner on SCALE before, even) but I'll have to repro from a clean slate later so I can send a lightweight debug up to Jira.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Makes sense. I'll wait for Morgan to confirm (just so I don't spam them with unwanted issues if not), but I'm happy to file a separate ticket for the SCALE repro if so.

Also, for what it's worth, setting zfs_dmu_offset_next_sync=0 may have mitigated the issue for me. At the very least, I wasn't able to repro in 10 or so tries (which would have been enough to trigger the corruption multiple times without that parameter override).
I bug report with info on SCALE and CORE is probably best. The engineering team can split if they see a need for different fixes.
 

Gcon

Explorer
Joined
Aug 1, 2015
Messages
59
For what it's worth, I seemed to get a performance boost by toggling zfs_dmu_offset_next_sync to "0", on a NAS running SCALE 22.12.4.2 (zfs-2.1.12-1) with a storage vdev of 4 mirrors (HDD) + nvme SSD log vdev, mounted over NFS to a server used for backing up about a dozen production VMs nightly.

I start off with a "forever forward incremental" job where the backup software reports speeds typically in the range of 750Mbps to 850Mbps (via 10Gbps DAC), but after putting zfs_dmu_offset_next_sync to "0" it clocked 943Mbps. This is the first time I've seen that get above 900Mbps (system has been operational for months) barring perhaps the first ever backup. A successful completion triggers a second job that copies full VM images to an offsite server. I usually get about 1.5Gbps for that, but last night it clocked 1.69Gbps - which I think is another record (offsite server runs ext3 as that's all it supports).

I am wary of making "post hoc ergo propter hoc" false assumptions, but there were no other changes made, so it'll be interesting to see if the speed boost continues. If it does then I'll take it, but no biggie if it doesn't, as the main aim by far is to mitigate any silent corruption. With these backup boxes, raw speed is nice, but file integrity is absolutely essential and non-negotiable!
 

axhxrx

Cadet
Joined
Nov 11, 2023
Messages
3
Hmm, that's weird. It's very possible there is in fact a way to set this through the GUI that I just missed, but I couldn't find one.
The way I fixed this persistently was based on this comment upthread.

Navigate to System Settings → Advanced → Init/Shutdown Scripts.

I added a script to run the command echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync on every boot with the "When" value set to "Pre Init".

I've confirmed that this persists across reboots. (I was initially confused because in the UI there seemed to be multiple possible ways... the System Settings → Advanced → Systctl settings area seemed like it might be the way, but apparently that doesn't currently work in SCALE per other discussion in this thread.)

1700977516810.png
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
It was indeed mentioned that vfs.zfs_dmu_offset_next_sync=0 increases write performance, at the cost of possibly using more space by missing potential holes which could have been filled.

Per comments #7-8 in the FreeBSD discussion, vfs.zfs_dmu_offset_next_sync=0 is a mitigation but not a definitive solution. A solution with respect to block cloning (OpenZFS 2.2) has been proposed, but still leaves out the question as to why it can occur in OpenZFS 2.1.
 
Top