winnielinnie
MVP
- Joined
- Oct 22, 2019
- Messages
- 3,641
Happy Thanksgiving, everyone! Great timing, eh?
A "silent corruption" bug has been discovered recently in OpenZFS.
Here is the bug report for reference. (It's a long read, and the discussion and investigation is still ongoing):
github.com
From what I can grasp (and this might be inaccurate):
Tony Hutter's script (designed for OpenZFS 2.2.0 and Linux-based systems): https://gist.github.com/tonyhutter/d69f305508ae3b7ff6e9263b22031a84
@Ericloewe's modified version of this script to work with TrueNAS Core systems: https://www.truenas.com/community/threads/truenas-13-0-u6-is-now-available.114337/page-3
Write-up about what this issue is and is not: https://gist.github.com/rincebrain/e23b4a39aba3fadc04db18574d30dc73
Others have reproduced this issue on - at least:
I have reproduced silent corruption on Arch Linux with OpenZFS 2.2.0.
I could not reproduce silent corruption on TrueNAS Core 13.0-U6 with OpenZFS 2.1.13. Yes, I could. Post #25.
* I ran 16 instances of the script in parallel on spinning HDDs. Supposedly, it's more likely to occur on HDDs rather than SSDs.
I won't be available to follow-up for most of today, but we need a separate thread here where users can post their test results, as well as to follow the ongoing bug report(s) on the OpenZFS GitHub. The announcement thread for the release of TrueNAS Core 13.0-U6 will become too cluttered, so I started a new thread in here.
Disclaimer: I am but a simple person who wants 100% assurance that my data will not silently corrupt with ZFS, even under "rare" circumstances with unlikely I/O operations. Some of the stuff being investigated in the bug report by other users and developers is beyond my tiny brain's understanding.
Paging users who were in the "release announcement" thread:
@Ericloewe
@morganL
@Davvo
@Gcon
@Juan Manuel Palacios
@Etorix
EDIT: Information might rapidly change as more is discovered about this. Rather than continually edit this original post, it's better to read any updates below.
EDIT November 27, 2023: Tickets on iXsystems' TrueNAS Jira tracker, on page #7 by @Kris Moore
A "silent corruption" bug has been discovered recently in OpenZFS.
Here is the bug report for reference. (It's a long read, and the discussion and investigation is still ongoing):
some copied files are corrupted (chunks replaced by zeros) · Issue #15526 · openzfs/zfs
System information Type Version/Name Distribution Name Gentoo Distribution Version (rolling) Kernel Version 6.5.11 Architecture amd64 OpenZFS Version 2.2.0 Reference https://bugs.gentoo.org/917224 ...
From what I can grasp (and this might be inaccurate):
- This most notably affects Open ZFS 2.2.0 with block-cloning.
- Block-cloning was an original suspect, but there's the theory that it simply exploits an underlying bug.
- Recent versions of coreutils 9.1+ are implicated (however, because this supposedly also affects FreeBSD, it conflates this issue further).
- This has been reproduced on different Linux distributions, FreeBSD, and TrueNAS Core 13.0-U5.3 (and possibly others, as I've reproduced it on Arch Linux).
- This has been reproduced on OpenZFS 2.2.0, OpenZFS 2.1.11, and Open ZFS 2.1.13.
- An upstream "fix" from OpenZFS 2.2.1 is to disable the Block-Reference Table. (This is not a true "fix", only a safeguard in the meantime.)
If you want to test this out yourself and share your results in here:
Tony Hutter's script (designed for OpenZFS 2.2.0 and Linux-based systems): https://gist.github.com/tonyhutter/d69f305508ae3b7ff6e9263b22031a84
@Ericloewe's modified version of this script to work with TrueNAS Core systems: https://www.truenas.com/community/threads/truenas-13-0-u6-is-now-available.114337/page-3
Write-up about what this issue is and is not: https://gist.github.com/rincebrain/e23b4a39aba3fadc04db18574d30dc73
Others have reproduced this issue on - at least:
- TrueNAS Core 13-U5.3 with standard FreeBSD cp
- TrueNAS Core 13-U.6 with GNU cp 9.x
- OpenZFS 2.1.x
- OpenZFS 2.2.x, where it is very evident due to the combination of block cloning and GNU cp 9.x's eagerness to take advantage of it
I have reproduced silent corruption on Arch Linux with OpenZFS 2.2.0.
* I ran 16 instances of the script in parallel on spinning HDDs. Supposedly, it's more likely to occur on HDDs rather than SSDs.
I won't be available to follow-up for most of today, but we need a separate thread here where users can post their test results, as well as to follow the ongoing bug report(s) on the OpenZFS GitHub. The announcement thread for the release of TrueNAS Core 13.0-U6 will become too cluttered, so I started a new thread in here.
Disclaimer: I am but a simple person who wants 100% assurance that my data will not silently corrupt with ZFS, even under "rare" circumstances with unlikely I/O operations. Some of the stuff being investigated in the bug report by other users and developers is beyond my tiny brain's understanding.
Paging users who were in the "release announcement" thread:
@Ericloewe
@morganL
@Davvo
@Gcon
@Juan Manuel Palacios
@Etorix
EDIT: Information might rapidly change as more is discovered about this. Rather than continually edit this original post, it's better to read any updates below.
EDIT November 27, 2023: Tickets on iXsystems' TrueNAS Jira tracker, on page #7 by @Kris Moore
We've been following the progress over the holiday on the OpenZFS side, looks like a fix is being tested and we'll have a proper OpenZFS fix version here soon(ish). We're going to be reviewing this week and finalizing our update plans, but you can be assured we'll have fixes pushed out as soon as is reasonably safe to do so. If you want to monitor the TrueNAS tickets for reference, here they are:
Ticket for SCALE:
Ticket for CORE:[NAS-125358] - iXsystems TrueNAS Jira
ixsystems.atlassian.net
[NAS-125356] - iXsystems TrueNAS Jira
ixsystems.atlassian.net
Last edited by a moderator: