Updates on 12.0-U2, SCALE 21.02 and Docs Efforts

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
TrueNAS Community,

A quick update today! On the TrueNAS 12.0-U2 front, we're currently looking at it releasing next week (on or around 2/2). It'll address many issues, including some performance regressions in Intel and Chelsio networking specifically. Additionally over 120 other issues have been identified and resolved in this release, and we greatly appreciate all the help from our community in tracking down these issues for resolution.

On the TrueNAS SCALE front, we're getting close to tagging and pushing our next ALPHA, 21.02. This version includes an update to Linux kernel 5.10, as well as some major work on the new Apps UIs and over 170 other fixes and improvements. Additionally support for GPU passthrough to containers has been enabled and exposed for Nvidia and Intel QuickSync users, which should make a lot of Plex users quite happy. Keep an eye out for this to land later in February.

Lastly we've made some fantastic progress on porting the remaining content from the old FreeNAS users guide into the new Hugo driven docs site. As of now, all remaining content should be available in the new docs site, but we're not done yet. Over the coming weeks we'll be giving the site a massive navigation and organizational improvement, which will give readers a much more streamlined experience in how they browse through documentation. Additionally, support for a single-file / offline document view will be making an appearance as well.

February is shaping up to be a big month for TrueNAS users. As always we appreciate your feedback!
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
A double up-vote for the open and clear communication from iX these days. It makes it much easier to help the community when we can refer to iX activities planned and upcoming with timelines (even if still flexible).

Especially looking forward to the SMB fixes in TrueNAS Core 12.0-U2.
 

indy

Patron
Joined
Dec 28, 2013
Messages
287
open and clear communication from iX these days
I feel the exact opposite.
Ixsystems introduced a destructive bug into their zfs implementation by upstreaming their own patch (sidestepping openzfs validation).
Furthermore this was only detected on the application side, with the corruption happening deeper at the zfs level.
Now does that mean every pool that has been touched by 12.0(-U1) is at risk of undetectable (meta-)data corruption?
I think it would be kind of important to be open and honest about that.
And if you have good news... share that - dont just patch over it.
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
I feel the exact opposite.
Ixsystems introduced a destructive bug into their zfs implementation by upstreaming their own patch (sidestepping openzfs validation).
Furthermore this was only detected on the application side, with the corruption happening deeper at the zfs level.
Now does that mean every pool that has been touched by 12.0(-U1) is at risk of undetectable (meta-)data corruption?
I think it would be kind of important to be open and honest about that.
And if you have good news... share that - dont just patch over it.

Just to clarify here, the bug introduced had been run through the OpenZFS test suite, and passed. As a result of us discovering this failure mode, we're also now investing the effort into beefing up the OpenZFS test-suite to better catch this kind of failure as well. That should also be up-streamed at some point, but like everything it takes time. This failure wasn't easy for us to reproduce and we very much appreciate the assist from our community members who did see the bug in their environment and allowed us access to confirm the problem and resolve.

To further clarify for new readers, we haven't seen any instances of this corruption touching normal file-based workloads. It tended to only surface in virtualization environments, and you'd see it when your VM filesystem started throwing errors during a scandisk/fsck or similar.
 
Last edited:
Joined
Jan 26, 2021
Messages
1
Just to clarify here, the bug introduced had been run through the OpenZFS test suite, and passed. As a result of us discovering this failure mode, we're also now investing the effort into beefing up the OpenZFS test-suite to better catch this kind of failure as well. That should also be up-streamed at some point, but like everything it takes time. This failure wasn't easy for us to reproduce and we very much appreciate the assist from our community members who did see the bug in their environment and allowed us access to confirm the problem and resolve.

To further clarify for new readers, we haven't seen any instances of this corruption touching normal file-based workloads. It tended to only surface in virtualization environments, and you'd see it when your VM filesystem started throwing errors during a scandisk/fsck or similar.

What indy said was correct, though. Whether or not the patch applied passed the test suite, it was not yet merged into the OpenZFS mainline.

I would just like to add to the voices asking for more transparency and communication about the aftermath of the async issue. Any information you can share about the root cause of the bug and its proximate triggers and effects would be helpful. Just because we've only seen instances of corruption on VMs doesn't mean that other files haven't been affected, as well. You would expect to see VM filesystem corruption, as it manifests quickly and catastrophically.

Is it the case that ZFS itself corrupted the data on disk? Or did it only return corrupted values which were then written-back by the hypervisor? Have people who have experienced this corruption been able to roll back with snapshots? Or are their zvols/datasets/pools just permanently destroyed?

We haven't seen so much as a mea culpa, yet, much less a complete post-mortem.
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
Much of this was already discussed in the Jira ticket, but I'll be happy to recap here for new readers understanding.

Specifically, this issue was a regression introduced in Async CoW for 12. The fault condition was very rare, and was the result of a bad race condition, where a write would be issued, followed by an *immediate* read of the same block. In some environments, the arc write would still be partially underway, when the read would issue and return the partially written block. In our investigation this is where we saw most client VM's fault, since they would detect the wrong data being returned, and report it as local filesystem/disk corruption. The risk for corruption is if the client side took that incorrect read, and then wrote it *again* back to disk. The first set of occurrences were extremely rare, even for virtualization, the write-back would be even more-so. This is why it slipped by us for so long. Old data / old snapshots would *not* have been impacted. We'd been running this for more than a year internally including in our own virtualization environments and it still took us a long while to locally come up with a way to reproduce it.

That said, for U1.1 we disabled Async CoW out of an overabundance of caution, and in the meantime have been carefully reviewing the code and building out some tests to try and detect this kind of race condition. We've since fixed the underlying problem, but are not going to rush to re-enable it for U2, and instead will wait to bring it back in U3 or some later update once we're confident there won't be any similar type issues, no matter how rare they may be.
 

seanm

Guru
Joined
Jun 11, 2018
Messages
570
Just to clarify here, the bug introduced had been run through the OpenZFS test suite, and passed. As a result of us discovering this failure mode, we're also now investing the effort into beefing up the OpenZFS test-suite to better catch this kind of failure as well.

If you're not already, definitely give thread sanitizer a try with your test suite. It's invaluable for finding race conditions.
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
If you're not already, definitely give thread sanitizer a try with your test suite. It's invaluable for finding race conditions.

Thank you! I'm not 100% if our team has looked at thread sanitizer before, so I'll pass that along :)
 

LouisB

Dabbler
Joined
Oct 15, 2020
Messages
28
A good idea would also be to reorganise the community forums. FreeNas 11, TrueNas 12, TrueNas Scale,....
 

ornias

Wizard
Joined
Mar 6, 2020
Messages
1,458
On the TrueNAS SCALE front, we're getting close to tagging and pushing our next ALPHA, 21.02.
Some background info on that one for those interested:

as well as some major work on the new Apps UIs
The Apps UIs are now actually based on the apps themselves instead of hardcoded. Also a lot of time has been put in finding an fixing small bugs that turned up when creating and testing many different apps. The UI also got support for non-Official app catalogs, which can be added using the new CLI. To give an idea about the scope of this change: About once every three days a relevant bug/mishap was discovered/fixed or a feature was created for the deployment of apps and many are still on the to-do list.

Additionally support for GPU passthrough to containers has been enabled and exposed for Nvidia and Intel QuickSync users
This includes the ability to select GPU's in the actual install GUI of Apps.
Nvidia was a bit of a pain, because NVIDIA fucked up their drivers. The temporary workaround on the forums was installing cuda-dev drivers, but those happened to be 2gb+. It really had nothing to do with IX's efforts.

to land later in February.
It might've been sooner, but a few small things (like getting the community requested packages actually included and Official Apps fully solid) turned up that definately needed to be merged before releasing the alpha
 
Top