FreeNAS (now TrueNAS) is no longer stable

viniciusferrao · Dec 12, 2020

Hello, I'm writing this with deeply concerns about TrueNAS/FreeNAS and the move that seemed a little bit irresponsible regarding quality and testing.

I have three virtualization pools that relied on FreeNAS for years. One specifically is running since 2013, other 2014 and the newer one since 2016.

On the 2014 pool, we've updated from FreeNAS 11.3-U5 to TrueNAS 12.0-RELEASE, 3 weeks ago, precisely on November 20th. Suddenly we started to discover extreme VM corruption within the XFS filesystem, everything was getting corrupted, including the filesystem superblocks, leading to the inability to recover from xfs_repair.

We blamed everyone: we blamed the Hypervisor, in that case oVirt 4.4, we blamed the fabrics and network, since this pool is using a Cisco Catalyst 2960X as a core, which is not ideal, we blamed the XFS filesystem due to issues on writeback mode, we blamed everyone. We didn't even considered blaming TrueNAS. I've even opened a discussion within the hypervisor mailing lists, but nothing conclusive was found: https://lists.ovirt.org/archives/list/users@ovirt.org/message/2DVB4ULURXWJ5VGHX64FDUZW27F7DY3J

So for next days we blamed mainly the network, since there's some packets dropped on the switch. We concluded that the load on the park, for whatever reasons, have increased and the drops could have causing the issue. A guy on the maililng list recommended falling back to NFSv3 as VM storage instead of NFSv4 due to weird things happening under load. We tried, the situation was better but the issue is still happening.

In this monday, we've had maintenance on the pools from 2013 and 2016. So it's upgrade time. We upgraded both pools to TrueNAS 12.0-RELEASE. 12.0-U1 wasn't available yet.

Everything went fine... but on this Thursday the mail server on the pool from 2013 went down with a disconnection on the iSCSI disk due to I/O errors. Well let's see what happened and the result was: the VM was completely trashed. Corruption on the filesystem, on the operational system, on the service and on the databases that held the mailboxes. Other VM's like a webserver are completely trashed too. So it's a disaster scenario.

Regarding the pool from 2016, I've already detected in place XFS corruption in one VM. For safety measures everything was shut down.

So what happened?

All three pools have different equipments and software, but the only common denominator is the storage system, which was ranging from FreeNAS 11.1 to 11.3. The hypervisors are mixed: oVirt 4.3, oVirt 4.4 and XenServer 7.2; two of them uses iSCSI as the storage backend and one is with NFSv3. Hardware is completely different either, so as you can see. TrueNAS is the only piece that's equal.

For now, I've upgraded everything from 12.0-RELEASE to 12.0-U1. In hope that this will fix this issues.

I don't have any artifact to blame FreeNAS/TrueNAS, the only thing that I've is my word of what happened on those pools. I never had any issue with FreeNAS/TrueNAS for almost 8 years running it, but this move to 12.0 may seem rushed by iXsystems. There's no logs generated within TrueNAS, no errors, no health issues on the zpools, nothing. Which leads me to believe that the software is in an silent unstable state.

I don't have any options right now, I can't downgrade back to 11.3-U5/6/7/etc since the zpool was upgraded on the three systems. But there's one things that dropped the ball regarding to trust with iX releasing the proper stable versions of TrueNAS.

After the upgrade I've noted that 12.0-RELEASE was built with RC (Release Candidate) code:

Code:

Last login: Tue Dec  8 17:24:17 2020
FreeBSD 12.2-RC3 7c4ec6ff02c(HEAD) TRUENAS 

    TrueNAS (c) 2009-2020, iXsystems, Inc.
    All rights reserved.
    TrueNAS code is released under the modified BSD license with some
    files copyrighted by (c) iXsystems, Inc.

    For more information, documentation, help or support, go here:
    http://truenas.com

FreeBSD freenas.win.versatushpc.com.br 12.2-RC3 FreeBSD 12.2-RC3 7c4ec6ff02c(HEAD) TRUENAS  amd64

OpenZFS 2.0 wasn't even released yet, leading to confusion. When 12.0-RELEASE was announced I understood that OpenZFS 2.0 was released together, but this seems not to be the case since the announcement was done two day ago, on December 10th: https://www.ixsystems.com/blog/openzfs-2-on-truenas

What we got running on 12.0-RELEASE?

Code:

root@freenas:~ # pkg info | grep -i zfs
beadm-1.4                      Solaris-like utility to manage Boot Environments on ZFS
iohyve-0.7.9                   bhyve manager utilizing ZFS and other FreeBSD tools
openzfs-2020100200             OpenZFS userland for FreeBSD
openzfs-kmod-2020100200        OpenZFS kernel module for FreeBSD
py38-libzfs-1.0.202008212020   Python libzfs bindings
py38-zettarepl-0.1_24          Cross-platform ZFS replication solution

OpenZFS snapshot from October 2nd. This is not STABLE at all...

In 12.0-U1 we got the proper released OpenZFS version, and a non RC FreeBSD 12 system. As we would expect from a RELEASE release.

Code:

FreeBSD freenas.win.versatushpc.com.br 12.2-RELEASE-p2 FreeBSD 12.2-RELEASE-p2 663e6b09467(HEAD) TRUENAS  amd64

root@freenas:~ # pkg info | grep -i zfs
beadm-1.4                      Solaris-like utility to manage Boot Environments on ZFS
iohyve-0.7.9                   bhyve manager utilizing ZFS and other FreeBSD tools
openzfs-2020120100             OpenZFS userland for FreeBSD
openzfs-kmod-2020120100        OpenZFS kernel module for FreeBSD
py38-libzfs-1.0.202011201432   Python libzfs bindings
py38-zettarepl-0.1_27          Cross-platform ZFS replication solution

Yeah, so... given the evidence I cannot conclude anything different from: 12.0-RELEASE is not STABLE. It should not be marketed as STABLE in first place. Even upgrade to 12.0-U1 still marks 12.0-U1 as development branch and should not be used in production: https://jira.ixsystems.com/browse/NAS-108580; yes it may seems to be a cosmetic bug, but for paying customers TrueNAS 12 isn't available yet. So all this TrueNAS Core thing leads to extreme confusion. There is cleary two separate branches from the OpenSource release and the one that iX ships, which is fine, but this should be explained better.

For now I don't even know if 12.0-U1 would solve the reported issues, and if 12.0-U1 will be considered stable. Because it's not.

Regarding the original issue, I'm pretty much confident that the issues were consequence of running 12.0-RELEASE. People can blame me for "upgrading it too early" or "you should have paid support since your environment is critical", or other nonsenses like: "you probably don't know how to build proper ZFS systems". But the reality is that none of them applies to the situation.

I know that iX is not responsible for this, this is FOSS software and delivered "as is"; this is just as an alert to keep running FreeNAS 11.3-U5/6/7/etc until things get really stable on the 12.0 branch. Keep an eye with the paying customers, look when they will receive the updates, I've read somewhere that this release will be on December 22th. We hope this will be stable, so people could have a proper Christmas and a good new year.

Thanks for listening.

PS: If there's any artifact that I can generate to help further investigate I'm totally willing to do it, but I don't know what I could provide to help it out. And now all the three pools were upgraded to 12.0-U1.

Kris Moore · Dec 12, 2020

Thanks for posting to let us know you're experiencing some issues here. So far we've seen 30k+ systems upgrade to 12.0-RELEASE and 12.0-U1, and are not seeing any major influx of reports of data corruption like you are mentioning. We also use it in various places as virtualization backends as well, but without any of these difficulties that you are describing.

That said, there's not really enough evidence here to rule it out or confirm it either way. Obviously we take reports like this *very* seriously, and would appreciate it if you would open a ticket on jira.ixsystems.com and attach debugs from the systems in question. At a minimum we'd like to review your configuration and see if there's anything amiss that we should look deeper into.

Just to check as well - I assume you're using "Sync=Always" on your datasets / zvols which host those VMs?

viniciusferrao · Dec 12, 2020

Kris Moore said:
Thanks for posting to let us know you're experiencing some issues here. So far we've seen 30k+ systems upgrade to 12.0-RELEASE and 12.0-U1, and are not seeing any major influx of reports of data corruption like you are mentioning. We also use it in various places as virtualization backends as well, but without any of these difficulties that you are describing.

That said, there's not really enough evidence here to rule it out or confirm it either way. Obviously we take reports like this *very* seriously, and would appreciate it if you would open a ticket on jira.ixsystems.com and attach debugs from the systems in question. At a minimum we'd like to review your configuration and see if there's anything amiss that we should look deeper into.

Just to check as well - I assume you're using "Sync=Always" on your datasets / zvols which host those VMs?

Hi Kris, I'll open the issue there. You guys need the output of "freenas-debug -A" right?

Regarding sync=always; the 2013 pool is sync=standard, but the pool from 2014 that was the first one impacted have sync=always on the VM datasets only, everything else is sync=standard.

But keep in mind that the pool from 2013 never had this catastrophic issue in seven years of production. This pool is really though to be honest, every single original disk from this machine has been replaced since they have already faulted.

blanchet · Dec 12, 2020

I have also encounter some issues with TrueNAS-12.0, VMware and NFSv3.

My TrueNAS server hosts a NFSv3 datastore which is use only for vSphere HA datastore heartbeating. There is no virtual machine on this datastore, so there is not a lot of I/O.

When I run TrueNAS-12.0 or TrueNAS-12.0u1, VeeamOne sends me regularly emails because it detects datastore disconnection.
If I revert back to FreeNAS-11.3u5, the issue disappears. (fortunately I never upgrade the zpool)

My server is a 4U FreeNAS certified server from iXsystems.

Kris Moore · Dec 12, 2020

viniciusferrao said:
Hi Kris, I'll open the issue there. You guys need the output of "freenas-debug -A" right?

Regarding sync=always; the 2013 pool is sync=standard, but the pool from 2014 that was the first one impacted have sync=always on the VM datasets only, everything else is sync=standard.

But keep in mind that the pool from 2013 never had this catastrophic issue in seven years of production. This pool is really though to be honest, every single original disk from this machine has been replaced since they have already faulted.

Easy way is to do it via the UI, using system -> advanced -> save debug.

As for sync, anytime you're hosting vms we'd recommend sync=always be set. Running in standard mode almost guarantees issues with vm filesystem inconsistency in the event of a hard reboot / failure of any type.

Kris Moore · Dec 12, 2020

blanchet said:
I have also encounter some issues with TrueNAS-12.0, VMware and NFSv3.

My TrueNAS server hosts a NFSv3 datastore which is use only for vSphere HA datastore heartbeating. There is no virtual machine on this datastore, so there is not a lot of I/O.

When I run TrueNAS-12.0 or TrueNAS-12.0u1, VeeamOne sends me regularly emails because it detects datastore disconnection.
If I revert back to FreeNAS-11.3u5, the issue disappears. (fortunately I never upgrade the zpool)

My server is a 4U FreeNAS certified server from iXsystems.

At first glance that doesn't sound related to the original issue here, but let's do the same routine please, make a ticket, attach debug and send one of those veeam emails over for us to investigate.

viniciusferrao · Dec 12, 2020

Kris Moore said:
Easy way is to do it via the UI, using system -> advanced -> save debug.

As for sync, anytime you're hosting vms we'd recommend sync=always be set. Running in standard mode almost guarantees issues with vm filesystem inconsistency in the event of a hard reboot / failure of any type.

Alright, I'll get it from the web UI.

As for sync, yes, I'm aware of the consequences of an unsafely shutdown. It's like a hardware controller without battery and writeback enabled. The DCs in question have proper nobreaks with a shutdown scheme. And I'm aware about the sync=always recommendations that people like cyberjock and jgrego already says since the beginning of this forum.

In that case there was no power outage, nor forced shutdowns. The corruption started three days after the upgrade to 12.0-RELEASE.

Thank you!

viniciusferrao · Dec 12, 2020

Ticket opened at: https://jira.ixsystems.com/projects/NAS/issues/NAS-108627

I've attached the debug info from the 2013 and 2014 systems.

viniciusferrao · Dec 13, 2020

It seems that the issue still persists on 12.0-U1. Again it may not be TrueNAS fault but with the given history it should be.

bed · Dec 13, 2020

I also opened a bug report which was closed due to a similar one NAS-108559. In this case all the effected extents are shared via a chelsio T320 network interface card. If you have something else you can swap out for that, your issues may go away.

viniciusferrao · Dec 13, 2020

bed said:
I also opened a bug report which was closed due to a similar one NAS-108559. In this case all the effected extents are shared via a chelsio T320 network interface card. If you have something else you can swap out for that, your issues may go away.

Thanks bed, but I'm running with Intel I350 Gigabit in the 2013 system and Intel X520 on the other two. Hardware-wise we've been following what FreeNAS/TrueNAS guys mandate. And look, those systems were stable for years! Everything started happening after 12.0-RELEASE.

ECC · Jan 12, 2021

Unfortunately, my thread was closed, so I post my answer here:

To get this right: Since 12.12.2020 it is known that TrueNAS 12.0 & 12.0 U1 is NOT stable, nevertheless it is till available for download & marked as stable. For me this is a complete betrayal of confidence. Maybe free users won't matter to ixsystems, but I hope they care for their paying customers...Wow that is intense.

Patrick M. Hausen · Jan 12, 2021

It is rock solid for me including NFS and iSCSI ... haven't had a single case of data corruption. I don't doubt that you experience the problems you describe but "TrueNAS is NOT stable" is simply not true. There must be something specific to your setup.

Kris Moore · Jan 12, 2021

ECC said:
Unfortunately, my thread was closed, so I post my answer here:

To get this right: Since 12.12.2020 it is known that TrueNAS 12.0 & 12.0 U1 is NOT stable, nevertheless it is till available for download & marked as stable. For me this is a complete betrayal of confidence. Maybe free users won't matter to ixsystems, but I hope they care for their paying customers...Wow that is intense.

To clarify out 40k systems running 12.0, we've only seen this happen about half a dozen times. At the moment it's our highest priority issue, but due to how rare it is, it's making it somewhat difficult to track down and fix. If you have a reproduction case, please update the ticket with your debug file.

ude6 · Jan 12, 2021

I also had problems on ESXI with VMs on NFS IF they had snapshots. Without (ESXI) snapshots I do currently have no issue.

Kris Moore · Jan 12, 2021

ude6 said:
I also had problems on ESXI with VMs on NFS IF they had snapshots. Without (ESXI) snapshots I do currently have no issue.

We think we've narrowed it down to that particular use-case as well. Investigation is ongoing, but we're making some progress. If you would please add to the original ticket with your details / debug, that would be helpful. What's odd is that it doesn't appear to be 100% reproducible, so it may be a combination of other factors related to setup or hardware. The worst kind of bug for tracking down and fixing, of course.

ude6 · Jan 12, 2021

Hi,
I currently have no system available where I can reproduce this. I have deleted all Sanpshots (in ESXI) and the issue went away.
Have to keep the system running, sorry for not beeing more of a help.

Andreas

Kris Moore · Jan 12, 2021

No worries! Even that bit of data is helpful. Appreciate it.

HoneyBadger · Jan 12, 2021

Kris Moore said:
We think we've narrowed it down to that particular use-case as well.

I'm willing to sacrifice some VMs on the altar of stability. Other than "run VMs on snapshots" are there any other suggestions on how to try forcing the issue (eg: prevalence in iSCSI vs NFS, large amounts of delete/UNMAPs, etc)

Kris Moore · Jan 12, 2021

So far it does seem running VMs using XFS as the filesystem seems to help surface the problem. Maybe start there, see if you can surface any sort of corruption thats visible to the VM client side? Jira ticket does have a lot of information you can review as well.

Important Announcement for the TrueNAS Community.

FreeNAS (now TrueNAS) is no longer stable

Contributor

SVP of Engineering

Contributor

Guru

SVP of Engineering

SVP of Engineering

Contributor

Contributor

Contributor

Dabbler

Contributor

Explorer

Hall of Famer

SVP of Engineering

Dabbler

SVP of Engineering

Dabbler

SVP of Engineering

actually does care

SVP of Engineering

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS (now TrueNAS) is no longer stable"

Similar threads