FreeNAS (now TrueNAS) is no longer stable

viniciusferrao

Contributor
Joined
Mar 30, 2013
Messages
192
Hello, I'm writing this with deeply concerns about TrueNAS/FreeNAS and the move that seemed a little bit irresponsible regarding quality and testing.

I have three virtualization pools that relied on FreeNAS for years. One specifically is running since 2013, other 2014 and the newer one since 2016.

On the 2014 pool, we've updated from FreeNAS 11.3-U5 to TrueNAS 12.0-RELEASE, 3 weeks ago, precisely on November 20th. Suddenly we started to discover extreme VM corruption within the XFS filesystem, everything was getting corrupted, including the filesystem superblocks, leading to the inability to recover from xfs_repair.

We blamed everyone: we blamed the Hypervisor, in that case oVirt 4.4, we blamed the fabrics and network, since this pool is using a Cisco Catalyst 2960X as a core, which is not ideal, we blamed the XFS filesystem due to issues on writeback mode, we blamed everyone. We didn't even considered blaming TrueNAS. I've even opened a discussion within the hypervisor mailing lists, but nothing conclusive was found: https://lists.ovirt.org/archives/list/users@ovirt.org/message/2DVB4ULURXWJ5VGHX64FDUZW27F7DY3J

So for next days we blamed mainly the network, since there's some packets dropped on the switch. We concluded that the load on the park, for whatever reasons, have increased and the drops could have causing the issue. A guy on the maililng list recommended falling back to NFSv3 as VM storage instead of NFSv4 due to weird things happening under load. We tried, the situation was better but the issue is still happening.

In this monday, we've had maintenance on the pools from 2013 and 2016. So it's upgrade time. We upgraded both pools to TrueNAS 12.0-RELEASE. 12.0-U1 wasn't available yet.

Everything went fine... but on this Thursday the mail server on the pool from 2013 went down with a disconnection on the iSCSI disk due to I/O errors. Well let's see what happened and the result was: the VM was completely trashed. Corruption on the filesystem, on the operational system, on the service and on the databases that held the mailboxes. Other VM's like a webserver are completely trashed too. So it's a disaster scenario.

Regarding the pool from 2016, I've already detected in place XFS corruption in one VM. For safety measures everything was shut down.

So what happened?

All three pools have different equipments and software, but the only common denominator is the storage system, which was ranging from FreeNAS 11.1 to 11.3. The hypervisors are mixed: oVirt 4.3, oVirt 4.4 and XenServer 7.2; two of them uses iSCSI as the storage backend and one is with NFSv3. Hardware is completely different either, so as you can see. TrueNAS is the only piece that's equal.

For now, I've upgraded everything from 12.0-RELEASE to 12.0-U1. In hope that this will fix this issues.

I don't have any artifact to blame FreeNAS/TrueNAS, the only thing that I've is my word of what happened on those pools. I never had any issue with FreeNAS/TrueNAS for almost 8 years running it, but this move to 12.0 may seem rushed by iXsystems. There's no logs generated within TrueNAS, no errors, no health issues on the zpools, nothing. Which leads me to believe that the software is in an silent unstable state.

I don't have any options right now, I can't downgrade back to 11.3-U5/6/7/etc since the zpool was upgraded on the three systems. But there's one things that dropped the ball regarding to trust with iX releasing the proper stable versions of TrueNAS.

After the upgrade I've noted that 12.0-RELEASE was built with RC (Release Candidate) code:

Code:
Last login: Tue Dec  8 17:24:17 2020
FreeBSD 12.2-RC3 7c4ec6ff02c(HEAD) TRUENAS 

    TrueNAS (c) 2009-2020, iXsystems, Inc.
    All rights reserved.
    TrueNAS code is released under the modified BSD license with some
    files copyrighted by (c) iXsystems, Inc.

    For more information, documentation, help or support, go here:
    http://truenas.com

FreeBSD freenas.win.versatushpc.com.br 12.2-RC3 FreeBSD 12.2-RC3 7c4ec6ff02c(HEAD) TRUENAS  amd64


OpenZFS 2.0 wasn't even released yet, leading to confusion. When 12.0-RELEASE was announced I understood that OpenZFS 2.0 was released together, but this seems not to be the case since the announcement was done two day ago, on December 10th: https://www.ixsystems.com/blog/openzfs-2-on-truenas

What we got running on 12.0-RELEASE?
Code:
root@freenas:~ # pkg info | grep -i zfs
beadm-1.4                      Solaris-like utility to manage Boot Environments on ZFS
iohyve-0.7.9                   bhyve manager utilizing ZFS and other FreeBSD tools
openzfs-2020100200             OpenZFS userland for FreeBSD
openzfs-kmod-2020100200        OpenZFS kernel module for FreeBSD
py38-libzfs-1.0.202008212020   Python libzfs bindings
py38-zettarepl-0.1_24          Cross-platform ZFS replication solution


OpenZFS snapshot from October 2nd. This is not STABLE at all...

In 12.0-U1 we got the proper released OpenZFS version, and a non RC FreeBSD 12 system. As we would expect from a RELEASE release.
Code:
FreeBSD freenas.win.versatushpc.com.br 12.2-RELEASE-p2 FreeBSD 12.2-RELEASE-p2 663e6b09467(HEAD) TRUENAS  amd64

root@freenas:~ # pkg info | grep -i zfs
beadm-1.4                      Solaris-like utility to manage Boot Environments on ZFS
iohyve-0.7.9                   bhyve manager utilizing ZFS and other FreeBSD tools
openzfs-2020120100             OpenZFS userland for FreeBSD
openzfs-kmod-2020120100        OpenZFS kernel module for FreeBSD
py38-libzfs-1.0.202011201432   Python libzfs bindings
py38-zettarepl-0.1_27          Cross-platform ZFS replication solution


Yeah, so... given the evidence I cannot conclude anything different from: 12.0-RELEASE is not STABLE. It should not be marketed as STABLE in first place. Even upgrade to 12.0-U1 still marks 12.0-U1 as development branch and should not be used in production: https://jira.ixsystems.com/browse/NAS-108580; yes it may seems to be a cosmetic bug, but for paying customers TrueNAS 12 isn't available yet. So all this TrueNAS Core thing leads to extreme confusion. There is cleary two separate branches from the OpenSource release and the one that iX ships, which is fine, but this should be explained better.

For now I don't even know if 12.0-U1 would solve the reported issues, and if 12.0-U1 will be considered stable. Because it's not.

Regarding the original issue, I'm pretty much confident that the issues were consequence of running 12.0-RELEASE. People can blame me for "upgrading it too early" or "you should have paid support since your environment is critical", or other nonsenses like: "you probably don't know how to build proper ZFS systems". But the reality is that none of them applies to the situation.

I know that iX is not responsible for this, this is FOSS software and delivered "as is"; this is just as an alert to keep running FreeNAS 11.3-U5/6/7/etc until things get really stable on the 12.0 branch. Keep an eye with the paying customers, look when they will receive the updates, I've read somewhere that this release will be on December 22th. We hope this will be stable, so people could have a proper Christmas and a good new year.

Thanks for listening.

PS: If there's any artifact that I can generate to help further investigate I'm totally willing to do it, but I don't know what I could provide to help it out. And now all the three pools were upgraded to 12.0-U1.
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,448
Thanks for posting to let us know you're experiencing some issues here. So far we've seen 30k+ systems upgrade to 12.0-RELEASE and 12.0-U1, and are not seeing any major influx of reports of data corruption like you are mentioning. We also use it in various places as virtualization backends as well, but without any of these difficulties that you are describing.

That said, there's not really enough evidence here to rule it out or confirm it either way. Obviously we take reports like this *very* seriously, and would appreciate it if you would open a ticket on jira.ixsystems.com and attach debugs from the systems in question. At a minimum we'd like to review your configuration and see if there's anything amiss that we should look deeper into.

Just to check as well - I assume you're using "Sync=Always" on your datasets / zvols which host those VMs?
 

viniciusferrao

Contributor
Joined
Mar 30, 2013
Messages
192
Thanks for posting to let us know you're experiencing some issues here. So far we've seen 30k+ systems upgrade to 12.0-RELEASE and 12.0-U1, and are not seeing any major influx of reports of data corruption like you are mentioning. We also use it in various places as virtualization backends as well, but without any of these difficulties that you are describing.

That said, there's not really enough evidence here to rule it out or confirm it either way. Obviously we take reports like this *very* seriously, and would appreciate it if you would open a ticket on jira.ixsystems.com and attach debugs from the systems in question. At a minimum we'd like to review your configuration and see if there's anything amiss that we should look deeper into.

Just to check as well - I assume you're using "Sync=Always" on your datasets / zvols which host those VMs?

Hi Kris, I'll open the issue there. You guys need the output of "freenas-debug -A" right?

Regarding sync=always; the 2013 pool is sync=standard, but the pool from 2014 that was the first one impacted have sync=always on the VM datasets only, everything else is sync=standard.

But keep in mind that the pool from 2013 never had this catastrophic issue in seven years of production. This pool is really though to be honest, every single original disk from this machine has been replaced since they have already faulted.
 

blanchet

Guru
Joined
Apr 17, 2018
Messages
511
I have also encounter some issues with TrueNAS-12.0, VMware and NFSv3.

My TrueNAS server hosts a NFSv3 datastore which is use only for vSphere HA datastore heartbeating. There is no virtual machine on this datastore, so there is not a lot of I/O.

When I run TrueNAS-12.0 or TrueNAS-12.0u1, VeeamOne sends me regularly emails because it detects datastore disconnection.
If I revert back to FreeNAS-11.3u5, the issue disappears. (fortunately I never upgrade the zpool)

My server is a 4U FreeNAS certified server from iXsystems.
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,448
Hi Kris, I'll open the issue there. You guys need the output of "freenas-debug -A" right?

Regarding sync=always; the 2013 pool is sync=standard, but the pool from 2014 that was the first one impacted have sync=always on the VM datasets only, everything else is sync=standard.

But keep in mind that the pool from 2013 never had this catastrophic issue in seven years of production. This pool is really though to be honest, every single original disk from this machine has been replaced since they have already faulted.

Easy way is to do it via the UI, using system -> advanced -> save debug.

As for sync, anytime you're hosting vms we'd recommend sync=always be set. Running in standard mode almost guarantees issues with vm filesystem inconsistency in the event of a hard reboot / failure of any type.
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,448
I have also encounter some issues with TrueNAS-12.0, VMware and NFSv3.

My TrueNAS server hosts a NFSv3 datastore which is use only for vSphere HA datastore heartbeating. There is no virtual machine on this datastore, so there is not a lot of I/O.

When I run TrueNAS-12.0 or TrueNAS-12.0u1, VeeamOne sends me regularly emails because it detects datastore disconnection.
If I revert back to FreeNAS-11.3u5, the issue disappears. (fortunately I never upgrade the zpool)

My server is a 4U FreeNAS certified server from iXsystems.

At first glance that doesn't sound related to the original issue here, but let's do the same routine please, make a ticket, attach debug and send one of those veeam emails over for us to investigate.
 

viniciusferrao

Contributor
Joined
Mar 30, 2013
Messages
192
Easy way is to do it via the UI, using system -> advanced -> save debug.

As for sync, anytime you're hosting vms we'd recommend sync=always be set. Running in standard mode almost guarantees issues with vm filesystem inconsistency in the event of a hard reboot / failure of any type.

Alright, I'll get it from the web UI.

As for sync, yes, I'm aware of the consequences of an unsafely shutdown. It's like a hardware controller without battery and writeback enabled. The DCs in question have proper nobreaks with a shutdown scheme. And I'm aware about the sync=always recommendations that people like cyberjock and jgrego already says since the beginning of this forum.

In that case there was no power outage, nor forced shutdowns. The corruption started three days after the upgrade to 12.0-RELEASE.

Thank you!
 
Last edited:

viniciusferrao

Contributor
Joined
Mar 30, 2013
Messages
192
It seems that the issue still persists on 12.0-U1. Again it may not be TrueNAS fault but with the given history it should be.
 

bed

Dabbler
Joined
Jun 17, 2011
Messages
38
I also opened a bug report which was closed due to a similar one NAS-108559. In this case all the effected extents are shared via a chelsio T320 network interface card. If you have something else you can swap out for that, your issues may go away.
 

viniciusferrao

Contributor
Joined
Mar 30, 2013
Messages
192
I also opened a bug report which was closed due to a similar one NAS-108559. In this case all the effected extents are shared via a chelsio T320 network interface card. If you have something else you can swap out for that, your issues may go away.

Thanks bed, but I'm running with Intel I350 Gigabit in the 2013 system and Intel X520 on the other two. Hardware-wise we've been following what FreeNAS/TrueNAS guys mandate. And look, those systems were stable for years! Everything started happening after 12.0-RELEASE.
 
Last edited:

ECC

Explorer
Joined
Nov 8, 2020
Messages
65
Unfortunately, my thread was closed, so I post my answer here:

To get this right: Since 12.12.2020 it is known that TrueNAS 12.0 & 12.0 U1 is NOT stable, nevertheless it is till available for download & marked as stable. For me this is a complete betrayal of confidence. Maybe free users won't matter to ixsystems, but I hope they care for their paying customers...Wow that is intense.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,737
It is rock solid for me including NFS and iSCSI ... haven't had a single case of data corruption. I don't doubt that you experience the problems you describe but "TrueNAS is NOT stable" is simply not true. There must be something specific to your setup.
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,448
Unfortunately, my thread was closed, so I post my answer here:

To get this right: Since 12.12.2020 it is known that TrueNAS 12.0 & 12.0 U1 is NOT stable, nevertheless it is till available for download & marked as stable. For me this is a complete betrayal of confidence. Maybe free users won't matter to ixsystems, but I hope they care for their paying customers...Wow that is intense.

To clarify out 40k systems running 12.0, we've only seen this happen about half a dozen times. At the moment it's our highest priority issue, but due to how rare it is, it's making it somewhat difficult to track down and fix. If you have a reproduction case, please update the ticket with your debug file.
 

ude6

Dabbler
Joined
Aug 2, 2017
Messages
37
I also had problems on ESXI with VMs on NFS IF they had snapshots. Without (ESXI) snapshots I do currently have no issue.
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,448
I also had problems on ESXI with VMs on NFS IF they had snapshots. Without (ESXI) snapshots I do currently have no issue.

We think we've narrowed it down to that particular use-case as well. Investigation is ongoing, but we're making some progress. If you would please add to the original ticket with your details / debug, that would be helpful. What's odd is that it doesn't appear to be 100% reproducible, so it may be a combination of other factors related to setup or hardware. The worst kind of bug for tracking down and fixing, of course.
 

ude6

Dabbler
Joined
Aug 2, 2017
Messages
37
Hi,
I currently have no system available where I can reproduce this. I have deleted all Sanpshots (in ESXI) and the issue went away.
Have to keep the system running, sorry for not beeing more of a help.


Andreas
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,448
No worries! Even that bit of data is helpful. Appreciate it.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
We think we've narrowed it down to that particular use-case as well.

I'm willing to sacrifice some VMs on the altar of stability. Other than "run VMs on snapshots" are there any other suggestions on how to try forcing the issue (eg: prevalence in iSCSI vs NFS, large amounts of delete/UNMAPs, etc)
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,448
So far it does seem running VMs using XFS as the filesystem seems to help surface the problem. Maybe start there, see if you can surface any sort of corruption thats visible to the VM client side? Jira ticket does have a lot of information you can review as well.
 
Top