FreeNAS (now TrueNAS) is no longer stable

viniciusferrao · Jan 12, 2021

Patrick M. Hausen said:
It is rock solid for me including NFS and iSCSI ... haven't had a single case of data corruption. I don't doubt that you experience the problems you describe but "TrueNAS is NOT stable" is simply not true. There must be something specific to your setup.

You just ignore the fact that I've three systems with more than recommended hardware running without any issue with the following runtimes: 7 years, 6 years, 4 years. All of them failed with 12.0-RELEASE.

I've noted this behavior is a common thing on our community: just blame everything instead of nailing down the issue. Don't get me wrong, this is not personal, but every time is this. There's the initial scrutiny: "You don't do this, that, whatever" and when there's nothing more to complain: "ok, we may have an issue". I understand that sometimes people with lack of skill may be extremely overwhelming and tiresome, but please read the topic and the bug report first instead of discrediting what I've related.

And as a side node, I've been working with this issue for more than a month now. I do care for my systems in first place and for all those that I've recommended FreeNAS/TrueNAS from more than 10 years of work with storage systems. Second I do care for iXsystems for developing this product and simply giving it back to the community. I'm doing my part trying to help and nail down the issue with the help of iXsystems on the Jira ticket, I'm risking my data for a better objetive: stability on 12.0-RELEASE not only for me, but for everyone: you, the community customers and the paying clients from iXsystems. I could just move my data rollback everything to 11.3-U5 and wait someone to fix this for me, since as today I've managed to replicate the data elsewhere, but I'm still running the three pools in broken state and experimenting with hot fixes made exclusively for one of my systems. Will you waste your time and risk your data for that? It's hard to answer without being in contact with the issue.

adrianwi · Jan 12, 2021

Perhaps if you added "My" to the beginning of your thread title, it would be a little more factual? Both of my systems have been just as stable running 12.0-U1 as they were running 11.3-U5.

Looks to me like it's difficult to reproduce the error, although there is acceptance of a problem and efforts are being made to resolve.

Not sure how your alarmist thread and attitude help this?

Patrick M. Hausen · Jan 12, 2021

We are on the same side. I have been running Alpha and Beta versions of FreeNAS/TrueNAS through all of the last decade, including Corral, all with my personal live data, written bug reports, participated in forum discussions and so on ...

Your situation sucks. No arguing about that. I just challenged the "TrueNAS not stable" claim, because according to iX' statistics for some ten thousands of systems including two that I run, it is. Perfectly stable.

Been there, done that - years ago we bought a rack worth of new Tyan servers and the network driver in the then-current FreeBSD would just freeze after some amount of time/traffic. Without the help of Jack Vogel, Intel employee responsible for the driver, we would have had a ton of very expensive door stops. Luckily it all turned out well and we "only" had a delay of 3 weeks in deployment.

So as a member of the community: thank you for not just rolling back or switching the product altogether and trying to get this debugged and fixed. Most of us went through something like this at some time.

Kind regards,
Patrick

viniciusferrao · Jan 12, 2021

adrianwi said:
Perhaps if you added "My" to the beginning of your thread title, it would be a little more factual? Both of my systems have been just as stable running 12.0-U1 as they were running 11.3-U5.

Looks to me like it's difficult to reproduce the error, although there is acceptance of a problem and efforts are being made to resolve.

Not sure how your alarmist thread and attitude help this?

What a joke having to read this. Have you read the Jira issue? I've mention there at least two bugs that are exactly the same issue. And there's four or more people that replicated the issue after what I've posted. Three systems, all of them with different hardware and different software stacks and I was alarmist.

Not mentioning one more critical bug that breaks LAGG+VLAN on Intel cards that I've found because of this bug rendering *all the systems with this same setup unbootable* due to a race condition and I'm alarmist?

If issuing a warning to the community and carefully telling the story of all my three main systems with details, wasting three weeks of my time checking everything else to carefully not blame anything wrong is being an alarmist, ok. I'm an alarmist.

Patrick M. Hausen · Jan 12, 2021

viniciusferrao said:
Not mentioning one more critical bug that breaks LAGG+VLAN on Intel cards

I am running that. Now. Here. Live. What precisely are you talking about?

viniciusferrao · Jan 12, 2021

Patrick M. Hausen said:
I am running that. Now. Here. Live. What precisely are you talking about?

It's a race condition, that can even be triggered on emulated e1000 NICs.

https://jira.ixsystems.com/projects/NAS/issues/NAS-108810

240609 – iflib: Panic with INVARIANTS: sleeping in an epoch section (12.1-pre-QA) (vlan + lagg involved)

bugs.freebsd.org

Patrick M. Hausen · Jan 12, 2021

viniciusferrao said:
It's a race condition, that can even be triggered on emulated e1000 NICs.

Puzzled why it had not hit me even once ...

HoneyBadger · Jan 12, 2021

@viniciusferrao Since you're here live, quick question - are you noticing a pattern with the corruption eg: is it more prevalent in guests using XFS or with VMs running on top of hypervisor-level snapshots as suggested by Kris Moore?

In the meantime I will see about getting a host running oVirt up so I can test like-for-like.

viniciusferrao · Jan 12, 2021

Patrick M. Hausen said:
Puzzled why it had not hit me even once ...

Try booting the debug kernel, this is mandatory to trigger the issue. I've mentioned on the ticket that I was able to perfectly replicate it on a second system, and there's the info to "de-trigger" the issue, what involves booting back to 11.3-U5 just to put the network interfaces on a state that will successfully boot 12.0-RELEASE and later with the RELEASE kernel.

viniciusferrao · Jan 12, 2021

HoneyBadger said:
@viniciusferrao Since you're here live, quick question - are you noticing a pattern with the corruption eg: is it more prevalent in guests using XFS or with VMs running on top of hypervisor-level snapshots as suggested by Kris Moore?

In the meantime I will see about getting a host running oVirt up so I can test like-for-like.

The first place I've discovered the issue was in RHV 4.4 (Red Hat Virtualization) system using NFSv4.1 as a storage backend. On that system there was just 3 or 4 Windows VM's that I wasn't able to find a thing. But those VM's don't do anything heavy and there was no snapshots in them. The major issue was on RHEL and CentOS VMs, specially the "heavier" ones. We lost our Jira machine that we host our open source HPC software (http://versatushpc.com.br/opencattus/). And we lost other machine that the devs used to run Docker on them. We even blamed Docker, due to it's bugs with XFS and overlayfs. They moved to Podman instead, and in fact Docker wasn't the culprit.

On the other pool, the most affected one, I've lost an entire Exchange Server. It was the mailing server of one University here in Brazil. As an emergency we just moved everything we can to Office365. So, I don't have the extreme workload anymore, but I do have the VM in broken state in a snapshot. I'll eventually boot it to gather corrupted files to feed iXsystem with more data. There was a catastrophic issue, everything was corrupted: the OS, filesystem, mailboxes, databases, everything. Lossy repair was the only solution with mass move to Office365. During the recovery the VM crashed at least 4 times.

On the third pool there was minor corruptions, mainly on XFS filesystems. This one wasn't that affected at all, but was affected.

The pattern here is intensive I/O. That's the pattern. On the first system, even backups were done corrupted. We backup this machine to another FreeNAS system with "zfs and receive". So yes, today I've running, under my control a total of 8 FreeNAS systems, and I'm only reporting the issue on the major ones.

EDIT: One thing that I still don't understand is how superblocks of a filesystem can be entirely corrupted or zeroed if they don't receive writes? And that was what's happening in the XFS filesystems, with data corruption of course. But in the superblocks? Really?

Patrick M. Hausen · Jan 12, 2021

viniciusferrao said:
Try booting the debug kernel

Sorry, but no. Private systems, but still some people other than myself rely on certain services being available. Thanks for the insight.

viniciusferrao · Jan 12, 2021

Patrick M. Hausen said:
Sorry, but no. Private systems, but still some people other than myself rely on certain services being available. Thanks for the insight.

Yeah, I wouldn't mess with this too

ECC · Jan 13, 2021

Kris Moore said:
To clarify out 40k systems running 12.0, we've only seen this happen about half a dozen times. At the moment it's our highest priority issue, but due to how rare it is, it's making it somewhat difficult to track down and fix. If you have a reproduction case, please update the ticket with your debug file.

well, maybe this is a very rare bug. But we are dealing here with data corruption, so at the end it is a little bit like playing with fire and hoping for the best. Who will help me, if I'm the "lucky" 7th person with this bug and lost all my data? Yeah, I'm doing backups, but I don't want to fear for my data integrety every day.

ornias · Jan 13, 2021

ECC said:
well, maybe this is a very rare bug. But we are dealing here with data corruption, so at the end it is a little bit like playing with fire and hoping for the best. Who will help me, if I'm the "lucky" 7th person with this bug and lost all my data? Yeah, I'm doing backups, but I don't want to fear for my data integrety every day.

The things is:
TrueNAS/FreeNAS doesn't do much with your data itself. So if it gets corrupted the chances of it actually being TrueNAS that causes it, are VERY slim at best.

I personally think a report at OpenZFS or the respective NFS/iSCSI projects is more fruitfull

viniciusferrao · Jan 13, 2021

ornias said:
The things is:
TrueNAS/FreeNAS doesn't do much with your data itself. So if it gets corrupted the chances of it actually being TrueNAS that causes it, are VERY slim at best.

I personally think a report at OpenZFS or the respective NFS/iSCSI projects is more fruitfull

Read the Jira ticket first. The culprit may be the ACoW scheme. NFS/iSCSI are only the medium, not the issue.

ornias · Jan 13, 2021

viniciusferrao said:
Read the Jira ticket first. The culprit may be the ACoW scheme. NFS/iSCSI are only the medium, not the issue.

Instead of showing an attitude, you could've actually tried to understand what I just wrote.

I personally think a report at OpenZFS or the respective NFS/iSCSI projects is more fruitfull

What subproject do you think handles the ACoW scheme?

anodos · Jan 13, 2021

ornias said:
Instead of showing an attitude, you could've actually tried to understand what I just wrote.

I personally think a report at OpenZFS or the respective NFS/iSCSI projects is more fruitfull

What subproject do you think handles the ACoW scheme?

NFS/iSCSI in FreeBSD are provided by the kernel. Right now ZFS and kernel developers are engaged in the ticket.

ornias · Jan 13, 2021

anodos said:
NFS/iSCSI in FreeBSD are provided by the kernel. Right now ZFS and kernel developers are engaged in the ticket.

Ahh, bit different from Linux then when it comes to those two things.

viniciusferrao · Jan 13, 2021

ornias said:
Instead of showing an attitude, you could've actually tried to understand what I just wrote.

I personally think a report at OpenZFS or the respective NFS/iSCSI projects is more fruitfull

What subproject do you think handles the ACoW scheme?

Answering your suggestion is showing an attitude? Don’t be contamined by the fallacies of others.

I totally understood what you’ve said, and anodos already answered you. It will not be fruitful at all. TrueNAS is an appliance, you first report issues to the appliance vendor. And them they will see if it’s an upstream issue or not.

It’s just like you buy a car, the tire has an issue and you go after the tire maker instead of the vendor of the car. The tire maker will say that they have nothing to do with it.

Hope this clarifies the misunderstanding.

djjaeger82 · Jan 13, 2021

Not to pile on, but wanted to list my experience. I believe I'm also facing this silent data corruption issue. I run for my home lab an ESXI / Freenas all in one server that has been rock solid for 5 years+....that is until December of 2020. I can't believe I didn't notice / consider Freenas could have been the possible problem, but it appears going back and checking my windows home server 2011 logs the timeline matches up now.

Within the ESXi host, one of my VMs is Windows Home Server 2011 that runs client PC backups in my home. In early/mid December I started to get some client PCs that would fail the backup process. The system would retry the back up the next night and most of the time it would go thru. But as December went on these errors became more and more frequent to the point that I could not even complete a backup on the system with the largest amount of data to back up. I've been spending the last 2 weeks ripping my hair out testing everything I can: scrubbing volumes in freenas, chkdsking everything on the clients and the windows server, memory tests. I even destroyed my client computer backup database and tried starting over assuming it was some bug / problem with WHS2011. Even rebuilding the backup database from scratch resulted in the same problem with multiple different client PCs not able to finish a single backup. I convinced myself it was something with my setup and software/OS configuration, perhaps WHS2011 that is well deprecated at this point and not supported. I even tried new Windows Server 2016 Essentials installs and ran into the exact same problems during client pc backups. Even built a new desktop with a fresh OS install and it also has backup errors.

Upon auditing all of the logs in both Windows Home Server 2011 and Windows Server essentials 2016, the common denominator were file checksum verification failures in the client backup database files that Windows Server uses for these tasks (WHS and WSE split the backup database into multiple 4GB files). Each time the failure seemed to be on a different backup database file, not a common one. I've scrubbed the volume over and over, and never a single error. I've moved the VMDK for the backup folder from a RAIDZ1 pool to a different RAIDZ2 pool and still had the same problem. Tried messing with NFS3 vs NFS4.1, changed record sizes and recopied the VMDK over and over. Forced Sync to Always from Standard, and no improvement. Basically I'm losing my sanity trying to figure this out. Spent days and days thinking I had an ESXi issue potentially, patching, downgrading, upgrading, etc but all failed. I was just about ready to replace the motherboard/CPU and further go down the hardware rabbit hole when I came across this thread...

Needless to say I'm exporting my pool right now and will be reinstalling a fresh FREENAS 11.3 install and reimporting these pools which luckily I did not upgrade. (I don't see 11.xx in my system boot options for some reason any more, I must have purged them after my upgrade, oddly I can see some of my 9.x boots from 5 years ago) I'll report back if my issues disappear immediately once I get things setup again. I'm not sure theres much else I can contribute to help solve this issue but am happy to answer any follow-ups about my setup / configuration specifics. -Dan

System Specifics:

Intel i7-4790 CPU
32GB DDR3 RAM (non-ecc, ya ya i know not ideal but never has been a problem in the past)
2x SAS2308 PCI-e cards for storage passed thru to the FreeNAS VM
2x 480GB Sandisk Ultra II Mirror SSD Volume
4x 8TB WD in RAIDZ1 HDD volume
8x 4TB Seagate in RAIDZ2 HDD volume

ESXi 7.0 Update 1c
FreeNAS 12.0U1
VM where I've seen the issue: WHS2011 SP1, also on WSE2016
NFS mounts for all 3 volumes into ESXi to be able to use the storage in my VMs
No issues observed with my media libraries directly stored within ESXi, but not sure I'd have picked that up yet in any manner...

Important Announcement for the TrueNAS Community.

FreeNAS (now TrueNAS) is no longer stable

Contributor

Guru

Hall of Famer

Contributor

Hall of Famer

Contributor

Hall of Famer

actually does care

Contributor

Contributor

Hall of Famer

Contributor

Explorer

Wizard

Contributor

Wizard

Sambassador

Wizard

Contributor

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS (now TrueNAS) is no longer stable"

Similar threads