TrueNAS 12.0-U1 Quality Update

Kris Moore · Jan 12, 2021

UPDATE: 1/15/2021 - 10:45AM Eastern
----
We have finished validation that this data integrity issue is resolved. A 12.0-U1.1 hotfix release is being pushed *today* which all users will be highly encouraged to upgrade to, even if they are not observing any issues with data corruption.

UPDATE: 1/14/2021 - 10:21AM Eastern
----
We believe we've identified the issue and have a fixed kernel module up for testing on the Jira ticket. Feedback so far is positive that this resolves the issue, and we will be working to issue a 12.0-U1.1 hotfix release soon after a bit more validation time in the field.

--- Original Post ---

TrueNAS Community,

I wanted to post an update today on the general state of TrueNAS 12.0-U1 quality as well as a heads up of some of the issues we are tracking for resolution in TrueNAS 12.0-U2.

So far we’ve seen over 40k+ systems upgrade to 12.0 and the response has been very positive. That has made this one of the fastest adopted TrueNAS (And FreeNAS) releases ever. General quality at the U1 stage is far surpassing what we saw in the 11.3 series and we’re eager for the launch of 12.0-U2 with even more polish and improvements. In the meantime we’ve gotten a few reports about some regressions from 11.3-U5 that we wanted to take a moment to address.

First, we had gotten reports of some performance regressions which were eventually tracked down to some bugs that came in from our FreeBSD 12.2 upstream network drivers for Intel and Chelsio devices specifically. Both of these issues have been investigated and now are resolved in the upcoming 12.0-U2. A special thank-you to everybody who helped us track that down and determine what hardware was impacted.

https://jira.ixsystems.com/browse/NAS-107593

Second, we are also currently tracking a handful of users reporting issues with data integrity in some very specific virtualization environments. TrueNAS ZFS is reporting that there is no corruption on the disks or pool, but inside the VM the local filesystem may report needing to run a filesystem repair (fsck or scandisk), depending upon the guest OS. The issue has been seen on a few hypervisors and with iSCSI and NFS. It may be related to the type of VM & filesystem running as a guest. For more details, please refer to this ticket:

https://jira.ixsystems.com/browse/NAS-108627

As you may already know, we take reports of issues with data integrity very seriously, and you can rest assured that the entire iX engineering team is treating this as an "all hands on deck" situation until we come to a resolution. We have held up publishing the TrueNAS Enterprise 12.0 update train and we will hold up the next update to 12.0-U2 until we can be confident that we have a validated fix in place. That release is scheduled to arrive in early February.

The good news is, this issue seems to be extremely rare from the looks of it. Even more so considering how hard it has been to reproduce it internally. Out of tens of thousands systems running 12.0, we've only seen it on a handful of very specific virtualization workloads and nothing related to traditional SMB/NFS/Plugins scenarios.

The bad news is, due to how rare this issue appears, it makes it that much more difficult to troubleshoot and resolve. If you suspect you've seen something similar, please update the Jira ticket referenced above with your details and attach a debug file (System -> Advanced -> Generate Debug).

If you have seen a similar issue or will be running a production-critical environment, we would recommend staying on or rolling back to 11.3-U5 until we can further diagnose and fix this issue.

I'm going to sticky this post for the time being and update as we have more information to share about the status of the integrity issue.

Thanks for your patience everybody, on behalf of the entire iX team we really do appreciate it.

ddaenen1 · Jan 13, 2021

Well Kris, i have to compliment the team for the seemless transition experience from FreeNAS to TrueNAS. I am only a DIY user and i had some concerns with the upgrade because everything was running well, the way i like it with no hassle, but as many others, "new" tickles my curiousity and it is hard for me to not given into it. In the end, my transition has been excellent and everything is up and running as before. U1 took care of some glitches in the reporting section for me but for the rest, again an uneventful upgrade.

Keep up the good work!

winnielinnie · Jan 15, 2021

Kris Moore said:
We believe we've identified the issue and have a fixed kernel module up for testing on the Jira ticket. Feedback so far is positive that this resolves the issue, and we will be working to issue a 12.0-U1.1 hotfix release soon after a bit more validation time in the field.

Going through the comments in the bug report, I'm trying to understand if this is caused by using virtual machines, or rather that VMs simply expose the underlying data corruption (that a ZFS scrub would miss)?

If it's the latter, does that mean some of us may have been affected and not even know it, even after running a scrub of our entire pool with no errors? We will only find out when we try to read an affected file some time in the future?

Jon Moog · Jan 15, 2021

I was under the impression that this U1.1 update would fix the smbd crashing issues. I just updated and it is definitely not the case. Is this expected for the update? Reference thread

Code:

[2021/01/15 17:32:50.098708,  0] ../../lib/util/fault.c:79(fault_report)
  ===============================================================
[2021/01/15 17:32:50.098764,  0] ../../lib/util/fault.c:80(fault_report)
  INTERNAL ERROR: Signal 10 in pid 8803 (4.12.9)
  If you are running a recent Samba version, and if you think this problem is not yet fixed in the latest versions, please consider reporting this bug, see https://wiki.samba.org/index.php/Bug_Reporting
[2021/01/15 17:32:50.098796,  0] ../../lib/util/fault.c:86(fault_report)
  ===============================================================
[2021/01/15 17:32:50.098834,  0] ../../source3/lib/util.c:830(smb_panic_s3)
  PANIC (pid 8803): internal error
[2021/01/15 17:32:50.100776,  0] ../../lib/util/fault.c:265(log_stack_trace)
  BACKTRACE: 6 stack frames:
   #0 0x801952d17 <log_stack_trace+0x37> at /usr/local/lib/samba4/libsamba-util.so.0
   #1 0x802ccfb86 <smb_panic_s3+0x56> at /usr/local/lib/samba4/libsmbconf.so.0
   #2 0x801952b07 <smb_panic+0x17> at /usr/local/lib/samba4/libsamba-util.so.0
   #3 0x801952eee <log_stack_trace+0x20e> at /usr/local/lib/samba4/libsamba-util.so.0
   #4 0x801952ae9 <fault_setup+0x59> at /usr/local/lib/samba4/libsamba-util.so.0
   #5 0x80fd48c20 <_pthread_sigmask+0x530> at /lib/libthr.so.3
[2021/01/15 17:32:50.100897,  0] ../../source3/lib/dumpcore.c:315(dump_core)

Samuel Tai · Jan 15, 2021

Jon Moog said:
I was under the impression that this U1.1 update would fix the smbd crashing issues. I just updated and it is definitely not the case. Is this expected for the update? Reference thread

According to the U1.1 Release Notes, only the ZFS corruption issue was patched:

TrueNAS 12.0-U1.1 Changelog

NAS-108627 : While the underlying ZFS issue causing the instability is being resolved, there has been a temporary reversion of the ZFS CFA patch. This means that Asynchronous Copy-on-Write is temporarily disabled in TrueNAS 12.0-U1.1, with the goal to re-enable this functionality in a later TrueNAS 12.0 update release after this issue has been fully resolved.

It looks like the Samba fix will be pushed to U2.

Kris Moore · Jan 15, 2021

U2 is only a few weeks out. Didn't want to risk a regression elsewhere, corruption needs to come first.

As for workload, the access pattern is super unlikely to have happened on file workloads. Running another filesystem on top of ZFS is the only thing we've seen do this so far. (even that wasn't 100% gaurenteed to trigger the issue)

Jon Moog · Jan 15, 2021

Kris Moore said:
U2 is only a few weeks out. Didn't want to risk a regression elsewhere, corruption needs to come first.

Understood, thanks for the clarification.

vpv · Jan 16, 2021

I hope this is the right thread to ask. I could also make a new one. Anyway...

I can't install the update from the GUI. It says

Code:

Error: [EFAULT] Unable to mount boot-environment 12.0-U1.1-1

By googling for a minute I found out that I may need to destroy boot environments from the command line. But I feel like the GUI should handle stuff like this. Should I post a Jira ticket?

Mannekino · Jan 16, 2021

I just applied this update and saw the following two errors in my /var/log/messages

Should I worry about this? If so, what kind of checks should I do to find out what's wrong.

Code:

Jan 16 20:14:29 freenas.x syslog-ng[1424]: I/O error occurred while writing; fd='23', error='Input/output error (5)'
Jan 16 20:14:29 freenas.x syslog-ng[1424]: Suspending write operation because of an I/O error; fd='23', time_reopen='60'

jjstecchino · Jan 16, 2021

Just updated and I have syslog-ng crash
pid 2589 (syslog-ng), jid 0, uid 0: exited on signal 6 (core dumped)

Newfoundland.Republic · Jan 16, 2021

jjstecchino said:
Just updated and I have syslog-ng crash
pid 2589 (syslog-ng), jid 0, uid 0: exited on signal 6 (core dumped)

I've had the same issue since 12.0....

morganL · Jan 16, 2021

jjstecchino said:
Just updated and I have syslog-ng crash
pid 2589 (syslog-ng), jid 0, uid 0: exited on signal 6 (core dumped)

Can you confirm which release you updated from?

In general, we only recommend this update (U1.1) to 12.0 users that have seen this specific issue or are running virtualization. Its a a hot fix version and has not been through our normal QA cycle. For those that are on 11.3, stay on 11.3-U5, and wait for 12.0-U2 for an update.

ThreeDee · Jan 16, 2021

wasn't having any issues before and I don't run VM's ..saw the update and installed it without issue

Mannekino · Jan 17, 2021

morganL said:
Can you confirm which release you updated from?

In general, we only recommend this update (U1.1) to 12.0 users that have seen this specific issue or are running virtualization. Its a a hot fix version and has not been through our normal QA cycle. For those that are on 11.3, stay on 11.3-U5, and wait for 12.0-U2 for an update.

I updated from 12.0-U1. I went from FreeNAS to TrueNAS Core about 9 days ago.

I do thing syslog is running and working though.

These are the latest entries in /var/log/messages

Code:

Jan 17 00:00:00 freenas.fritz.box syslog-ng[1414]: Configuration reload request received, reloading configuration;
Jan 17 00:00:00 freenas.fritz.box syslog-ng[1414]: Configuration reload finished;

And ps aux | grep syslog

Code:

root        1413   0.0  0.0  19344   7960  -  I    20:19      0:00.00 /usr/local/sbin/syslog-ng -f /usr/local/etc/syslog-ng.conf -p /var/run/syslog.pid
root        1414   0.0  0.0  26264  10636  -  Is   20:19      0:00.29 /usr/local/sbin/syslog-ng -f /usr/local/etc/syslog-ng.conf -p /var/run/syslog.pid
root        2443   0.0  0.0  11428   2808  -  SsJ  20:21      0:00.32 /usr/sbin/syslogd -c -ss
root        2935   0.0  0.0  11428   2808  -  SsJ  20:21      0:00.16 /usr/sbin/syslogd -c -ss
root        3445   0.0  0.0  11428   2808  -  SsJ  20:22      0:00.17 /usr/sbin/syslogd -c -ss
root        3940   0.0  0.0  11428   2808  -  IsJ  20:22      0:00.21 /usr/sbin/syslogd -c -ss
root        4655   0.0  0.0  11428   2808  -  IsJ  20:23      0:00.08 /usr/sbin/syslogd -c -ss
root        5127   0.0  0.0  11428   2808  -  SsJ  20:23      0:00.16 /usr/sbin/syslogd -c -ss
root        5595   0.0  0.0  11428   2808  -  SsJ  20:24      0:00.16 /usr/sbin/syslogd -c -ss
root        6065   0.0  0.0  11428   2808  -  SsJ  20:24      0:00.15 /usr/sbin/syslogd -c -ss
root        6543   0.0  0.0  11428   2808  -  IsJ  20:24      0:00.13 /usr/sbin/syslogd -c -ss
root       21436   0.0  0.0  11428   2808  -  IsJ  22:57      0:00.07 /usr/sbin/syslogd -c -ss
root       87508   0.0  0.0  11380   2948  0  R+   10:23      0:00.00 grep syslog

Should I do/check anything else?
What is that I/O error and what writing operations are suspended?
Is this error still currently active or did it happen during the updating process before the automatic reboots?

jjstecchino · Jan 17, 2021

morganL said:
Can you confirm which release you updated from?

In general, we only recommend this update (U1.1) to 12.0 users that have seen this specific issue or are running virtualization. Its a a hot fix version and has not been through our normal QA cycle. For those that are on 11.3, stay on 11.3-U5, and wait for 12.0-U2 for an update.

Updated from 12.0 U1. Just saw the alert saying an update was available, so I did it.

indy · Jan 17, 2021

For anyone interested in the technical details this should be the openzfs pull request associated with the silent data corruption:

add async dmu support by mattmacy · Pull Request #10377 · openzfs/zfs

This supersedes #10303 and #10317 Exposed to user space by way of zvol and vnops. Can be exercised by way of aio on FreeBSD and io_uring on Linux. The only user visible change for existing code is ...

github.com

Selassie · Jan 18, 2021

Kris Moore said:
UPDATE: 1/15/2021 - 10:45AM Eastern
----
We have finished validation that this data integrity issue is resolved. A 12.0-U1.1 hotfix release is being pushed *today* which all users will be highly encouraged to upgrade to, even if they are not observing any issues with data corruption.

UPDATE: 1/14/2021 - 10:21AM Eastern
----
We believe we've identified the issue and have a fixed kernel module up for testing on the Jira ticket. Feedback so far is positive that this resolves the issue, and we will be working to issue a 12.0-U1.1 hotfix release soon after a bit more validation time in the field.

--- Original Post ---

TrueNAS Community,

I wanted to post an update today on the general state of TrueNAS 12.0-U1 quality as well as a heads up of some of the issues we are tracking for resolution in TrueNAS 12.0-U2.

So far we’ve seen over 40k+ systems upgrade to 12.0 and the response has been very positive. That has made this one of the fastest adopted TrueNAS (And FreeNAS) releases ever. General quality at the U1 stage is far surpassing what we saw in the 11.3 series and we’re eager for the launch of 12.0-U2 with even more polish and improvements. In the meantime we’ve gotten a few reports about some regressions from 11.3-U5 that we wanted to take a moment to address.

First, we had gotten reports of some performance regressions which were eventually tracked down to some bugs that came in from our FreeBSD 12.2 upstream network drivers for Intel and Chelsio devices specifically. Both of these issues have been investigated and now are resolved in the upcoming 12.0-U2. A special thank-you to everybody who helped us track that down and determine what hardware was impacted.

https://jira.ixsystems.com/browse/NAS-107593

Second, we are also currently tracking a handful of users reporting issues with data integrity in some very specific virtualization environments. TrueNAS ZFS is reporting that there is no corruption on the disks or pool, but inside the VM the local filesystem may report needing to run a filesystem repair (fsck or scandisk), depending upon the guest OS. The issue has been seen on a few hypervisors and with iSCSI and NFS. It may be related to the type of VM & filesystem running as a guest. For more details, please refer to this ticket:

https://jira.ixsystems.com/browse/NAS-108627

As you may already know, we take reports of issues with data integrity very seriously, and you can rest assured that the entire iX engineering team is treating this as an "all hands on deck" situation until we come to a resolution. We have held up publishing the TrueNAS Enterprise 12.0 update train and we will hold up the next update to 12.0-U2 until we can be confident that we have a validated fix in place. That release is scheduled to arrive in early February.

The good news is, this issue seems to be extremely rare from the looks of it. Even more so considering how hard it has been to reproduce it internally. Out of tens of thousands systems running 12.0, we've only seen it on a handful of very specific virtualization workloads and nothing related to traditional SMB/NFS/Plugins scenarios.

The bad news is, due to how rare this issue appears, it makes it that much more difficult to troubleshoot and resolve. If you suspect you've seen something similar, please update the Jira ticket referenced above with your details and attach a debug file (System -> Advanced -> Generate Debug).

If you have seen a similar issue or will be running a production-critical environment, we would recommend staying on or rolling back to 11.3-U5 until we can further diagnose and fix this issue.

I'm going to sticky this post for the time being and update as we have more information to share about the status of the integrity issue.

Thanks for your patience everybody, on behalf of the entire iX team we really do appreciate it.

I installed the update, the machine then turned itself off and never came back on. the machine starts but no IMPI or NIC IP addresses at all. Turned the machine on and off several times yet both NICS are dead. The motherboard is a Supermicro X11SPH-NCTF Server Motherboard - Intel Chipset - Socket P LGA-3647 with a XEON Silver 4114 processor. Everything was working fine prior to the attempted update.

Not sure where to start to recover everything. I have 100TB of data on the drives which are in two pools. the chassis is a 16 bay unit with a 12 bay unit connected as a JOBD. Any help would be appreciated.

Heracles · Jan 18, 2021

Hey @Selassie,

We would need more info about your setup...

Are you using encryption ? Do you have backups ? (config and data) Any error message you can see in the console ? Can you boot your system with a new and fresh boot device ? What is your hardware and how is it configured ? ...

Selassie · Jan 18, 2021

Heracles said:
Hey @Selassie,

We would need more info about your setup...

Are you using encryption ? Do you have backups ? (config and data) Any error message you can see in the console ? Can you boot your system with a new and fresh boot device ? What is your hardware and how is it configured ? ...

it did a backup of the config files before doing the upgrade, so I have that and the drives were not encrypted.

Selassie · Jan 18, 2021

Heracles said:
Hey @Selassie,

We would need more info about your setup...

Are you using encryption ? Do you have backups ? (config and data) Any error message you can see in the console ? Can you boot your system with a new and fresh boot device ? What is your hardware and how is it configured ? ...

ita a Supermicro motherboard X11SPH-NCTF with a XEON Silver processor, 128GB of Ram with two pools. one pool comprises of 16 hot-swap bays and the other is a 12 bay hot swap unit connected as a JOBD.

Important Announcement for the TrueNAS Community.

TrueNAS 12.0-U1 Quality Update

SVP of Engineering

Patron

MVP

Dabbler

Never underestimate your own stupidity

SVP of Engineering

Dabbler

Dabbler

Patron

Contributor

Guru

Captain Morgan

Guru

Patron

Contributor

Patron

Dabbler

Wizard

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "TrueNAS 12.0-U1 Quality Update"

Similar threads