Unscheduled system reboots only after upgrading to FreeNAS-11.2

ericus · May 14, 2019

Long time FreeNAS user, first time post.

I have a Dell PowerEdge R710 server that has been serving as an SMB, NFS (v2, v3), RSYNC, and FTP server for quite some time. It was running FreeNAS 9 for a couple of years without issue, and the hardware has consistently been rock-solid even under relatively heavy load.

It sits alongside an ESXi server that runs various Linux VMs that access the storage (primarily over NFS and a bit of RSYNC every few hours), and there are a handful of Windows hosts scattered around the network that access the NAS via SMB.

Back in late Feb/early March, we had to pick up and move a huge portion of this hardware to a different lab site. Upon arriving and while going through some reconfiguration (of the VMs, other hardware, etc) to accommodate the different environment and network/IPs/etc, we started hitting a ton of problems with crashes, memory leaks, and more (again, this was just on the VMs; NAS was still rock-solid at this point). However, in troubleshooting, it was decided to also go ahead and update the FreeNAS server to 11.2 just in case.

This completely blew up, with the FreeNAS server unable to boot again. A lot of troubleshooting and recovery attempts ensued, and eventually we were able to do a clean install and re-import the old FreeNAS database ok. We didn't really consider trying to dig up and re-install 9.x, since we didn't think that would make sense if we could get the server up and running on 11.x.

Anyway, long story short, ever since then we have been getting "Unscheduled system reboot" emails from the server (and also hard crashes, of course) at least once a day- and sometimes multiple times per day.

I have gone through every single Google, ixsystems forums, freebsd forums, etc search result I could find, and simply cannot get these to stop no matter what I try. In addition, I cannot find any breadcrumbs on the system that could give a hint as to what could be triggering these.

A few points:

Memory problems?
- Hardware was rock-solid for months/years at the original site, and solid for at least a couple of weeks at the new site and before it was upgraded.
- There is certainly enough RAM for the system, and every single chart I have been able to pull of memory consumption over time (FreeNAS gui) or log of "top" shows a very stable memory usage pattern that never even comes close to a significant percentage of the total amount (96GB ECC).
- memtest86 runs clean for 18 hours (longest I could keep the system totally down over a weekend to run this test), before I had to bring the system back online (hey, a crashy NAS is still better than no NAS :))
Are you sure it's actually crashing and these aren't just spurious emails?
- Yes. I know there is a bug in 11.x whereby it will spit out a couple of these emails after upgrade, but that bug states it will do that a couple of times and then stop. Also, we definitely lose connectivity to the NAS, it won't respond to pings, and I know that FreeNAS has truly crashed hard from the iDRAC virtual console, the system logs & uptime, etc.
Temperature, or some other CPU problem?
- Similar to previous point. All charts show stable behavior, never any alerts or problems according to the iDRAC6 reports/logs either.
Disk errors?
- Nothing that I can find, and again all of the storage disks are the same as they were before updating. A few zfs scrubs have happened on the pool since these crashes started, and they always come back clean.
What does /var/log/xx say?
- I've done countless attempts at finding anything useful in any of the log files in this directory, (i.e. not limiting myself to /var/log/messages or messages.x.bz2) as well as looking at the last entries before the crashes (in every file here). Typically, there is nothing of note, no critical error, nothing (as far as I can tell) that would help me know where to investigate further.
Any patterns in the crashes?
- None that I've been able to find. They are not always "close" to any particularly large load period, RSYNC access, scrub, snapshot, scheduled task, or anything else I can tell. We don't do replication either.
- I haven't been able to trigger a crash "on demand" yet either; I have tried some worst-case scenario tests, high loading, selectively disabling (some) services (I can't shut off NFS and would reeeally prefer to not shut off SMB but otherwise have shut off stuff one-at-a-time until it crashes again and I know it wasn't that).
What's going on on the network when it crashes?
- I have poured over a wireshark capture leading up to, and including, the most recent crash, captured on a port mirror of the interface between the NAS and the directly-connected switch. There is nothing of note, other than suddenly the server stops responding to any SMB or NFS query, and comes back about 6 minutes later. There is truly nothing special about the traffic immediately before the crash.
What about the physical console- does anything spit out there?
- Just in case the server was spitting something out on its console that wasn't making it into /var/log/message (or elsewhere), I set up a screen recording of the iDRAC6 virtual console and waited for the next crash (I know I sound absolutely insane at this point, but trust me, I'm truly running out of ideas). There is absolutely nothing spit out to the local console when it crashes, and goes from showing the FreeNAS "Console setup" menu, to just "black" and then "Configuring memory...." that it shows when it booting from scratch.
Are you losing power?
- It would almost seem to behave like this is the case, but iDRAC6 reports that the power supplies (redundant) have been online since March 11 (when I was onsite and troubleshooting/moving power. So as far as I can tell, this is not the issue. Also, the MGMT-only iDRAC NIC port on the connected Cisco never goes down, so there's that.
ZFS problems?
- Again, nothing that is readily visible to me. Also, I have already "upgraded the pool" after upgrading to 11.x, as one of the more recent troubleshooting steps and in response to the FreeNAS nags that ZFS feature flags could be upgraded. Just in case some backwards-incompatibility between 11.2 and an old 9.x-created pool as an issue, I did the <zpool upgrade> steps a week or two ago and unfortunately this didn't help the crashes. It would also make it much harder for me to roll back to a 9.x install, even if I was ready to go that route.
Anything else that is mildly unique or interesting about your server?
- As per a manual backup that I have made/updated a couple of times before each of the "major" troubleshooting steps described above, it looks like the server hosts almost 5 million files. It's just a bit over 1TB I think, but yeah, a lot of little files.
- We use a 4-port Link Aggregation (active) to connect to the upstream Cisco switch. Again, this didn't change in any way when the crashes began, but it's just one of the things I haven't yet tried to eliminate.
Why don't you just xx?
- I know I'll get a question or two like:
  - "why don't you just go back to 9.x" - database wouldn't be backwards compatible, so we'd have to reconfig the server from scratch. This would be fine, but now that I've upgraded the pool to latest feature flags, I would also have to rebuild the pool from scratch and re-seed an enormous amount of files, folders, and tiny intricacies/etc. This would be absolutely the last resort...
  - put it on a VM, replace the whole server, move it back to the old location, etc etc etc... yeah, I'm sure a lot of different drastic changes might shotgun this problem away, but it's not something that we can do at this time.

Anyway, what I am hoping for here, is that someone could tell me if I've missed something important here, or if I could at least be pointed in the right direction for finding the root cause. I can consider drastic changes if it is clear that some specific thing is the culprit, but I just haven't found any smoking gun anywhere at this point.

To anyone that is still reading at this point, thanks and I appreciate your time even if you don't respond :)

I apologize for the massive data dump. And again, thanks in advance to anyone that can help.

styno · May 15, 2019

If you create a debug dump (system - advanced - save debug) you can find the files in /var/db/system/ixdiagnose/ixdiagnose/crash
FreeNAS panics can be found in the info file located in that directory. I guess it would be safe to say that if there is no panic string in those files something outside of the OS is causing the reset.

ericus · May 17, 2019

Thanks styno. Unfortunately I have looked for panics in the past (and just did another debug dump just now to double-check) and I haven't seen anything get generated ever.

I think cyberjock noted in another thread about a similar crash, after looking at someone's debug dump tarball, that "a problem was in x"- even though that crash was also not generating explicit kernel panic files (I think). So I was kind of hoping that there would be some other place to look, or way to identify the general area of the problem, to narrow down further troubleshooting attempts... :(

And I agree with what you are saying, that if there isn't a kernel panic/core dump that the problem is probably external to the OS. However, that's pretty much the only thing that was changed (OS, plus minor config changes) to bring about this problem. And everything I know to check, hardware-wise, checks out...

Apollo · May 17, 2019

You should revert back to your former stable Freenas version.
As I have mentionned many times, Freenas 11.2 is not as stable as people may think.

Important Announcement for the TrueNAS Community.

Unscheduled system reboots only after upgrading to FreeNAS-11.2

ericus

Cadet

styno

Patron

ericus

Cadet

Apollo

Wizard

Similar threads