TrueNAS Upgrade Constant Rebooting

ayao218

Dabbler
Joined
Jun 2, 2018
Messages
12
I have two identical Dell R710 machines that were previously running freenas v11.1. Recently, I upgraded it from v11.1 -> v11.3 -> truenas v12. These machines had previously been running perfectly fine for the past 3 years they had been in service. They boot from dual mirrored USB sticks. No jails/vms run on the machines. Only storage, SMB, NFS, and replication.

Since the upgrade, I have noticed that TrueNAS has been constantly rebooting from emails and my observations of the machine. These reboots have no particular consistent time/interval, but I have noticed them happen in an interval anytime between 8 hours to 1.5 days. This happens on both machines which were upgraded to TrueNAS, but the reboots of the machines do not happen at the same time or the same interval. The interval seems to be different.

The problem here is I'm not sure what is causing these mysterious reboots and I'm not sure if these are clean reboots or crashes (my guess is crashes since I have definitely not setup anything to my knowledge that would reboot these machines).

From the `/var/log/messages` file, I do not see anything of relevance as there are no logs up to until the reboot occurs when I see logs of the system booting up. Additionally, I have not seen any other relevant logs that would point me towards the issue.

Additional:
This would on the surface seem like a hardware issue, but I find it to be very unlikely given that I never had any issues prior to upgrading and after upgrading both machines showed this behavior.

Any help or thoughts would be very much appreciated! Thanks in advance!
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
It does indeed sound like a crash of some type, perhaps a kernel panic, but the reasons for it could be many, such as driver bug, hardware issue, etc.

Suggest tossing a ticket on jira.ixsystems.com with a debug file attached (system -> advanced -> download debug).
 

ayao218

Dabbler
Joined
Jun 2, 2018
Messages
12
Thanks so much for the response! I have created a ticket here. I would be really interested if you are able to determine the root cause of this issue.
 

ayao218

Dabbler
Joined
Jun 2, 2018
Messages
12
@Kris Moore
Is there any chance you would be able to take a look at the ticket? While there, I saw another similar ticket that mentioned IPMI watchdog resets, but I did not see anything in the iDRAC logs at all. Furthermore, one of the machines hasn't restarted in 3+ days now (a new record), but the other machine experienced 4 reboots in the past 24 hours.

Thanks again!
 

ayao218

Dabbler
Joined
Jun 2, 2018
Messages
12
Is there anything else I can do to troubleshoot this? Still experiencing this issue with reboots and it's been extremely frustrating.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Is there anything logged in /var/log/console.log?
 

ayao218

Dabbler
Joined
Jun 2, 2018
Messages
12
Other than startup messages, no. There are messages that appear each time the machine reboots, but prior to the reboot, there are no logs anywhere on the system that I can see. I uploaded the debug file to the ticket linked if you would be able to take a look, but I looked through all the logs but couldn't find anything of use to diagnosing the issue.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Sorry, the members of the forum don't have access to the debugs uploaded. We're typically not iX employees, only volunteers. What you could do is to enable the debug kernel in System->Advanced and reboot. Then you might be able to see more info in the console log.
 

ayao218

Dabbler
Joined
Jun 2, 2018
Messages
12
Thanks! I will enable that and wait until my next inevitable automatic reboot. Do you think this could be related to the flash drive failing? I personally find hardware failing to be quite odd given that it's two different but identically configured machines experiencing the same symptoms which started at the same time.
 

ayao218

Dabbler
Joined
Jun 2, 2018
Messages
12
I'm also still running 12.0-U1.1, afraid to upgrade to U2 after this. Would you recommend upgrading to see if it will fix the reboots?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
It’s possible these reboots are due to the boot pool on both failing. You could try running freenas-verify to see if your installation has become corrupted. It wouldn’t make sense to try to upgrade on failing media.
 

ayao218

Dabbler
Joined
Jun 2, 2018
Messages
12
I just tried freenas-verify and it came back with success on both machines. It seems like the boot pool is fine then?
 

pschatz100

Guru
Joined
Mar 30, 2014
Messages
1,184
I just tried freenas-verify and it came back with success on both machines. It seems like the boot pool is fine then?
Maybe. Problems with USB sticks typically show up after an upgrade - and it can be really tricky to diagnose the problem. I moved my boot device to SSD a long time ago - just to avoid this issue. USB sticks are no longer recommended as boot devices and there is a reason for this.

If you don't find anything obvious, I would pick up a cheap SSD and try it.
 

cneeper

Cadet
Joined
Oct 17, 2017
Messages
4
ayao218, I'm curious if you found the source of your problem.

I don't want to muddy the waters much, but a similar problem drew me to the forums. In my case, I went from 11.3-U5 -> 12.0-U3.1. No spontaneous reboots before the upgrade, now spontaneous reboots as described. FWIW, I'm using dual mirrored SuperMicro SATA DOM SSDs (32GB); not USB. In my case, I actually can't really rule out the hardware since I'm abusing an ANCIENT (late 2006 but with newer storage devices) server and might possibly actually have hints of a potentially failing hard drive in my ~35TiB array. But it was definitely not rebooting prior to the upgrade and now it's rebooting as frequently as the OP mentioned. This is an important-ish server in that I use it to store daily backups, but not mission critical in the sense that the world ends if the server dies...hence the obscenely old repurposed hardware. (It's almost now a personal challenge/point of pride having a 15-year old server still running strong, performing well, and doing useful work. LOL!)

I enabled reporting to my syslog server in an attempt to find a clue as to why, but I don't see anything flashing a red "Look At Me!" in there. I have managed to catch and see a reboot in process and it didn't appear to be a hard reset of the hardware (as if someone hit the reset button). It was more of an actual shutdown and reboot, so the OS didn't totally die. It was still functional enough to terminate services and reboot itself.

Bottom line, OP's experience may not be unique. I may be experiencing a similar issue.
 

ayao218

Dabbler
Joined
Jun 2, 2018
Messages
12
Unfortunately I haven't figured out the issue. I've went through and even reinstalled and restored from backup and still am seeing the exact same issues. I've just left it as is in this rebooting state so far.
 
Joined
Oct 22, 2019
Messages
3,641
@ayao218, are you using native ZFS encryption? Did you try upgrading to 12.0-U3.1? It may not be related, but there was a bug that caused crashing in -U2 with native ZFS encryption.
 

cneeper

Cadet
Joined
Oct 17, 2017
Messages
4
I continued digging through the forums...Seems like maybe mine could be watchdog timer triggering the reboot.
ipmitool sel list (shows contents of System Event Log)

...shows me multiple instances of the following, but only since my upgrade from 11.x to 12.x:
Watchdog2 #0x03 | Power cycle | Asserted

Seems like quite a few others might be having watchdog timer issues in FreeBSD 12.x. I'm going to try disabling the watchdogd service for now and see how that affects my version of this issue. (service watchdogd stop) I'd look into doing it in my server's firmware settings also, but I'm remote to my server right now, so this is the best I can do. YMMV and fair warning: I only have a general conceptual understanding of the watchdog timers and the risks of disabling. So for anyone reading this: I'm nowhere close to being a *nix guru or to understanding exactly what my risks of disabling watchdogd are. Do your own homework and only follow me at your own peril as I jump off the cliff and go splat on my own server that's not overly mission critical!
 
Top