FreeNAS server rebooted on its own -- troubleshooting possible?

Status
Not open for further replies.

SnakeByte

Explorer
Joined
Jul 10, 2015
Messages
53
Yesterday one of our FreeNAS servers unexpectedly rebooted. (FreeNAS-9.10-STABLE-201606072003 (696eba7))
When this happens, is there any trace of why this happened left over for me to analyze?
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
spontaneous reboots are cause primarily by two common issues, a hardware failure (most likely),
or a kernel panic. The hardware would be where I looked first. as it is most likely the cause.
  • Power supply
  • UPS
  • Overheated component or components
  • faulty boot device or data/power cable
 

SnakeByte

Explorer
Joined
Jul 10, 2015
Messages
53
It's an iXSystems Mini plugged into a large UPS in an air conditioned room. No other device had trouble in the server room, so I can at least rule out the UPS. Overheating is unlikely due to the AC.
It has been up for 12 hours since the reboot.
Does the FreeNAS gui have any hardware monitoring?

If one wanted to trouble shoot the kernel panic, what is the best way to go about this? I assume the last kernel panic info was lost after reboot? I also assume there's a way to stop the unit from rebooting upon the next kernel panic so that I can see the "screen of death" info?
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Please don't be offended if this insults your intelligence, but have you checked the logs
by using the secure shell (not the GUI shell) and typing in the command:
# cat /var/log/messages

The use of the secure shell is conducted through software such as puTTY and Bitvise.

You should be able to scroll up the through the results, however depending on your settings
in FreeNAS, the log will "roll over" every so often and the previous log file will be deleted.
If this is the case. you will have to wait until the issue re-occurs.
logscreen.JPG
 
Last edited:

SnakeByte

Explorer
Joined
Jul 10, 2015
Messages
53
Besides a huge chunk of missing time in the messages.0.bz2 file before the reboot, everything looked peachy:

Code:
Jul  7 09:00:22 freenas autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 freenas-2.intranet.domain.com "zfs destroy -d 'trunk/Remote/MM@auto-20160630.0900-1w'"
Jul  7 09:00:23 freenas autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 freenas-2.intranet.domain.com "zfs destroy -d 'trunk/Remote/Backups@auto-20160630.0900-1w'"
Jul  7 09:00:24 freenas autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 freenas-2.intranet.domain.com "zfs destroy -d 'trunk/Remote/jails@auto-20160630.0900-1w'"
Jul  7 09:00:25 freenas autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 freenas-2.intranet.domain.com "zfs destroy -d 'trunk/Remote@auto-20160630.0900-1w'"
Jul  7 21:20:23 freenas syslog-ng[1875]: syslog-ng starting up; version='3.6.4'
Jul  7 21:20:23 freenas Copyright (c) 1992-2016 The FreeBSD Project.
Jul  7 21:20:23 freenas Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Jul  7 21:20:23 freenas         The Regents of the University of California. All rights reserved.
Jul  7 21:20:23 freenas FreeBSD is a registered trademark of The FreeBSD Foundation.


I'm unclear how waiting for this to happen again will help things -- I'll be right back where I am now unless there's a way to halt the reboot to see the kernel panic info? Or are you saying that normally when there is a kernel panic, the system doesn't reboot?

If it was a power issue, we shouldn't see this time gap though, correct?
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Besides a huge chunk of missing time in the messages.0.bz2 file before the reboot, everything looked peachy:
That is a VERY huge chunk of missing time alright!
I would recommend getting with IXSystems support and see if they can help you, or hang out
and see if someone else can lend a hand with this.
@cyberjock HELP!

In the mean time, perhaps checking the PSU while you wait...

I'm unclear how waiting for this to happen again will help things -- I'll be right back where I am now unless there's a way to halt the reboot to see the kernel panic info?
The act of waiting would only have been needed if your log "turned over" as I showed in the picture file I posted above.
This certainly does not apply to you in this case. You obviously have log output that seems to show your
machine was in a replication instance when the reboot occurred. I could be wrong about this.
 

SnakeByte

Explorer
Joined
Jul 10, 2015
Messages
53
I think that gap in time may be a red herring... Looking at older log files shows no activity after replication, up until midnight.
Code:
Jun 24 09:00:44 freenas autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 freenas-2.intranet.domain.com "zfs destroy -d 'trunk/Remote/jails@auto-20160617.0900-1w'"
Jun 24 09:00:45 freenas autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 freenas-2.intranet.domain.com "zfs destroy -d 'trunk/Remote@auto-20160617.0900-1w'"
Jun 25 00:00:00 freenas syslog-ng[6885]: Configuration reload request received, reloading configuration;
Jun 25 03:30:01 freenas cachetool.py: [common.pipesubr:61] Popen()ing: klist
Jun 25 03:30:03 freenas cachetool.py: [common.pipesubr:61] Popen()ing: klist
Jun 25 09:00:03 freenas autosnap.py: [tools.autosnap:61] Popen()ing: /sbin/zfs snapshot -r "trunk@auto-20160625.0900-1y"


I've got a PSU tester. I'll give it a go. I'll boot up a cd to test the hardware overnight too.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Besides a huge chunk of missing time in the messages.0.bz2 file before the reboot, everything looked peachy:

Code:
Jul  7 09:00:22 freenas autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 freenas-2.intranet.domain.com "zfs destroy -d 'trunk/Remote/MM@auto-20160630.0900-1w'"
Jul  7 09:00:23 freenas autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 freenas-2.intranet.domain.com "zfs destroy -d 'trunk/Remote/Backups@auto-20160630.0900-1w'"
Jul  7 09:00:24 freenas autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 freenas-2.intranet.domain.com "zfs destroy -d 'trunk/Remote/jails@auto-20160630.0900-1w'"
Jul  7 09:00:25 freenas autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 freenas-2.intranet.domain.com "zfs destroy -d 'trunk/Remote@auto-20160630.0900-1w'"
Jul  7 21:20:23 freenas syslog-ng[1875]: syslog-ng starting up; version='3.6.4'
Jul  7 21:20:23 freenas Copyright (c) 1992-2016 The FreeBSD Project.
Jul  7 21:20:23 freenas Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Jul  7 21:20:23 freenas         The Regents of the University of California. All rights reserved.
Jul  7 21:20:23 freenas FreeBSD is a registered trademark of The FreeBSD Foundation.


I'm unclear how waiting for this to happen again will help things -- I'll be right back where I am now unless there's a way to halt the reboot to see the kernel panic info? Or are you saying that normally when there is a kernel panic, the system doesn't reboot?

If it was a power issue, we shouldn't see this time gap though, correct?
Everything is normal as of 9pm on the 7th, then at some unknown point before hour 21:20
you lost power that caused the reboot. AFAIK if your machine experienced a kernel panic,
it would have shown in the logs prior to the reboot. It's early in the debug process, but my
guess at this point, is power supply.

Please keep us posted so others can refer back to this thread and learn from your experience,
as you work towards a solution.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
Can you generate a debug file and send it to me via a PM in the forums? I will be able to tell you if the system crashed or not, but if there's a hardware failure the logs may not indicate that a failure occurred.
 

SnakeByte

Explorer
Joined
Jul 10, 2015
Messages
53
It's early in the debug process, but my
guess at this point, is power supply.

Please keep us posted so others can refer back to this thread and learn from your experience,
as you work towards a solution.

The PSU test came up clean using this device: http://www.thermaltake.com/products-model.aspx?id=C_00001777

I tested with both the 24pin connector and what they call the peripheral connector. This particular PSU does not have a CPU plug.
 

SnakeByte

Explorer
Joined
Jul 10, 2015
Messages
53
The UPS is a CyperPower 1500 PFCLCD (https://www.cyberpowersystems.com/products/ups/pfc-sinewave/cp1500pfclcd). It is one of their "Sinewave" types that is supposed to provide clean power.
Some notable things:
I have a QNAP also connected to this UPS and the uptime of that unit is in months.
I had loaded an update for the Freenas 20 days ago so previous to the unexpected reboot, it was up for 20 days.
The Freenas unit has been plugged into this UPS for over a year without any trouble from it.
Thee freenas unit is using the USB cable that communicates with the UPS.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
The PSU test came up clean using this device: http://www.thermaltake.com/products-model.aspx?id=C_00001777

I tested with both the 24pin connector and what they call the peripheral connector. This particular PSU does not have a CPU plug.
I have limited knowledge of advanced diagnostic testing of power supplies, but do know
that the tester you linked to doesn't put a proper load on the PSU and therefore only tests
basic voltage output which may not be affected until a load is placed on the components.
AFAIK, a load test will sometimes have to remain ongoing for hours until an issue occurs.
Maybe someone else might be able to expound on advanced PSU testing methods.
 

SnakeByte

Explorer
Joined
Jul 10, 2015
Messages
53
Can you generate a debug file and send it to me via a PM in the forums? I will be able to tell you if the system crashed or not, but if there's a hardware failure the logs may not indicate that a failure occurred.
Done. In the meantime I've booted up Hiren's Boot CD 15.2 and am running Memtest86 v4.20. I plan on running it over the weekend. If you know of something more comprehensive, I'm game.
 

SnakeByte

Explorer
Joined
Jul 10, 2015
Messages
53
More info:
Logged into IPMI while the mem test runs. Server Health, Event Log only contains entries for "timestamp clock syncs" back in Dec 2015. Nothing for 2016. Server Health, System and Audit Log has entries, but all informational, and all only dated today. If there was previous days worth of data, I think it was lost after I power cycled the unit (after doing the PSU test), so if there was anything about ps voltage level issues, those were lost. I'm keeping the IPMI session open and have "live" graphs showing me ATX+5VSB, CPU Temp, MB temp, and VCore. Otherwise on the Server Health, Sensor Readings page, of the things it looks for (LNR, LC, LNC, UNR, UC, UNC, Other, Discrete), all have a count of zero.

One thing about Memtest86 v4.20 that is odd, even though the dimms in the unit are the ECC type (BIOS confirmed this), Memtest86 v4.20 shows ECC as off. Even when I go into Configuration and set it to on, it remains off. Not sure how this affects the tests.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
I have use the PassMark version the last few times I've tested RAM.
http://www.memtest86.com/features.htm
The page shows their claims of features compared to the older v4 software,
you can see the ECC differences listed there FWIW.
 

SnakeByte

Explorer
Joined
Jul 10, 2015
Messages
53
I have limited knowledge of advanced diagnostic testing of power supplies, but do know
that the tester you linked to doesn't put a proper load on the PSU and therefore only tests
basic voltage output which may not be affected until a load is placed on the components.
AFAIK, a load test will sometimes have to remain ongoing for hours until an issue occurs.
Maybe someone else might be able to expound on advanced PSU testing methods.

My Google-fu has failed me. I cannot find a tester that applies a load. I guess I'll just rely on the IPMI interface -- it shows me the voltage numbers and I'll be able to look at that during normal load.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Do you mean the Pro version? I've just purchased it and am running it now. Fancy.
The free version was what I was referring to, I have not even priced the Pro version. I'm green with envy:D

Post us a screen shot!
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Status
Not open for further replies.
Top