Need assistance diagnosing random shutdown.

Status
Not open for further replies.
Joined
Sep 13, 2014
Messages
149
Hardware details for my main "X10" server are in my sig.

I've just tried to access an SMB share via Windows and found that it was inaccessible. I went through the usual Windows malarkey to diagnose such problems but when I tried logging into the FreeNAS WebGUI, I found that it too was inaccessible. Sure enough, the FreeNAS server was powered off.

I logged in to the IPMI interface via a web browser and checked the event log but found nothing (although I have a suspicion that the reason I found zilch may have been my fault). I then rebooted the server and logged in via the WebGUI. There were no flashing red icons/alerts so I then checked /var/log/messages. There wasn't anything that seamed particularly important (but I'll quote them below non the less) other than giving me an idea of an approximate time when the server went down.

Code:
Oct 17 11:00:12 freenas-x10 autosnap.py: [tools.autosnap:66] Popen()ing: /sbin/zfs snapshot -r "tank/Youtube_Projects@auto-20181017$
Oct 17 11:00:12 freenas-x10 autosnap.py: [tools.autosnap:66] Popen()ing: /sbin/zfs destroy -r -d "tank/Youtube_Projects@auto-201810$
Oct 17 11:34:19 freenas-x10 alert.py: [common.pipesubr:66] Popen()ing: /usr/local/sbin/dmidecode -s system-product-name			 
Oct 17 11:34:19 freenas-x10 alert.py: [common.pipesubr:66] Popen()ing: /usr/local/sbin/dmidecode -s baseboard-product-name		 
Oct 17 12:34:28 freenas-x10 alert.py: [common.pipesubr:66] Popen()ing: /usr/local/sbin/dmidecode -s system-product-name			 
Oct 17 12:34:28 freenas-x10 alert.py: [common.pipesubr:66] Popen()ing: /usr/local/sbin/dmidecode -s baseboard-product-name		 
Oct 17 12:48:12 freenas-x10 autosnap.py: [tools.autosnap:66] Popen()ing: /sbin/zfs snapshot -r "tank@auto-20181017.1248-2m"		 
Oct 17 12:48:13 freenas-x10 autosnap.py: [tools.autosnap:66] Popen()ing: /sbin/zfs destroy -r -d "tank@auto-20180818.1248-2m"	   
Oct 17 13:00:12 freenas-x10 autosnap.py: [tools.autosnap:66] Popen()ing: /sbin/zfs snapshot -r "tank/Youtube_Projects@auto-20181017$
Oct 17 13:00:12 freenas-x10 autosnap.py: [tools.autosnap:66] Popen()ing: /sbin/zfs destroy -r -d "tank/Youtube_Projects@auto-201810$
Oct 17 13:04:10 freenas-x10 devd: notify_clients: dropping unresponsive client: Broken pipe										 
Oct 17 13:04:11 freenas-x10 kernel: epair0a: link state changed to DOWN															 
Oct 17 13:04:11 freenas-x10 kernel: epair0a: link state changed to DOWN															 
Oct 17 13:04:11 freenas-x10 kernel: epair0b: link state changed to DOWN															 
Oct 17 13:04:11 freenas-x10 kernel: epair0b: link state changed to DOWN															 
Oct 17 17:28:53 freenas-x10 syslog-ng[1688]: syslog-ng starting up; version='3.6.4'												 
Oct 17 17:28:53 freenas-x10 Copyright (c) 1992-2016 The FreeBSD Project.															
Oct 17 17:28:53 freenas-x10 Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994								
Oct 17 17:28:53 freenas-x10	 The Regents of the University of California. All rights reserved.								   
Oct 17 17:28:53 freenas-x10 FreeBSD is a registered trademark of The FreeBSD Foundation.											
Oct 17 17:28:53 freenas-x10 FreeBSD 10.3-STABLE #0 r295946+21897e6695f(HEAD): Thu Jun  8 20:02:12 UTC 2017


My next step was to check the Event Log using IPMIview... that's where things got interesting.

Code:

502,System Event,05/01/2018 03:21:53 Tue,Fan,FANA,De-assertion: Lower Critical - going low
503,System Event,05/01/2018 04:18:45 Tue,Fan,FANA,Assertion: Lower Critical - going low
504,System Event,05/01/2018 04:18:48 Tue,Fan,FANA,De-assertion: Lower Critical - going low
505,System Event,05/01/2018 05:05:22 Tue,Fan,FANA,Assertion: Lower Critical - going low
506,System Event,05/01/2018 05:05:25 Tue,Fan,FANA,De-assertion: Lower Critical - going low
507,System Event,05/01/2018 05:28:14 Tue,Fan,FANA,Assertion: Lower Critical - going low
508,System Event,05/01/2018 05:28:17 Tue,Fan,FANA,De-assertion: Lower Critical - going low
509,System Event,05/01/2018 06:12:22 Tue,Fan,FANA,Assertion: Lower Critical - going low
510,System Event,05/01/2018 06:12:25 Tue,Fan,FANA,De-assertion: Lower Critical - going low
511,System Event,10/17/2018 12:05:01 Wed,Watchdog 2,,Assertion: Watchdog 2| Event = Timer interrupt
512,System Event,10/17/2018 12:05:02 Wed,Watchdog 2,,Assertion: Watchdog 2| Event = Hard Reset



Ignore that FAN related events. I'm using the fans that came with my case, which the X10SL7-f doesn't play nice with, so despite not being able to actually hear it, apparently that fans ramp up and down semi regularly... but I must stress that the server is in a converted attic and we're heading into Autumn in the UK, so heat is not an issue (even in the height of summer for that matter).

What caught my eye was the last two entries concerning the watchdog. I did have have the motherboard jumper set to "reset" (pins 1 and 2) but I'm 99.9% sure I had it disabled in the BIOS. Either way, it's disabled now and the jumper has been removed in order to disable Watchdog.

So... where do I go from now? Are there any other logs I should check?

I do have a PSU tester... just a basic thing that tests whether the voltages and power good signal are within tolerances. Does anyone think it's worth testing the PSU?
 
Joined
Sep 13, 2014
Messages
149
It's just had it happen again. Cue more googling and more reading of threads with users with similar issues. As several of the threads were on these forums but none were resolved, I'm just gonna name drop you @Kris Moore. You had responded to several of the other threads but as I said, they didn't lead anywhere, so if you want another crack at finding the issue, then I'm happy to jump through some hoops after the weekend or whenever you, or another dev, has time.

After the previous shutdown, I double checked that the Watchdog was disabled in the BIOS and I removed the jumper on the motherboard to disable the Watchdog. [EDIT] I've since reinstalled the jumper to bridge pins 1 and 2, so the system resets... I did this to eliminate it as the cause of what I believe is an unrelated hardware issue that I have describe in the spoilers below. I might remove it again though.

The course of events were near identical with the latest shutdown but the logs showed a few differences. The system shutdown at 13:02:52 this time, so just under and hour later than the first shutdown. I've checked for any scheduled tasks taking place around that time and I guess it's possible either a Scrub or S.M.A.R.T. test could be running. I have them scheduled to run at 4am, so perhaps they could still be running. I don't think it's related to Scrubs or SMART tests though, as whilst a Scrub of my boot device did start last night (according to my email reports), no Scrub was taking place in the days leading up to or after the last shutdown.

I have four Jails running, Plex, Deluge and UniFi Controller. I have one Plugin running, Syncthing.

Here are the messages from the IPMI event log, there were also a bunch of entries regarding fans ramping up and down but I've cut them out as they're basically exactly the same as the messages in my previous post, aside from the date and time.

Code:
511,System Event,10/26/2018 13:02:52 Fri,Watchdog 2,,Assertion: Watchdog 2| Event = Timer interrupt
512,System Event,10/26/2018 13:02:53 Fri,Watchdog 2,,Assertion: Watchdog 2| Event = Hard Reset


Here are the logs from /var/log/messages.

Code:
fs destroy -r -d "tank@auto-20180827.1248-2m"								
Oct 26 13:00:12 freenas-x10 autosnap.py: [tools.autosnap:66] Popen()ing: /sbin/z
fs snapshot -r "tank/Youtube_Projects@auto-20181026.1300-2w"				
Oct 26 13:00:12 freenas-x10 autosnap.py: [tools.autosnap:66] Popen()ing: /sbin/z
fs destroy -r -d "tank/Youtube_Projects@auto-20181012.1300-2w"				
Oct 26 13:30:41 freenas-x10 alert.py: [common.pipesubr:66] Popen()ing: /usr/loca
l/sbin/dmidecode -s system-product-name										
Oct 26 13:30:41 freenas-x10 alert.py: [common.pipesubr:66] Popen()ing: /usr/loca
l/sbin/dmidecode -s baseboard-product-name									
Oct 26 14:00:12 freenas-x10 autosnap.py: [tools.autosnap:66] Popen()ing: /sbin/z
fs snapshot -r "tank@auto-20181026.1400-2m"									
Oct 26 14:00:35 freenas-x10 devd: notify_clients: dropping unresponsive client:
Broken pipe																	
Oct 26 14:00:35 freenas-x10 kernel: epair0a: link state changed to DOWN		
Oct 26 14:00:35 freenas-x10 kernel: epair0a: link state changed to DOWN		
Oct 26 14:00:35 freenas-x10 kernel: epair0b: link state changed to DOWN		
Oct 26 14:00:35 freenas-x10 kernel: epair0b: link state changed to DOWN		
Oct 26 18:26:02 freenas-x10 syslog-ng[1697]: syslog-ng starting up; version='3.6
.4'



--------------------------

I had some issues with my motherboard that delayed me from being able to name drop @Kris Moore (see the first spoiler below) but as the issue with the mobo is now fixed (hopefully), I can go ahead and extend the offer of helping troubleshoot the power loss issues I described above.

In the second spoiler, I will detail what the issue with the motherboard turned out to be in the hopes that it may help others in the same situation.

I'm posting the information below as it'd be more frustrating for the work I put into it to go to waste... I thing I have a hardware issue.

My system is completely unresponsive to physical power switches and and power control via IPMIView. I disconnected the 24pin and 8pin CPU connectors, and a SATA cable in order to test my PSU and after the PSU passed with flying colours, I reconnected everything. The System powered on for a moment, all the fans spun up, the LED next to the jumper to enable SAS turned red for a split second, the green and then the system lost power. All the LED lights are on as normal and in IMPIView, it's saying the system is powered on (as it would when it's turned on rather than simple receiving power) but it's completely unresponsive. I tried resetting CMOS but no dice.

Shit.

[EDIT] Okay, that has kind of put a smile on my face. Once I posted this, the word above beginning with 'S' was autocorrected to the c word.

[EDIT 2] I've pulled the power cable and removed the CMOS battery, fingers crossed that time really is the great healer.

The symptoms were as follows-

  • The system would power on for a few seconds when the PSU's switch was flipped.
  • The front panel power button did nothing.
  • After the system had powered up briefly and then powered down, IPMIView would still show the system as being powered on. Non of IPMIView's power controls, either through the dedicated "IPMI Device" tab or via an KVM Console window worked.
Here's what I did to diagnose the problem-

  • Removed RAM sticks and tried booting with a single DIMM in various slots.
  • Disconnected all storage disks.
  • Disconnected the internal USB cables.
  • Tried reseating the power connectors.
  • Replaced the mobo screws, to make sure the board was grounding properly (the screws that came with the mobo were so small that I do question whether they were actually making contract with the grounding contacts surrounding the screw holes).
  • Tried a different PSU.
  • Checked both PSU with my Dr Power II PSU tester (I would normally do this earlier in the troubleshooting but I had only recently tested the PSUs coincidentally).
  • Tried "Bread boarding" the motherboard to see if it was shorting in the case.
Huzza!!!!!!!

It fricking powered on as normal!... this was the point when I swapped the screws when I reinstalled the mobo into the case, reconnected all the cables (aside from the disk connections) and then.... frick. The issues came back. The board would boot briefly before powering off.

That's when I was reminded of a valuable lesson... make sure you read the manual carefully! On the X10SL7-f, there's a diagnostic LED located by the case's PCI slots covers. It's labelled "LE6" and it indicates the status of the motherboard's power. I had misread the manual, so when LE6 was turning red to show a fault with power delivery, I thought it wasn't really an issues as I thought LE6 was actually a different LED.

Anyway, what I knew after eating a bit of the old humble pie was that the motherboard worked whilst outside the case. What was different now that it was in the case?

All that stressing... all the effing and blinding because I was facing spending the hard saved £220, that was earmarked for a new GPU, on something less exciting... all the worrying that I might have fried something or worse yet, caused data loss somehow and what was the cause of the issue?

The bloody power button on the case!!!... reset is in full working order but the moment I plug in the power switch and the issue comes back. It's not a huge deal. It'd cost £10 at most to replace and I can still short the pins or use IPMI to power the system on or off but still... the cause of all that head scratching and stress was a power button of all things.
 
Last edited:
Status
Not open for further replies.
Top