Occasional errors/reboots - what is the culprit?

Status
Not open for further replies.

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The delay still worries me, to be honest.

Any chance there's an errant contact with the chassis where there should be none?
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
The delay still worries me, to be honest.

Any chance there's an errant contact with the chassis where there should be none?

I am not sure what to attribute it to, but I was never able to reproduce the delay. I did identify one faulty SATA cable during hotbox testing (parity errors), which went away after replacing the cable (considering I've replaced the motherboard I could have just tried to re-seat the cable but while I was there I've just used a new one).

I think I've got a resolution - after replacing the M1015 with the one that was in originally and doing some hotbox testing, I think it's time to conclude that the problems were following:
  • faulty motherboard
  • faulty replacement M1015 card
I've got another M1015 on order to have further contingency... but so far it seems to be holding up (even with the old CPU). I've tested with all of the drives scrubbing + CPU stress test + reduced ventilation to let it all heat up quite a bit (in a controlled way of course) and no errors. I've also ran memtest without any problems.

Thank you very much everyone for the advice! (Now to the adventure of upgrading FreeNAS and the corresponding driver update on the M1015).
 

rogerh

Guru
Joined
Apr 18, 2014
Messages
1,111
Good to hear. Two faults makes for a difficult problem to solve!
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Ok, I think maybe I've jumped the gun on this. This morning I saw that the server rebooted overnight. However, this time there was no watchdog-related message in the IPMI, only this was on the console I've left open overnight:

Code:
Broadcast Message from root@freenas.local                                 
        (no tty) at 2:18 CEST...                                         
                                                                          
Communications with UPS ups lost


Literally nothing out of ordinary in the logs - the machine just rebooted at 02:24, as I can see the boot messages in the log. Nothing preceding that though. I am really stumped on how to debug this, as there is nothing in the IPMI logs this time that indicates the same kind of lockup as before.

Is there any other place I can look? My only other guess is to keep a machine running with IPMI screen recording until that happens to at least see what's happening on the screen.

A faulty UPS has crossed my mind (I did move it to the testbed when working on this machine) but again - how to debug this? For this point I was thinking about possibly using a RaspberryPI plugged into the UPS as well to check if it reboots as well but there I am slightly worried that its powersupply can buffer more than my PC's.

Here is excerpt from the logs:

Code:
Oct  1 01:34:36 freenas smbd[24885]:   STATUS=daemon 'smbd' finished starting up and ready to serve connectionsmatchname: host name/name mismatch: 192.168.2.5 != (NULL)
Oct  1 01:34:36 freenas smbd[24885]: [2015/10/01 01:34:36.782883,  0] ../source3/lib/util_sock.c:1199(get_remote_hostname)
Oct  1 01:34:36 freenas smbd[24885]:   matchname failed on 192.168.2.5
Oct  1 01:39:38 freenas afpd[26154]: Login by petr (AFP3.4)
Oct  1 01:41:28 freenas afpd[26154]: AFP logout by petr
Oct  1 01:41:28 freenas afpd[26154]: AFP statistics: 25834.58 KB read, 256341.42 KB written
Oct  1 01:41:28 freenas afpd[26154]: done
Oct  1 01:44:35 freenas kernel: arp: 192.168.2.1 moved from 02:d9:47:67:1a:00 to 00:0e:b6:8c:97:50 on epair1b
Oct  1 01:51:30 freenas kernel: arp: 192.168.2.1 moved from 02:d9:47:67:1a:00 to 00:0e:b6:8c:97:50 on epair0b
Oct  1 01:55:15 freenas kernel: arp: 192.168.2.1 moved from 02:d9:47:67:1a:00 to 00:0e:b6:8c:97:50 on epair9b
Oct  1 02:01:55 freenas smbd[32147]:   STATUS=daemon 'smbd' finished starting up and ready to serve connectionsmatchname: host name/name mismatch: 192.168.2.6 != (NULL)
Oct  1 02:01:55 freenas smbd[32147]: [2015/10/01 02:01:55.563654,  0] ../source3/lib/util_sock.c:1199(get_remote_hostname)
Oct  1 02:01:55 freenas smbd[32147]:   matchname failed on 192.168.2.6
Oct  1 02:11:30 freenas kernel: arp: 192.168.2.1 moved from 02:d9:47:67:1a:00 to 00:0e:b6:8c:97:50 on epair0b
Oct  1 02:24:10 freenas syslog-ng[2188]: syslog-ng starting up; version='3.5.6'
Oct  1 02:24:10 freenas Copyright (c) 1992-2014 The FreeBSD Project.
Oct  1 02:24:10 freenas Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Oct  1 02:24:10 freenas     The Regents of the University of California. All rights reserved.
Oct  1 02:24:10 freenas FreeBSD is a registered trademark of The FreeBSD Foundation.
Oct  1 02:24:10 freenas FreeBSD 9.3-RELEASE-p26 #1 r281084+59f7d05: Mon Sep 21 11:47:33 PDT 2015
Oct  1 02:24:10 freenas root@build3.ixsystems.com:/tank/home/jkh/build/FN/objs/os-base/amd64/tank/home/jkh/build/FN/FreeBSD/src/sys/FREENAS.amd64 amd64
Oct  1 02:24:10 freenas gcc version 4.2.1 20070831 patched [FreeBSD]
...


EDIT:
As the IPMITools recording is utter crap and stops after a few minutes automatically, I've set screen recording of the actual IPMI window. I've also plugged-in raspberry PI to the same UPS to see if there are any power cycles/resets.
 
Last edited:

rogerh

Guru
Joined
Apr 18, 2014
Messages
1,111
Does sound as though it could be due to the UPS. Lost communication - shutdown - closedown UPS - UPS restarts after a delay - reboots server.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Does sound as though it could be due to the UPS. Lost communication - shutdown - closedown UPS - UPS restarts after a delay - reboots server.

Will keep my eye on it... according to the logs though there was no shutdown initiated - either the server rebooted suddenly due to error on its part or the UPS has cut the power - I can see quite a few messages in the log if a shutdown is happening properly. Is that what you meant?
 

rogerh

Guru
Joined
Apr 18, 2014
Messages
1,111
Will keep my eye on it... according to the logs though there was no shutdown initiated - either the server rebooted suddenly due to error on its part or the UPS has cut the power - I can see quite a few messages in the log if a shutdown is happening properly. Is that what you meant?
You're probably right, I can't remember what is logged, perhaps it is your original problem back, and the loss of communication with the UPS a coincidence. But a UPS fault is possible?
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
You're probably right, I can't remember what is logged, perhaps it is your original problem back, and the loss of communication with the UPS a coincidence. But a UPS fault is possible?

I do not think it's the same - previously I could see a reboot due to the motherboard's watchdog being triggered after system freeze, which could be found in the IPMI log. This time, the log is blank.

Will keep my eye on it and report back should I find anything new out.
 
Status
Not open for further replies.
Top