My server went down while I was out of town - where to troubleshoot?

Status
Not open for further replies.

hungarianhc

Patron
Joined
Mar 11, 2014
Messages
234
Hi All,

I was out of town for six or seven days, and Plex (running on FreeNAS) was working fine while I was gone, and all of a sudden this morning, it wasn't working... I came back home tonight, and i see that I can't access the FreeNAS gui via the IP address. I went to go look at the box to see if it powered down, and it's still up and running right now. As in the lights are on, etc...

I can't access my server via Plex, I can't back up to Time Machine, and I can't hit the UI with the IP Address. My default response would be to just reboot the server, but I'd like to try and diagnose this before I reboot it. Any help here? Thanks!
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
Hook up a monitor to it and see what the local display says...
 

hungarianhc

Patron
Joined
Mar 11, 2014
Messages
234
Got it - I actually have a motherboard w/ IPMI so I can get display re-direction. I didn't think of that... Well, I already rebooted, and it works just fine. I also upgraded to the newest version of FreeNAS. I wonder if the issue is having something plugged into one of my Marvell controller SATA ports: https://forums.freebsd.org/viewtopic.php?t=45197

Side question: I have few enough drives that I can switch them all over to the Intel ports. If I power the machine down, move one of the cables from one sata port to another, what is the expected outcome? A confused FreeNAS? Or will it figure it out?

Also, I doubt that I need to worry about not being able to full use 6gbps sata ports... Not all of the Intel ports are 6gbps... but I'm on spinning disks so what's the matter anyways?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
As long as it's a port that is on a controller that is supported by FreeNAS and isn't a damn RAID controller, FreeNAS doesn't care what controller your disks are on. It only cares that the disks work.

So the expected outcome is "everything will just work".

6Gb SATA doesn't matter for spinning disks. In fact, SATA 1.5Gb is faster than most disks anyway...
 

hungarianhc

Patron
Joined
Mar 11, 2014
Messages
234
Cool - Thanks. Tomorrow I'll power the NAS down and move everything to the Intel controller ports.
 

hungarianhc

Patron
Joined
Mar 11, 2014
Messages
234
Update... I went to Fry's today to buy some longer SATA cables. I wanted to make sure all drives were attached to the Intel controllers. They now are. I rebooted the FreeNAS box, and it worked just fine... for a bit... Then I saw that one of my Time Machine backups failed. I tried to go into the web UI, but no go... I loaded up IPMI to see what was on the screen, and I got the attached screenshot.

FYI for specs... I'm running the ASRock C2750 board, 32GB of ECC RAM on the ASRock approved list, 1 SSD for jails, 5x4TB drives in RaidZ2.

Thanks in advance - I'm confused on this one....
 

Attachments

  • Screen Shot 2014-04-30 at 11.15.07 PM.png
    Screen Shot 2014-04-30 at 11.15.07 PM.png
    182.4 KB · Views: 292

hungarianhc

Patron
Joined
Mar 11, 2014
Messages
234
Rebooted and all is well for now... until it likely crashes again.....

checked screen output... below...
 

Attachments

  • Screen Shot 2014-04-30 at 11.30.51 PM.png
    Screen Shot 2014-04-30 at 11.30.51 PM.png
    42.6 KB · Views: 281

hungarianhc

Patron
Joined
Mar 11, 2014
Messages
234
Hmmm it crashed again. Pic from screen output below:
 

Attachments

  • Screen Shot 2014-05-01 at 6.08.55 AM.png
    Screen Shot 2014-05-01 at 6.08.55 AM.png
    42.1 KB · Views: 296

warri

Guru
Joined
Jun 6, 2011
Messages
1,193
It doesn't look just a normal crash, the running processes actually seem to be ended gracefully (SIGTERM, exit code 15).
Also the terminal still seems to accept input (^[[A in the last screenshot).

Can you check the logs to see what happens before this? This information got cleared from the screen due to restricted space, but should be available in the log files at your .system dataset or /var/log
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
Are you using the Marvell controller on that board?
 

hungarianhc

Patron
Joined
Mar 11, 2014
Messages
234
Also... Just to give a few more steps in the "how I got here" story. Originally, I had 5 4TB drives plugged into the Marvell controller and 1 ssd plugged into the intel controller. I was running 9.2.1.3 without issue. My uptime would be days / weeks, and my only reboots would be when I forced it for some reason. Then 4 / 5 came out, and I made the upgrade. I upgraded to 9.2.1.4 right before I left town for about ten days. Plex worked fine while I was gone until the last couple of days. I came back and found it crashed. I thought maybe I just needed to move to 9.2.1.5, and I made the move. The crash happened again soon after. I did some reading, and then I tonight maybe I was hitting a bug with the Marvell sata controllers. I went to fry's to get some longer cables, and I moved everything over to the intel ports. My fingers were crossed that this would fix the issue, but the crash happened again soon after. I appreciate the support / help here. Hopefully we can get to the root of this - thanks!!
Are you using the Marvell controller on that board?

Nope... Per my previous post, I went to Fry's to get longer data cables and now every drive is connected to the intel controller.
Can you check the logs to see what happens before this? This information got cleared from the screen due to restricted space, but should be available in the log files at your .system dataset or /var/log

Absolutely. I will do this when I get home. Question, though... How do I check the log? My understanding was that after every reboot, the log gets wiped. What is the best way to capture it? I imagine I will be able to reproduce this pretty quickly.
 

warri

Guru
Joined
Jun 6, 2011
Messages
1,193
If you enable Use system dataset for syslog in the Advanced Settings the log is persisted in a dataset called .system on the pool specified with System dataset pool. Then it'll survive a reboot and you can find the log in /mnt/<pool>/.system/syslog/log/messages

You didn't really describe the exact state the system is in after such a "crash", only that the services shut down. But are you still able to make keyboard inputs at the box when it happens?
 

hungarianhc

Patron
Joined
Mar 11, 2014
Messages
234
You didn't really describe the exact state the system is in after such a "crash", only that the services shut down. But are you still able to make keyboard inputs at the box when it happens?

Yes! You noticed those random characters at the bottom of a screenshot, and those are because I hit a couple keys. I can also use IPMI to send ctrl+alt+delete to reboot the server.

When I get home this evening, I'll enable the logging on the system dataset pool. Seriously, thank you so much for the help thus far.
 

warri

Guru
Joined
Jun 6, 2011
Messages
1,193
No worries, I'm curious about the cause of this myself ;)

If you use the SSH service and it doesn't turn off you could also login via SSH and retrieve the log this way without having to reboot the server. Let's wait and see what we can find out..
 

hungarianhc

Patron
Joined
Mar 11, 2014
Messages
234
Okay so I have pasted some logs inline... Not sure if I went far enough up, but it should go through at least two reboots. I have also attached the log file.

The log file is here.

Code:
Apr 30 23:28:34 Hanson-NAS notifier: Performing sanity check on Samba configuration: OK
Apr 30 23:28:34 Hanson-NAS notifier: Starting nmbd.
Apr 30 23:28:34 Hanson-NAS notifier: Starting smbd.
Apr 30 23:28:34 Hanson-NAS notifier: Starting winbindd.
Apr 30 23:28:34 Hanson-NAS winbindd[9906]: [2014/04/30 23:28:34.540434,  0] ../source3/winbindd/winbindd_cache.c:3196(initialize_winbindd_cache)
Apr 30 23:28:34 Hanson-NAS winbindd[9906]:  initialize_winbindd_cache: clearing cache and re-creating with version number 2
Apr 30 23:36:40 Hanson-NAS kernel: ifa_del_loopback_route: deletion failed
Apr 30 23:36:40 Hanson-NAS kernel: Freed UMA keg (udp_inpcb) was not empty (80 items).  Lost 8 pages of memory.
Apr 30 23:36:40 Hanson-NAS kernel: Freed UMA keg (udpcb) was not empty (1008 items).  Lost 6 pages of memory.
Apr 30 23:36:40 Hanson-NAS kernel: Freed UMA keg (tcptw) was not empty (300 items).  Lost 6 pages of memory.
Apr 30 23:36:40 Hanson-NAS kernel: Freed UMA keg (tcp_inpcb) was not empty (90 items).  Lost 9 pages of memory.
Apr 30 23:36:40 Hanson-NAS kernel: Freed UMA keg (tcpcb) was not empty (44 items).  Lost 11 pages of memory.
Apr 30 23:36:40 Hanson-NAS kernel: hhook_vnet_uninit: hhook_head type=1, id=1 cleanup required
Apr 30 23:36:40 Hanson-NAS kernel: hhook_vnet_uninit: hhook_head type=1, id=0 cleanup required
Apr 30 23:43:49 Hanson-NAS afpd[10492]: read: Operation timed out
Apr 30 23:43:49 Hanson-NAS afpd[10492]: dsi_stream_read: len:-1, Operation timed out
Apr 30 23:43:49 Hanson-NAS afpd[10492]: dsi_stream_send: Broken pipe
Apr 30 23:43:49 Hanson-NAS afpd[10492]: dsi_wrtreply: Broken pipe
Apr 30 23:43:49 Hanson-NAS afpd[10492]: dsi_disconnect: entering disconnected state
Apr 30 23:43:49 Hanson-NAS afpd[10492]: dsi_disconnect: entering disconnected state
May  1 03:01:27 Hanson-NAS ntpd[2492]: sendto(199.102.46.72) (fd=22): No route to host
May  1 06:13:46 Hanson-NAS afpd[10492]: Disconnected session terminating
May  1 06:13:46 Hanson-NAS mDNSResponder: mDNSResponder (Engineering Build) (Apr 23 2014 17:17:29) stopping
May  1 06:13:46 Hanson-NAS kernel: .
May  1 06:13:46 Hanson-NAS mDNSResponder: mDNS_FinalExit failed to send goodbye for: 0000000800CBFD80 01  35 _afpovertcp._tcp.local. PTR Hanson-NAS._afpovertcp._tcp.local.
May  1 06:13:46 Hanson-NAS mDNSResponder: mDNS_FinalExit failed to send goodbye for: 0000000800CC1580 01  30 _adisk._tcp.local. PTR Hanson-NAS._adisk._tcp.local.
May  1 06:13:46 Hanson-NAS mDNSResponder: mDNS_FinalExit failed to send goodbye for: 0000000800CC2D80 01  36 _device-info._tcp.local. PTR Hanson-NAS._device-info._tcp.local.
May  1 06:13:46 Hanson-NAS mDNSResponder: mDNS_FinalExit failed to send goodbye for: 0000000800CBCD80 01  29 _http._tcp.local. PTR Hanson-NAS._http._tcp.local.
May  1 06:13:46 Hanson-NAS mDNSResponder: mDNS_FinalExit failed to send goodbye for: 0000000800CBE580 01  28 _smb._tcp.local. PTR Hanson-NAS._smb._tcp.local.
May  1 06:13:46 Hanson-NAS smbd[9902]: dnssd_clientstub DNSServiceProcessResult called with DNSServiceRef with no ProcessReply function
May  1 06:13:46 Hanson-NAS kernel: .
May  1 06:13:46 Hanson-NAS kernel: .
May  1 06:13:46 Hanson-NAS winbindd[9906]: [2014/05/01 06:13:46.917492,  0] ../source3/winbindd/winbindd.c:234(winbindd_sig_term_handler)
May  1 06:13:46 Hanson-NAS winbindd[9906]:  Got sig[15] terminate (is_parent=1)
May  1 06:13:46 Hanson-NAS winbindd[9908]: [2014/05/01 06:13:46.922364,  0] ../source3/winbindd/winbindd.c:234(winbindd_sig_term_handler)
May  1 06:13:46 Hanson-NAS winbindd[9908]:  Got sig[15] terminate (is_parent=0)
May  1 06:13:48 Hanson-NAS kernel: , 9902.
May  1 06:13:48 Hanson-NAS ntpd[2492]: ntpd exiting on signal 15
May  1 06:13:48 Hanson-NAS kernel: .
May  1 06:13:48 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 1
May  1 06:13:49 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 2
May  1 06:13:50 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 3
May  1 06:13:51 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect() failed Socket:4 Err:-1 Errno:61 Connection refused
May  1 06:13:53 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 1
May  1 06:13:54 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 2
May  1 06:13:55 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 3
May  1 06:13:56 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect() failed Socket:4 Err:-1 Errno:61 Connection refused
May  1 06:13:58 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 1
May  1 06:13:59 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 2
May  1 06:14:00 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 3
May  1 06:14:01 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect() failed Socket:4 Err:-1 Errno:61 Connection refused
May  1 06:14:03 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 1
May  1 06:14:04 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 2
May  1 06:14:05 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 3
May  1 06:14:06 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect() failed Socket:4 Err:-1 Errno:61 Connection refused
May  1 06:14:08 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 1
May  1 06:14:09 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 2
May  1 06:14:10 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 3
May  1 06:14:11 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect() failed Socket:4 Err:-1 Errno:61 Connection refused
May  1 06:14:13 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 1
May  1 06:14:14 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 2
May  1 06:14:15 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect()-> No of tries: 3
May  1 06:14:16 Hanson-NAS python: dnssd_clientstub ConnectToServer: connect() failed Socket:4 Err:-1 Errno:61 Connection refused
May  1 06:14:17 Hanson-NAS kernel: .
May  1 06:14:17 Hanson-NAS syslogd: exiting on signal 15
May  1 18:10:25 Hanson-NAS newsyslog[2223]: logfile first created
May  1 18:10:25 Hanson-NAS syslogd: kernel boot file is /boot/kernel/kernel
May  1 18:10:25 Hanson-NAS kernel: Copyright (c) 1992-2013 The FreeBSD Project.
May  1 18:10:25 Hanson-NAS kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
 

hungarianhc

Patron
Joined
Mar 11, 2014
Messages
234
Sigh... It happened again. Here is the second set of logs.

I have emptied out the log file now so that when it happens again, hopefully it will just be relevant information.
 

warri

Guru
Joined
Jun 6, 2011
Messages
1,193
I can't see anything out of the ordinary. A couple of things I noticed:

Samba seems to complain about something, but I don't think it's critical (https://bugs.freenas.org/issues/4027)
Apr 30 21:36:09 Hanson-NAS smbd[11669]: matchname: host name/name mismatch: 192.168.1.220 != (NULL)
[..]
Apr 30 21:36:15 Hanson-NAS smbd[11672]: matchname failed on 192.168.1.220
This could be related to this mDNS thing (see https://bugs.freenas.org/issues/4591):
Apr 30 23:20:50 Hanson-NAS mDNSResponder: mDNS_FinalExit failed to send goodbye for: 0000000800CBFD80 01 35 _afpovertcp._tcp.local. PTR Hanson-NAS._afpovertcp._tcp.local.

You occasionally get these messages, but according to a dev they are not important:
Apr 30 21:36:57 Hanson-NAS kernel: Freed UMA keg (udp_inpcb) was not empty (80 items). Lost 8 pages of memory.

AFP throws some warnings? Not sure if important, I don't use AFP.
Apr 30 22:07:19 Hanson-NAS afpd[12087]: dsi_wrtreply: Bad file descriptor
[..]
Apr 30 23:10:03 Hanson-NAS afpd[13607]: dsi_stream_read: len:-1, Operation timed out

Here is what I would do to identify if a service is responsible:
  • To get rid of the matchname and DNS errors, try the workaround suggested in https://bugs.freenas.org/issues/4591 (uncheck "hostname lookups" under CIFS settings). If that helps, feel free to add your voice to the bug report ;)
  • Shut down all services and see if crashes keep happening.
  • If not, start one service (either AFP or CIFS) and see if it keeps happening.. you get the picture ;)
  • Consider ditching AFP altogether, somewhere people said AFP is outdated and shouldn't be used on newer Macs anymore and that you should go for CIFS instead. Again, I'm not a Mac user, so that's your call to decide. I remember reading that time machine only supportsd AFP or something like that.
Do you run any jails?
 

hungarianhc

Patron
Joined
Mar 11, 2014
Messages
234
Thanks for all of the suggestions... I woke up this morning, and surprisingly, it still had not crashed. AFP is crucial for me, as it is what Time Machine uses, and Time Machine for all of my Macs is a large reason for having a NAS, in my case.

I went into the BIOS, and I completely disabled the Marvell controller. I also enabled SMART (surprised it wasn't on by default). My BIOS is completely up to date. It's been up for the whole day, and I'm just going to continue to use it normally and monitor it. Bummer that we weren't able to find the issue!
 

hungarianhc

Patron
Joined
Mar 11, 2014
Messages
234
And it crashed again... Here's the link to the log: https://dl.dropboxusercontent.com/u/6447223/messages-3

Feeling stupid... Sold my Synology and spent a good amount of extra $ to build this "sweet" FreeNAS box, and I can't get basic stability... Grrr... Hopefully we can find a solution. I'll post a bug on bugs.freenas and see what I get.
 

Attachments

  • Screen Shot 2014-05-04 at 2.21.12 PM.png
    Screen Shot 2014-05-04 at 2.21.12 PM.png
    178.9 KB · Views: 269

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
There's no crash on that screenshot.

But, hhook_vnet_uninit is usually from a jail using VIMAGE.
 
Status
Not open for further replies.
Top