Getting "Unscheduled system reboot" emails and finding our NAS has rebooted, several times a day

Status
Not open for further replies.

daelsutton

Cadet
Joined
Nov 18, 2018
Messages
8
Good afternoon, FN experts.

My FN box has been running pretty well stable for a few months, and over the last week or so has been rebooting overnight, and sometimes during the day and I don't know why.

Yesterday I upgraded it from 11.1U5 to 11.1U6 (and in the process upgraded the boot USB stick from 2GB to 16GB). Today I have had two reboots.

Are there any logs somewhere I can interrogate that might help me identify what's causing the reboots?

Build FreeNAS-11.1-U6
Platform Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz
Memory 8009MB


5x 240GB Intel 530 SATA SSD's, 2x NIC (em0 & re0) in a LAGG.

My gut says the machine's hardware, but I'd like to find something to watch for before swapping the hardware out.

D.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Are there any logs somewhere I can interrogate that might help me identify what's causing the reboots?

Check /data/crash directory for information, as well as the content of the older /var/log/messages files (messages.1.bz2, etc)

If you have out-of-band management on your motherboard (iLO/IPMI) the event log there should also give you a hint.

Also please add motherboard and PSU make/model to your original post.

Your gut is probably right though. Could be that your Intel NIC is rebelling against being put in a team with Realtek. ;)
 

daelsutton

Cadet
Joined
Nov 18, 2018
Messages
8
Ok, it looks like my logs and crash directory aren't persistent - I had a reboot at 8pm last night and /var/log/messages starts on the reboot and there's no .bz2 or other history logs for it. /data/crash is empty. (Which begs the next question - how can I make the logs persistent?).
Hardware wise, the systemboard is a Gigabyte H97N-WIFI and the power supply is CRS 350W model MPT-350
The unit is an older desktop used for one purpose, holding Win10 sysprep images for our desktop imaging process to pull from (CIFS) when one of out IT staff reimages a workstation. Not doing iscsi or anything. I pull a copy of the share via rsync once a day, but that backup runs between 1700 & 1900 and I haven't seen a reboot during that window yet.
 

daelsutton

Cadet
Joined
Nov 18, 2018
Messages
8
I've just checked and the system dataset was already pointing at my volume. I didn't have syslog turned on in there, so have now. I'll keep an eye on my syslog server and see if we get anything sent prior to the next crash.
 

daelsutton

Cadet
Joined
Nov 18, 2018
Messages
8
I got to see it this morning just before it rebooted - does FN self-reboot if it can't communicate with it's syslog host and/or LAN after a set period of time? I could use the console of the machine and even try to ping (nothing happening) so it looked like the LAGG had dropped. If FN attempts to fix network issues by rebooting when the LAGG drops, that might explain this issue. I'll reboot the switch that it's plugged into later today and see if that helps.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
If FN attempts to fix network issues by rebooting when the LAGG drops, that might explain this issue.

Absolutely not, that would be a hilariously bad response to a failure state.

An automatic reboot, rather than crashing to a debug prompt, would seem to implicate the hardware. Can you call a downtime window to do some stability testing (eg: booting to a memtest86 ISO) on this machine?
 

daelsutton

Cadet
Joined
Nov 18, 2018
Messages
8
Very good call, thankyou @HoneyBadger. It died again very recently so I booted it up and ran memtest 5.01 - the hardware hung 3m 21s into the test. I then blew the dust out of it, reseated the ram and tried again - and it hung at 3m 21s. I've now replaced the RAM (1x 8GB DDR3 1600 stick) with spare RAM I had kicking around in an anti-static bag (2x 2GB DDR3 1330 sticks) and reran the memtest - it passed with flying colours. Now booted FN back on on half the original total RAM and I'll keep a close eye on it.
 

daelsutton

Cadet
Joined
Nov 18, 2018
Messages
8
Update: a couple of days later it hung again and then wouldn't power back on. I shifted the SSD's and boot stick to a new identical machine and it fired up OK, so most likely was the PSU all along. I was able to get the old one stable by disabling hyperthreading & multicores on the CPU for a while, and made the same changes on the new hardware. Late last week I re-enabled the second CPU in the bios and today just now it hung/rebooted. This could be nothing, but will continue to monitor it.
 

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626
was able to get the old one stable by disabling hyperthreading & multicores on the CPU for a while,
To a noob like me, the above suggests thermal conditions verification...

Sent from my mobile phone
 
Status
Not open for further replies.
Top