Finding out the reason of unscheduled system reboot

coolnodje

Explorer
Joined
Jan 29, 2016
Messages
66
I've been notified of a very quick succession of unscheduled system reboot by emails.

This was 36h ago, and looking at /var/log/message I only have : messages, messages.0.bz2, messages.1.bz2 all pretty tiny in size.
So I can only look at the last of the reboot that left my system with an unavailable pool.

Shouldn't I have more messages logs available?

Is there anything equivalent to logrotate on FreeNAS? How do you control this?

Anywhere else I could take a look at to understand what happened?

Best
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
It’s usually hardware. I’d check IPMI for uncorrected memory errors, or if you built without ECC, run a memory test for a few days.

Check PSU quality, temperature and cooling, particularly around older HBAs.
 
Last edited:

coolnodje

Explorer
Joined
Jan 29, 2016
Messages
66
Nevermind for the /var/log/messages, I was able to find all of them back after a successful reboot.

It's really strange though that the content of messages was just different before the reboot. only 3 of them, and very different content.
That's mind blowing.

I realize I've had Ipmi errors for a long time:

Code:
Oct  3 06:50:04 freenas ipmi0: KCS error: 5f
Oct  3 06:50:04 freenas ipmi0: Failed to set watchdog
Oct  3 06:50:10 freenas ipmi0: KCS: Failed to read address
Oct  3 06:50:10 freenas ipmi0: KCS error: 5f
Oct  3 06:50:16 freenas ipmi0: KCS: Failed to read address
Oct  3 06:50:16 freenas ipmi0: KCS error: 5f
Oct  3 06:50:22 freenas ipmi0: KCS: Failed to read address
Oct  3 06:50:22 freenas ipmi0: KCS error: 5f
Oct  3 06:50:22 freenas ipmi0: Failed to set watchdog
Oct  3 06:50:38 freenas ipmi0: KCS: Failed to read address
Oct  3 06:50:38 freenas ipmi0: KCS error: 5f
Oct  3 06:50:42 freenas ipmi0: KCS: Reply address mismatch
Oct  3 06:50:42 freenas ipmi0: KCS error: 5f


Not sure how big of a pb this indicates. Probably not the cause of the reboots.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912

coolnodje

Explorer
Joined
Jan 29, 2016
Messages
66
it definitely rhymes with my issue.
I've started to have that after a FreeNAS update no so long ago.

But it never caused an unscheduled system reboot up to now.

Now I don't have that error
Code:
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
systematically.
I have it, sometimes.

It still seems to indicate a hardware problem. I'm too sure what to do about it apart from changing my motherboard.
 

pschatz100

Guru
Joined
Mar 30, 2014
Messages
1,184
If it started happening after an update, it is possible your boot device has failed. I don't see any description of your system, so it's not possible to comment further....

Are you booting from an SSD or a flash drive? How about posting your system configuration as per forum requirements?
 

coolnodje

Explorer
Joined
Jan 29, 2016
Messages
66
I think I've remedied the system signature.

I had a failing boot device which was still a USB dongle a month ago. I've replaced it with an SSD.
Could have been that the ipmi issue started with the USB dongle failing, but in any case it's still there.

It feels more that I have to restart the BMC (plug the power cord off and back in) to remedy this. Which is problematic and worrying.

Still, works fine most of the time. And this unexpected series of reboot was a first of its kind.
 

pschatz100

Guru
Joined
Mar 30, 2014
Messages
1,184
I would not think that a boot device could affect IPMI. I know that there have been issues with ASROCK motherboards from time to time. Do some searches on the forum for your motherboard. Maybe you will find some useful information. Good luck.
 

coolnodje

Explorer
Joined
Jan 29, 2016
Messages
66
I still get the infamous message
Code:
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
from time to time when running ipmitool command from a freenas prompt.

It would seem to indicate a problem, BUT running ipmitool command form another machine with the relevant parameters just works.
I.E.
Code:
ipmitool -I lanplus -H 192.168.0.61 -U admin -P <pwd> <command>


As for the time the many reboots occurred, the only time matching and relevant entries from the IPMI event logs were:
Code:
1308 10/04/2020 05:46:48 BAT Voltage Lower Critical - Going Low - Deasserted
1307 10/04/2020 05:46:48 BAT VoltageLower Non-Recoverable - Going Low - Deasserted
1306 10/04/2020 05:01:44 BAT VoltageLower Non-Recoverable - Going Low - Deasserted

This concerns the motherboard CMOS battery, and it seems to have been a false alarm as the voltage is now correct and stays stable.

I don't think the CMOS battery voltage going low during rutime could cause any problem, or could it?
 

pschatz100

Guru
Joined
Mar 30, 2014
Messages
1,184
I would not think a low CMOS battery would cause the system to reboot, but if you are getting a warning about it - why not replace it?
 

numbertwo

Dabbler
Joined
Jul 1, 2018
Messages
32
i have had the same crashing problem, the N40L NAS almost went crashing every other day. I replaced the CMOS battery and it is now running on the 2nd day, will keep monitoring. Tks for the clue!
 

coolnodje

Explorer
Joined
Jan 29, 2016
Messages
66
It didn't prove to make any difference in my case. Besides the AsRock BMC module keep reporting CMOS battery low voltage alert, even with a brand new one.
 

numbertwo

Dabbler
Joined
Jul 1, 2018
Messages
32
well, true enuf.. didn't solve the problem. :confused: I have upgraded to 12.0-U3.1 but didn't help either.
 

numbertwo

Dabbler
Joined
Jul 1, 2018
Messages
32
I just realised from the Truenas 12 installation page that now Truenas no longer recommend USB as the boot drive (no longer offer the simplicity, sadly), instead a SSD or HDD is required as Truenas 'Writes' a lot into the boot disk. So, i have now transfer the boot drive into a HDD, and it has gone running for 8 hours consecutively without rebooting every 2-3 hours.. So, i think that's the root cause for my case. Just sharing.

In my memory, Freenas 11 was still good using USB in my memory, missed it!
 

hcorEtheOne

Cadet
Joined
Sep 9, 2021
Messages
1
I had this same issue, it turned out I had to disable SMART at the SSD drives (Storage - Disks - Edit). My system is stable for 4 days, it couldn't last for 5 hours before that.
I also disabled the power surge protection in the BIOS, but it's a brand new PSU, so I doubt that it was the problem.
 
Top