NAS down after "Timer Interrupt"

Status
Not open for further replies.

rosabox

Explorer
Joined
Jun 8, 2016
Messages
77
Yesterday I got two Email massages from IPMI:

Code:
SEL_TIME: 2018/05/20 08:34:31
SENSOR_NUMBER: ca
SENSOR_TYPE: Watchdog 2
SENSOR_NAME:
EVENT_DESCRIPTION: Out of Spec Definition (0x08)
EVENT_DIRECTION: Assertion
EVENT SEVERITY:"non-critical"


Code:
SEL_TIME: 2018/05/20 08:34:32
SENSOR_NUMBER: ca
SENSOR_TYPE: Watchdog 2
SENSOR_NAME:
EVENT_DESCRIPTION: Hard Reset
EVENT_DIRECTION: Assertion
EVENT SEVERITY:"non-critical"


Similar two messages are in the IPMI Event log but nothing more:
Code:
Event ID		 Time Stamp		 Sensor Name		 Sensor Type		 Description 
1	2018/05/20 08:34:31	#0xca	Watchdog 2	Timer Interrupt - Assertion
2	2018/05/20 08:34:32	#0xca	Watchdog 2	Hard Reset - Assertion


Today I came to work and found out the NAS is down/powered off.
I powered it up remotely using IPMI and all seems to be fine (I got a "Unauthorized system reboot" email, but that was expected).

But
1) I have no idea what was the problem, I didn't find any "ca" sensor or any other clue, what the problem was
2) I assumed that "hard reset" means that the server reboots, but I found it powered off

Any clues, hints, explanations?

Here are the specs of the NAS:
FreeNAS-11.1-U4
Mobo - Supermicro X10SDV-2C-TP4F
Chassis - Supermicro CSE-847BE1C-R1K28LPB
RAM - 2x 32GB DDR4 MEM-DR-432L-HL02
HBA - Supermicro AOC-S3008L-L8e
DOM - 2x 16GB Supermicro SSD-DM016-SMCMV
Cabling - 2x Supermicro CBL-SAST-0531
HDD - 18x Toshiba Nearline 8TB 3,5" SATA3 (2x8 vdev + 2 spare)
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Is there anything in the NAS log files?

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

rosabox

Explorer
Joined
Jun 8, 2016
Messages
77
Chris Moore:
I've looked through the logs in /var/db/system/syslog-.../log and nothing on the time of the alerts and nothing suspicious before that time.

Mirfster:
The NAS was running without a problem for some time - the last reboot was for the U4 update.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
The NAS was running without a problem for some time - the last reboot was for the U4 update.
That is a fairly long time.
Do you have any other functionality enabled on the server such as jails, or virtual machines. Is there any scheduled task that was running about that time? Usually, not always, a fault like this is the result of some task causing the system to hang.
Do you have regular SMART testing of the disks scheduled?
Is FreeNAS reporting as healthy?
 

rosabox

Explorer
Joined
Jun 8, 2016
Messages
77
No Jails or VMs.

A couple of scheduled tasks:
4 rsync tasks, one of them was running at that time
2 SMART tests (one short - every 10 days, one long - every month, not running at that time)
3 reporting scripts, not running at that time (SMART, UPS, ZPool)

The NAS also servers as a backup target using SMB, no backup running at that time

Yes, it's green and "The system has no alerts"
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
FWIW, there is a similar (older) thread regarding similar issues:
IPMI Watchdog 2 Hard Resets

You could also open a bug report and see if there is anything that recently may have caused this in the U4 update.

Sorry, I am still running 9.10 myself.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
4 rsync tasks, one of them was running at that time
I would suspect that something went wrong with the rsync that caused the system to hang.
That is where I would start looking. It may not be something you can do during business hours, but if you can start that rsync again, it might give you a direction to investigate.
 

rosabox

Explorer
Joined
Jun 8, 2016
Messages
77
The rsync job is logging it's progress to a log file - nothing suspicious there.
The job will run again today in the night, tomorrow I'll see how it went.

I noticed there is a new BIOS and IPMI firmware for the mobo, I'll upgrade them tomorrow.
 

rosabox

Explorer
Joined
Jun 8, 2016
Messages
77
I've upgraded BIOS and IPMI firmware to the latest versions 1.3 and 3.68.
Then I did some stress testing, all seems to be good so far...

"Watch Dog Function" is set to disabled in the BIOS (and it was before)
Watch Dog jumper on the mobo is in the default position (pins 1-2), here is the description from the mobo manual:

Watch Dog Timer Enable
Watch Dog (JWD1) is a system monitor that can be used to enter LAN bypass default settings, reset the system or enter NMI when the Timer expires. Close pins 1-2 to reset the system if an application hangs. Close pins 2-3 to generate a non-maskable interrupt signal for the application that hangs. Open all pins to enter LAN pair default mode only. See the table on the right for jumper settings. Watch Dog may be enabled in the BIOS Setup. The default timer is around 5 minutes.

But when I run: ipmitool mc watchdog get
I get:
Code:
Watchdog Timer Use:	 SMS/OS (0x44)
Watchdog Timer Is:	  Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:	  137 sec
Present Countdown:	  130 sec

meaning the watchdog is still running ...
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
In all honesty, I would think that having that running would be a good thing. It should prevent an application or other fault from just locking up the system.

I'd be curios if the BIOS/IPMI Firmware notes had anything pertaining to possible fixes included for WatchDog.

Maybe just wait and see if things have been corrected?
 

rosabox

Explorer
Joined
Jun 8, 2016
Messages
77
Sadly no release notes for the BIOS.
Release notes for IPMI:
Code:
New features:
1. Enable AC Power On Event log.
2. Set AC Power On Event log to be reported when BMC executes cold reboot.
3. Added Redfish memory information feature.

Fixes:
1. Fixed failure of SUM 2.0.1 Test.
2. Fixed failure of SUM "Test_case_215 Outband - ChangeBmcCfg and compare".
3. Fixed inability to disable LDAP Authentication immediately after enabling it.
4. Fixed problem of product serial number being incorrect when it is longer than 15 characters and written in IPMI FRU.


Watch Dog would be fine, if it actually reboots the NAS but only shutting it down doesn't really help.
I can try to enable it in BIOS and see how it works.

Can anybody please explain what this means: "LAN pairs set to default mode without reset or NMI"
I don't get it ...


EDIT: I've enabled Watch Dog in the BIOS, the "ipmitool mc watchdog get" output looks the same.

EDIT2: That was a bad idea, the NAS suddenly rebooted (couple minutes after booting, no load) and I got the same two email messages mentioned in my first post. The only good news is, it actually rebooted, not shut down.
Disabled Watch Dog in BIOS
 
Last edited:

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
So with WatchDog enabled in the BIOS it rebooted when nothing was really running would possibly be an issue with hardware. Like if CPU was overheating, if RAM was really bad, etc.

Any recent hardware alerts or changes? System running hot?

I would guess the "LAN Pairs" may pertain to IPMI being able to run over more than one NIC instead of the Dedicated NIC? Not 100% sure though.
 

rosabox

Explorer
Joined
Jun 8, 2016
Messages
77
With Watch Dog enabled the system was running for cca 2 minutes and then rebooted and then again 2 minutes and rebooted and then I went to the BIOS and disabled the Watch Dog. After that, running fine so far.
I'm thinking about disabling Watch Dog in FreeNAS using System/Tunables watchdogd_enable=NO (rc.conf)

No alerts, no changes, CPU around 40-45C idle, 50-55C under heavy load.

The question about "LAN pairs" is from the mobo manual regarding the Watch Dog switch, it's the description for the "no jumper" option.
(Jumper on 1-2 is reset, on 2-3 is NMI)
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Thing is, FreeNAS should be patting the watch dog to stop the reboots.

That sounds like a problem right there.
 

rosabox

Explorer
Joined
Jun 8, 2016
Messages
77
I have disabled the Watch Dog in the BIOS and in FN and I did some stress testing over the weekend and the NAS is stable.
 
Status
Not open for further replies.
Top