freenas.local had an unscheduled system reboot

meeza_me

Cadet
Joined
Aug 12, 2020
Messages
7
HI, We have a FreeNAS (11.3.U4.1) system which i am testing to go into production, but it is having unscheduled system reboots every couple of times a week.

System Specs:
  • Supermicro X10DRi-T, but only occupying 1 processor at the moment.
  • Processor is Intel Xeon E5-2637 V3 QEYT ES (Engineering sample)
  • 4x 32GB Hynix PC4 DDR4 2Rx4 RDIMM 2400T HMA84GR7MFR4N-UH memory module.
  • HBA is Avago SAS SAS3224(A1) with firmware to 16.00.11.00 and BIOS to 08.37
  • Intel X520-DA2
  • HYPER M.2 X16 CARD with 1 500gb Samsung NVME for testing
  • 2x(TB Seagate exos 7e8
  • FreeNAS is installed on USB3toMSATA 120gb ssd.
Initially based on my ticket https://jira.ixsystems.com/browse/NAS-106855 we found the watchdog timer is initiating the reboot and thus, We disabled the watchdogd in FreeNAS, meanwhile watchdog timer was already disabled in the bios.

With this fix the we had no reboot for about 10 days, then on 10th day FreeNAS was total unresponsive and freezed. IPMI was showing no error messages.

The next was to open (disable) the WATCHDOG JWD1 Jumper on the motherboard. But we forget to stop the watchdogd on FreeNAS after the boot, and then we had another unscheduled reboot in the morning but now we can see an in IPMI following information, but still no useful information.


8
2020/08/12 05:11:34
#0xca
Watchdog 2
Timer Interrupt - Assertion
9
2020/08/12 05:11:35
#0xca
Watchdog 2
Hard Reset - Assertion

After that we Enable the watchdog timer in bios, start the watchdogd service in FreeNAS and change the WATCHDOG JWD1 jumper to pin 2-3 to outpput NMI. But after that we started to have reboot after after 10 minutes. But We manage to record the error message on the FreeNAS shell.

here is the video link & timeline
  • below system message 0.00 to 0:20
  • freeNAS boot text 2:00 to 3:30
  • manual shutdown 3:45 to 5:30
------------------------------------------------------------------------
Enter an option from 1-11: NMI/cpu5 ... going to debugger
[thread pid 11 tid 100000]
Stopped at acpi_cpu_idle_mwait+0x7c: testq %rsi, %rsi
db:0:kdb.enter.default>write cn_mute 1
cn_mute 0 + 0x1
db:0:kbd.enter.default> reset
cpu_reset: Restarting BSP
cpu_reset:Failed to restart BBSP
-------------------------------------------------------------------------
and then system rebooted
--------------------------------------------------------------------------

he other issues i had from earlier is the SHUTDOWN

whenever i shut down the system is always paused after video from 3:45 to 5:30
  • GEOM_MIRROR: device destroyed
  • Link down
  • usbhub down
and after that it just keep there unless i powerdown the server from the power button.


As a next step I will investigate how to use IPMI "power diag” from remote system when system get freeze as recommended in the ticket.
Moreever we started the "MemTest86+ " and it is scheduled fro 24 hour, at the moment it is running for 10+ hour and NO errors.

It will be great to know if any one else had similar situation or any feedback.



"
 

meeza_me

Cadet
Joined
Aug 12, 2020
Messages
7
Memtest86 run for 24 hours and no Errors

20-08-13 23-43-17 5966.jpg


run Breakin for about 40 minutes and no errors. Processor Temp were high but still no errors.

20-08-14 00-33-57 5968.jpg



i saw an old thread NAS down after "Timer Interrupt
but no definite answer that was it solved.
 

meeza_me

Cadet
Joined
Aug 12, 2020
Messages
7
  • Update the Supermicro X10DRi bios to latest version
    • BIOS Version: 3.2a
    • BIOS Build Time: 05/14/2020
Now
  • Watchdog timer is OFF is BIOS
  • WATCHDOG JWD1 jumper to pin 2-3 to output NMI
  • Watchdogd service is running in FreeNAS

And now while ago again i got another unscheduled reboot.

This time a crash dump was generated , as i am not sure what to look but found following interesting bits.

1: ixdiagnose\crash\textdump.tar.0\textdump.tar\ddb.txt
pid ppid pgrp uid state wmesg wchan cmd
1174 1 1174 0 Ss nanslp 0xffffffff82167f50 watchdogd

2: ixdiagnose\crash\textdump.tar.0\textdump.tar\ddb.txt
------------------------------------------------------------------------------
Tracing command watchdogd pid 1174 tid 101559 td 0xfffff804c7ae2620
sched_switch() at sched_switch+0x88e/frame 0xfffffe202164f6f0
mi_switch() at mi_switch+0x181/frame 0xfffffe202164f720
sleepq_switch() at sleepq_switch+0x115/frame 0xfffffe202164f760
sleepq_catch_signals() at sleepq_catch_signals+0x3b6/frame 0xfffffe202164f7d0
sleepq_timedwait_sig() at sleepq_timedwait_sig+0x14/frame 0xfffffe202164f810
_sleep() at _sleep+0x33d/frame 0xfffffe202164f8c0
kern_clock_nanosleep() at kern_clock_nanosleep+0x1b6/frame 0xfffffe202164f940
sys_nanosleep() at sys_nanosleep+0x5f/frame 0xfffffe202164f980
amd64_syscall() at amd64_syscall+0x792/frame 0xfffffe202164fab0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe202164fab0
— syscall (240, FreeBSD ELF64, sys_nanosleep), rip = 0x800b1fbca, rsp = 0x7fffffffeb28, rbp = 0x7fffffffeb70 —
----------------------------------------------------------------------------------

3: ixdiagnose\fndebug\Network\dump.txt
root watchdogd 1174 3 dgram -> /var/run/logpriv

4. ixdiagnose\log\messages
Aug 14 12:30:31 freenas pcib0: _OSC returned error 0x10
Aug 14 12:30:31 freenas pcib1: _OSC returned error 0x10

5.ixdiagnose\log\middlewared.log
[2020/08/14 12:30:17] (ERROR) middlewared.set_sysctl():407 - Failed to set sysctl '<module 'sysctl' from '/usr/local/lib/python3.7/site-packages/sysctl/_init_.py'>' to '': sysctl: unknown oid 'kern.cam.ctl.ha_peer'





i hope these is any thing useful , otherwise i will stop the watchdogd service in FreeNAS and wait for the freez.
 
Last edited:

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Processor is Intel Xeon E5-2637 V3 QEYT ES (Engineering sample)
Having an "ES" CPU would make me very uncomfortable (especially considering this is a production system under test) - I'd suggest replacing it with a non-ES version for a trial at least.
 

meeza_me

Cadet
Joined
Aug 12, 2020
Messages
7
Temporary replaced the Processor with Intel Xeon 2680 V3, and fresh Install of FreeNAS on two Mirror SSD's.

Also removed
  • Intel X520-DA2
  • HYPER M.2 X16 CARD with 1 500gb Samsung NVME for testing
  • 2x(TB Seagate exos 7e8
Setting
  • Watchdog timer is OFF is BIOS
  • WATCHDOG JWD1 jumper to pin 2-3 to output NMI
  • Watchdogd service is running in FreeNAS

Up time is about 15 hours now.
 
Last edited:

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Having an "ES" CPU would make me very uncomfortable (especially considering this is a production system under test) - I'd suggest replacing it with a non-ES version for a trial at least.
Good catch! Not a good idea to use engineering samples for production.
 

e30boarder

Cadet
Joined
Aug 17, 2020
Messages
2
I have the same exact problem with a very similar setup. Including identical errors in IPMI interface.
SuperMicro X10SRM-F
Xeon E5-2650 v4
32GBs ECC RAM
3x WD RED 3TB
3x HGST UltraStar 12TB
USBtoSATA Adapter to Samsung 850 PRO 256GB (FreeNAS boot)

I ran MemTEST all night, no errors. I recently replaced the PSU, still got another random reboot last week. I just updated the firmware for the IPMI interface the other day and I'm crossing my fingers. I'm pretty sure it's related to some watchdog issue, because when I turn watchdog on in BIOS and on the mobo, the server reboots about every 10 min. I've been trying to get a shot of the error that shows up before watchdog reboots the server. I'm about to pull one of my WYZE cams and just sit it in front of the screen.
 

meeza_me

Cadet
Joined
Aug 12, 2020
Messages
7
they system is up and running after replacing the CPU. now at 2 days & 16 hours without reboot.

but Watchdog timer is OFF in the BIOS. I will turn it ON again after this phase of testing is complete (No reboot/freeze of 10 days or over.)

try the following settings and see how it goes.
  • Watchdog timer is OFF is BIOS
  • WATCHDOG JWD1 jumper to pin 2-3 to output NMI
  • Watchdogd service is running in FreeNAS
But Supermicro X10DRi two watchdog timers, One BIOS and another BMC(IPMI), and what i think (not 100% certain ) WATCHDOG JWD1 is BMC(IPMI) timer.

SO you need to look for SuperMicro X10SRM-F

i was using USB3 to mSATA adapter, but now i removed it to mirror ssd to decrease the potential failure points for testing.
 

e30boarder

Cadet
Joined
Aug 17, 2020
Messages
2
My boards the same way. I have the jumper and BIOS set to OFF but the watchdog service is running in FreeNAS. I could get almost a month between reboots, you may want to push it for more than 10 days
 

meeza_me

Cadet
Joined
Aug 12, 2020
Messages
7
System is up and running without a unscheduled reboot :)
uptime is 10 days, 10:23
but following component s are removed
  • Intel X520-DA2
  • HYPER M.2 X16 CARD with 1 500gb Samsung NVME for testing
  • 2x 8 TB Seagate exos 7e8
we are waiting for a Intel Xeon E5-2623 V3 delivery, after that we will put back all the devices and have anther test run.
 
Top