HI, We have a FreeNAS (11.3.U4.1) system which i am testing to go into production, but it is having unscheduled system reboots every couple of times a week.
System Specs:
With this fix the we had no reboot for about 10 days, then on 10th day FreeNAS was total unresponsive and freezed. IPMI was showing no error messages.
The next was to open (disable) the WATCHDOG JWD1 Jumper on the motherboard. But we forget to stop the watchdogd on FreeNAS after the boot, and then we had another unscheduled reboot in the morning but now we can see an in IPMI following information, but still no useful information.
After that we Enable the watchdog timer in bios, start the watchdogd service in FreeNAS and change the WATCHDOG JWD1 jumper to pin 2-3 to outpput NMI. But after that we started to have reboot after after 10 minutes. But We manage to record the error message on the FreeNAS shell.
here is the video link & timeline
Enter an option from 1-11: NMI/cpu5 ... going to debugger
[thread pid 11 tid 100000]
Stopped at acpi_cpu_idle_mwait+0x7c: testq %rsi, %rsi
db:0:kdb.enter.default>write cn_mute 1
cn_mute 0 + 0x1
db:0:kbd.enter.default> reset
cpu_reset: Restarting BSP
cpu_reset:Failed to restart BBSP
-------------------------------------------------------------------------
and then system rebooted
--------------------------------------------------------------------------
he other issues i had from earlier is the SHUTDOWN
whenever i shut down the system is always paused after video from 3:45 to 5:30
As a next step I will investigate how to use IPMI "power diag” from remote system when system get freeze as recommended in the ticket.
Moreever we started the "MemTest86+ " and it is scheduled fro 24 hour, at the moment it is running for 10+ hour and NO errors.
It will be great to know if any one else had similar situation or any feedback.
"
System Specs:
- Supermicro X10DRi-T, but only occupying 1 processor at the moment.
- Processor is Intel Xeon E5-2637 V3 QEYT ES (Engineering sample)
- 4x 32GB Hynix PC4 DDR4 2Rx4 RDIMM 2400T HMA84GR7MFR4N-UH memory module.
- HBA is Avago SAS SAS3224(A1) with firmware to 16.00.11.00 and BIOS to 08.37
- Intel X520-DA2
- HYPER M.2 X16 CARD with 1 500gb Samsung NVME for testing
- 2x(TB Seagate exos 7e8
- FreeNAS is installed on USB3toMSATA 120gb ssd.
With this fix the we had no reboot for about 10 days, then on 10th day FreeNAS was total unresponsive and freezed. IPMI was showing no error messages.
The next was to open (disable) the WATCHDOG JWD1 Jumper on the motherboard. But we forget to stop the watchdogd on FreeNAS after the boot, and then we had another unscheduled reboot in the morning but now we can see an in IPMI following information, but still no useful information.
8 | 2020/08/12 05:11:34 | #0xca | Watchdog 2 | Timer Interrupt - Assertion |
9 | 2020/08/12 05:11:35 | #0xca | Watchdog 2 | Hard Reset - Assertion |
After that we Enable the watchdog timer in bios, start the watchdogd service in FreeNAS and change the WATCHDOG JWD1 jumper to pin 2-3 to outpput NMI. But after that we started to have reboot after after 10 minutes. But We manage to record the error message on the FreeNAS shell.
here is the video link & timeline
- below system message 0.00 to 0:20
- freeNAS boot text 2:00 to 3:30
- manual shutdown 3:45 to 5:30
Enter an option from 1-11: NMI/cpu5 ... going to debugger
[thread pid 11 tid 100000]
Stopped at acpi_cpu_idle_mwait+0x7c: testq %rsi, %rsi
db:0:kdb.enter.default>write cn_mute 1
cn_mute 0 + 0x1
db:0:kbd.enter.default> reset
cpu_reset: Restarting BSP
cpu_reset:Failed to restart BBSP
-------------------------------------------------------------------------
and then system rebooted
--------------------------------------------------------------------------
he other issues i had from earlier is the SHUTDOWN
whenever i shut down the system is always paused after video from 3:45 to 5:30
- GEOM_MIRROR: device destroyed
- Link down
- usbhub down
As a next step I will investigate how to use IPMI "power diag” from remote system when system get freeze as recommended in the ticket.
Moreever we started the "MemTest86+ " and it is scheduled fro 24 hour, at the moment it is running for 10+ hour and NO errors.
It will be great to know if any one else had similar situation or any feedback.
"