Hard crash / unscheduled reboot, SuperMicro "Lower Critical - Going Low - Assertion"

diskdiddler

Wizard
Joined
Jul 9, 2014
Messages
2,377
2 months ago.....

Code:
293    2019/03/02 10:16:15    VDIMM    Voltage    Lower Critical - Going Low - Assertion
294    2019/03/02 10:16:16    VDIMM    Voltage    Lower Non-Recoverable - Going Low - Assertion
295    2019/03/02 10:16:18    #0xff    Processor    IERR - Assertion
296    2019/03/02 10:16:37    PVCCSRAM    Voltage    Lower Critical - Going Low - Assertion
297    2019/03/02 10:16:37    PVCCSRAM    Voltage    Lower Non-Recoverable - Going Low - Assertion
298    2019/03/02 10:18:17    #0xca    Watchdog 2    Timer Interrupt - Assertion
299    2019/03/02 10:18:18    #0xca    Watchdog 2    Hard Reset - Assertion
300    2019/03/02 10:20:09    VDIMM    Voltage    Lower Non-Recoverable - Going Low - De-assertion
301    2019/03/02 10:20:09    VDIMM    Voltage    Lower Critical - Going Low - De-assertion
302    2019/03/02 10:20:09    PVCCSRAM    Voltage    Lower Non-Recoverable - Going Low - De-assertion
303    2019/03/02 10:20:09    PVCCSRAM    Voltage    Lower Critical - Going Low - De-assertion



Last night :(

Code:
304    2019/05/12 01:02:09    Vcpu    Voltage    Lower Critical - Going Low - Assertion
305    2019/05/12 01:02:09    Vcpu    Voltage    Lower Non-Recoverable - Going Low - Assertion
306    2019/05/12 01:02:09    VDIMM    Voltage    Lower Critical - Going Low - Assertion
307    2019/05/12 01:02:09    VDIMM    Voltage    Lower Non-Recoverable - Going Low - Assertion
308    2019/05/12 01:02:09    PVCCSRAM    Voltage    Lower Critical - Going Low - Assertion
309    2019/05/12 01:02:09    PVCCSRAM    Voltage    Lower Non-Recoverable - Going Low - Assertion
310    2019/05/12 01:02:16    #0xff    Processor    IERR - Assertion
311    2019/05/12 01:04:10    #0xca    Watchdog 2    Timer Interrupt - Assertion
312    2019/05/12 01:04:11    #0xca    Watchdog 2    Hard Reset - Assertion
313    2019/05/12 01:06:19    Vcpu    Voltage    Lower Non-Recoverable - Going Low - De-assertion
314    2019/05/12 01:06:20    Vcpu    Voltage    Lower Critical - Going Low - De-assertion
315    2019/05/12 01:06:20    VDIMM    Voltage    Lower Non-Recoverable - Going Low - De-assertion
316    2019/05/12 01:06:20    VDIMM    Voltage    Lower Critical - Going Low - De-assertion
317    2019/05/12 01:06:20    PVCCSRAM    Voltage    Lower Non-Recoverable - Going Low - De-assertion
318    2019/05/12 01:06:20    PVCCSRAM    Voltage    Lower Critical - Going Low - De-assertion


Has anyone seen this before at all?
If I recall when I built the machine, I did a pretty thorough memtest on the system.
What can I do here? I googled and got some kind of idea, it is possible that an IPMI driver is actually slightly wonky and may cause this?
I can't be certain, this could totally be hardware, but I'm reluctant to waste the developers time with a job on this until I know more.

Here's some information, the first one seems to imply it might be a driver rather than hardware?
https://webcache.googleusercontent.com/search?q=cache:svbkPlIa72wJ:https://forum.opsi.org/viewtopic.php?t=9281+&cd=5&hl=en&ct=clnk&gl=au
https://blog.pcfe.net/hugo/posts/2018-08-23-ipmi-watchdog/

The board is :
A2SDi-8C-HLN4F, (2.2ghz 8c Atom Denverton 3758),
Firmware Revision : 03.60
Firmware Build Time : 07/28/2017
BIOS Version: 1.1a
BIOS Build Time: 09/18/2018
Redfish Version : 1.0.1

Any ideas at all? I feel like I need to run an memtestx86 on it for like a week or something :( not a pleasant idea!
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Hmmmm - why don't you think that's a power (supply) problem?

At first blush it looks like the progress of a voltage dip and recovery to me.
 

diskdiddler

Wizard
Joined
Jul 9, 2014
Messages
2,377
I would expect disks to have an issue (spin up / downs) before anything else really.
PSU is an over-rated one for the environment, it's a low power system, pulls 99w at the wall, with a 450w gold supply.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
The voltages that are reported to have apparently dipped and restored are controlled by the chipset AFAIK. I wonder if you have a motherboard issue (not PSU) affecting the voltage control. I wish you good luck in resolving this.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Has the CPU/motherboard been replaced yet? If not, it's probably dying.
 

diskdiddler

Wizard
Joined
Jul 9, 2014
Messages
2,377
No it's brand new. If necessary I will, regrettably.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Brand new? Then it's not the infamous bug in the CPU. Hopefully.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
@diskdiddler, what's the latest with your unscheduled reboot problem?
 

diskdiddler

Wizard
Joined
Jul 9, 2014
Messages
2,377
I have only seen the issue twice now. So I'm unsure what to do. It's now 7 month old hardware with only 2 reboots like this.
I will try to run a long memtest on the server perhaps in a month when I go away for a week solid.
If it's the PSU it's very infrequent. Difficult work area to fit a UPS to 'clean' the power too unfortunately.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Thx for the comeback. Good luck - I hope you can bottom the issue.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Lets work with just the facts and not how new or the ratings of items, hardware does fail new or old. Your motherboard reported low voltages and it appears they are likely all from the 3.3VDC rail. The problem is likely the PSU or the Motherboard. But if you have any add-on cards, these can't be ruled out either unless you remove them and continue to test.

My advice:
1) Install a different PSU if you have one, this is the easiest thing to check.
2) If you have a good DMM, connect it up to the PSU and monitor the 3.3VDC line.
3) In the BIOS you may be able to view the PSU voltages, watch that for a while.
4) The failure could be the VRM's on the motherboard which stands for Voltage Regulator Module. Maybe a missing heatsink, not enough air flow, or one or more are just having premature failure. It happens.

I would expect disks to have an issue (spin up / downs) before anything else really.
Why? The hard drives use +5VDC and +12VDC which is a different part of the PSU.
 
Top