Reboot cycle

Status
Not open for further replies.

paulatmig

Dabbler
Joined
Jul 14, 2014
Messages
41
Going through a lot of other forum posts about frequent rebooting, and how likely it is to be a hardware problem, I'm seeing that the usual troubleshooting method is to remove and replace components to see what helps or doesn't help. We'll be running a memtest tomorrow night for a while, but until we get the results from that I figured I'd see if anyone had a similar setup with the most recent updates where they had fixed the issue:

Our system:
* Supermicro 6027R-E1R12L (X9DRD-7LN4F mobo, LSI controller in IT mode)
* ~100Gb of ECC RAM
* just flashed the mobo firmware to the latest version - 3.2.
* Chelsio T420-CR
* pulled the jumper for watchdog on the mobo, plus made sure watchdog wasn't enabled in the BIOS
* 12x2Tb WD SAS drives in RAID10
* Intel P3700 SLOG
* dual 960w power supplies connected to a switched PDU, and that's going to a UPS.
* mirrored 8Gb USB sticks (Kingston DataTraveler SE9's)
... and it'd been running fine for the last 6 months.

Basically I'm using this system as an iSCSI storage target for our ESXi (5.5) hosts. It had been in a test environment for the last 6 months, we'd put a few "B-list" servers on it, and that was fine. We're sharing the iSCSI as a file extent in a dataset on the only zvol; it's got 1Tb assigned to it, with a capacity limit of 85%.

Anyway, things were going smooth. Just finished uploading the last of our semi-production VMs on to the system late at night last Thursday) - it's now at 230Gb of 10Tb total storage. Default compression, no dedupe. Friday morning the system starts a reboot cycle - it's rebooting over and over again.

We've checked the UPS logs, no weird voltage problems. I switched USB sticks, tried a fresh install and re-apply our backup config, checked the kernel panic logs (there were none), time and date on the BIOS (it's consistent). It doesn't happen within consistent intervals: sometimes it reboots during the boot process, sometimes the system gets to the console and freezes after 5 minutes, sometimes it runs fine for about an hour. Right now it's back to rebooting over & over, never finishing the boot process.

So I got a replacement PSU just in case, and again - running memtest starting tomorrow night. Anything else I should be looking at?
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Reboot loops without any sort of kernel messages---to me, that smells of watchdogd or bad RAM/timings. You seem to be barking up the right trees. Let us know what happens.
 
Joined
Oct 2, 2014
Messages
925
Any chance its grounding somewhere? Possibility to change out the motherboard itself, to any spare motherboard?
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
oooh I like that. That's exactly what would happen if you were grounding out to the chassis.
 
Joined
Oct 2, 2014
Messages
925

paulatmig

Dabbler
Joined
Jul 14, 2014
Messages
41
Could check the cables, see if anything's touching metal that should be. Given that we haven't messed with the internals in the last few months, I wouldn't normally suspect grounding, but it certainly wouldn't hurt to check. That'd also reboot our memtest run, so if we let that run overnight and we come back to only a few passes that'll be a positive!
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526

Those go on the screws before you set the motherboard in the chassis? If so, those are a horrible idea. Those screws and pegs are *supposed* to be grounded. That's *by design* and using insulating washers is literally creating more problems.

Reminded once again of how screwed up the world is.. selling stuff that does the exact opposite of what should be done. :(
 

paulatmig

Dabbler
Joined
Jul 14, 2014
Messages
41
Checked, and there are no components or loose bits that are shorting the board. Besides, that would cause a sudden reboot - not a lock-up *then* reboot; at least that's my understanding of how that works.

In any case, memtest is at 45% now running for about 28 hours now without rebooting. Doesn't sound like a power or shorting issue, else it would have rebooted again in that time frame. So more likely a memory or processor problem. Would a bad NIC cause such problems? I wouldn't think so, but I'm willing to entertain the thought.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Those go on the screws before you set the motherboard in the chassis? If so, those are a horrible idea. Those screws and pegs are *supposed* to be grounded. That's *by design* and using insulating washers is literally creating more problems.

Reminded once again of how screwed up the world is.. selling stuff that does the exact opposite of what should be done. :(
It's not too bad, it's still grounded through the the screw threads. I got a handful of those washers with the Sharkoon T9. I said to myself "What the fsck are these stupid washers for?" and promptly added them to the "Misc." compartment of my screw box, where other semi-useful crap lies (along with oversized screws).
 
Joined
Oct 2, 2014
Messages
925
Those go on the screws before you set the motherboard in the chassis? If so, those are a horrible idea. Those screws and pegs are *supposed* to be grounded. That's *by design* and using insulating washers is literally creating more problems.

Reminded once again of how screwed up the world is.. selling stuff that does the exact opposite of what should be done. :(
They've come in handy when there are chassis issues but the chassis is proprietary or something else thats odd
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
It's not too bad, it's still grounded through the the screw threads.

No, it's not. You put those washers on the top and bottom of the motherboard, so neither the stationary peg that sticks up from the chassis grounds to the motherboard nor does the screw itself touch the motherboard. Look at a motherboard's screw holes. You'll notice beads of metal. Those are supposed to come in contact with both the peg on the bottom and the screw on the top. In fact, proper motherboard screws have the underside that is not smooth. The rigid side is suppose to be rough enough to remove any oxidized layer of metal that doesn't conduct well as well as dig into those metal beads, thereby ensuring a solid ground from the motherboard's ground, through the beads, through the screw, through the peg, and directly to the chassis.

They've come in handy when there are chassis issues but the chassis is proprietary or something else thats odd

I would argue that if you are having to deliberately break specs (which call for the grounds I just explained above for safety reasons) then you'd also be dealing with hardware I'd call "inadequate for me to ever trust with hardware that costs me money". ;)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
No, it's not. You put those washers on the top and bottom of the motherboard, so neither the stationary peg that sticks up from the chassis grounds to the motherboard nor does the screw itself touch the motherboard. Look at a motherboard's screw holes. You'll notice beads of metal. Those are supposed to come in contact with both the peg on the bottom and the screw on the top. In fact, proper motherboard screws have the underside that is not smooth. The rigid side is suppose to be rough enough to remove any oxidized layer of metal that doesn't conduct well as well as dig into those metal beads, thereby ensuring a solid ground from the motherboard's ground, through the beads, through the screw, through the peg, and directly to the chassis.
Well, putting them on top is a terrible idea. It never even occurred to me to use a washer on top - what's the thought process behind that?
Not that using them with the standoffs (in any way) even occurred to me...
 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,554
Well, putting them on top is a terrible idea. It never even occurred to me to use a washer on top - what's the thought process behind that?
Not that using them with the standoffs (in any way) even occurred to me...
I knew a guy in college who decided to build his own computer. First build. He didn't know what standoffs were and attached the motherboard directly to the case. I think that ended up being a rather expensive lesson.
 

paulatmig

Dabbler
Joined
Jul 14, 2014
Messages
41
Sure enough, memory issue. Interestingly, 6 passes of Memtest didn't find it - IPMI logs detected errors in one of the DIMMs upon booting; hopefully it's not a slot problem instead of a stick problem. Got replacement RAM coming in and will check it once installed - glad it (seems to be) is an easy enough fix!
 
Joined
Oct 2, 2014
Messages
925
Would it be possible to move a single DIMM into the slot the possibly bad stick came from? This would tell you if the slot is bad, or just that DIMM you RMA'd
 

paulatmig

Dabbler
Joined
Jul 14, 2014
Messages
41
Alright, so swapping out the memory and adding more in general (now up to 192Gb) - plus setting the RPM reporting to 7200 RPM - seems to have done the trick for now. I think the combination of faulty RAM plus the ESXi hosts expecting the responsiveness of SSD drives was causing the system to just go bonkers. We're going to run another server burn-in to see if we get similar issues this time around, but I'm thinking we won't.
 
Status
Not open for further replies.
Top