Reboot cycle

paulatmig · May 18, 2015

Going through a lot of other forum posts about frequent rebooting, and how likely it is to be a hardware problem, I'm seeing that the usual troubleshooting method is to remove and replace components to see what helps or doesn't help. We'll be running a memtest tomorrow night for a while, but until we get the results from that I figured I'd see if anyone had a similar setup with the most recent updates where they had fixed the issue:

Our system:
* Supermicro 6027R-E1R12L (X9DRD-7LN4F mobo, LSI controller in IT mode)
* ~100Gb of ECC RAM
* just flashed the mobo firmware to the latest version - 3.2.
* Chelsio T420-CR
* pulled the jumper for watchdog on the mobo, plus made sure watchdog wasn't enabled in the BIOS
* 12x2Tb WD SAS drives in RAID10
* Intel P3700 SLOG
* dual 960w power supplies connected to a switched PDU, and that's going to a UPS.
* mirrored 8Gb USB sticks (Kingston DataTraveler SE9's)
... and it'd been running fine for the last 6 months.

Basically I'm using this system as an iSCSI storage target for our ESXi (5.5) hosts. It had been in a test environment for the last 6 months, we'd put a few "B-list" servers on it, and that was fine. We're sharing the iSCSI as a file extent in a dataset on the only zvol; it's got 1Tb assigned to it, with a capacity limit of 85%.

Anyway, things were going smooth. Just finished uploading the last of our semi-production VMs on to the system late at night last Thursday) - it's now at 230Gb of 10Tb total storage. Default compression, no dedupe. Friday morning the system starts a reboot cycle - it's rebooting over and over again.

We've checked the UPS logs, no weird voltage problems. I switched USB sticks, tried a fresh install and re-apply our backup config, checked the kernel panic logs (there were none), time and date on the BIOS (it's consistent). It doesn't happen within consistent intervals: sometimes it reboots during the boot process, sometimes the system gets to the console and freezes after 5 minutes, sometimes it runs fine for about an hour. Right now it's back to rebooting over & over, never finishing the boot process.

So I got a replacement PSU just in case, and again - running memtest starting tomorrow night. Anything else I should be looking at?

DrKK · May 18, 2015

Reboot loops without any sort of kernel messages---to me, that smells of watchdogd or bad RAM/timings. You seem to be barking up the right trees. Let us know what happens.

Darren Myers · May 18, 2015

Any chance its grounding somewhere? Possibility to change out the motherboard itself, to any spare motherboard?

DrKK · May 18, 2015

oooh I like that. That's exactly what would happen if you were grounding out to the chassis.

Darren Myers · May 18, 2015

On all my builds i install these http://www.amazon.com/dp/B00H96MWHC/?tag=ozlp-20

paulatmig · May 18, 2015

Could check the cables, see if anything's touching metal that should be. Given that we haven't messed with the internals in the last few months, I wouldn't normally suspect grounding, but it certainly wouldn't hurt to check. That'd also reboot our memtest run, so if we let that run overnight and we come back to only a few passes that'll be a positive!

cyberjock · May 19, 2015

Darren Myers said:
On all my builds i install these http://www.amazon.com/Cosmos®-Pieces-Motherboard-Insulating-Washers/dp/B00H96MWHC

Those go on the screws before you set the motherboard in the chassis? If so, those are a horrible idea. Those screws and pegs are *supposed* to be grounded. That's *by design* and using insulating washers is literally creating more problems.

Reminded once again of how screwed up the world is.. selling stuff that does the exact opposite of what should be done. :(

paulatmig · May 19, 2015

Checked, and there are no components or loose bits that are shorting the board. Besides, that would cause a sudden reboot - not a lock-up *then* reboot; at least that's my understanding of how that works.

In any case, memtest is at 45% now running for about 28 hours now without rebooting. Doesn't sound like a power or shorting issue, else it would have rebooted again in that time frame. So more likely a memory or processor problem. Would a bad NIC cause such problems? I wouldn't think so, but I'm willing to entertain the thought.

Ericloewe · May 20, 2015

cyberjock said:
Those go on the screws before you set the motherboard in the chassis? If so, those are a horrible idea. Those screws and pegs are *supposed* to be grounded. That's *by design* and using insulating washers is literally creating more problems.

Reminded once again of how screwed up the world is.. selling stuff that does the exact opposite of what should be done. :(

It's not too bad, it's still grounded through the the screw threads. I got a handful of those washers with the Sharkoon T9. I said to myself "What the fsck are these stupid washers for?" and promptly added them to the "Misc." compartment of my screw box, where other semi-useful crap lies (along with oversized screws).

Darren Myers · May 20, 2015

cyberjock said:
Those go on the screws before you set the motherboard in the chassis? If so, those are a horrible idea. Those screws and pegs are *supposed* to be grounded. That's *by design* and using insulating washers is literally creating more problems.

Reminded once again of how screwed up the world is.. selling stuff that does the exact opposite of what should be done. :(

They've come in handy when there are chassis issues but the chassis is proprietary or something else thats odd

cyberjock · May 20, 2015

Ericloewe said:
It's not too bad, it's still grounded through the the screw threads.

No, it's not. You put those washers on the top and bottom of the motherboard, so neither the stationary peg that sticks up from the chassis grounds to the motherboard nor does the screw itself touch the motherboard. Look at a motherboard's screw holes. You'll notice beads of metal. Those are supposed to come in contact with both the peg on the bottom and the screw on the top. In fact, proper motherboard screws have the underside that is not smooth. The rigid side is suppose to be rough enough to remove any oxidized layer of metal that doesn't conduct well as well as dig into those metal beads, thereby ensuring a solid ground from the motherboard's ground, through the beads, through the screw, through the peg, and directly to the chassis.

Darren Myers said:
They've come in handy when there are chassis issues but the chassis is proprietary or something else thats odd

I would argue that if you are having to deliberately break specs (which call for the grounds I just explained above for safety reasons) then you'd also be dealing with hardware I'd call "inadequate for me to ever trust with hardware that costs me money". ;)

Ericloewe · May 20, 2015

cyberjock said:
No, it's not. You put those washers on the top and bottom of the motherboard, so neither the stationary peg that sticks up from the chassis grounds to the motherboard nor does the screw itself touch the motherboard. Look at a motherboard's screw holes. You'll notice beads of metal. Those are supposed to come in contact with both the peg on the bottom and the screw on the top. In fact, proper motherboard screws have the underside that is not smooth. The rigid side is suppose to be rough enough to remove any oxidized layer of metal that doesn't conduct well as well as dig into those metal beads, thereby ensuring a solid ground from the motherboard's ground, through the beads, through the screw, through the peg, and directly to the chassis.

Well, putting them on top is a terrible idea. It never even occurred to me to use a washer on top - what's the thought process behind that?
Not that using them with the standoffs (in any way) even occurred to me...

anodos · May 20, 2015

Ericloewe said:
Well, putting them on top is a terrible idea. It never even occurred to me to use a washer on top - what's the thought process behind that?
Not that using them with the standoffs (in any way) even occurred to me...

I knew a guy in college who decided to build his own computer. First build. He didn't know what standoffs were and attached the motherboard directly to the case. I think that ended up being a rather expensive lesson.

paulatmig · May 28, 2015

Sure enough, memory issue. Interestingly, 6 passes of Memtest didn't find it - IPMI logs detected errors in one of the DIMMs upon booting; hopefully it's not a slot problem instead of a stick problem. Got replacement RAM coming in and will check it once installed - glad it (seems to be) is an easy enough fix!

Darren Myers · May 28, 2015

Would it be possible to move a single DIMM into the slot the possibly bad stick came from? This would tell you if the slot is bad, or just that DIMM you RMA'd

paulatmig · Jun 16, 2015

Alright, so swapping out the memory and adding more in general (now up to 192Gb) - plus setting the RPM reporting to 7200 RPM - seems to have done the trick for now. I think the combination of faulty RAM plus the ESXi hosts expecting the responsiveness of SSD drives was causing the system to just go bonkers. We're going to run another server burn-in to see if we get similar issues this time around, but I'm thinking we won't.

Important Announcement for the TrueNAS Community.

Reboot cycle

paulatmig

Dabbler

DrKK

FreeNAS Generalissimo

Darren Myers

Guru

DrKK

FreeNAS Generalissimo

Darren Myers

Guru

paulatmig

Dabbler

cyberjock

Inactive Account

paulatmig

Dabbler

Ericloewe

Server Wrangler

Darren Myers

Guru

cyberjock

Inactive Account

Ericloewe

Server Wrangler

anodos

Sambassador

paulatmig

Dabbler

Darren Myers

Guru

paulatmig

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Reboot cycle

Dabbler

FreeNAS Generalissimo

Guru

FreeNAS Generalissimo

Guru

Dabbler

Inactive Account

Dabbler

Server Wrangler

Guru

Inactive Account

Server Wrangler

Sambassador

Dabbler

Guru

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Reboot cycle"

Similar threads