[SOLVED] ( kind off) New build constantly rebooting when under any load

Status
Not open for further replies.

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
HI,

Let me introduce myself first since that seems to be the polite thing to do ;-) I'm Noctris. Long time reader, first time poster. I've got several medium to high-end freenas boxes running for work, had one at home which was getting a bit light for my needs ( 8GB DDR2, intel duo something) so i started on a new build.

Hardware

I tryed to reuse some of the stuff i still had and purchased some extra's to meet spec. so currently the new build ( nebula) is running on:

Asus P8Z77-M ( https://www.asus.com/Motherboards/P8Z77M/)
Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz
16 GB ( 2 X 8gb corsair )16043MB ( noticed that this goes up and down in the freenas interface. A previous time i booted, it gave 15538 Mb ?)
M1015 in IT mode
Intel RES2SV240 6Gbps SATA SAS RAID Expander 24port LSI
Intel PCIX Dual Gb NIC
6 X 3Tb red in RaidZ2
2 X 3TB Red in Mirror
8 X 1TB seagate in raidZ2 ( these are old drives that served as "spare" disks and where never plugged in so i consider them new)
4 X Seagate 2TB in Mirrored stripe ( same as above, 3 to 4 years old but straight out of the sealed box)
4 X 2TB wd greens -> for playing and testing
24 Bay Logic Case (http://www.logic-case.com/products/...s-minisas-550m-deep-with-fan-wall-sc-4324s-f/)

8 Gb usb for boot ( X 2 in Mirror) Don't know the brand, these are thumb drives we had printed for our own company. These where tested and considered good ( and during the course of my "adventure the last couple of days, i switched these aswell , just to be sure)

Installed freenas 9.2.1 image, then upgraded to 9.3 Stable ( last stable train update)

the 8 X 1TB raidZ2 was made on my previous system and contains a bunch of data ( about 63% filled)

I think that about sums it up.

The issue:

I started setting up the freenas , made my pools , datasets and zvol's ( esx iscsi) , installed plugin jails, created iscsi zvol and all was well. In fact, performance exceeded my expectations. however, whenever i put some real load on it, it goes down. No kernel panic, no error messages on drives or anything, just.. down. for example: i'm rsyncing the data from the 8 X 1TB RaidZ2 to the new 6 X 3TB RaidZ2 internally. Speeds are in the 300 MB/Sec range which is more then acceptable to me. It does this between 5 and 40 minutes and then it reboots ( sometimes even sort of a shutdown as the mobo does not seem to start booting again) This also happens when i copy data from our office servers to here ( which is through a wireless link that goes across the street ) so when only using 7 to 8 MB/sec

I've done memtests, changed boot devices, checked logs, looked for core dumps, pushed my logs to external syslog but helas: No joy. There does not seem to be a real dump or panic. Currently i'm running a debug version of the kernel. I'm probably missing something but i could really use some help where to look for even the smallest indication of what is happening ( or maybe people that are more knowledgeable on hardware recognize this kind of stuff ?)

At this point, i'm ready to just buy a supermicro mobo with xeon and ecc ( as i do for my pro builds) , but on the other hand, the gear i'm using now might not be top notch server grade, i've ran freenas on ( A LOT) less then this and it ran fine. So before spending an additional 500 to 600 euro, I would just like to make sure i dotted the i's and crossed the T's.

Any ideas ?
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
Currently running a 650Watt truepower antec PSU. Since i wasn't sure this would be enough, i disconnected a bunch of disks in one of my "test runs" so there were only 8 connected at the time. Even then it just went down :s

I did do smart tests, memtest. I haven't done a cpu stress test recently but the board and proc come out of a running system i had for a year and i did it before i started that in "production". The thing is, i have ran another OS (debian) just to see if it was bad HW. It worked perfectly fine... It's strange stuff. Especially the "no logs to be found".. no panic. nothing
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I'd suggest doing some stress testing anyways, regardless of where the board and processor came from. This sounds like a hardware issue. You could have any number of issues, from a standoff shorting out something on your board, to some randomly misseated card.
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
Card have been checked .. many.. many.. many times ( did i mention many) during the hours i babysitted the box while hoping to catch a glimpse of an error on the console ;-) I actually had the same config running stable with different mobo so i'm leaning towards the "changed" components -> the new motherboard, cpu ram combo AND the Intel nic. I have a spare of those so i can swap that. i'll just empty it and start stressting , then adding part by part ( although to be hones, after almost a week of messing around with a box that looks good enough on paper, i'm about ready to push 'order' on the shoppinglist i assembled .. even my wife is on board after not seeing me the last couple of days ;-)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Well, I'd tend to lean towards figuring out what the issue is before buying new hardware, but that's just me. Computers suck. I'm sitting here this morning battling with ESXi over ridiculous configuration retardedness on a new hypervisor ... and have literally been through a whole multi-hour quagmire of not even being able to boot due to $complexity.

What I'm sayin' is, don't feel bad, hang in there. ;-)
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
I long for the days where you could just fry the crap out of flaky hardware with a power surge and RMA it saying "you don't understand what happened either" :) The problem is: even though computers sucks, they are ALWAYS right.. even when they are wrong ;-)

Thing is: I actually decided to buy it anyway. This box is going to contain some personal data i give great importance to. And in all honesty, even with backup, I don't want to worry about my data. The feeling bad part is not the issue with the hardware itself, but me loosing trust in the hardware on which i want to put my pictures, documents, etc on..

So change of plans in hardware stuff ( I was already planning to replace another, less important machine): Buy stuff i know will work for FreeNAS since this will be the "supporting" layer of al the rest i have planned.

Move this board to a role that is "more forgiving" when down..

Thanks for the input and good luck with ESXi ( have you tried kicking it ? ;-)
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
quick update. So @jgreco was right .. fiddled around again, checked all connectors and sure enough: one of the pins on the motherboard pci-x powerconnector was askew... Fixed this, box is humming away just fine...
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
aaaaaaaaannnd.. i jinxed it.. it went down again less then 10 minutes after me thinking it was "OK"..
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
quick update. So @jgreco was right .. fiddled around again, checked all connectors and sure enough: one of the pins on the motherboard pci-x powerconnector was askew... Fixed this, box is humming away just fine...

Great to hear!
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
Great to hear!
Are we talking about the "you being right" or "me fixing it" part ? ;-)

Pretty sure i got it narrowed down to psu. Got me thinking that although i swapped it, i did so with an identical model. Got a different one somewhere around here. Gonna give that a spin. These truepower psu's have a "feature" where the are very quite but run quite hot. Since it has never been put to heavy load ( used to just run a simple machine so had plenty of power left, i may have gotten around this feature but now powering heavier cpu, extra pci-x gear and 20 Disks, i might be pushing it. It occured to me that every time i had a "longer" run, it was when the machine would have been off for several hours ( aka, all cold) .. reboots started to become more frequent the longer it was online.. this with both psu's. Since all other temp readouts are good.. i think it is just going into temp protection ( which would also explain the lack of core dumping.. it just "switches off"...

Anyway.. just the ramblings of an angry guy that wants it's toy to work instead of playing "Dr House" on hardware ;-)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
If it comes down to replacing the supply, I will note that the users around here who are building non-rackmount systems usually really like the Seasonic G-series power supplies.

Typed that twice in two minutes!
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
It's a rackmount though... good suggestions on that ? Might try one i got lying around, if this keeps the box stable , i'll order out
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Are we talking about the "you being right" or "me fixing it" part ? ;-)

Yyyyyeeeesssssss.......????

Pretty sure i got it narrowed down to psu. [....] angry guy that wants it's toy to work instead of playing "Dr House" on hardware ;-)

Didn't I start out suggesting PSU? Heh. (Coming from a guy who's spent way too many hours playing "Dr House" on all things tech.)

It's a rackmount though... good suggestions on that ? Might try one i got lying around, if this keeps the box stable , i'll order out

It's not really a rackmount, more of a standard PC case laid on its side and given rack ears. Looks like it ought to take a standard ATX supply, which a true rackmount server usually doesn't.

Don't trust supplies that spin down the fan to be "quiet".
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Yyyyyeeeesssssss.......????



Didn't I start out suggesting PSU? Heh. (Coming from a guy who's spent way too many hours playing "Dr House" on all things tech.)



It's not really a rackmount, more of a standard PC case laid on its side and given rack ears. Looks like it ought to take a standard ATX supply, which a true rackmount server usually doesn't.

Don't trust supplies that spin down the fan to be "quiet".
Pretty much all ATX PSUs do that these days. Seasonic tends to be conservative and spin up quickly, though. The only PSU that caused problems because of such things was one of the old Corsair RMs (OEM has since been replaced with CWT. Bleh.) under very specific conditions.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Did you check the power supply to see if the fan was actually turning? If you have the case cover off, do you still have the issue? I'm looking at air pressure/air flow. Just looking to see if the exhaust fans are pulling so much air that they are trying to pull it through the PSU as well and reducing air flow within the PSU. Hopefully it's the PSU and its a quick fix for you.

As for testing your components after the PSU replacement, I doubt your CPU is causing the issue, what is your CPU load while a scrub is in progress or when you are doing one of your data transfers?

Run a very long (3 days at least) stress test on your RAM.

If this only occurs during periods of a lot of I/O, I'd look at the parts which heat up a lot during I/O, most likely one of the SATA interface boards or a chip (NB maybe) on the motherboard. You could also have a bad solder joint somewhere in your system.
 

Bhoot

Patron
Joined
Mar 28, 2015
Messages
241
I would think 16gb ram is a bit too less for so many hdds with huge TB of storage.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I would think 16gb ram is a bit too less for so many hdds with huge TB of storage.
It would depend on what is being done with the FreeNAS system. The only way to know is to see how much Swap space is being used, but that should have no bearing on the problem at hand.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
OK, dumb question. This is not a thermal issue, is it? Perhaps, a mis-seated CPU heatsink/fan? Researching this problem with this board, it seems that most people with this problem had one of the following problems:
  • CPU power connector not attached correctly
  • Thermal issue
  • Memory kit was not fully compatible
  • some kind of BIOS error that was cleared when the CMOS was cleared
 
Status
Not open for further replies.
Top