[SOLVED] ( kind off) New build constantly rebooting when under any load

Status
Not open for further replies.

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
OK, dumb question. This is not a thermal issue, is it? Perhaps, a mis-seated CPU heatsink/fan? Researching this problem with this board, it seems that most people with this problem had one of the following problems:
  • CPU power connector not attached correctly
  • Thermal issue
  • Memory kit was not fully compatible
  • some kind of BIOS error that was cleared when the CMOS was cleared
Pretty sure it isn't. Cpu temp never goes over 40 celcius, disks not over 35 and all fans look good both in monitoring soft as physical. Gear inside 'feels' ok ( can touch hba and expander even when running a while. Only psu is hot. Very hot. As in 'cannot keep hand behind exhaust' hot.

@jgreco ah yes.. indeed. It's a 'poormans' rackmount indd :) gonna swap psu later today ( if my 2 year old allows it :p

Concerning the 16gb: nope. My disks are divided in smaller pools and the system does not even break a sweat when rsyncing a tb from one to another. That makes the issue strange. Also: if lack of memory was the issue, you would see all kinds of information, warnings, kernel panics and performance issues ( and i'm pretty familiar with those having ran freenas in a stable condition on less then recommended hardware for years. It just requires some 'tweaks' and you have to take the loss in performance that comes with it). Swap is 0 and performance is perfect right up to the moment it dies.Probably because psu is turning itself off because of too hot. It's one of the only things that would explain an unexpected shutdown without the debug kernel giving me ANY info ( sometimes last post in log is a freenas script querying a pools health to update the interface, other times it is smart being ran on disk not related to the mentioned pool).

From software point of view, it looks like someone yanking out the power cable..
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
The PS air should not be that hot, not even close. Have you replaced the PS yet?
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
Toddler nap time is coming up ;-)
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
Ok, when toddlers take their nap, daddy fixes.. euhm.. server crap ? Anyways.. pulled out the TruePower PSU and saw the other unit i have is a neopower 650. which has a simpel but effective fan hauling out as much air as it can, combine this with the fact that the fans in the front to cool the disk bays are pulling air in, this should give a nice airflow

Side note on the truepower: taking a better look at it i see there is some intresting design on this thing.. basically looking inside , the fan is very slow by antec choice AND , about half of it is covered with a hard plastic part to keep it away from the transfo coil.. effectively blocking the direct path of air and cooling. There is a cutout in it to allow the airflow through but it is small . It also sucks it air from the bottom ( so pulling in the exhaust heat from the CPU fan) and there is no fan on the back to push it out of the case... i never noticed this before but with this type of design, i strongly doubt ever being able to put full load on it :s

Now, we play the waiting game..
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
Angry guys log, stardate -307568.1527143581

Well, that didn't take long.. When i started to pull stuff over the network, it went down again.. Only this time i was around to see it happen.

So on the console i see a samba error together with a syslog error not being able to reach the syslog server.. Also, iscsi informs me it's dropping connections from esx since it cannot ping it anymore.

so apparently, networking goes out first
it then stays like that for about 20 to 30 seconds, fully responsive, can even log on to shell on the console while pinging it from a different machine is not working.
Then all lights simply go out. No warning, panic, dump, .. just "poef".. gone. then reboot cylce starts

So now, i pulled out the intel nic ( which is the only one i was using, onboard realtek was not hooked up) and reconfigured it to use the onboard realtek to see if it does the same with that one..
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Hey, that didn't take long at all.

After you have finished with that test, if you have another failure, not sure what you use as a boot device but what about a fresh install of FreeNAS? Maybe something is corrupt. I'm fishing here becasue if it's not the software, well it's obvious it's the phase of the moon causing your issues.
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
well, i'm into my 4th usb thumb drive :s I wondered if it could be that so i've done several clean install's the last couple of days.. using new a new device every time. Currently some scandisk and white brand (thumb drives we had printed for out business and that proved thrustworthy in the past).. eitherway, my Supermicro board and other stuff is on it's way, but i would still like to clear out what is wrong with this one since i tend to reuse it for something else ( just not as crucial as "keeping my data safe") .. but i'm starting to suspect lunar influance too
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
about half of it is covered with a hard plastic part
That's meant to channel air away from the rear exhaust. Pretty much everyone did that a while back. Curiously, it seems to have died down.
 
Joined
Oct 2, 2014
Messages
925
Is it possible for you to test the hardware (motherboard,cpu,psu,single hdd) outside of the case? Power it up with a windows boot and run some stress tests on it, or leave it in the case now and just unplug your FreeNAS drives so you dont accidentally *touch* them with whatever you stress test with
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
ha.. didn't know that. still looks strange to me.. the rear "exhaust" has no fan, so in effect, this psu is sucking in "warm" air from inside of the case ( air coming out of cpu fan none the less) , and not pushing it actively out through a fan in the back.. weird.

Anyway: update time. The machine is running and aside from expected "re0: watchdog timeout" realtek goodness, it seems to stick. so after killing the realtek driver under heavy load, i had no more access to the machine but it was still running. So i did a console reboot, loaded some loader tuning variables that more or less fix the realtek stuff (well, keep it running until the new gear arrives) and now it seems to keep running. I'm moving data around just to check..

As a bonus my tweaks stress the cpu a bit more ( disabled all nic hardware offloading, leaving it to the kernel so the realtek driver doesn't do anything stupid), so together with the scrub on a degraded volume it get's some extra pushing

I can pull an intel nic out of a test machine later to see if this works stable ( could still be the slot instead of the nic itself)

Side note: I'm impressed at how well ZFS handles this whole trial and error reboot fest... After countless reboots, i now have my first degraded volume. It really surprises me that this didn't happen ealier seeing that i was copying data the whole time on non-ecc ram and getting hard poweroff's all the time.

The other surprise however, is how a nic in a pci-x slot can kill a machine in such a way that the kernel doesn't even have time to panic about it... ( unless this was a multiple issue perfect storm kind of thing)
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
Is it possible for you to test the hardware (motherboard,cpu,psu,single hdd) outside of the case? Power it up with a windows boot and run some stress tests on it, or leave it in the case now and just unplug your FreeNAS drives so you dont accidentally *touch* them with whatever you stress test with
Sure, i could. I'm not really worried about the drives since this is a new build. Worst that can happen is that i have to copy a couple of TB over a wireless point-to-point to our office ( at 7 Megabytes per sec) so data loss is no worry.

Thing is that a memtest86+ and running "stress" from a live cd didn't kill it. It only happened when putting load on it when using freenas. So the nic would make sense ( since that is not really "tested" when you run mem or stress tests and i didn't run an iperf because i didn't know faulty nic could cause such catastrophic failures )
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
After you get your new system parts up and running, then would be a great time to try to isolate the issue because you could run MemTest for 3 or 4 days to ensure there isn't an issue with the RAM or MB timing with the RAM. After than maybe a 4 hour CPU stress test even though I doubt that will do anything other than pass because I doubt it's a CPU overheating issue.

The other thing it could be is the PCI slot, well more likely a small capacitor near the base of the PCI slot so if you have a different slot you could use, give it a try. Ofc ourse it could be the NIC card itself as well. Capacitors are the things which more often than not are the root cause of a MB failure when it comes out of the blue and the MB had been working for years, and then you got cold solder joints which are typically near components which get hot. This is a generalization and not an absolute.

The RealTek NIC may not work too well, I had my own issues with mine and as reluctant as I was to admit it, the drivers for the RealTek NIC just plain suck for FreeNAS. I'm not certain that translated to FreeBSD as well, I never investigated it. I know RealTek drivers for Windoze are stable.

Did you say that you were building a new system and it will have ECC RAM this time. Yea, off topic a bit.

And hope your toddler got in a good nap.
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
Definately will run those tests again ( i already did beginning of the week ) and again on the supermicro. But indeed i think i need to go check the pci aswell.. I'm gonna put in another intel nic later on and verify if this works..

The realtek stuff is horrible on FreeBSD ( at least in the experiences i had). They caused quite a few problems with on PfSense boxes i've seen ( throughput and general mess).

About the new build: i'm rebuilding my home and test network and had this hardware planned as freenas box. Since there is full backup available i didn't really worry much about the non-ecc ( my other machines do have ecc so with those across the street pulling the backups, no worries.

But since this config was/has been/is so flaky , i simply don't trust it anymore. Even with backups, i want this thing to work as good as possible. I want to use it, not tinker with it ;-)

So change of plans. The budget i had in mind will go to the freenas box ( Supermciro board X9 series with 32Gb ecc ram and a bit heavier Xeon CPU). This way i can run what i had in mind as a jail or virtualbox on the freenas ( nothing really heavy) and can save out a machine. So the additional money i am spending on the freenas hw will come back in electricity savings ;-) This board , with the issues, can still serve as a test for other stuff once i'm sure i figured out the problem.

Toddlers don't nap.. they pass out on the spot.. and he did so as advertised ;-)

Eitherway, already many thanks for all the input of everyone. Hardware is not my "home" area.. i know a bit about it but generally am more active with the stuff running on top of it so all help was very welcome after i crossed of the basics i know ;-)
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
OK, are you saying, the problem was an Intel NIC card you had plugged in to the PCI slot?

Seems very unlikely to be that card. I'm with joeschmuck--something wrong with the board.
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
Well...

swapped the nic -> same

Swapped nic and M1015 pci slot -> same

I did memtest for around 48 hours -> 10 something passes, no worries, 4 hours of CPU stress. No worries.

And then i realised i only did one thing consistently the same.. which was stupid in retrospect.

I reinstalled several times on different keys but i always used the same 9.2.x img file i had and then updated to 9.3... ( yes.. i know.. )

So i downloaded the latest iso, installed clean on one of the thumb drives which i was sure was ok.. and it was running for about 12 hours doing some heavy data copying while running @jgreco array-test-v2.sh from https://forums.freenas.org/index.php?threads/building-burn-in-and-testing-your-freenas-system.17750/ ( i never got any more then 2 hours out of it so longest uptime until now)

Somewhere during that heavy load, it died again.. and corrupted a pool that was the source of an rsync to a different pool for test purposes ( and the array test script). There was no real data on it, just a couple of TB of copied data to push the machine, and the system dataset , but I'm starting to wonder if this is board or hba. By all accounts everything i throw at the cpu, motherboard and cpu seems to be handles well... it's when i start messing with data that stuff goes wrong...

zpool import gives me this:

Code:
 pool: media
     id: 6656127391778025609
  state: UNAVAIL
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
   see: http://illumos.org/msg/ZFS-8000-EY
config:

        media                                           UNAVAIL  insufficient replicas
          raidz2-0                                      UNAVAIL  insufficient replicas
            11761918741126144065                        UNAVAIL  cannot open
            gptid/c2b1d761-0481-11e5-8c26-00270e027a2b  ONLINE
            355178948775450878                          UNAVAIL  cannot open
            18207329256978712961                        UNAVAIL  cannot open
            13065024954840738848                        UNAVAIL  cannot open
            gptid/c63a7d46-0481-11e5-8c26-00270e027a2b  ONLINE
            1535683463100413334                         UNAVAIL  cannot open
            15803035284657015843                        OFFLINE

While in the freenas interface, i can see all disks without worry. Looking like the GPT is corrupted ?

so now i'm going to try and recover and see what i can find out while doing so...
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
huh... some unused disks in the chasis have been in a different freenas before and used to be in a pool called media too... so freenas kinda flipped on the import of that . Dettached the pool while in "unkown" state, imported it again succesfully, then did a whipe of the old disks...

This should not have anything to do with the issues right ? Since zfs used the numeric id ?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I doubt it has anything to do with your issues but I've seen some weird stuff in my time where one completely unrelated (in my mind) thing caused something to go wrong.

I'm still thinking some sort of I/O path heating up when it's in use. That could be the HBA or MB.

Keep in mind that when you are doing the MemTest and CPU Burn-in, these only test certain parts of the system, they are important parts and typically where a person would expect to see a failure however not everything is tested. You need special equipment to test out a motherboard completely.

If your MB is under warranty then I'd try to RMA it.

Hey, something else you might be able to try for testing purposes only of course... Reinstall your Intel NIC, Remove your HBA, Install as many SATA drives as you can directly to the MB (looks like six SATA connectors) and see how that works. If your system is stable then either your HBA or the PCI port on the MB is the problem. If it crashes again then you know it's not your HBA or the PCI port and you can reasonably say it's the MB.

I'm not sure what you have the RAM clocked for, even though it passes the MemTest, just ensure it's operating at the slower RAM speed it's certified for to see if that makes a difference. A little tweak here or there can make all the difference.

Last question for you... Did you say this system ever worked as built or has it never worked for any heavy data transfers?
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
Yup, that is what i am worried about too ( the I/o path) although strangly, i've been pushing it to it's limits for 12 hours straight with copying and moving TB's of data between pools ( both on sas expander as between expander and port 2 of the m1015). Never blinked once. Started doing 3TB of cifs transfers, nothing either. Within 20 minutes of running the array script, it went.

So i rebooted, resumed pounding the disks... all clear for an hour or 5 to 6.. i then installed jails, in particular: sonarr ( it was running previously aswell).. Within minutes the box went down...

I now stopped and disabled autostart on the jail. System keeps going like a champ.

All of this while having the intel nic in there ( actually, i put in the second one i tested permantely since this is a high profile, the previous one was a low profile so that was not optimal either). All traffic going over this nic without a hastle...

The sata idea is good, i'll give that a try... I also got a couple of M1015's coming in (a steal from an italian on ebay, since the prices can sometimes be absurd here in europe, i keep an eye out) for another build and as cold backup part. this means that if the sata thing does work, i can replace current m1015 to eliminate that and can test with 16 disks without the sas expander.

Concerning this sytem: it has been running fine for about 2 years but was previously a ridiculously overdimensioned XBMC machine.. so it never was under any real heavy load.. it also ran fine for about 2 days ( no heavy load but some copying) before it started to act out... Only software changes during this time (jails, playing around with 9.3 since i was running 9.2.x in production)
 

Noctris

Contributor
Joined
Jul 3, 2013
Messages
163
So i've been having some fun .. It was copying both internal between pools as from the network. or the last 3 hours. Load is.. well.. heavy enough ;-) ( low speed on em0 is because of copy from usb disk.. just wanted to keep samba busy) It has been doing this for 3 hours straight.. if something was wrong with the i/o path, it should not be surviving this right ?
loading.png
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
So what exactly happens? Do you have console access when this problem occurs?

I had something similar on my freenas2 (marvel nics). The NIC would hang after a hefty network transfer. Using the console, I could see that the machine was ok, it was just the network that hung. If I stopped and started the interface, all was good.. After some googling, I applied a couple tunables and the problem hasn't come back. There might be something similar for the Intel PCIX NIC.

upload_2015-6-9_17-39-29.png
 
Status
Not open for further replies.
Top