Server Stops Responding - Restores via Hard Reset via IPMI

SerpentMage

Dabbler
Joined
May 16, 2017
Messages
29
Hello,

I've had this server up and running for a couple years now, only major issue so far has been the expected ASRock C2550D4I, however I've had the board replaced previously and this does not seem to be a similar issue as before the system would not boot at all. However I'm still not ruling out the board.

System Details:
  • ASRock Rack C25504DI
  • Intel Avoton C2550 Processor
  • 16GB ECC
  • 4 x Seagate Ironwolf 2TB in RaidZ2, 4 x WD Red 2TB in RaidZ2, also have a single 16GB Sandisk USB Flash Drive as boot.
  • Onboard HardDrive controllers, I believe they're Marvell SE9230 or Marvell SE9220
  • Onboard NICs, chipset is Intel I210.


So far over the past 2 to 3 weeks I've had my NAS stop responding over night. The first time, I could log in via the WebGUI (SSH was not enabled then), however I could not get status of my Pools nor my one Jail. I tried a normal shutdown via the GUI and after that did nothing via the Shell, which also failed. I was forced to hard reset via IPMI. Initially I did a reset via IPMI, which never fully lost power, a continuous ping never dropped at all. When it came back most of the network config was missing. At this point I powered the server down, ensuring it was off via IPMI and that my pings were failing then powered back on.

Everything was fine for a couple weeks, I can't remember exactly sorry. However I noticed nothing out of the ordinary. Night before last, morning of the 13th, same thing. I woke up to the shares unresponsive. Except this time, I could not log via GUI, I'd get to the prompt but it'd just sit there. Also I could not ping it. I logged into my IPMI interface on the motherboard and remembered it had a remote console (simulates a montitor and keyboard plugged in) and tried to shutdown that way, which failed. I shutdown via IPMI power control again, waited about 30 seconds, and restarted. No issues during boot that I could see. Everything came back normal.

After this I set the syslog to go to a remote system and setup a free syslog service on another PC to try to get some info. The same thing happened again this last night (morning of 14th). I had to shutdown I got the syslog files, however admittedly I'm not 100% sure what I'm looking at. I was unable to log in via GUI, SSH and Remote Console shutdown failed as before. I currently have the server powered off.

Any suggestions or guidance would be greatly appreciated. I do have some jobs set to run at night, SMART tests and such, I'm wondering if something with them is causing problems? I can upload the syslogs if that would help.

Thanks, and apologies for the long post. Trying to provide as much detail as possible.

-John
 

SerpentMage

Dabbler
Joined
May 16, 2017
Messages
29
Sorry to reply to my own message so quick, but after a bit of coffee I took another look at the syslog of this morning and noticed my hard drives temps are getting really high while I see various cron messages (no errors that I see). I'm wondering if my scheduled SMART tests are hammering the system and the fans aren't able to keep up or something? Could that be causing the issue?

I'm tempted to power the server back up and turn off all my cron jobs and see what happens.

Thanks,
-John
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Based on my experience with my C2750 I suggest that you look at the CPU temp history in the Asrock IPMI also.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
System Details:
  • ASRock Rack C25504DI
  • Intel Avoton C2550 Processor
  • 16GB ECC
  • 4 x Seagate Ironwolf 2TB in RaidZ2, 4 x WD Red 2TB in RaidZ2, also have a single 16GB Sandisk USB Flash Drive as boot.
  • Onboard HardDrive controllers, I believe they're Marvell SE9230 or Marvell SE9220
  • Onboard NICs, chipset is Intel I210.
What kind of chassis?
after a bit of coffee I took another look at the syslog of this morning and noticed my hard drives temps are getting really high
How high?
I'm wondering if my scheduled SMART tests are hammering the system and the fans aren't able to keep up or something? Could that be causing the issue?
A long smart test or a scrub might make the drives heat up if your fan speeds are set too low. If you can tell about your case and fans and fan speed adjustments you may have made, it might give us some info to work from.

Often times, when a system stops responding and a reboot fixes it, it is a failing boot drive. USB boot drives are very prone to failure. We see one on the forum almost every day.
 

SerpentMage

Dabbler
Joined
May 16, 2017
Messages
29
Based on my experience with my C2750 I suggest that you look at the CPU temp history in the Asrock IPMI also.
So far CPU temp looks ok. When I had to replace my motherboard before the CPU temp was not responding, everything looks ok currently.

What kind of chassis?
It's a standard Thermaltake PC tower, 120mm in the front and back. I know it's far from ideal, I'm working on building a newer, better server. Initially this was mostly an experiment that I threw more and more stuff at.

How high?
Worst case, around 100C which after a bit of googling is WAY too high as expected.

A long smart test or a scrub might make the drives heat up if your fan speeds are set too low. If you can tell about your case and fans and fan speed adjustments you may have made, it might give us some info to work from.

Often times, when a system stops responding and a reboot fixes it, it is a failing boot drive. USB boot drives are very prone to failure. We see one on the forum almost every day.
Yeah, I forgot to mention that I ordered 2 new USB drives due to the numerous reports of boot drives causing various issues and I'm going to set them up in a mirror. When the server gets rebuilt/replaced I'll go to HDD or SSD as opposed to USB.

I'm most likely going to boot the server again, and turn off all my SMART tests for now, check fan speeds and such. How often do you suggest running scrubs?

Thanks again,
John
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Worst case, around 100C which after a bit of googling is WAY too high as expected.
WOW. Yes, that needs more airflow. If it is an older case, double check that the fans are not choked with dust because any dust in the path will reduce airflow and crank the speed up if there is a fan controller. If you were checking with Google, about the best temp for a hard drive, there is no telling what they told you, but I try to keep mine under 40°C and my preference is under 35°C.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I'm most likely going to boot the server again, and turn off all my SMART tests for now, check fan speeds and such. How often do you suggest running scrubs?
I run a short SMART test every day at 6 AM and a long SMART test once a week at 7 AM (just pick a day that is good for you) and I have my scrub schedule setup so they run once a month. It is supposed to work out that way, I use 'threshold days' and have it at 28. It is working well enough but I may tinker with the schedule again. If you have enough airflow through the case, you should be able to do those things without the drives going any where near an unsafe temperature.
That said, is your system in a climate controlled environment? We had someone on here where they had put the server in the attic where there was no air conditioning or insulation and the ambient temperature was too hot for anything to be able to cool, no matter how much air-flow there was.
 

SerpentMage

Dabbler
Joined
May 16, 2017
Messages
29
WOW. Yes, that needs more airflow. If it is an older case, double check that the fans are not choked with dust because any dust in the path will reduce airflow and crank the speed up if there is a fan controller. If you were checking with Google, about the best temp for a hard drive, there is no telling what they told you, but I try to keep mine under 40°C and my preference is under 35°C.
Yeah I'm going to dust everything out this weekend, double check the fans to make sure they're actually working and max their speed out.

I run a short SMART test every day at 6 AM and a long SMART test once a week at 7 AM (just pick a day that is good for you) and I have my scrub schedule setup so they run once a month. It is supposed to work out that way, I use 'threshold days' and have it at 28. It is working well enough but I may tinker with the schedule again. If you have enough airflow through the case, you should be able to do those things without the drives going any where near an unsafe temperature.
That said, is your system in a climate controlled environment? We had someone on here where they had put the server in the attic where there was no air conditioning or insulation and the ambient temperature was too hot for anything to be able to cool, no matter how much air-flow there was.
Yeah, my server is in my office (where I am sitting right now), so it's got access to A/C and heat when needed, my office has 2 big french doors that are pretty much open the entire time. The room itself should be ok for the most part, it's the case that needs more airflow. Which is a limitation I should have realized way sooner.
Currently I think my airflow is not good enough for scheduled Long SMART tests, so I deleted that. I'm going to set my short tests run 2 days per week, one pool of drives per day. So for example Drives 0-3 on Monday and Drives 4-7 on Tuesday for example. I also change my scrubs to run on different days for each pool (1st and 15th) with a threshold of 28 (was 10).

Hopefully this will give me some more ammo for me to talk the better half into letting me spend the money on a new server. ;-)

Thanks for your help!
-John
 

SerpentMage

Dabbler
Joined
May 16, 2017
Messages
29
Sorry to bring this back up after a couple days. My server hasn't had any further issues thankfully. However I've been watching my logs closer and I'm pretty certain that the temps being reported in the syslog are incorrect.

If I compare the temps listed in the logs to the temps shown in the reports section for the disks via the GUI, the logs report a crazy temp but the GUI reports show no spike in temps. I'm also seeing some other odd message in the syslogs that I don't understand.

I'm going to move to a new boot drive, I've also ordered a better case, Fractal Design Define R5, and better fans. I know it's still a standard mid-tower, but I can't do a full server rackmount chassis right now. This will let me get 2 140 x 140 mm fans directly in front of the HDDs and another 140 x 140 mm in the rear of the case with options for top mounted fans too, which should significantly increase my airflow.

Hopefully this should solve the issues, I'm guessing/hoping it's the boot drive though honestly.

Thanks,
John
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

SerpentMage

Dabbler
Joined
May 16, 2017
Messages
29
Sorry to keep beating this dead horse of a thread, life kinda got in the way for a bit, but had one more question before moving everything to the new case.

I have an encrypted pool, is there anything I need to worry about when moving to the new case? When I swapped the boot disk I had to Export/Disconnect the encrypted pool and then import it back onto the new boot drive and wasn't sure if there is anything similar when moving to the new case?
 
Top