Occasional errors/reboots - what is the culprit?

Status
Not open for further replies.

petr

Contributor
Joined
Jun 13, 2013
Messages
142
I had a flawless run of almost, however, recently I've started having problems. It all started with Watchdog triggered reboot, after which I started seeing occasional "ATA error count" appear across all of the drives and controllers.

I've initially thought that my M1015 card is on its last legs - I've therefore replaced it with a spare I had ready. However, it did not seem to help at all, the situation was exactly the same.

The machine lives in a room with around 32C ambient temperature (good airflow with all temps reporting LOW in the IPMI).

I am now experiencing watchdog-triggered reboot approx. once a month plus the "ATA error count" also occasionally jumps by 1. The ATA errors seem to have subsided after I've moved the M1015 one PCI slot up - though this could be anecdotal. Another anecdotal piece of evidence pertains to when the problem started - the first reboot occurred when thunderstorm was passing through my area (the PC is behind UPS/surge protector but I suppose you never really know).

My question is - does this point to the motherboard/PSU/other part? What tests can I run to determine what is wrong? I suppose the cheapest option would be to start with PSU replacement but I have no idea if this would correspond to the symptoms above.

My setup:
  • X9SCM-F-O paired with Xeon 1230 v2,
  • 32GB ECC RAM
  • 10x3TB WD RED for storage, 1x120GBSSD for VirtualBox VMs
  • Large Noctua cooler for the CPU, drives in cages pushing air through the whole front of the case. Additional fan for the M1015. I am quite confident that the temps are OK.

As I cannot really afford any downtime, it would be great to know if it's likely a motherboard or not before spending 200E on a new one. What is/was my course of action:

  1. (done) Check all cabling, replace/swap hdd cables
  2. replace PSU (could it be that?)
  3. replace motherboard
  4. (less likely IMO) replace CPU
  5. (less likely IMO) check memory
  6. throw the whole thing out of a window and build a new one
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
"ATA error count"
Could be a symptom of the reboots...

If your IPMI log has nothing useful to share, I'd troubleshoot from easiest to hardest. Typically this might be:
  • Reseat the CPU heatsink
  • RAM/PSU, depending on what you have on hand
  • Motherboard

Unfortunately, there's no real way of making this process easier (except having lots of stuff on hand).
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Given the ATA error it's maybe the PSU. What PSU do you use?
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Could be a symptom of the reboots...
  • Reseat the CPU heatsink
  • RAM/PSU, depending on what you have on hand

How would you test the RAM - would memtest suffice? Also, my CPU core temperatures according to the
Code:
sysctl -a |egrep -E "cpu\.[0-9]+\.temp"
never go over 55C, normally around 45C.

Given the ATA error it's maybe the PSU. What PSU do you use?

Regular ATX PSU, CoolerMaster I think. This may be a cheap fix to get a spare anyway, any tips on a redundant ATX-form factor PSU? :)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Personally, sounds like a motherboard problem to me. I have that expect board and CPU and I've never had the problems you are having nor would I expect that moving slots would fix the issue if it was the motherboard. Of course, if it was the card, moving slots wouldn't fix that either. :P
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Personally, sounds like a motherboard problem to me. I have that expect board and CPU and I've never had the problems you are having nor would I expect that moving slots would fix the issue if it was the motherboard. Of course, if it was the card, moving slots wouldn't fix that either. :p

I'm afraid so as well - thought would you think so as well without the anecdotal evidence I've provided? It also did not fix the problem completely - just reduced occurrence of it..
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
@cyberjock - Do you think that running for a year in 30C ambient could have damaged it? It should be well-within its operating range and cooling was more than adequate (CPU core temps never got over 60C and there are fans in the middle to create nice wind tunnel through the case, all IPMI temps were always LOW).
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
@cyberjock - Do you think that running for a year in 30C ambient could have damaged it? It should be well-within its operating range and cooling was more than adequate (CPU core temps never got over 60C and there are fans in the middle to create nice wind tunnel through the case, all IPMI temps were always LOW).


Well, technically the ATX specification dictates input air temp should never exceed 80F (26.6C), so its at least possible. But to be honest, I wouldn't consider that to be something overly stressful on the system. Could just be dumb luck?
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Well, technically the ATX specification dictates input air temp should never exceed 80F (26.6C), so its at least possible. But to be honest, I wouldn't consider that to be something overly stressful on the system. Could just be dumb luck?

Yeah I thought as much... ok, taken the plunge and bought motherboard + PSU... I guess I will end up with new machine to run after I RMA the faulty part to run VMs and free up FreeNAS's resources going forward..
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Ok, I've got an update... After replacing the motherboard first, then the PSU.. the error still seems to be there.

What I am doing is following:

- keep my previous HDDs plugged in (10x3TB) + 2 spares
- load up clean FreeNAS
- create mirror array on the 2 spares, disable compression
- stress test
- start filling-in the drive by running yes 123132242342 > /mnt/test/xxx
- CPU loader: dd if=/dev/zero of=/dev/null

After not a very long time, I start seeing error messages re. writes on one of the spare drives. Then, it sometimes continues to run like this, sometimes it locks up completely and KPs. As it was getting cooler, I've reduced ventilation for the case to simulate the max. 30C ambient we can see here but during the testing the CPU did not seem to go over 58C core temp (which I believe is well-within the spec)

I guess now the only two optiosn are RAM and CPU. Which one do you think is more likely? My bet is on CPU now, as I've seen similar machines go down and the ECC usually gives you a bit of a warning..
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Update - when I leave out the DD CPU load part of the testing, it all seems to be holding up. Ordered new CPU in the meantime.. this is turning out to be quite expensive exercise :).

@cyberjock - any thoughts? Faulty CPU - how unlikely is that?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Honestly, I don't know. This could be a bad hard drive... I suppose. You're definitely playing a game of having to rule out problems one at a time though. Generally if your CPU is having problems you should be getting errors in the IPMI logs. Likewise if you have RAM errors, and both RAM and CPU errors often manifest themselves as logs in the footer (/var/log/messages file) too.

The fact you aren't getting any errors is weird, assuming it is RAM or CPU.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Honestly, I don't know. This could be a bad hard drive... I suppose. You're definitely playing a game of having to rule out problems one at a time though. Generally if your CPU is having problems you should be getting errors in the IPMI logs. Likewise if you have RAM errors, and both RAM and CPU errors often manifest themselves as logs in the footer (/var/log/messages file) too.

The fact you aren't getting any errors is weird, assuming it is RAM or CPU.

I am starting to get slightly crazy actually - replaced the CPU and doing more testing now. The most common kind of errors I am seeing are bus resets / device communication / SATA passthrough errors.

As another test, even though I've got multiple M1015 cards to test, I will actually remove the card altogether to see if I can reproduce any problems. I now have pretty much every part twice so it should be a bit more easy/efficient.

Another issue I've started noticing was with one of brand new 3TB HGST drives I've picked up. If I connect it to the M1015 (and there are other drives connected to it), it seems to be dropping off frequently (not the case if it's alone). I've got a few more of the same make/model so I am trying swapping it if I am actually experiencing multiple problems at the same time.

EDIT: added error message examples from previous tests

Screen Shot 2015-09-22 at 12.26.36.png
Screen Shot 2015-09-21 at 23.16.03.png
 
Last edited:

petr

Contributor
Joined
Jun 13, 2013
Messages
142
@cyberjock Looking at your signature you have almost identical system to mine actually - how do you connect your drives?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Mine are all connected via an M1015 and a SAS Expander, with the boot device being a SATA DOM connected to one of the ports on the motherboard.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Those errors look like the mps0 controller (M1015 or equivalent) is having major problems. I would definitely replace/remove that controller. And I'd probably do it sooner than later. If the controller goes bonkers and writes garbage to enough drives the zpool could, in theory, be corrupted beyond the ability for ZFS to recover.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Those errors look like the mps0 controller (M1015 or equivalent) is having major problems. I would definitely replace/remove that controller. And I'd probably do it sooner than later. If the controller goes bonkers and writes garbage to enough drives the zpool could, in theory, be corrupted beyond the ability for ZFS to recover.

That's the problem - I am already on a second controller here... Will try putting in the first one I had removed after the first sign of trouble. It seems to get really "confused" when I plug one of the HGST drives to it. Will keep testing... and yes, it's M1015 flashed to IT mode (no BIOS), driver version is matching card firmware.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
So, at this point the only common factors are the RAM and HDDs? Have you tried using only one of the DIMMs, for troubleshooting?
 

rogerh

Guru
Joined
Apr 18, 2014
Messages
1,111
Were both M1015s from the same source? In which case there may be a more than random chance of them both being defective. Or, indeed, both being counterfeit.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Were both M1015s from the same source? In which case there may be a more than random chance of them both being defective. Or, indeed, both being counterfeit.

No, different sources.

I've got update on the testing. I've put back the original M1015 back (now with new motherboard and CPU) and repeated the test. Apart from odd delay in one of the HDDs being recognised after boot (no errors, just was not recognised by the FreeNAS after until 10m after boot. This happend only after first boot, could not reproduce it during several reboots), it held well so far.

With the new MOBO, new CPU, new PSU and the old M1015, it seems to be running smoothly (tested CPU stress test on FreeNAS while doing scrub with 2 drives, one on SATA onboard and another connected via the M1015).

I've ordered another M1015 for good measure just to be on a safe side. At the moment, I've started off Memtest86+ that's bundled with Ubuntu, will let it run through several passes. Afterwards, I will boot up Ubuntu Live and do further stress testing, then I will move back to rebooting my original FreeNAS setup with all the drives attached to see.
 
Status
Not open for further replies.
Top