Occasional errors/reboots - what is the culprit?

petr · Sep 15, 2015

I had a flawless run of almost, however, recently I've started having problems. It all started with Watchdog triggered reboot, after which I started seeing occasional "ATA error count" appear across all of the drives and controllers.

I've initially thought that my M1015 card is on its last legs - I've therefore replaced it with a spare I had ready. However, it did not seem to help at all, the situation was exactly the same.

The machine lives in a room with around 32C ambient temperature (good airflow with all temps reporting LOW in the IPMI).

I am now experiencing watchdog-triggered reboot approx. once a month plus the "ATA error count" also occasionally jumps by 1. The ATA errors seem to have subsided after I've moved the M1015 one PCI slot up - though this could be anecdotal. Another anecdotal piece of evidence pertains to when the problem started - the first reboot occurred when thunderstorm was passing through my area (the PC is behind UPS/surge protector but I suppose you never really know).

My question is - does this point to the motherboard/PSU/other part? What tests can I run to determine what is wrong? I suppose the cheapest option would be to start with PSU replacement but I have no idea if this would correspond to the symptoms above.

My setup:

X9SCM-F-O paired with Xeon 1230 v2,
32GB ECC RAM
10x3TB WD RED for storage, 1x120GBSSD for VirtualBox VMs
Large Noctua cooler for the CPU, drives in cages pushing air through the whole front of the case. Additional fan for the M1015. I am quite confident that the temps are OK.

As I cannot really afford any downtime, it would be great to know if it's likely a motherboard or not before spending 200E on a new one. What is/was my course of action:

(done) Check all cabling, replace/swap hdd cables
replace PSU (could it be that?)
replace motherboard
(less likely IMO) replace CPU
(less likely IMO) check memory
throw the whole thing out of a window and build a new one

Ericloewe · Sep 15, 2015

petr said:
"ATA error count"

Could be a symptom of the reboots...

If your IPMI log has nothing useful to share, I'd troubleshoot from easiest to hardest. Typically this might be:

Reseat the CPU heatsink
RAM/PSU, depending on what you have on hand
Motherboard

Unfortunately, there's no real way of making this process easier (except having lots of stuff on hand).

Bidule0hm · Sep 15, 2015

Given the ATA error it's maybe the PSU. What PSU do you use?

petr · Sep 15, 2015

Ericloewe said:
Could be a symptom of the reboots...

Reseat the CPU heatsink

RAM/PSU, depending on what you have on hand

How would you test the RAM - would memtest suffice? Also, my CPU core temperatures according to the

Code:

sysctl -a |egrep -E "cpu\.[0-9]+\.temp"

never go over 55C, normally around 45C.

Bidule0hm said:
Given the ATA error it's maybe the PSU. What PSU do you use?

Regular ATX PSU, CoolerMaster I think. This may be a cheap fix to get a spare anyway, any tips on a redundant ATX-form factor PSU? :)

cyberjock · Sep 15, 2015

Personally, sounds like a motherboard problem to me. I have that expect board and CPU and I've never had the problems you are having nor would I expect that moving slots would fix the issue if it was the motherboard. Of course, if it was the card, moving slots wouldn't fix that either. :P

petr · Sep 15, 2015

cyberjock said:
Personally, sounds like a motherboard problem to me. I have that expect board and CPU and I've never had the problems you are having nor would I expect that moving slots would fix the issue if it was the motherboard. Of course, if it was the card, moving slots wouldn't fix that either. :p

I'm afraid so as well - thought would you think so as well without the anecdotal evidence I've provided? It also did not fix the problem completely - just reduced occurrence of it..

petr · Sep 15, 2015

@cyberjock - Do you think that running for a year in 30C ambient could have damaged it? It should be well-within its operating range and cooling was more than adequate (CPU core temps never got over 60C and there are fans in the middle to create nice wind tunnel through the case, all IPMI temps were always LOW).

cyberjock · Sep 15, 2015

petr said:
@cyberjock - Do you think that running for a year in 30C ambient could have damaged it? It should be well-within its operating range and cooling was more than adequate (CPU core temps never got over 60C and there are fans in the middle to create nice wind tunnel through the case, all IPMI temps were always LOW).

Well, technically the ATX specification dictates input air temp should never exceed 80F (26.6C), so its at least possible. But to be honest, I wouldn't consider that to be something overly stressful on the system. Could just be dumb luck?

petr · Sep 15, 2015

cyberjock said:
Well, technically the ATX specification dictates input air temp should never exceed 80F (26.6C), so its at least possible. But to be honest, I wouldn't consider that to be something overly stressful on the system. Could just be dumb luck?

Yeah I thought as much... ok, taken the plunge and bought motherboard + PSU... I guess I will end up with new machine to run after I RMA the faulty part to run VMs and free up FreeNAS's resources going forward..

petr · Sep 21, 2015

Ok, I've got an update... After replacing the motherboard first, then the PSU.. the error still seems to be there.

What I am doing is following:

- keep my previous HDDs plugged in (10x3TB) + 2 spares
- load up clean FreeNAS
- create mirror array on the 2 spares, disable compression
- stress test
- start filling-in the drive by running yes 123132242342 > /mnt/test/xxx
- CPU loader: dd if=/dev/zero of=/dev/null

After not a very long time, I start seeing error messages re. writes on one of the spare drives. Then, it sometimes continues to run like this, sometimes it locks up completely and KPs. As it was getting cooler, I've reduced ventilation for the case to simulate the max. 30C ambient we can see here but during the testing the CPU did not seem to go over 58C core temp (which I believe is well-within the spec)

I guess now the only two optiosn are RAM and CPU. Which one do you think is more likely? My bet is on CPU now, as I've seen similar machines go down and the ECC usually gives you a bit of a warning..

petr · Sep 22, 2015

Update - when I leave out the DD CPU load part of the testing, it all seems to be holding up. Ordered new CPU in the meantime.. this is turning out to be quite expensive exercise :).

@cyberjock - any thoughts? Faulty CPU - how unlikely is that?

cyberjock · Sep 23, 2015

Honestly, I don't know. This could be a bad hard drive... I suppose. You're definitely playing a game of having to rule out problems one at a time though. Generally if your CPU is having problems you should be getting errors in the IPMI logs. Likewise if you have RAM errors, and both RAM and CPU errors often manifest themselves as logs in the footer (/var/log/messages file) too.

The fact you aren't getting any errors is weird, assuming it is RAM or CPU.

petr · Sep 23, 2015

cyberjock said:
Honestly, I don't know. This could be a bad hard drive... I suppose. You're definitely playing a game of having to rule out problems one at a time though. Generally if your CPU is having problems you should be getting errors in the IPMI logs. Likewise if you have RAM errors, and both RAM and CPU errors often manifest themselves as logs in the footer (/var/log/messages file) too.

The fact you aren't getting any errors is weird, assuming it is RAM or CPU.

I am starting to get slightly crazy actually - replaced the CPU and doing more testing now. The most common kind of errors I am seeing are bus resets / device communication / SATA passthrough errors.

As another test, even though I've got multiple M1015 cards to test, I will actually remove the card altogether to see if I can reproduce any problems. I now have pretty much every part twice so it should be a bit more easy/efficient.

Another issue I've started noticing was with one of brand new 3TB HGST drives I've picked up. If I connect it to the M1015 (and there are other drives connected to it), it seems to be dropping off frequently (not the case if it's alone). I've got a few more of the same make/model so I am trying swapping it if I am actually experiencing multiple problems at the same time.

EDIT: added error message examples from previous tests

petr · Sep 23, 2015

@cyberjock Looking at your signature you have almost identical system to mine actually - how do you connect your drives?

cyberjock · Sep 23, 2015

Mine are all connected via an M1015 and a SAS Expander, with the boot device being a SATA DOM connected to one of the ports on the motherboard.

cyberjock · Sep 23, 2015

Those errors look like the mps0 controller (M1015 or equivalent) is having major problems. I would definitely replace/remove that controller. And I'd probably do it sooner than later. If the controller goes bonkers and writes garbage to enough drives the zpool could, in theory, be corrupted beyond the ability for ZFS to recover.

petr · Sep 23, 2015

cyberjock said:
Those errors look like the mps0 controller (M1015 or equivalent) is having major problems. I would definitely replace/remove that controller. And I'd probably do it sooner than later. If the controller goes bonkers and writes garbage to enough drives the zpool could, in theory, be corrupted beyond the ability for ZFS to recover.

That's the problem - I am already on a second controller here... Will try putting in the first one I had removed after the first sign of trouble. It seems to get really "confused" when I plug one of the HGST drives to it. Will keep testing... and yes, it's M1015 flashed to IT mode (no BIOS), driver version is matching card firmware.

Ericloewe · Sep 23, 2015

So, at this point the only common factors are the RAM and HDDs? Have you tried using only one of the DIMMs, for troubleshooting?

rogerh · Sep 23, 2015

Were both M1015s from the same source? In which case there may be a more than random chance of them both being defective. Or, indeed, both being counterfeit.

petr · Sep 24, 2015

rogerh said:
Were both M1015s from the same source? In which case there may be a more than random chance of them both being defective. Or, indeed, both being counterfeit.

No, different sources.

I've got update on the testing. I've put back the original M1015 back (now with new motherboard and CPU) and repeated the test. Apart from odd delay in one of the HDDs being recognised after boot (no errors, just was not recognised by the FreeNAS after until 10m after boot. This happend only after first boot, could not reproduce it during several reboots), it held well so far.

With the new MOBO, new CPU, new PSU and the old M1015, it seems to be running smoothly (tested CPU stress test on FreeNAS while doing scrub with 2 drives, one on SATA onboard and another connected via the M1015).

I've ordered another M1015 for good measure just to be on a safe side. At the moment, I've started off Memtest86+ that's bundled with Ubuntu, will let it run through several passes. Afterwards, I will boot up Ubuntu Live and do further stress testing, then I will move back to rebooting my original FreeNAS setup with all the drives attached to see.

Important Announcement for the TrueNAS Community.

Occasional errors/reboots - what is the culprit?

Contributor

Server Wrangler

Server Electronics Sorcerer

Contributor

Inactive Account

Contributor

Contributor

Inactive Account

Contributor

Contributor

Contributor

Inactive Account

Contributor

Contributor

Inactive Account

Inactive Account

Contributor

Server Wrangler

Guru

Contributor

Similar threads