Random crashing

James Mason

Dabbler
Joined
Jan 21, 2019
Messages
16
Hi. I'm just after some thoughts really on this one. I've only been using FreeNAS for a few months and it would be good to see if anyone has any comments.

Recently I had a bit of a strange situation where I returned home to find that my power at home had tripped and my UPS was beeping (a lot!). It's a APC Smart UPS 1000 and a bit of research showed me that the type of beeping was the current overload warning. Since it's a used item I thought it was likely that something had gone wrong with the UPS but it was actually one of the power supplies in the PC used for FreeNAS had violently died.

The FreeNAS system (was) a Dell T310 with dual redundant 400w PSUs. When I tried to start it up without the UPS there was a reasonably loud bang and sparks actually came from one of the PSUs. I concluded that the UPS was fine and as a temporary measure booted up the T310 with the single remaining PSU. It worked fine for a while but I woke a couple of days later to an email saying that one of the HDDs had a bad sector and another email saying that there was an "Unscheduled system reboot".

With sparks flying around inside the system it didn't surprise me that this may have caused some other issues, although it worked perfectly for 2 days. It had also worked flawlessly since around January before the PSU died. I decided to buy a completely new system (Fujitsu TX150 S7) and transfer the 2x USB boot drives and 4x HDDs (running in RAIDZ2 - I know, very inefficient but it's a work in progress). I transferred the USB boot drives, HDDs and RAM into the new system and it was rebooting every few hours. I have narrowed the rebooting down to the watchdog rebooting after a system crash because when I disabled the watchdog it crashed and just remained crashed - no reboot.

Thinking at this point that it was the RAM (2x 4GB Kingston DDR3 1333MHz ECC - I don't have the part number to hand but I can get it if necessary), I ran a Memtest86 run and it passed a few times with no issues. I also ran the test with the RAM in another machine and again had no errors.

I am waiting on delivery of a set of RAM that is an original part for the Fujitsu (which does complain at boot that the Kingston RAM isn't an authorised part), to see if the RAM is at fault (or a bit incompatible). At this point the only common hardware with the system that was originally running fine and the new system will be the HDDs and USB drives. Could these cause a system to crash? I was running 11.2-U7 and I tried booting into U6 and it still crashed.

Would it be worth taking out one of the HDDs and seeing if it still crashes, and then repeating this with a different HDD removed until all have been tested? Or the same with the USB boot drives? Is this particularly risky? I have backups of the data.

If anyone can see anything I have missed that is blindingly obvious then that would be great.

For clarity, here is the system as it is currently set up:

Fujitsu TX150 S7
Xeon X3430
8GB Kingston ECC DDR3 1333MHz (running at 800MHz because of the CPU - I have an X3450 to fix that)
4x HGST 3TB SATA HDDs in RAIDZ2 (about 4TB used)
2x SanDisk USB 3.1 16GB boot drives

Finally, how risky is it to have it crash whilst I'm trying to figure this out? I have backups and would assume that data loss is fairly unlikely but I'd prefer not to run into problems from repeated crashing.

Thanks in advance.
 

James Mason

Dabbler
Joined
Jan 21, 2019
Messages
16
As a quick update I've installed the new (to me) fully compatible RAM and the X3450 from the system with the dodgy PSU and it's currently chugging along as normal and in fact finishing a scrub that it crashed in the middle of.

If I get any changes I'll update.
 

James Mason

Dabbler
Joined
Jan 21, 2019
Messages
16
It has just rebooted itself so I'm going to remove the HDD that had the unreadable sector to start with and see what happens. If anyone is reading this and thinks that this a horrendously bad idea please tell me.
 

ThreeDee

Guru
Joined
Jun 13, 2013
Messages
700
well ... memory is on the lite side
USB 3 can be flaky
It's recommended to use SSD's for boot drives
you could try some linux distro to boot into without installing and check stability of hardware

you always risk corrupting data or damaging your hard drives if you crash due to a power failure of some sort
 

pschatz100

Guru
Joined
Mar 30, 2014
Messages
1,184
As a matter of course, I always suggest that people reconfigure their systems to boot off a small SSD. Flash drives are notoriously unreliable under the best of conditions - and after a power problem such as you described... who knows?

And, if you rebuild your system using an SSD, you don't need mirrored boot drives. Just backup your FreeNAS configuration on a regular basis and you can always do a re-install if necessary.
 

James Mason

Dabbler
Joined
Jan 21, 2019
Messages
16
Thanks for the replies guys and apologies for the quietness up to now. I did say I'd update when I had more info because hopefully this will be useful to others.

I replaced every single component except for the HDDs and it still crashed, so I booted up with a single hard drive in until it crashed (of course with a degraded pool) and it still crashed, even with every hard drive used on its own. I then installed a clean install of FreeNAS on a known good single 1TB drive and put some data on it and it still crashed. At this point every single component from the original stable system had been replaced.

I am currently thinking that the following might be the issue. The Fuji TX150 S7 system that I am using has 2 SATA ports on the motherboard and a SAS connector (SFF 8087, I believe but I have very little experience of this type of connector so might be wrong). The system originally had a RAID card connected to a 4 drive bay with the 8087 connector and I removed this card and attached this to the connector on the motherboard instead, effectively removing the RAID function from the drive bay. I was very pleased at the time that FreeNAS found all the drives this way but I am wondering whether the SAS controller on the motherboard is a bit rubbish and not getting on with either my 3TB drives or the 4 drive bay (or both).

My concern with this theory is that it doesn't explain why it crashed with the 1TB drive (or does it?). I was reading that some older SAS controllers don't like drives larger than 2TB.

I have ordered a Dell H310 controller that has been pre-flashed to IT mode from a reputable seller who has confirmed that this works with FreeNAS. I'm going to try this with the Fuji system first and if that doesn't work I'm going to install it in a Dell T410 system with a 6 drive bay that I have available, although I was planning to use this for another project, but if it works then it will be the FreeNAS box and the Fuji systems can be sold on.

I've also ordered some replacement USB drives with USB 2.0 to get rid of any potential USB 3.1 issues.

If anyone has any thoughts I'm very interested, and if not I'll update when I have info.

Thanks again for the help so far.
 

James Mason

Dabbler
Joined
Jan 21, 2019
Messages
16
Just to finish this off, it is fixed! It seems that it was the onboard SATA controller not liking the 3TB drives. I am now successfully running 8x 3TB drives with an uptime of 5 days currently and no issues.

If anyone has similar problems then the Dell H310 with reflashed firmware has behaved perfectly so I would recommend going down this route.

Thanks for all of your suggestions :smile:
 
Top