Hard system locks while working in jails

Status
Not open for further replies.

kaseg

Dabbler
Joined
Feb 23, 2015
Messages
15
edit 2016-01-11: See my latest posts, I am no longer under the impression this issue is related to high load operations. The system has long since also been upgraded to 8GB RAM.

I'm having a problem where CPU and/or IO intensive loads are causing system lockups. Running up to date FreeNAS 9.3 train as of 2015/11/04. The problem has occurred for months.

Scenario:

1) SSH into the FreeNAS machine or one of its jails.
2) Attempt a long running process such as compressing a 50+GB file, or uploading very large files (1GB+) to S3, which takes hours on my internet connection.

Problem:

Approx 80% of the time, the system becomes completely unresponsive within 1 minute to an hour or so. All shares become unaccessable. Web interface will not load. Can not SSH in. Only remedy is a hard reset via power button.

We have had 99.9% uptime otherwise over 6+ months. 0 SMART problems. 0 scrub issues. Neither the boot volume or storage pool is more than 50% full. The only potential issue I can think of is that we are operating with 4GB ECC RAM, which is less than the recommended amount (although we only have 4TB storage).

Any thoughts? It feels like a hardware problem, but I'm unsure where to start.
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
which is less than the recommendedrequired amount

Here's your problem.

It's maybe a thermal issue but I'm 99 % sure the RAM is the problem.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Looks like a hardware problem, you seem to be missing some hardware. Look into getting all your ram installed.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
I concur. 4GB is known to (literally) be insufficient on FreeNAS 9.3. When a FreeNAS is under-RAMmed, all sorts of counterintuitive, and seemingly illogical problems can crop up. Strongly recommend you bring your system to 8GB. No one will want to spend much time diagnosing your problem if your system does not meet the minimum specifications.
 

kaseg

Dabbler
Joined
Feb 23, 2015
Messages
15
Thanks for the input everyone -- I upgraded to 8GB, which was a good idea regardless of this issue. Ran a full memtest86 run, and the RAM is all OK.

Retried something that fairly consistently causes my problem: uploading a large number of files to S3 from a jail. Unfortunately after 10-20 hours the same issue occurred -- web interface and SSH become completely inaccessible, file shares were up but extremely slow. As I had no other way to resolve the issue, I had to manually power cycle the box again.

CPU temperatures seem normal (about 35C according to sysctl), and these uploads dont cause much CPU use anyways.

I can't expect the FreeNAS community to support arbitrary tools installed in jails (awscli installed from ports in my case), but it seems a little absurd that a python script to upload files can completely bring down a server, requiring a hard reset.

Here's what memory usage looks like:

hpwLNod.png


The first gap is where I upgraded the RAM yesterday, the second gap is where the system had become unresponsive and had to be reset today. As you can see, swap is essentially never touched -- most of that RAM usage is the ARC (which was using about 6GB before the system died).

Would love any input. I'm at the point where I am considering throwing something like a raspi into the server closet to handle these long uploads. At least if it goes down, it wont take the NAS offline (I hope).
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You should list all your hardware components, MB, NICs, CPU, etc... and are you encrypting, how full is your pool (yea, you said 50%, just double checking)? Put as much info as you can about your system configuration as well, this may help us out. When the system was running fine, what version of FreeNAS were you running? If you go back to that is the problem solved?

Realize that this type of problem "could" be the result of the NIC failing.

EDIT: I was just thinking about SAMBA, it could be related tot a SAMBA upgrade although the SSH doesn't fit then.
 

kaseg

Dabbler
Joined
Feb 23, 2015
Messages
15
Full system information:

Lenovo ThinkServer TS140
CPU: Intel Core i3 4130
RAM: 2 x 4GB ECC DDR3 1600MHz
Drives: 3 x 2TB WD Red, configured as RAIDZ1.
Storage: No encryption, lz4 compression on, deduplication off
NIC: Intel Gigabit (unsure exact chipset)

Pool is 55% full today.

I have successfully ran the sort of tasks that are causing me issues once or twice, but not in the past few months. To clarify, the system is seemingly 100% stable at just basic NAS tasks. No problems sharing files, doing large system image backups, etc. Never seen a lockup in these cases. A nightly sync to S3 of some important (but small) files has never failed, using the same awscli tool, to my knowledge. Best uptime was 30 days, only restarted for a system update.

The only time I am seeing system lockups is with processes that run for multiple hours, like compressing large files, or uploading large files over a slow internet connection.

Edit: here is a pastebin of pciconf -lvv: http://pastebin.com/ffWfBDfH
 
Last edited:

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Well, if it's not RAM starved, then the only thing that makes sense to me for why it would take a crap over long period would be overheating components. Overheating NIC, overheating chipset, something like that. Or, perhaps, it's not the FreeNAS at all, but one of the pieces of network equipment that starts going off after hours of near-max throughput.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I agree with DrKK, it's likely a component causing the failure. But in order to rule out software, I still suggest you roll back to a point in time where you knew it worked without issue and test again. If it fails, it's not the software. This is actually an important thing you should rule out if it's fairly easy to do.

As for other components wearing out/overheating, it will not be instant gratification tracking down the problem but you will need to do it to solve the issue.

What do the log files say? Maybe you can dump those to the forum (scan them for sensitive info first).
 

kaseg

Dabbler
Joined
Feb 23, 2015
Messages
15
Here's a the "messages" syslog containing the boot, lockup and subsequent reset: http://pastebin.com/uypwqa4r Lockup occurs around line 287. I could post some of the other log files, but I didn't see much in them and unfortunately they stopped being written to around the time the system became unresponsive (about 30-40 mins before I reset it).

I dont see much interesting here, except that smbd likes to complain, a lot. Dont see why it would be related.

I will start on the process of ruling different things out later this week. Thanks again to everyone for their help and input.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Well that CIFS/smbd is certainly throwing a lot of errors. The configuration on that needs to be looked at.

May I suggest, that you transfer a similar amount of data that was causing the problem, but over FTP? You see, if there's no problem with that, then the eye is drawn to the smbd. I don't know if the other guys concur or not, but my instinct here is to DISABLE the CIFS service entirely, set up the FTP service temporarily, turn that on, and upload a similarly huge amount of data, and see what we see.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Maybe SAMBA needs to be rolled back to version 3 or some customization to SAMBA 4 needs to take place? Oops, gotta run, work is calling me.
 

kaseg

Dabbler
Joined
Feb 23, 2015
Messages
15
Well, this problem has hit again, and without undergoing any CPU/IO intense tasks. I was just upgrading packages in a jail and when I came back to the ssh session a couple minutes later the server was in a completely hard-locked unresponsive state (no web access, no ssh access, nothing). System had to be hard reset.

My worry now is that eventually one of these hard-resets is going to lead to data loss. It is very difficult to troubleshoot this problem as months can go by with no issues at all (I have had zero problems with this system since my posts last November). I've done things that I was under the impression had a good chance of leading to this problem (multi hour file compression, multi-day file uploads) without issue a number of times in the past 2 months, so I'm at a loss as to the underlying problem. So far, the only common thing I can think of is every time this issue has occurred, I have been working in a jail. I edited the title to reflect this new conclusion.

I still agree with many posters suggesting this is a hardware problem, but in my many years of doing IT work, I cant say I've ever come across a hardware issue that takes months to manifest itself between occurrences.

Right now the only thing I can think to do is to completely replace the motherboard, if I can even find a suitable replacement -- I cant think what else would be causing this problem unless it is a super subtle software or driver issue that just doesn't hit many people.

Anyways, do you all still think this is a hardware problem? How likely am I to experience data loss if I have to hard reset the system every month or two?
 
Last edited:

SinDeus

Explorer
Joined
Sep 3, 2013
Messages
65
Damn, it sure looks a lot like the issue I had... Even to the Samba message logs.

TL;DR: I had the same symptoms as yours, I shut down the newest jail I created (problems started to occur since this creation), and so far so good.

I've been told te replace my Network card, as my current brand is known to badly behave in FreeBSD under pressure. I received it yesterday, will mount it soon enough and try to start up this ominous jail. I'll let you know.
 
Status
Not open for further replies.
Top