SOLVED How Do You Research a Samba Crash?

Status
Not open for further replies.
Joined
Jul 27, 2017
Messages
16
I am new to the forum, but I've been using FreeNAS for about a year.

I am running FreeNAS-11.0-U2 (e417d8aa5) on ESXi (with direct disk access, of course), with RAIDZ2 on 7 x 4TB hard drives with 28 GB of memory with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (x8). As of now, we're storing about 13 TiB.

We have a lot going on: filming, photography and editing, document sharing, software development, and video encoding. In addition to this, we have backups being made daily to s3 using s3cmd (which also does checksums on the new data). Most of our clients are Windows 10, and a few are apple, but use SMB anyway to keep consistent relative path names.

For the past two weeks, the SMB service has randomly been crashing, sometimes only minutes after re-enabling it. When it crashes, I have to manually turn the service back on from the UI.

I've set the logs to debug and see things like PANIC (pid 5720): invalid lock_order and smb_panic(): action returned status 0 I would include it, but 3 seconds worth is about 40 MB.

It seems to happen only when it's under a heavy load and only in the last two weeks. So my question is, where do I even start trying to figure out what is causing this problem? Do I submit a bug or is there a setting that I'm missing?
 

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,478
Have you gone through and thought of what has changed in the last two weeks? Did you update FreeNAS or client machines?

Do you have the server setup with only one NIC connected? What motherboard are you using?

You said the time between crashes seems to be random? What is the longest it has gone in the last two weeks without crashing that you are aware of?

Do you have any custom parameters set in FreeNAS for the SMB service? (auxiliary parameters)
 
Joined
Jul 27, 2017
Messages
16
Yes actually. About two weeks ago, I ran into some strange problems with permissions being completely cleared (this was on 9.10).

I added the acl_tbd VFS object because it seems to deal with ACL's and the documentation isn't super descriptive.

It started crashing then and I decided to upgrade to version 11 to see if that would help, but no such luck.

I tried removing some and then eventually ALL VFS objects to see if that would help, and now that I'm looking, they're all still on there. Even if I remove them, they stay.

I tried oplocks = no as an aux. param. with no luck. I also tried disabling dos features, but still, it crashed.

When researching invalid lock_order, mostly I get forums no later than 2014 with bug reports long since patched.


EDIT* The motherboard is virtual, there's only one NIC on my server and as of right now, I do not have any custom parameters on the SMB conf settings.
 

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,478
About two weeks ago, I ran into some strange problems with permissions being completely cleared (this was on 9.10).
What do you mean by this? I'm not sure what you mean by "completely cleared"?

How many datasets are you sharing though SMB and is it happening with all shares or specific ones?

Things can get complicated when you're not running on bare metal but I will assume you know what you are doing with ESXi (I have little experience with that, maybe someone on here can help you eliminate that variable).

Things rarely break out for no reason (not impossible though). I would try and trace back the changes that have occurred and if you have kept records of updates, patches, etc on both server and client machines, then start there. If you don't have good records or are unsure, could you restore a version of your config file before the problem started happening?
 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,554
Yes actually. About two weeks ago, I ran into some strange problems with permissions being completely cleared (this was on 9.10).

I added the acl_tbd VFS object because it seems to deal with ACL's and the documentation isn't super descriptive.
That'd be why you're crashing - https://bugzilla.samba.org/show_bug.cgi?id=11761

TL;DR, acl_tdb appears to be broken. Use zfs acls. If for some reason ZFS ACLs aren't suitable for your workflow, use acl_xattr.

I tried removing some and then eventually ALL VFS objects to see if that would help, and now that I'm looking, they're all still on there. Even if I remove them, they stay.
That sounds broken. Perhaps back up your config, reinstall, then restore the config.
 
Joined
Jul 27, 2017
Messages
16
What do you mean by this? I'm not sure what you mean by "completely cleared"?

I had previously set up permissions in Windows from a client machine. I started off with everyone having access to everything and slowly added restrictions on accounts. Most of my team have their own accounts in FreeNAS, but a subset of part time and project folks share a single user (because they also share computers interchangeably). One day, without any explanation, everyone suddenly had access to everything again. That's what I mean by "completely cleared"

How many datasets are you sharing though SMB and is it happening with all shares or specific ones?

I have 12 shares. I don't know if it's specific shares or not, because the whole service crashes. As a stop-gap, I've created a cron job that starts smbd every minute. (If it is already running, doesn't do anything)

That sounds broken. Perhaps back up your config, reinstall, then restore the config.

That's what I'll be doing tonight. Actually, I'll be putting ESXi on another machine to run my firewall, phone system, etc. and then using my current server for FreeNAS only. If I can get it working, I'll try learning more about bhyve and see if it's a viable option or if I just need two machines. I'll update with the results.
 
Last edited:
Joined
Jul 27, 2017
Messages
16
Well, I've finished reinstalling the operating system from scratch. I imported the settings and volumes and all of these issues seem to have been resolved now. I have a few other issues, like block size: 512B configured, 4096B native, but my research indicates this can't be fixed without recreating the volume. I hope this isn't a major problem because it will be a while before I'm even able to do that.

Thank you all for your help.
 
Status
Not open for further replies.
Top