FreeNAS-9.10-STABLE-201606072003 SMBD at 100% after update

baggins · Jun 21, 2016

HI Guys,

This server ran great for months is serving two main groups of users/services.
NFS to a series of proxmox clusters every thing is running great as it was before
CIFS to a Aimetis Farm (video surveillance) SMBD process takes about 15 seconds to hit 100% and all servers slow to a crawl.

I am absolutely losing it so all help is welcome. Because of all the VM's its very time consuming to do any reboots but the system has been down several times today.

Current top:
65 processes: 3 running, 62 sleeping
CPU: 16.0% user, 0.0% nice, 0.1% system, 0.0% interrupt, 83.9% idle
Mem: 173M Active, 508M Inact, 15G Wired, 6940K Cache, 15G Free
ARC: 14G Total, 6908M MFU, 7589M MRU, 3816K Anon, 40M Header, 79M Other
Swap: 8192M Total, 8192M Free

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
31860 root 1 102 0 325M 31480K CPU7 7 0:35 100.00% smbd
31863 root 1 102 0 329M 31732K CPU0 0 0:33 100.00% smbd

The server was last updated about 2-3 weeks ago and ran fine until this morning again from 9.10-Stable tree

Server is a Dell T320 12 core, 32G Ram and 26TB of disk on hardware raid.

I have been working on this all day so far none of the changes have improved anything.

Current cifs gui attributes :

kernel change notify = no
ea support = no
store dos attributes = no
map archive = no
map hidden = no
map readonly = no
map system = no

CIFS Config file:
[global]
server max protocol = SMB3
encrypt passwords = yes
dns proxy = no
strict locking = no
oplocks = yes
deadtime = 15
max log size = 51200
max open files = 941546
logging = file
load printers = no
printing = bsd
printcap name = /dev/null
disable spoolss = yes
getwd cache = yes
guest account = nobody
map to guest = Bad User
obey pam restrictions = yes
directory name cache size = 0
kernel change notify = no
panic action = /usr/local/libexec/samba/samba-backtrace
nsupdate command = /usr/local/bin/samba-nsupdate -g
server string = NAS Server
ea support = yes
store dos attributes = yes
lm announce = yes
hostname lookups = yes
unix extensions = no
acl allow execute always = true
dos filemode = yes
multicast dns register = no
domain logons = no
local master = no
idmap config *: backend = tdb
idmap config *: range = 90000001-100000000
server role = standalone
netbios name = NAS2
workgroup = CORP
security = user
pid directory = /var/run/samba
create mask = 0666
directory mask = 0777
client ntlmv2 auth = yes
dos charset = CP437
unix charset = UTF-8
log level = 1
kernel change notify = no
ea support = no
store dos attributes = no
map archive = no
map hidden = no
map readonly = no
map system = no
[xxxxxVolume1]
path = /mnt/xxxxxVolume1
printable = no
veto files = /.snapshot/.windows/.mac/.zfs/
writeable = yes
browseable = yes
vfs objects = zfs_space zfsacl aio_pthread streams_xattr
hide dot files = yes
hosts allow = xxxxx.1.231,xxxxx.232,xxxxx.233,xxxxx.234,xxxxx.10,xxxxx.15.231
guest ok = no
nfs4:mode = special
nfs4:acedup = merge
nfs4:chown = true
zfsacl:acesort = dontcare

anodos · Jun 21, 2016

baggins said:
HI Guys,

This server ran great for months is serving two main groups of users/services.
NFS to a series of proxmox clusters every thing is running great as it was before
CIFS to a Aimetis Farm (video surveillance) SMBD process takes about 15 seconds to hit 100% and all servers slow to a crawl.

I am absolutely losing it so all help is welcome. Because of all the VM's its very time consuming to do any reboots but the system has been down several times today.

Current top:
65 processes: 3 running, 62 sleeping
CPU: 16.0% user, 0.0% nice, 0.1% system, 0.0% interrupt, 83.9% idle
Mem: 173M Active, 508M Inact, 15G Wired, 6940K Cache, 15G Free
ARC: 14G Total, 6908M MFU, 7589M MRU, 3816K Anon, 40M Header, 79M Other
Swap: 8192M Total, 8192M Free

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
31860 root 1 102 0 325M 31480K CPU7 7 0:35 100.00% smbd
31863 root 1 102 0 329M 31732K CPU0 0 0:33 100.00% smbd

The server was last updated about 2-3 weeks ago and ran fine until this morning again from 9.10-Stable tree

Server is a Dell T320 12 core, 32G Ram and 26TB of disk on hardware raid.

I have been working on this all day so far none of the changes have improved anything.

Current cifs gui attributes :

kernel change notify = no
ea support = no
store dos attributes = no
map archive = no
map hidden = no
map readonly = no
map system = no

CIFS Config file:
[global]
server max protocol = SMB3
encrypt passwords = yes
dns proxy = no
strict locking = no
oplocks = yes
deadtime = 15
max log size = 51200
max open files = 941546
logging = file
load printers = no
printing = bsd
printcap name = /dev/null
disable spoolss = yes
getwd cache = yes
guest account = nobody
map to guest = Bad User
obey pam restrictions = yes
directory name cache size = 0
kernel change notify = no
panic action = /usr/local/libexec/samba/samba-backtrace
nsupdate command = /usr/local/bin/samba-nsupdate -g
server string = NAS Server
ea support = yes
store dos attributes = yes
lm announce = yes
hostname lookups = yes
unix extensions = no
acl allow execute always = true
dos filemode = yes
multicast dns register = no
domain logons = no
local master = no
idmap config *: backend = tdb
idmap config *: range = 90000001-100000000
server role = standalone
netbios name = NAS2
workgroup = CORP
security = user
pid directory = /var/run/samba
create mask = 0666
directory mask = 0777
client ntlmv2 auth = yes
dos charset = CP437
unix charset = UTF-8
log level = 1
kernel change notify = no
ea support = no
store dos attributes = no
map archive = no
map hidden = no
map readonly = no
map system = no
[xxxxxVolume1]
path = /mnt/xxxxxVolume1
printable = no
veto files = /.snapshot/.windows/.mac/.zfs/
writeable = yes
browseable = yes
vfs objects = zfs_space zfsacl aio_pthread streams_xattr
hide dot files = yes
hosts allow = xxxxx.1.231,xxxxx.232,xxxxx.233,xxxxx.234,xxxxx.10,xxxxx.15.231
guest ok = no
nfs4:mode = special
nfs4:acedup = merge
nfs4:chown = true
zfsacl:acesort = dontcare

Post a debug tarball by clicking 'system' 'advanced' 'save debug'

baggins · Jun 21, 2016

Thanks for the response:

Update:
I just noticed the below on the logs:
[2016/06/21 22:51:07.116656, 0] ../lib/util/debug.c:947(reopen_logs_internal)
Unable to open new log file '/var/log/samba4/log.smbd': Permission denied
[2016/06/21 22:51:07.127216, 0] ../lib/util/debug.c:947(reopen_logs_internal)
Unable to open new log file '/var/log/samba4/log.smbd': Permission denied
[2016/06/21 22:51:07.132428, 0] ../lib/util/debug.c:947(reopen_logs_internal)
Unable to open new log file '/var/log/samba4/log.smbd': Permission denied

I dont believe it was there through out the day though so like a recent change, i am debugging that now,

aimetis 38984 100.0 0.1 333828 33908 - R 10:46PM 5:03.95 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf
aimetis 41764 100.0 0.1 333828 32252 - R 10:50PM 0:49.87 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf
root 30308 0.0 0.1 292508 26748 - Is 8:53PM 0:00.11 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf
root 30312 0.0 0.1 288368 24664 - S 8:53PM 0:00.12 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf
aimetis 30501 0.0 0.1 333592 33932 - S 8:53PM 3:39.05 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf

Debug packaged attached.

Mirfster · Jun 22, 2016

baggins said:
26TB of disk on hardware raid.

Eh? Can you elaborate on this a bit more?

baggins · Jun 22, 2016

We have 4 disks of 2TB in a hardware raid 10 and 4 disks of 8TB in a hardware raid 5.

The raid 10 volume is split into two volumes of 2TB, 1 for VM images and 1 for video storage.

The Raid 5 disk group contains a single logical disk of 21TB, the video storage volume was expanded to include this volume.

However all of this was done a while ago and has worked well and plays no role in the issue the expanded volume is currently at approx 70% utilization of space the amount we write to the disks is trivial it tops at around 20mb a sec and current writes at this moment is around 200k and i have two process pinned at 100%.

After restarting CIFS it takes about 30 seconds before 2 or 3 the cpu cores are pinned, this varies on the number of video farm members are writing i have most of it turned of at the moment.

Spearfoot · Jun 22, 2016

baggins said:
We have 4 disks of 2TB in a hardware raid 10 and 4 disks of 8TB in a hardware raid 5.

The raid 10 volume is split into two volumes of 2TB, 1 for VM images and 1 for video storage.

The Raid 5 disk group contains a single logical disk of 21TB, the video storage volume was expanded to include this volume.

However all of this was done a while ago and has worked well and plays no role in the issue the expanded volume is currently at approx 70% utilization of space the amount we write to the disks is trivial it tops at around 20mb a sec and current writes at this moment is around 200k and i have two process pinned at 100%.

After restarting CIFS it takes about 30 seconds before 2 or 3 the cpu cores are pinned, this varies on the number of video farm members are writing i have most of it turned of at the moment.

"RAID 10 ... RAID 5" -- are you passing RAID arrays to FreeNAS via a RAID controller? Instead of passing control of the disks directly to FreeNAS?

baggins · Jun 22, 2016

Correct this way the raid calculations are handled by a ship specifically designed to hand raid calculations. FreeNas sees the logical disks presented by the raid controllers. and does not need to spend anytime doing software raid.

This is not realted to the problem though as i mentioned. this has been working in this configuration for some time and i have been building NAS system in this way for years and have even built NAS heads in front of SAN's in the same manner.

Unless there is something in the debug uploaded that specifically states this needs to change or that it is part of the problem we would be unlikely to change.

i didnt mention that the specific servers im having a problem are 2012R2.

I have done some tests form a windows7 client and copied a 4g file which pegged a new process/core to 100% but the file copied at approx. 100MB a sec.

w3hat happens with the 2012 is that once it starts and the core hits 100% then all other NAS functions slow to a crawl, directory lists everything ?

anodos · Jun 22, 2016

baggins said:
Correct this way the raid calculations are handled by a ship specifically designed to hand raid calculations. FreeNas sees the logical disks presented by the raid controllers. and does not need to spend anytime doing software raid.

This is not realted to the problem though as i mentioned. this has been working in this configuration for some time and i have been building NAS system in this way for years and have even built NAS heads in front of SAN's in the same manner.

1) Spend some time reading stickies here. Hell, read the guide iXsystems put here - http://www.freenas.org/blog/a-compl...are-design-part-i-purpose-and-best-practices/
If you want to run hardware RAID, switch OSes. Don't use ZFS. This is a more fundamental problem than your smbd consuming 100% CPU.

2) You are sharing the same dataset with NFS and samba with oplocks turned on. This is begging to have your data get corrupted. If this is a requirement for your production environment. Switch to using Linux and enable the "kernel oplocks" parameter in Samba. CentOS and Debian are both good choices.

I sincerely hope this is a personal system and not one that is in production. If this is something you put together for a client, you should fix (1) and (2) ASAP. There have been a few instances of people destroying their client's data by doing this.

Spearfoot · Jun 22, 2016

What @anodos said! I personally think iXsystems should just go ahead and say "Do not use a RAID controller with FreeNAS!", which they don't quite do. Here is the relevant section from the Best Practices guide:

RAID vs. Host Bus Adapters (HBAs)
ZFS wants direct control of the underlying storage that it is putting your data on. Nothing will make ZFS more unstable than something manipulating bits underneath ZFS. Therefore, connecting your drives to an HBA or directly to the ports on the motherboard is preferable to using a RAID controller; fortunately, HBAs are cheaper than RAID controllers to boot! If you must use a RAID controller, disable all write caching on it and disable all consistency checks. If the RAID controller has a passthrough or JBOD mode, use it. RAID controllers will complicate disk replacement and improperly configuring them can jeopardize the integrity of your volume (Using the write cache on a RAID controller is an almost sure-fire way to cause data loss with ZFS, to the tune of losing the entire pool).

In any case, the experts who could help you with your NFS/Samba problems aren't likely to do so in light of your unsupported storage configuration, which is probably the cause of your problems anyway.

baggins · Jun 22, 2016

HI Guys,

I apologize for not putting this in my post. We do not and would not ever actually use NFS and samba actively on the same volume. I will delete the NFS configs. I did this yesterday to run a test to see if we can use NFS and test performance. At no time were hosts ever connected to the volume via both protocols simultaneously as this would obviously be bad. Also in fact I never ran the test or connected to via NFS because we found out our apps would not support it anyway.

I'm also not sure about the mal practice statement as I'm sure we've all got years of storage experience the data that resides on the volume in question is completely backed up and at this point is useless so data corruption was not such an concern trying to find out whats actually wrong.

@ Anodos: May be your right and FreeNAS is not a good fit here. Its still surprising that this installation and indeed around 15 to 20 other I run in other areas have never suffered from any issues until yesterday nor have I seen data corruption. I will delete the NFS configs from the CIFS volume i should have done it right away when we realized the config/tests was point less but its been a trying 24 hours.

@Spearfoot: Your right i have been working with FreeNAS for some time and never had any real issues hence the new user. But making the incompatibility a lot more obvious about FreeNAS not working with with hardware raid would be good to called out. By the way we don't use raid based caching. So i guess this may be why we don't see as many issues.

If we take out the NFS/samba concern on the same volume (my guess not based on what i read above) in anyway changes further trouble shooting this issue please let me know as we need to move quickly to find a solution.

Thanks again for the help everyone.

baggins · Jun 22, 2016

Also we have done some more troubleshooting and the server farm that connects to the Aimetis volume is 4 servers and actually 2 of the servers don't suffer from the problem and when they write the write clears very quickly and the servers are very responsive. 2 of the servers seem to have the problem and the moment there brought online seem to be driving the CPU's crazy and suffer horrible performance.

Unfortunate the farm can not function with only two servers but at least seems to point to a problem with the actual servers not necessary the NAS. However this does not take way whats been mentioned above and we will keep looking a better solution over all.

anodos · Jun 22, 2016

A few more thoughts:

Try disabling oplocks. The log.smbd indicates that samba is panicing and that there are repeated errors in operations related to oplocks. I don't believe you should be experiencing oplock breaks from your cameras. This indicates that there is either (1) some buggy code in your security camera system or (2) some cameras are trying to overwrite each other.
The messages from samba stating
Code:
```
Unable to open new log file '/var/log/samba4/log.smbd': Permission denied
```
may indicate problems with your freenas install. It may also indicate that samba is in the process of crashing horrifically. Perhaps, download a backup of your FreeNAS config, install the latest freenas on a fresh USB stick, and apply the config. See if the problem persists.
Verify that LACP is properly configured on your switch. Perhaps test without LACP (only one NIC connected per subnet).
You can probably improve performance by creating a new dataset with "case sensitivity" set to "insensitive". Then create your samba share with the parameter (case sensitive = no). This will stop samba from doing checks on whether the file is upper or lower case, which typically reduces CPU use.

baggins · Jun 22, 2016

Thanks Anodos,

I will try what you suggest, i tested LACP already and that did not yield any benefit. I am also updating the camera app code base even though we are only a minor revision behind with a full install in case something on the servers got corrupted.

I was just checking to see if disabling oplock took (it did and the speed at which the cpus clime has slowed a bit but they still hit 100%) and see this in the logs:

[2016/06/22 16:40:18.218485, 0] ../source3/smbd/server_exit.c:162(exit_server_common)
exit_server_common: smbXsrv_session_logoff_all() failed (NT_STATUS_NOT_FOUND) - triggering cleanup
[2016/06/22 16:40:18.218540, 0] ../source3/smbd/smbXsrv_open.c:1047(smbXsrv_open_close)
smbXsrv_open_close(0xe50f59e3): failed to delete global key 'E50F59E3': NT_STATUS_NOT_FOUND
[2016/06/22 16:40:18.218583, 0] ../lib/util/fault.c:78(fault_report)
===============================================================
[2016/06/22 16:40:18.218605, 0] ../lib/util/fault.c:79(fault_report)
INTERNAL ERROR: Signal 11 in pid 97596 (4.3.6-GIT-UNKNOWN)
Please read the Trouble-Shooting section of the Samba HOWTO
[2016/06/22 16:40:18.218634, 0] ../lib/util/fault.c:81(fault_report)
===============================================================
[2016/06/22 16:40:18.218656, 0] ../source3/lib/util.c:789(smb_panic_s3)
PANIC (pid 97596): internal error
[2016/06/22 16:40:18.219065, 0] ../source3/lib/util.c:900(log_stack_trace)
BACKTRACE: 30 stack frames:
#0 0x80303f9b8 <smb_panic_s3+152> at /usr/local/lib/libsmbconf.so.0
#1 0x800a704c5 <smb_panic+53> at /usr/local/lib/libsamba-util.so.0
#2 0x800a70aaa <smb_panic+1562> at /usr/local/lib/libsamba-util.so.0
#3 0x800a70483 <fault_setup+115> at /usr/local/lib/libsamba-util.so.0
#4 0x80084298a <pthread_sigmask+1306> at /lib/libthr.so.3
#5 0x80084206c <pthread_getspecific+3580> at /lib/libthr.so.3
#6 0x7ffffffff193
#7 0x80101c003 <smbXsrv_open_close+2419> at /usr/local/lib/samba/libsmbd-base-samba4.so
#8 0x80101a7af <smbXsrv_open_create+3759> at /usr/local/lib/samba/libsmbd-base-samba4.so
#9 0x804858e7c <talloc_unlink+4220> at /usr/local/lib/libtalloc.so.2
#10 0x804858d33 <talloc_unlink+3891> at /usr/local/lib/libtalloc.so.2
#11 0x804858d33 <talloc_unlink+3891> at /usr/local/lib/libtalloc.so.2
#12 0x80101f3ad <smbd_exit_server+2125> at /usr/local/lib/samba/libsmbd-base-samba4.so
#13 0x80101f56c <smbd_exit_server_cleanly+28> at /usr/local/lib/samba/libsmbd-base-samba4.so
#14 0x8036ddbba <exit_server_cleanly+42> at /usr/local/lib/samba/libsmbd-shim-samba4.so
#15 0x800fb73dd <smbd_setup_sig_term_handler+173> at /usr/local/lib/samba/libsmbd-base-samba4.so
#16 0x804a697b7 <tevent_common_check_signal+231> at /usr/local/lib/libtevent.so.0
#17 0x803063384 <run_events_poll+52> at /usr/local/lib/libsmbconf.so.0
#18 0x803064735 <event_add_idle+2165> at /usr/local/lib/libsmbconf.so.0
#19 0x804a658e2 <_tevent_loop_once+114> at /usr/local/lib/libtevent.so.0
#20 0x804a65b1b <tevent_common_loop_wait+59> at /usr/local/lib/libtevent.so.0
#21 0x800fbd0df <smbd_process+3631> at /usr/local/lib/samba/libsmbd-base-samba4.so
#22 0x40c2a8 <main+17128> at /usr/local/sbin/smbd
#23 0x803063a06 <run_events_poll+1718> at /usr/local/lib/libsmbconf.so.0
#24 0x803064854 <event_add_idle+2452> at /usr/local/lib/libsmbconf.so.0
#25 0x804a658e2 <_tevent_loop_once+114> at /usr/local/lib/libtevent.so.0
#26 0x804a65b1b <tevent_common_loop_wait+59> at /usr/local/lib/libtevent.so.0
#27 0x40ad8f <main+11727> at /usr/local/sbin/smbd
#28 0x409fe4 <main+8228> at /usr/local/sbin/smbd
#29 0x40681f <_start+367> at /usr/local/sbin/smbd
[2016/06/22 16:40:18.219417, 0] ../source3/lib/util.c:801(smb_panic_s3)
smb_panic(): calling panic action [/usr/local/libexec/samba/samba-backtrace]
/usr/local/libexec/samba/samba-backtrace: /usr/local/bin/gdb711: not found
/usr/local/libexec/samba/samba-backtrace: /usr/local/bin/gdb711: not found
/usr/local/libexec/samba/samba-backtrace: /usr/local/bin/gdb711: not found
[2016/06/22 16:40:18.226717, 0] ../source3/lib/util.c:809(smb_panic_s3)
smb_panic(): action returned status 0
[2016/06/22 16:40:18.226767, 0] ../source3/lib/dumpcore.c:318(dump_core)
dumping core in /var/db/system/cores

baggins · Jun 22, 2016

Also i don't see the error: Unable to open new log file '/var/log/samba4/log.smbd': Permission denied any more.

the only change on the FreeNAS server was to delete the NFS share on the CIFS volume and to apply the oplock = no.

anodos · Jun 22, 2016

If the camera server update doesn't fix anything. Try
1) Back up current FreeNAS config through webgui.
2) Install FreeNAS on fresh USB stick
3) Apply backed-up FreeNAS config

baggins · Jun 22, 2016

will, do.

anodos · Jun 22, 2016

baggins said:
will, do.

Also try removing the aio_pthread vfs object.

baggins · Jun 22, 2016

Just for arguments sake and this does not work. What would be the best configuration should i reset the raid config to JOBD.

given the server has the MEM doe ZFS raidz.

we have 4x2TB and 4x8TB what i would need to create is a system that presents the following:

1 x volume 2TB shared as CIFS for domain users
1 x volume 2TB shared NFS for proxmox clusters (KVM, vm's)
1 x volume 24TB shared CIFS for Camera Data

Can this be done with 2 underlying Raidz volumes with 4 disks each.

baggins · Jun 22, 2016

tried the removing the aio_pthread vfs object, however it seems to have made the build up to 100% faster. i re added the object and build up seems slower ?

anodos · Jun 22, 2016

Post output of smbstatus

Important Announcement for the TrueNAS Community.

FreeNAS-9.10-STABLE-201606072003 SMBD at 100% after update

Dabbler

Sambassador

Dabbler

Attachments

Doesn't know what he's talking about

Dabbler

He of the long foot

Dabbler

Sambassador

He of the long foot

Dabbler

Dabbler

Sambassador

Dabbler

Dabbler

Sambassador

Dabbler

Sambassador

Dabbler

Dabbler

Sambassador

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS-9.10-STABLE-201606072003 SMBD at 100% after update"

Similar threads