Certain plugins hang the whole system

emarj

Dabbler
Joined
Feb 7, 2018
Messages
23
Hello everyone,

I have two different systems (see my signature) in two separate location running FreeNAS (11.2-U6/U7) with different specs. They both serve files via Samba shares and run Syncthing just fine. They remotely backup each other via zfs replication over ZeroTier tunnel. This base configuration is quite stable. One machine has been running great for almost two years, the other one is newer and has been fine for 3 months.

Long story:
At one point I started experimenting with Nextcloud plugin. I mounted my shares into the jail and used "External storage" feature in Nextcloud. In LAN everything was working fine (apart some minor permissions issues). At the same time I installed Emby to share some movies (2 TB) and music (50GB) locally and, again, everything was ok.

Many things happened. I'll try to put order in my memories and summarize a bit.

One weekend different things happened on System1:
  1. I opened Nextcloud to the internet using a Caddy jail with a reverse proxy
  2. I added a 1TB mp3 music library to Emby
  3. Few days before we changed ISP and a new modem/router was installed
I was able to connect to Nextcloud remotely and everything was working. Emby was scanning the library and I left it doing that. I had to left the location where is at. The day after I tried to connect remotely to System 1 Nextcloud and I was not getting any response.

At first I thought was a router fault. The machine was not reachable trough ZeroTier either.
At one point the local users told me that the SMB shares where not reachable. I kept thinking the issue was the router.
At one point I manage to connect to the BMC/IPMI interface (I plan to secure it further with a VPN but atm is just exposed over HTTPS on a high port) and the FreeNAS console itself was not responding (!). At the time seemed that the issue was an intermittent link, but still I was worried from the weird behavior. I mean, the system should stay up and not hang even if the link is intermittent. I described the problem here.

Since I was 300km away from System 1, to distract myself I started to replicate the steps 1. and 2. on my home machine, System 2.
Again, at first everything was working fine, but then after few hours Nextcloud stopped responding, FreeNAS web UI was slow as hell and I could not manage to reach the jail tab, at one point the system stopped responding at all. Since I don't have BMC/IPMI I connected the monitor to the server and I was not getting any VGA feed. I ended up restarting the server and quickly accessing the shell to stop the Nextcloud jail since it was set to autostart. The system was fine again. After few hours the system started slow down again. The Nextcloud jail was not running. The FreeNAS Web UI was accessible but extremely slow. I managed to load the "Display system processes" page and nothing was pinning the CPU. But still the system was slow as hell. Samba shares were barely accessible. Took me 2 minutes to load the jails page and to stop Emby. The system went back to normal.

Needless to say that when I went back where System 1 is, after I replaced the modem/router because you-never-know, I experienced the same behavior of System 2. So, the router is not the problem, even though at the moment I am sticking with the new one in order to reduce the variables.

So I am left with two different systems manifesting the same weird problem: Emby and Nextcloud jail are, in certain circumstances, choking the system "silently".

If you read until here, thank you very much. This whole thing is one of those you need to write down somewhere to feel a bit better!

The problem

On both systems, Nextcloud and Emby jails, under certain circumstances, "silently" choke the whole system.

The symptoms
  1. Samba shares extremely slow/not working
  2. FreeNAS UI extremely slow/not working
  3. FreeNAS Console (!) extremely slow/not working
  4. SSH extremely slow/not working
  5. (No output from VGA port, but this might be an issue with the HPE Microserver)
BUT
  • "Display System processes" not showing anything weird
  • Netdata and FreeNAS UI, when working, are not showing high CPU utilization

The possible causes

With Emby the problem appears to be, as said here this new 1TB library of mp3 files which I mounted one the both the systems. By deleting the mount from the jail and the system seems stable. I want to remark though, that the system is working fine with a 2TB library of movies and 50GB library of mp3.
I could not find a plausible explanation of what happened with Nextcloud. Seems ridiculous, but the triggering factor was putting it behind an HTTPS reverse proxy. I didn't experimented much because it used to completely hang the system causing me to hard reset it. I wrote "used" because in the meantime I also updated the system from 11.2-U6 to U7 and I kinda lost track of the status of it. Now I does not work at all apparently. Of course I am not saying much, I know. I should try to replicate this with in a new jail but this issue is so frustrating that until now I didn't make further experiments.

This is driving me nuts, in both cases, some inner problem makes the whole system to hang, and this is quite serious. Jails should mitigate if not prevent this.


What to do? Where to post debug archives?
Apart from experimenting further with plugins and opening discussion on the specific (sub)forums for Emby and Nextcloud, I would like to know what causes this from the base system perspective, i.e. FreeNAS/FreeBSD. Again, I don't think is desirable that plugins/jails may have such a destructive impact on the base system.

After this facts I saved the Debug Archives from Freenas UI and I tried to have a look at logs and stuff but I am not qualified to do that.

Can someone try to have a look at those? Where I could post them? Do they contain sensitive information I need to strip off?

Thank you so much,

Marco

Related Discussions
https://www.ixsystems.com/community...ink-makes-freenas-console-unresponsive.80120/
https://www.ixsystems.com/community...by-plugin-whole-freenas-system-is-busy.80046/
 
Last edited:
D

dlavigne

Guest
Anything in /var/log/messages when the slowdowns occur?
8GB RAM is probably a bottleneck on that system and, depending upon workload, 16GB may be as well.
What's the full output of ifconfig (within code tags)?
Anything other jails/VMs running on those systems, or just one plugin per system?
 

emarj

Dabbler
Joined
Feb 7, 2018
Messages
23
First of all thanks a lot for the reply.

Was a bit stupid on my part not carefully looking at this before, sorry for that.
I just looked at the 3 different log dumps I have and in all of them this line is spammed in /log/messages around the time the slowdown occurred:
Code:
collectd[3862]: rrdcached plugin: Failed to reconnect to RRDCacheD at unix:/var/run/rrdcached.sock: Unable to connect to rrdcached: No such file or directory (status=2)

May this be the cause? I found this.
EDIT: But maybe this is just due to shutting down the machine since I see it mixed up with shutting down messages. Usually this is what I did after the lock-up. I'm doing some experiments. I will soon post an update.
EDIT2: That message is spammed just during the shutdown process for 5 seconds, so this should not be the cause.


The only other jail running all the time is Syncthing and sometimes a Caddy reverse proxy jail. No VMs.
I used to keep the memory usage under control and I didn't spot issues. I should monitoring it while trying to reproduce the problem...


ifconfig output:
Code:
bge0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=c0099<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,VLAN_HWTSO,LINKSTATE>
        ether 98:f2:b3:e8:75:be
        hwaddr 98:f2:b3:e8:75:be
        inet 192.168.1.100 netmask 0xffffff00 broadcast 192.168.1.255
        inet6 fe80::9af2:b3ff:fee8:75be%bge0 prefixlen 64 scopeid 0x1
        nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
bge1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
        ether 98:f2:b3:e8:75:bf
        hwaddr 98:f2:b3:e8:75:bf
        nd6 options=9<PERFORMNUD,IFDISABLED>
        media: Ethernet autoselect
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6>
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3
        inet 127.0.0.1 netmask 0xff000000
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        groups: lo
zt851lc1p3r0qr5: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 5000mtu 2800
        options=80000<LINKSTATE>
        ether 66:89:0f:91:6e:6d
        hwaddr 00:bd:5a:fe:f8:09
        inet 10.0.2.10 netmask 0xffffff00 broadcast 10.0.2.255
        nd6 options=1<PERFORMNUD>
        media: Ethernet autoselect
        status: active
        groups: tap
        Opened by PID 4766
bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        ether 02:6e:4d:ff:ac:00
        nd6 options=1<PERFORMNUD>
        groups: bridge
        id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
        maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
        root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
        member: vnet0:2 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 7 priority 128 path cost 2000
        member: vnet0:1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 6 priority 128 path cost 2000
        member: bge0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 1 priority 128 path cost 20000
vnet0:1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: associated with jail: caddy as nic: epair0b
        options=8<VLAN_MTU>
        ether 02:ff:60:9a:61:30
        hwaddr 02:4f:50:00:06:0a
        inet6 fe80::ff:60ff:fe9a:6130%vnet0:1 prefixlen 64 scopeid 0x6
        nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
        media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
        status: active
        groups: epair
vnet0:2: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: associated with jail: syncthing as nic: epair0b
        options=8<VLAN_MTU>
        ether 02:ff:60:5b:07:f8
        hwaddr 02:4f:50:00:07:0a
        inet6 fe80::ff:60ff:fe5b:7f8%vnet0:2 prefixlen 64 scopeid 0x7
        nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
        media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
        status: active
        groups: epair
 
Last edited:

emarj

Dabbler
Joined
Feb 7, 2018
Messages
23
I did some experiments last night on the 8GB machine (the only one I have physical access at the moment).
I tried monitoring memory usage while starting the incriminated jails. When I start those the graphs in the dashboard and in the reporting page go to zero and they recover when I the jails are shut down.

So I used top to monitor. As you can see mono (i.e. Emby), php-fpm and mysql (i.e. Nextcloud) are using some memory but there is still a lot of free memory. Of course I now ZFS, when possible, will use it all as ARC, but still...
Screenshot_15.png

EDIT: Expanding to 16GB is on my todo list, but since this issue happens also on the 16GB machine I'm not sure this is "the" problem. Also, I don't want to complain that much about RAM usage. I knew this when I decided to go with FreeNAS, but just adding RAM without understanding exactly why I should need it is not really ideal.

The system slowed down a lot when I started the jails but was still usable (apart from the graphs at 0). I left those jails running yesterday night and this morning the system was completely hang. Was not reachable over the network and I wasn't able to access the console physically (after typing any number, I got no response).

I proceeded to hard reset it and after it came back I dumped the logs.

Nothing (apparently) odd there, apart a lot of messages from the FreeNAS middleware not able to contact sentry.ixsystems.com:
Code:
uwsgi: [sentry.errors:680] Sentry responded with an error: HTTP Error 502: Bad Gateway
...
uwsgi: [sentry.errors.uncaught:702] ['timeout: timed out',...

But this is to be expected since seems like the whole network connectivity goes bad. This time there are no traces of the collectd error above.
 
Last edited:

emarj

Dabbler
Joined
Feb 7, 2018
Messages
23
Bump! I need some input to sort this out
 
Top