Jails losing network connectivity after multiple days

DATAstrm · Dec 29, 2021

Hi All. I'm looking for some troubleshooting tips for the below issue.

My jails are all losing connectivity after multiple days (the most recent being about 10 days). The issue is difficult to troubleshoot because the connectivity loss happens after numerous days (with jails working flawlessly during the interim). I have searched the forum for similar issues and also checked the logs (/var/logs) in the host machine as well as the jails, but I have not found anything relevant.

Some information:
Jails running

Reverse proxy (for #2)
Nextcloud
Unifi controller
Wireguard
Cloud backup to Backblaze (runs rclone every 5 min)
Cloud backup (downloads a file every day)

Because I have a cloud backup running every 5 min (Jail #5) with verbose logging, I am able to identify when the connectivity drops down to ~5 min. The rclone backup simply stalls and never completes. When that happens, I cannot access any of the 6 jails. I cannot ping them either (allow raw sockets is on).

Restarting the jails (iocage restart ALL) does NOT result in connectivity being restored. The jails ONLY come back online after I reboot the entire system. After a reboot, the system, including all the jails, work fine for multiple days. I am then able to ping each jail with no problem, both from inside each jail and from an external system.

Some Questions I have:

Is there any other troubleshooting I can perform? I have already scoured the logs for all the jails and the host system. I've also looked at the logs in my router. Because I know when the connection loss happens down to ~ 5 min, I am able to look at the relevant part of the logs.
Is there any way to restart the network stack so I don't have to reboot the entire system? I tried restarting the jails, with no success. I also tried making a nominal change to the network to see if the jail would come up, but they don't. Only a reboot seems to work. If there's a command that can restart/refresh the network stack, then maybe I can avoid rebooting with a script.
Last resort would be to just get a monitoring script up that pings the jails and lets me know when they go down. Because the failure is silent, this would at least tell me when to reboot the system.

Any help would be welcome!

Here's my ifconfig. I doubt the issue is there since the jails work for multiple days before failing silently. This indicates to me that the network is configured and working correctly.

Code:

ix0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: member of lagg0
        options=a100b9<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,RXCSUM_IPV6>
        ether b8:ca:3a:70:b3:24
        hwaddr b8:ca:3a:70:b3:20
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
        nd6 options=9<PERFORMNUD,IFDISABLED>
ix1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: member of lagg0
        options=a100b9<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,RXCSUM_IPV6>
        ether b8:ca:3a:70:b3:24
        hwaddr b8:ca:3a:70:b3:22
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
        nd6 options=9<PERFORMNUD,IFDISABLED>
igb0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: member of lagg0
        options=a100b9<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,RXCSUM_IPV6>
        ether b8:ca:3a:70:b3:24
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
        nd6 options=9<PERFORMNUD,IFDISABLED>
igb1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: Access Vlan 20 - Port 4
        options=a500b9<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6>
        ether b8:ca:3a:70:b3:25
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
        nd6 options=9<PERFORMNUD,IFDISABLED>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x5
        inet 127.0.0.1 netmask 0xff000000
        groups: lo
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
pflog0: flags=0<> metric 0 mtu 33160
        groups: pflog
lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: Mediaserver 2 Main interface (LAGG)
        options=a100b9<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,RXCSUM_IPV6>
        ether b8:ca:3a:70:b3:24
        inet 192.168.1.111 netmask 0xffffff00 broadcast 192.168.1.255
        laggproto lacp lagghash l2,l3,l4
        laggport: ix0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: ix1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: igb0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        groups: lagg
        media: Ethernet autoselect
        status: active
        nd6 options=9<PERFORMNUD,IFDISABLED>
vlan20: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: Camera Vlan
        options=200001<RXCSUM,RXCSUM_IPV6>
        ether b8:ca:3a:70:b3:25
        groups: vlan
        vlan: 20 vlanpcp: 4 parent interface: igb1
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
        nd6 options=9<PERFORMNUD,IFDISABLED>
bridge20: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: vlan 20 bridge
        ether 02:ab:61:57:66:14
        id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
        maxage 20 holdcnt 6 proto stp-rstp maxaddr 2000 timeout 1200
        root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
        member: vlan20 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 8 priority 128 path cost 55
        groups: bridge
        nd6 options=9<PERFORMNUD,IFDISABLED>
bridge111: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: Main Lagg Bridge
        ether 02:ab:61:57:66:6f
        id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
        maxage 20 holdcnt 6 proto stp-rstp maxaddr 2000 timeout 1200
        root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
        member: vnet0.6 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 19 priority 128 path cost 2000
        member: vnet0.4 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 17 priority 128 path cost 2000
        member: vnet0.3 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 16 priority 128 path cost 2000
        member: vnet0.2 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 15 priority 128 path cost 2000
        member: vnet0.1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 14 priority 128 path cost 2000
        member: vnet1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 13 priority 128 path cost 2000000
        member: lagg0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 7 priority 128 path cost 2000000
        groups: bridge
        nd6 options=9<PERFORMNUD,IFDISABLED>
bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        ether 02:ab:61:57:66:00
        id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
        maxage 20 holdcnt 6 proto stp-rstp maxaddr 2000 timeout 1200
        root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
        member: vnet0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 12 priority 128 path cost 2000000
        member: igb1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 4 priority 128 path cost 20000
        groups: bridge
        nd6 options=1<PERFORMNUD>
vnet0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=80000<LINKSTATE>
        ether fe:a0:98:59:68:7e
        hwaddr 58:9c:fc:10:ff:9d
        groups: tap
        media: Ethernet autoselect
        status: active
        nd6 options=1<PERFORMNUD>
        Opened by PID 2384
vnet1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=80000<LINKSTATE>
        ether fe:a0:98:03:50:7e
        hwaddr 58:9c:fc:10:2f:7d
        groups: tap
        media: Ethernet autoselect
        status: active
        nd6 options=1<PERFORMNUD>
        Opened by PID 2384
vnet0.1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: associated with jail: CloudBackup as nic: epair0b
        options=8<VLAN_MTU>
        ether ba:ca:3a:f3:e1:77
        hwaddr 02:e8:a0:a9:fb:0a
        groups: epair
        media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
        status: active
        nd6 options=1<PERFORMNUD>
vnet0.2: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: associated with jail: Nextcloud as nic: epair0b
        options=8<VLAN_MTU>
        ether ba:ca:3a:27:9a:bf
        hwaddr 02:21:56:f3:88:0a
        groups: epair
        media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
        status: active
        nd6 options=1<PERFORMNUD>
vnet0.3: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: associated with jail: Security-Backup as nic: epair0b
        options=8<VLAN_MTU>
        ether ba:ca:3a:bd:c3:3f
        hwaddr 02:21:63:5f:00:0a
        groups: epair
        media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
        status: active
        nd6 options=1<PERFORMNUD>
vnet0.4: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: associated with jail: Unifi_Controller as nic: epair0b
        options=8<VLAN_MTU>
        ether ba:ca:3a:81:ad:c7
        hwaddr 02:5e:97:7c:89:0a
        groups: epair
        media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
        status: active
        nd6 options=1<PERFORMNUD>
vnet0.5: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: associated with jail: WireGuardJail as nic: epair0b
        options=8<VLAN_MTU>
        ether ba:ca:3a:cf:07:1b
        hwaddr 02:1e:e4:93:49:0a
        inet 172.16.0.1 netmask 0xfffffffc broadcast 172.16.0.3
        groups: epair
        media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
        status: active
        nd6 options=1<PERFORMNUD>
vnet0.6: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: associated with jail: reverse-proxy as nic: epair0b
        options=8<VLAN_MTU>
        ether ba:ca:3a:b5:7c:2a
        hwaddr 02:7f:dc:69:1b:0a
        groups: epair
        media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
        status: active
        nd6 options=1<PERFORMNUD>

Kris Moore · Dec 29, 2021

Nothing interesting in /var/log/messages or 'dmesg' around the time it goes down? That kind of failure I'd expect to at least throw some sort of error somewhere.

DATAstrm · Dec 29, 2021

Nothing at all, which is why this is so frustrating. Silent failure after multiple days. I don't have timestamps for dmesg (is there a way to get them?), but I didn't see anything of interest in there either.

Here's var/log/messages from the host. The last time the jails went down was 12/22 at ~13:50.

Code:

Dec 20 00:00:00 Mediaserver2 syslog-ng[1786]: Configuration reload finished;
Dec 21 00:00:00 Mediaserver2 syslog-ng[1786]: Configuration reload request received, reloading configuration;
Dec 21 00:00:00 Mediaserver2 syslog-ng[1786]: Configuration reload finished;
Dec 22 00:00:00 Mediaserver2 syslog-ng[1786]: Configuration reload request received, reloading configuration;
Dec 22 00:00:00 Mediaserver2 syslog-ng[1786]: Configuration reload finished;
Dec 23 00:00:00 Mediaserver2 syslog-ng[1786]: Configuration reload request received, reloading configuration;

Kris Moore · Dec 29, 2021

That is really weird for sure. When they die, can you still ping the jails IP directly from the host? Is it only the external access that is getting lost? Wondering if the entire lagg is going down, or a specific bridge / vlan device stops passing traffic...

DATAstrm · Dec 29, 2021

I can no longer ping the jail IPs from the host or externally. While in the jail console, I cannot ping the TrueNAS host or any IP. It says something about a socket error (sorry I don't recall the exact language and the machine is now running after a reboot).

The lagg remains up. I can access the TrueNAS host through the lagg. I also have a Blue Iris VM running that I can access through the lagg. Only the jails go down. The lagg is not on a vlan, and its bridge (Bridge 111) continues to allow traffic to both the host machine and the VM.

DATAstrm · Jan 1, 2022

So my jails all lost connectivity again after a few days. Nothing in the /var/log messages.

I did find out some more information. Running netstat -Q shows a HMark (I think this stands for high watermark) for epair as 2100 with a single-digit number under Qdrop. I believe this means that the epair queue limit was reached. This is consistent with only jails losing network connectivity as all the jail interfaces are epair intefaces. The tap interfaces remain online, allowing my VMs to function.

I've restarted and upped the epair qlimit (net.link.epair.netisr_maxqlen) to 10240, which I believe is the max value unless net.isr.maxqlimit is increased.

So far, I've seen a HMark of 618 when running netstat -Q. Len (I believe this means the current queue length) usually shows single digits.

I'm not sure if this will solve the problem, but putting it out there in case anyone has any other thoughts on troubleshooting.

A few questions:

Is there any way to clear the qdrops with a command to get jail connectivity back without rebooting the entire machine?
Any thoughts on what would cause a high queue to build up? I'm transferring large amounts of information, but typically the queue length stays in the single digits. Maybe it's the internet going down temporarily? (in which case a higher queue limit won't solve anything since a large file will just overflow it anyway).
Does anyone know of any other logging I can perform? Right now, nothing shows up in any of the logs.

Thanks in advance!

pschatz100 · Jan 1, 2022

Please post the specs of your system (according to forum rules.). Without more information, it is not possible to offer an informed opinion.

DATAstrm · Jan 1, 2022

Thanks for the tip. I updated my signature per the rules.

Important Announcement for the TrueNAS Community.

Jails losing network connectivity after multiple days

DATAstrm

Dabbler

Kris Moore

SVP of Engineering

DATAstrm

Dabbler

Kris Moore

SVP of Engineering

DATAstrm

Dabbler

DATAstrm

Dabbler

pschatz100

Guru

DATAstrm

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Jails losing network connectivity after multiple days

Dabbler

SVP of Engineering

Dabbler

SVP of Engineering

Dabbler

Dabbler

Guru

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Jails losing network connectivity after multiple days"

Similar threads