Sporadic jail network failure post boot

Joined
Dec 4, 2021
Messages
7
I have seen over the years this issue, probably on v11 already, often it would work after a few boots and the fail again, bit like a race condition might. This is the first time it seems consistant, great opportunity to figure out root cause, which might be an actual bug.

After updating to TrueNAS-13.0-U5.3 all my jails networking stopped, combination of VNET and direct/shared. Without changing any config, just booting with TrueNAS-13.0-U5.2 jail networking works fine again, and breaks when back in TrueNAS-13.0-U5.3. So far this stayed consistant with 5x cycles ...

The issue happens accross static IP as well as DHCP, you can ping loopback and assigned IP but not accross the bridge either.

Tested with explicit default router or "auto", makes no change, but since I cannot ping over the bridge not sure it would make a difference anyway.

Any suggestions of what I might look at appreciated, done all I could think of over the years and could never figure this one out.

1691155877391.png
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
We may all be wasting our time chasing nothing if you don't specify more about your setup... most importantly, what NIC you're using.

If that NIC is anything but Intel, it's likely to be the problem here.
 
Joined
Dec 4, 2021
Messages
7
Totally get your point. I was considering the information I posted, to me at least, it seemed that the NIC would be less likely considering the now 100% repeatable nature of the issue.

Turns out it is an Intel NIC, what other information would be useful? Sorry, bit of a n00b, so dont even know what you migth like to see.

Code:
igb0@pci0:4:0:0:    class=0x020000 rev=0x03 hdr=0x00 vendor=0x8086 device=0x1539 subvendor=0x1849 subdevice=0x1539
    vendor     = 'Intel Corporation'
    device     = 'I211 Gigabit Network Connection'
    class      = network
    subclass   = ethernet


Appreciate the help and support here, thanks.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
what other information would be useful? Sorry, bit of a n00b, so dont even know what you migth like to see.
The Forum rules (in red at the top of the forum site) covers that.

Turns out it is an Intel NIC
OK, so we can rule the NIC out (mostly).

What's your bridge setup? ifconfig would be one way to share that (with jails running). Also you could share Network | Interfaces.
 
Joined
Dec 4, 2021
Messages
7
Thanks, I read the red post, will see to align better (my bad) :smile:

Code:
MB: ASRock AM4/X570M Pro4
CPU: AMD Ryzen 7 2700 Eight-Core Processor
RAM: 64 GiB Crucial Ballistix Sport AT
BOOT: 1x Western Digital Green 240 GB Internal SSD -- Boot
HDD: 4x HGST/Hitachi Ultrastar 7K3000 2TB 64MB 7200RPM -- Storage pool
SSD: 4x SanDisk SSD PLUS 240GB -- Fast pool for jails
NIC: Intel Gigabit LAN
OS: TrueNAS-13.0-U5.2


1691421540922.png


2x jails are not running (same as original image), they are backup's so not really needed for this analysis I don't think.

Code:
igb0: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=4a520b9<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,NOMAP>
    ether 70:85:c2:db:2c:84
    inet 10.0.0.5 netmask 0xffffff00 broadcast 10.0.0.255
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
    options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
    inet6 ::1 prefixlen 128
    inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
    inet 127.0.0.1 netmask 0xff000000
    groups: lo
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
pflog0: flags=0<> metric 0 mtu 33160
    groups: pflog
wg0: flags=80c1<UP,RUNNING,NOARP,MULTICAST> metric 0 mtu 1420
    options=80000<LINKSTATE>
    inet 192.168.0.1 netmask 0xffffffff
    groups: wg
    nd6 options=109<PERFORMNUD,IFDISABLED,NO_DAD>
bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    ether 58:9c:fc:10:d5:45
    id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
    maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
    root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
    member: vnet0.6 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 10 priority 128 path cost 2000
    member: vnet0.5 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 11 priority 128 path cost 2000
    member: vnet0.3 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 9 priority 128 path cost 2000
    member: vnet0.2 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 8 priority 128 path cost 2000
    member: vnet0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 7 priority 128 path cost 2000000
    member: vnet0.1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 6 priority 128 path cost 2000
    member: igb0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 1 priority 128 path cost 20000
    groups: bridge
    nd6 options=9<PERFORMNUD,IFDISABLED>
vnet0.1: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: associated with jail: dns as nic: epair0b
    options=8<VLAN_MTU>
    ether 02:ff:60:fd:33:60
    hwaddr 02:4c:0c:f0:98:0a
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
vnet0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=80000<LINKSTATE>
    ether fe:a0:98:67:b0:3f
    hwaddr 58:9c:fc:10:ff:93
    groups: tap
    media: Ethernet autoselect
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
    Opened by PID 2035
vnet0.2: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: associated with jail: mail as nic: epair0b
    options=8<VLAN_MTU>
    ether 02:ff:60:65:29:a8
    hwaddr 02:a3:60:29:60:0a
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
vnet0.3: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: associated with jail: plex as nic: epair0b
    options=8<VLAN_MTU>
    ether 72:85:c2:14:fa:09
    hwaddr 02:9a:40:3c:15:0a
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
vnet0.5: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: associated with jail: tools as nic: epair0b
    options=8<VLAN_MTU>
    ether 02:ff:60:0f:85:2e
    hwaddr 02:c6:69:75:0b:0a
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
vnet0.6: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: associated with jail: choke as nic: epair0b
    options=8<VLAN_MTU>
    ether 02:ff:60:66:11:d0
    hwaddr 02:d6:8a:1e:54:0a
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
BOOT: 1x Western Digital Green 240 GB Internal SSD -- Boot
Not sure why that would be specifically problematic for one version over another other than the random nature of ZFS corruption, but I'm familiar with those drives not working really well with ZFS due to the way their controller handles TRIM.

It could be that a scrub of your boot pool would show up corruption in some files (which may be all or mostly in your latest OS Environment, but not the previous one)?

That's a bit of a long shot, as I know the TRIM situation had improved over the last few major versions, but worth a try.

Otherwise, it's onward to manually creating the bridge instead of allowing the automatic behavior.
 
Joined
Dec 4, 2021
Messages
7
Interesting, not sure how that would work but ok ... lets try.

FYI, my main data pool has been migrated forward for many years, the fast SSD based one is recent and boot has been redone in the past. Every few years I do a clean install and import pools, recreate jails etc ... maybe this info helps.

Boot pool seems clean based on scrub.

Code:
 state: ONLINE
  scan: scrub repaired 0B in 00:05:08 with 0 errors on Mon Aug  7 16:55:02 2023
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      ada8p2    ONLINE       0     0     0
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
You can follow what I suggested over here:
 
Joined
Dec 4, 2021
Messages
7
<<Was busy writing this when you posted, will take a look>>

I think I made a breakthrough inspired by your comments, seems you were right, so hopefully this helps.

Was troubleshooting, looking if I can still consistantly recreate etc, and I noticed that I see a difference.

Booting with TrueNAS-13.0-U5.2 I get 1x bridge, but booting with TrueNAS-13.0-U5.3 I have 2x bridge!!

So with I TrueNAS-13.0-U5.2 get bridge0 with all jails members, but with TrueNAS-13.0-U5.3 both bridge0 and bridge1 and the jails spread accross both with some missing even though they are UP.

TrueNAS-13.0-U5.2 Boot
Code:
igb0: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=4a520b9<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,NOMAP>
    ether 70:85:c2:db:2c:84
    inet 10.0.0.5 netmask 0xffffff00 broadcast 10.0.0.255
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
    options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
    inet6 ::1 prefixlen 128
    inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
    inet 127.0.0.1 netmask 0xff000000
    groups: lo
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
pflog0: flags=0<> metric 0 mtu 33160
    groups: pflog
wg0: flags=80c1<UP,RUNNING,NOARP,MULTICAST> metric 0 mtu 1420
    options=80000<LINKSTATE>
    inet 192.168.0.1 netmask 0xffffffff
    groups: wg
    nd6 options=109<PERFORMNUD,IFDISABLED,NO_DAD>
bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    ether 58:9c:fc:10:d5:45
    id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
    maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
    root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
    member: vnet0.5 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 11 priority 128 path cost 2000
    member: vnet0.4 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 10 priority 128 path cost 2000
    member: vnet0.3 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 9 priority 128 path cost 2000
    member: vnet0.2 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 8 priority 128 path cost 2000
    member: vnet0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 7 priority 128 path cost 2000000
    member: vnet0.1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 6 priority 128 path cost 2000
    member: igb0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 1 priority 128 path cost 20000
    groups: bridge
    nd6 options=9<PERFORMNUD,IFDISABLED>
vnet0.1: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: associated with jail: dns as nic: epair0b
    options=8<VLAN_MTU>
    ether 02:ff:60:fd:33:60
    hwaddr 02:4c:0c:f0:98:0a
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
vnet0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=80000<LINKSTATE>
    ether fe:a0:98:67:b0:3f
    hwaddr 58:9c:fc:10:ff:93
    groups: tap
    media: Ethernet autoselect
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
    Opened by PID 2024
vnet0.2: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: associated with jail: mail as nic: epair0b
    options=8<VLAN_MTU>
    ether 02:ff:60:65:29:a8
    hwaddr 02:a3:60:29:60:0a
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
vnet0.3: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: associated with jail: plex as nic: epair0b
    options=8<VLAN_MTU>
    ether 72:85:c2:14:fa:09
    hwaddr 02:9a:40:3c:15:0a
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
vnet0.4: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: associated with jail: choke as nic: epair0b
    options=8<VLAN_MTU>
    ether 02:ff:60:66:11:d0
    hwaddr 02:2f:76:a5:e1:0a
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
vnet0.5: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: associated with jail: tools as nic: epair0b
    options=8<VLAN_MTU>
    ether 02:ff:60:0f:85:2e
    hwaddr 02:c6:69:75:0b:0a
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>


TrueNAS-13.0-U5.3 Boot
Code:
igb0: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=4a520b9<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,NOMAP>
    ether 70:85:c2:db:2c:84
    inet 10.0.0.5 netmask 0xffffff00 broadcast 10.0.0.255
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
    options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
    inet6 ::1 prefixlen 128
    inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
    inet 127.0.0.1 netmask 0xff000000
    groups: lo
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
pflog0: flags=0<> metric 0 mtu 33160
    groups: pflog
wg0: flags=80c1<UP,RUNNING,NOARP,MULTICAST> metric 0 mtu 1420
    options=80000<LINKSTATE>
    inet 192.168.0.1 netmask 0xffffffff
    groups: wg
    nd6 options=109<PERFORMNUD,IFDISABLED,NO_DAD>
bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    ether 58:9c:fc:10:d5:45
    id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
    maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
    root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
    member: vnet0.2 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 9 priority 128 path cost 2000
    member: vnet0.1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 6 priority 128 path cost 2000
    groups: bridge
    nd6 options=9<PERFORMNUD,IFDISABLED>
vnet0.1: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: associated with jail: dns as nic: epair0b
    options=8<VLAN_MTU>
    ether 02:ff:60:fd:33:60
    hwaddr 02:4c:0c:f0:98:0a
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
bridge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    ether 58:9c:fc:10:eb:76
    id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
    maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
    root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
    member: vnet0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 7 priority 128 path cost 2000000
    member: igb0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 1 priority 128 path cost 20000
    groups: bridge
    nd6 options=9<PERFORMNUD,IFDISABLED>
vnet0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=80000<LINKSTATE>
    ether fe:a0:98:67:b0:3f
    hwaddr 58:9c:fc:10:ff:93
    groups: tap
    media: Ethernet autoselect
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
    Opened by PID 2024
vnet0.2: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: associated with jail: mail as nic: epair0b
    options=8<VLAN_MTU>
    ether 02:ff:60:65:29:a8
    hwaddr 02:9a:40:3c:15:0a
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
OK, so doing the bridge manually should indeed fix it.
 
Joined
Dec 4, 2021
Messages
7
Will try and let you know.

So we still suspect some kind of corruption then and not a bug of sorts?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
So we still suspect some kind of corruption then and not a bug of sorts?
Not sure it's corruption, maybe a configuration issue... not impossible there's a bug there, but I would expect to have a lot more noise about it if it's a really simple one.
 
Top