FreeNAS -> TrueNAS Core Upgrade - broke 'ix' nic?

tackyone

Dabbler
Joined
Jun 8, 2020
Messages
19
Hi,

I've been running FreeNAS 11.x for a while on a SuperMicro box - with dual 10Gig NIC's (Intel PRO/10GbE aka 'ix' Interface).

This has worked reliably since I installed it. I just upgraded from that to TrueNAS Core 12.0 - and it appears to have broken the NIC's. They still work - but a couple of hours after starting the box - they just "stop"

'ifconfig' / 'netstat -b -n -i' et'al - show nothing wrong (interface is up, running etc.) - if I tcpdump the Interface - I can't see traffic, if I try to ping any host on the LAN - I can't, likewise - remote clients if they try to ping the TrueNAS Core box - all I can see on them via tcpdump is the classic "ARP - Who has x tell y" [where 'x' is the TrueNAS Core IP address].

'dmesg' output was also unremarkable (nothing listed - only "promiscuous mode enabled/disabled" caused by me running tcpdump).

I tried 'ifconfig ixX down' and 'up' - didn't fix anything. From the switch point of view - the link is up, at 10Gig and everything is fine (no errors counting up or anything).

Anyone seen similar?

dmesg for the cards initing is:

Code:
ix4: <Intel(R) PRO/10GbE PCI-Express Network Driver> port 0xf020-0xf03f mem 0xfbc80000-0xfbcfffff,0xfbd04000-0xfbd07fff irq 50 at device 0.0 numa-domain 1 on pci10
ix4: Using 2048 TX descriptors and 2048 RX descriptors
ix4: Using 8 RX queues 8 TX queues
ix4: Using MSI-X interrupts with 9 vectors
ix4: allocated for 8 queues
ix4: allocated for 8 rx queues
ix4: Ethernet address: 00:1b:21:d7:39:94
ix4: PCI Express Bus: Speed 5.0GT/s Width x8
ix4: link state changed to UP
ix5: <Intel(R) PRO/10GbE PCI-Express Network Driver> port 0xf000-0xf01f mem 0xfbc00000-0xfbc7ffff,0xfbd00000-0xfbd03fff irq 52 at device 0.1 numa-domain 1 on pci10
ix5: Using 2048 TX descriptors and 2048 RX descriptors
ix5: Using 8 RX queues 8 TX queues
ix5: Using MSI-X interrupts with 9 vectors
ix5: allocated for 8 queues
ix5: allocated for 8 rx queues
ix5: Ethernet address: 00:1b:21:d7:39:95
ix5: PCI Express Bus: Speed 5.0GT/s Width x8
ix5: link state changed to UP


The box has 4 other onboard 10Gig NIC's - ix4 and ix5 are expansion slot SFP+ (onboards are all twisted pair).

ix5 is in a LAGG pair with ix0, but ix4 is on it's own - and both ix4 and ix5 stopped. ix0 (the other LAGG partner for ix5) is currently disconnected, hence always shows as:

Code:
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: lagg0
    options=e53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
    ether 0c:c4:7a:62:e5:84
    inet x.x.x.x netmask 0xffffff00 broadcast x.x.x.255
    laggproto failover lagghash l2,l3,l4
    laggport: ix0 flags=1<MASTER>
    laggport: ix5 flags=4<ACTIVE>
    groups: lagg
    media: Ethernet autoselect
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>


This was actually "boringly reliable" with FreeNAS 11.3 etc. - so don't really know where to start troubleshooting it - other than the update has obviously broken something?

-Tacks
 

seb101

Contributor
Joined
Jun 29, 2019
Messages
142
I am having a related issue. My system has been totally unstable since moving to TrueNAS 12 Core, repeated kernel panics and the same network issue you describe above (oddly only on one interface, despite 2 ix interfaces). The TrueNAS team have been completely non-responsive on the matter, however the FreeBSD community have thankfully been much more helpful. Having ruled out memory corruption the strongest contender is a kernel driver mafunctioning and the 10G 'ix' drivers are at the top of the list.

Unfortunetly the devs can't do anything more without a full crash core-dump of the kernel, but TrueNAS can either not do this, or nobody in the entire community or dev team knows how.

I am 100% regretting moving to TrueNAS 12, it was not ready for production.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Unfortunetly the devs can't do anything more without a full crash core-dump of the kernel, but TrueNAS can either not do this, or nobody in the entire community or dev team knows how.
I suspect this is due to the way FreeNAS/TrueNAS constructs the swap device dynamically on boot up based on the pool topology. And I must say, I don't like this feature, either. I'd rather have control over the swap partitions and be able to utilise all of them - which I currently cannot.
 

seb101

Contributor
Joined
Jun 29, 2019
Messages
142
Now I look in more detail... panics are occuring in relation to state changes on ix1 interface...

Code:
<6>ix1: link state changed to UP
panic: Bad tailq NEXT(0xffffffff82135158->tqh_last) != NULL
cpuid = 3
time = 1604305215
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0075d697a0
vpanic() at vpanic+0x17b/frame 0xfffffe0075d697f0
panic() at panic+0x43/frame 0xfffffe0075d69850
callout_process() at callout_process+0x32f/frame 0xfffffe0075d698c0
handleevents() at handleevents+0x185/frame 0xfffffe0075d69900
timercb() at timercb+0x196/frame 0xfffffe0075d69950
lapic_handle_timer() at lapic_handle_timer+0x9b/frame 0xfffffe0075d69980
Xtimerint() at Xtimerint+0xb1/frame 0xfffffe0075d69980
--- interrupt, rip = 0xffffffff811af6e6, rsp = 0xfffffe0075d69a50, rbp = 0xfffffe0075d69a50 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6/frame 0xfffffe0075d69a50
acpi_cpu_idle() at acpi_cpu_idle+0x232/frame 0xfffffe0075d69aa0
cpu_idle_acpi() at cpu_idle_acpi+0x3e/frame 0xfffffe0075d69ac0
cpu_idle() at cpu_idle+0x9f/frame 0xfffffe0075d69ae0
sched_idletd() at sched_idletd+0x3f1/frame 0xfffffe0075d69bb0
fork_exit() at fork_exit+0x80/frame 0xfffffe0075d69bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0075d69bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
 

tackyone

Dabbler
Joined
Jun 8, 2020
Messages
19
I am having a related issue. My system has been totally unstable since moving to TrueNAS 12 Core, repeated kernel panics and the same network issue you describe above (oddly only on one interface, despite 2 ix interfaces). The TrueNAS team have been completely non-responsive on the matter, however the FreeBSD community have thankfully been much more helpful. Having ruled out memory corruption the strongest contender is a kernel driver mafunctioning and the 10G 'ix' drivers are at the top of the list.

Unfortunetly the devs can't do anything more without a full crash core-dump of the kernel, but TrueNAS can either not do this, or nobody in the entire community or dev team knows how.

I am 100% regretting moving to TrueNAS 12, it was not ready for production.

I've not seen any panics - but I'm on my third reboot now to keep the ix interfaces running.

You say, "without a full crash core-dump of the kernel, but TrueNAS can either not do this" - so TrueNAS can't do a core dump on panic? - Is that usual?

Looking on the system here - swap is built out of a bunch of mirror/.eli devices - I suppose you could - as a one off turn that off manually (with swapoff) and setup a separate swap device on a spare drive / standard partition - 'swapon' it and get a core dump on that (if the problem is it either can't dump to the .eli's, or can't savecore from them).

I'm going to see how things go over the next day - if it's still suffering from hanging ix interfaces - I'd hope/guess you can boot the system in the "previous" version still - the previous ZFS boot volumes all seem to be there. I have another box here I can try to run 12.x up on to poke & prod - but I'd rather not keep losing the main box.

-Tacks
 

seb101

Contributor
Joined
Jun 29, 2019
Messages
142
Yes you can easily go back to 11.3 so long as you don't upgrade the ZFS feature flags on your pools. Sadly I did and now I'm stuck on TrueNAS 12.
 

styno

Patron
Joined
Apr 11, 2016
Messages
466
Fortunately I am not seeing this on the older Intel 82598EB 10 Gigabit Ethernet Controllers that I am using. They are also using the ix driver and are plugged into a Supermicro X10SL7-F (and an X9SCL-F) running TrueNAS 12. They are connected over fiber.
I know this is not helping you directly but maybe it can help narrowing down the issue.

Code:
ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver> port 0xe020-0xe03f mem 0xf75a0000-0xf75bffff,0xf7540000-0xf757ffff,0xf75c4000-0xf75c7fff irq 16 at device 0.0 on pci1
ix0: Using 2048 TX descriptors and 2048 RX descriptors
ix0: Using 2 RX queues 2 TX queues
ix0: Using MSI-X interrupts with 3 vectors
ix0: allocated for 2 queues
ix0: allocated for 2 rx queues
ix0: Ethernet address: 00:1b:21:6b:74:1b
ix0: PCI Express Bus: Speed 2.5GT/s Width x8
ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver> port 0xe000-0xe01f mem 0xf7580000-0xf759ffff,0xf7500000-0xf753ffff,0xf75c0000-0xf75c3fff irq 17 at device 0.1 on pci1
ix1: Using 2048 TX descriptors and 2048 RX descriptors
ix1: Using 2 RX queues 2 TX queues
ix1: Using MSI-X interrupts with 3 vectors
ix1: allocated for 2 queues
ix1: allocated for 2 rx queues
ix1: Ethernet address: 00:1b:21:6b:74:1a
ix1: PCI Express Bus: Speed 2.5GT/s Width x8
 

Fullreg

Cadet
Joined
Jan 22, 2020
Messages
1
I do upgrad to Truenas 12 on oct 29 and i had no ciummunication with my network. I deleted Nic card and reinstalle with tthe same ip and mask, form the console i can't ping the gateway. My nic is 10g Chelsio with Twinax. They work 24/7 t'ill 2019 09.
 

tackyone

Dabbler
Joined
Jun 8, 2020
Messages
19
I'm going to have to set this up on another box. I've had to restart the system umpteen times - some times both ix interfaces 'hang' - other times, the single (non lagged) one hangs.

When this happens things like 'tcpdump' literally show no traffic captured at all - so something is getting stuck somewhere.

Unfortunately - no dmesg / console errors get logged, nothing anywhere to indicate theres any problem - but it's just so unreliable now, I've reverted back to 11.3 (which fortunately I could).

I'll try setting up another box with the same NIC's and TrueNAS CORE 12 again - it'll be great if it exhibits the same problem - but due to lack of any errors anywhere, I'm kind of wondering what to do next.

I think I'll see if I can set it up for crashdumps - because that's all I can think of doing at the moment :(

-Tacks
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
@tackyone Did you submit a report to IX? If so can you post it here? I'd like to follow it as I'm also having issues with my 10G NICs.
 

tackyone

Dabbler
Joined
Jun 8, 2020
Messages
19
@tackyone Did you submit a report to IX? If so can you post it here? I'd like to follow it as I'm also having issues with my 10G NICs.

I thought @seb101 had - I'll have a dig around the bug reporting site.

I'm back running 11.3-U3.2 now - and, as you'd hope - nothings gone bump in the night yet with the NIC's.

-Tacks
 

tackyone

Dabbler
Joined
Jun 8, 2020
Messages
19
Ok - first time submitting a bug (there were a couple of other ixgbe issues - but they didn't seem to be the same, e.g. dmesg output complaining about firmware issues) - anyhow:


Now exists to cover this. Once I've got the spare running - I'll try and update it with dmesg output etc. (as I'm back on 11.3 now I can't easily get that from the 12.0 boot).

-Tacks
 
Top