Problems with Intel XXV710 NIC, trying to track down root of issue

DemohFoxfire

Dabbler
Joined
May 2, 2023
Messages
11
This is still a buildup to put into production but I ran into some pretty weird problems.

supermicro x9 series board
Intel DC P4608 6.4tb (2x3.2)
LSI 9211-8i active with drives attached
LSI 9300-8i in the system so I can firmware update it
Intel XXV710-DA2

Using the 3008 firmware from a guide here I went to sas3flash the 9300-8i in the system from putty ssh while I was preparing for some testing using a 2019 server VM on esxi 7 connected via both ports on the intel card. windows iscsi initiator (vm passthrough nics as if they were raw networks instead of vmware iscsi) as I was just doing some benchmarking. The drives on the 9211 were presented via iscsi to an interface on ixl0 while the nvme was presented as 2 targets on ixl1.

I was currently writing a series of 4gb files to the 9211 drives when I was performing the firmware update on the 9300. Since my -listalls came back with 1 controller each and both were id 0 I went ahead without specifying -c

Code:
root@truenas[/tmp/firmware/9300-8i]#
root@truenas[/tmp/firmware/9300-8i]#
root@truenas[/tmp/firmware/9300-8i]#
root@truenas[/tmp/firmware/9300-8i]# sas2flash -listall
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2008(B2)

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS2008(B2)     20.00.07.00    14.01.00.08    07.39.02.00     00:85:00:00

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.
root@truenas[/tmp/firmware/9300-8i]#
root@truenas[/tmp/firmware/9300-8i]#
root@truenas[/tmp/firmware/9300-8i]#
root@truenas[/tmp/firmware/9300-8i]# sas3flash -listall
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.

        Adapter Selected is a Avago SAS: SAS3008(C0)

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS3008(C0)  15.00.00.00    0e.00.00.07    08.35.00.00     00:84:00:00

        Finished Processing Commands Successfully.
        Exiting SAS3Flash.
root@truenas[/tmp/firmware/9300-8i]#
root@truenas[/tmp/firmware/9300-8i]#
root@truenas[/tmp/firmware/9300-8i]#
root@truenas[/tmp/firmware/9300-8i]# sas3flash -o -f SAS9300_8i_IT.bin
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.

        Advanced Mode Set

        Adapter Selected is a Avago SAS: SAS3008(C0)

        Executing Operation: Flash Firmware Image

                Firmware Image has a Valid Checksum.
                Firmware Version 16.00.12.00
                Firmware Image compatible with Controller.

                Valid NVDATA Image found.
                NVDATA Major Version 0e.01
                Checking for a compatible NVData image...

                NVDATA Device ID and Chip Revision match verified.
                NVDATA SubSystem Vendor and SubSystem Device ID match verified.
                NVDATA Versions Compatible.
                Valid Initialization Image verified.
                Valid BootLoader Image verified.

                Beginning Firmware Download...
                Firmware Download Successful.

                Verifying Download...

                Firmware Flash Successful.

                Resetting Adapter...



after a while I heard fans spin up and the server was mid-post. I saw the rom for the 9300 and entered it, it showed the new firmware. I let the truenas server continue to boot. hitting enter on putty I received the confirmation the session disconnected.

sas3 -listall showed the correct new firmware version. not trusting the flash successful and trying to recreate the reboot I ran the flash again without any activity, didnt bother with the windows server as iscsi was "reconnecting" but never reconnected. The server didnt reboot and sas3flash completed successfully this time and -listall output didnt change.

That in itself is odd, but I decided to restart my testing, delete the failed file and recreate (I was just writing dummy files w/ random data, 4gb each, no big deal) and iscsi wouldnt reconnect.

windows couldnt ping the 2 portals. truenas couldnt ping the 2 windows NICs. wireshark on windows doesnt show any traffic from the truenas server EXCEPT for LLDP packets from the 2 intel mac addresses on the respective windows NICs.

Ive link cycled the DAC cables, changed the ip from 2.1 to 2.2 and back on one of the interfaces, rebooted both servers but I cant get a peep out of the truenas ixl interfaces.


Code:
root@truenas[~]#
root@truenas[~]#
root@truenas[~]#
root@truenas[~]# ping 172.16.1.10
PING 172.16.1.10 (172.16.1.10): 56 data bytes
^C
--- 172.16.1.10 ping statistics ---
62 packets transmitted, 0 packets received, 100.0% packet loss
root@truenas[~]#
root@truenas[~]#
root@truenas[~]# ping 172.16.2.10
PING 172.16.2.10 (172.16.2.10): 56 data bytes
^C
--- 172.16.2.10 ping statistics ---
90 packets transmitted, 0 packets received, 100.0% packet loss
root@truenas[~]#
root@truenas[~]#
root@truenas[~]#
root@truenas[~]# ifconfig
igb0: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=4e527bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
        ether 00:25:90:4f:93:c0
        inet 192.168.10.164 netmask 0xffffff00 broadcast 192.168.10.255
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
        nd6 options=9<PERFORMNUD,IFDISABLED>
igb1: flags=8822<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=4e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
        ether 00:25:90:4f:93:c1
        media: Ethernet autoselect
        status: no carrier
        nd6 options=9<PERFORMNUD,IFDISABLED>
igb2: flags=8822<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=4e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
        ether 00:25:90:4f:93:c2
        media: Ethernet autoselect
        status: no carrier
        nd6 options=9<PERFORMNUD,IFDISABLED>
igb3: flags=8822<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=4e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
        ether 00:25:90:4f:93:c3
        media: Ethernet autoselect
        status: no carrier
        nd6 options=9<PERFORMNUD,IFDISABLED>
ixl0: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=4e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
        ether 40:a6:b7:9a:c6:24
        inet 172.16.1.1 netmask 0xffffff00 broadcast 172.16.1.255
        media: Ethernet autoselect (10Gbase-Twinax <full-duplex>)
        status: active
        nd6 options=9<PERFORMNUD,IFDISABLED>
ixl1: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=4e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
        ether 40:a6:b7:9a:c6:25
        inet 172.16.2.1 netmask 0xffffff00 broadcast 172.16.2.255
        media: Ethernet autoselect (10Gbase-Twinax <full-duplex>)
        status: active
        nd6 options=9<PERFORMNUD,IFDISABLED>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x7
        inet 127.0.0.1 netmask 0xff000000
        groups: lo
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
pflog0: flags=0<> metric 0 mtu 33160
        groups: pflog
root@truenas[~]#
root@truenas[~]#
root@truenas[~]#
root@truenas[~]# dmesg | grep ixl0
ixl0: <Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 - 2.3.1-k> mem 0xfa000000-0xfaffffff,0xfb008000-0xfb00ffff irq 50 at device 0.0 numa-domain 1 on pci14
ixl0: fw 6.0.48442 api 1.7 nvm 6.01 etid 80003646 oem 1.263.0
ixl0: PF-ID[0]: VFs 64, MSI-X 129, VF MSI-X 5, QPs 768, MDIO & I2C
ixl0: Using 1024 TX descriptors and 1024 RX descriptors
ixl0: Using 8 RX queues 8 TX queues
ixl0: Using MSI-X interrupts with 9 vectors
ixl0: taskqgroup_attach_cpu failed 22
ixl0: Ethernet address: 40:a6:b7:9a:c6:24
ixl0: Allocating 8 queues for PF LAN VSI; 8 queues active
ixl0: PCI Express Bus: Speed 8.0GT/s Width x8
ixl0: SR-IOV ready
ixl0: Link is up, 10 Gbps Full Duplex, Requested FEC: CL108 RS-FEC, Negotiated FEC: None, Autoneg: False, Flow Control: None
ixl0: link state changed to UP
debugnet_any_ifnet_update: Bad dn_init result from ixl0 (ifp 0xfffff801050f7800), ignoring.
ixl0: link state changed to DOWN
ixl0: Link is up, 10 Gbps Full Duplex, Requested FEC: CL108 RS-FEC, Negotiated FEC: None, Autoneg: False, Flow Control: None
ixl0: link state changed to UP
ixl0: link state changed to DOWN
ixl0: Link is up, 10 Gbps Full Duplex, Requested FEC: CL108 RS-FEC, Negotiated FEC: None, Autoneg: False, Flow Control: None
ixl0: link state changed to UP
ixl0: link state changed to DOWN
ixl0: Link is up, 10 Gbps Full Duplex, Requested FEC: CL108 RS-FEC, Negotiated FEC: None, Autoneg: False, Flow Control: None
ixl0: link state changed to UP
root@truenas[~]#
root@truenas[~]#
root@truenas[~]#
root@truenas[~]# dmesg | grep ixl1
ixl1: <Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 - 2.3.1-k> mem 0xf9000000-0xf9ffffff,0xfb000000-0xfb007fff irq 50 at device 0.1 numa-domain 1 on pci14
ixl1: fw 6.0.48442 api 1.7 nvm 6.01 etid 80003646 oem 1.263.0
ixl1: PF-ID[1]: VFs 64, MSI-X 129, VF MSI-X 5, QPs 768, MDIO & I2C
ixl1: Using 1024 TX descriptors and 1024 RX descriptors
ixl1: Using 8 RX queues 8 TX queues
ixl1: Using MSI-X interrupts with 9 vectors
ixl1: taskqgroup_attach_cpu failed 22
ixl1: Ethernet address: 40:a6:b7:9a:c6:25
ixl1: Allocating 8 queues for PF LAN VSI; 8 queues active
ixl1: PCI Express Bus: Speed 8.0GT/s Width x8
ixl1: SR-IOV ready
ixl1: Link is up, 10 Gbps Full Duplex, Requested FEC: CL108 RS-FEC, Negotiated FEC: None, Autoneg: False, Flow Control: None
ixl1: link state changed to UP
debugnet_any_ifnet_update: Bad dn_init result from ixl1 (ifp 0xfffff81483a31800), ignoring.
ixl1: link state changed to DOWN
ixl1: Link is up, 10 Gbps Full Duplex, Requested FEC: CL108 RS-FEC, Negotiated FEC: None, Autoneg: False, Flow Control: None
ixl1: link state changed to UP
ixl1: link state changed to DOWN
ixl1: Link is up, 10 Gbps Full Duplex, Requested FEC: CL108 RS-FEC, Negotiated FEC: None, Autoneg: False, Flow Control: None
ixl1: link state changed to UP
ixl1: link state changed to DOWN
ixl1: Link is up, 10 Gbps Full Duplex, Requested FEC: CL108 RS-FEC, Negotiated FEC: None, Autoneg: False, Flow Control: None
ixl1: link state changed to UP
root@truenas[~]#
root@truenas[~]#




Im at a loss for this one, ill continue testing tomorrow but if anybody has ideas I am all ears. Its not too big of a deal as I can just blow away the entirety of both servers since its all sandbox right now but I would really love to get to the bottom of this one. The up/down were mostly me switching the DAC cables and watching the mac addresses change in wireshark and switching them back.
 

DemohFoxfire

Dabbler
Joined
May 2, 2023
Messages
11
Put some time into it today, I had the system linked up without any issues randomly and was about to start some testing. I had all of my iscsi targets and initiators set up, a few test file transfers, etc... All was well. I changed the topology to load up a 2nd truenas (vm this time) that has 128gb of ram (mainly for arc) as an iscsi target on network A for the windows box which initiator mounting that truenas box's volume on network A. Windows box now only talks to the existing truenas box (the one that started this thread) via network B.

Initial transfers went well. I shutdown the windows box, then rebooted the 2 truenas servers. I never was able to get any network traffic out of that intel card again.

I did find that I would get this anytime I would ifconfig ixl0 down, then up:
Code:
ixl0: TX queue 1 still enabled!
ixl0: TX queue 2 still enabled!
ixl0: TX queue 3 still enabled!
ixl0: TX queue 4 still enabled!
ixl0: TX queue 5 still enabled!
ixl0: TX queue 6 still enabled!
ixl0: TX queue 7 still enabled!
ixl0: TX queue 0 still enabled!
ixl0: TX queue 1 still enabled!
ixl0: TX queue 2 still enabled!
ixl0: TX queue 3 still enabled!
ixl0: TX queue 4 still enabled!
ixl0: TX queue 5 still enabled!
ixl0: TX queue 6 still enabled!
ixl0: TX queue 7 still enabled!
ixl0: TX queue 0 still enabled!



Most concerning was the buffers. I found some post about the buffers, so the output of netstat -x


Code:
Welcome to TrueNAS

Warning: the supported mechanisms for making configuration changes
are the TrueNAS WebUI and API exclusively. ALL OTHERS ARE
NOT SUPPORTED AND WILL RESULT IN UNDEFINED BEHAVIOR AND MAY
RESULT IN SYSTEM FAILURE.

root@truenas[~]#
root@truenas[~]# uptime
 4:36PM  up 3 mins, 1 user, load averages: 0.10, 0.13, 0.06
root@truenas[~]#
root@truenas[~]#
root@truenas[~]#
root@truenas[~]# netstat -i
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
igb0   1500 <Link#1>      00:25:90:4f:93:c0    34546     0     0     7916     0     0
igb0      - 192.168.10.0/ xxxxxxxxxoffice.ve    10908     -     -     7906     -     -
igb1*  1500 <Link#2>      00:25:90:4f:93:c1        0     0     0        0     0     0
igb2*  1500 <Link#3>      00:25:90:4f:93:c2        0     0     0        0     0     0
igb3*  1500 <Link#4>      00:25:90:4f:93:c3        0     0     0        0     0     0
ixl0   1500 <Link#5>      40:a6:b7:9a:c6:24 281474976710659     0 4294967276       11     0     0
ixl0      - 172.16.1.0/24 172.16.1.1               0     -     -       42     -     -
ixl1   1500 <Link#6>      40:a6:b7:9a:c6:25 281474976710659     0 4294967276       14     0     0
ixl1      - 172.16.2.0/24 172.16.2.1               0     -     -       40     -     -
lo0   16384 <Link#7>      lo0                     55     0     0       36     0     0
lo0       - localhost     localhost                8     -     -        8     -     -
lo0       - fe80::%lo0/64 fe80::1%lo0              0     -     -        0     -     -
lo0       - your-net      localhost               28     -     -       28     -     -
pflog 33160 <Link#8>      pflog0                   0     0     0        0     0     0
root@truenas[~]#
root@truenas[~]#



Those numbers 281474976710659 and 4294967276 looked oddly familiar..... buffer overflow or underflow. 281474976710659 came out to hex: 1000000000003 which is a few numbers off from 281474976710655 / FFFFFFFFFFFF and 4294967276 is FFFFFFEC.

Multiple reboots, same thing (notice uptime on the output). warm boot, cold boot, also same. Updated the firmware of the card (6.01 to 9.20) and same results.

Ive tried enabling / disabling hardware offloading, changing ip addresses, etc... but im at a loss for this buffer issue.

Faulty card? bad driver? Im trying to eliminate hardware or software so I can determine my next steps with this. Its easy enough for me to load up truenas again but this doesnt give me closure.
 
Last edited:

DemohFoxfire

Dabbler
Joined
May 2, 2023
Messages
11
To appease the nomads browsing the search engine results I will leave this solution to provide closure to this issue: (I apologize, I say buffer overflow but in reality I mean to say its intel's DRM to lock them to specific DACs.)

Intel coded 25GBe DACs were the solution.
I have used 1 brand of 10gb, I think they were cisco, those worked from a connectivity standpoint but I do not recall if there was buffer overflow.
FS generic coded 25gb caused the buffer overflow with no connectivity after a reboot.
Ubiquiti 25gb DAC caused the buffer overflow but they did have connectivity, I didnt test for long as this required 2 servers to be 'butt to butt' because I only had a pair of super short DACs.

After receiving some intel coded DACs from FS for use with the xxv710 NICs all is well so I consider this issue solved (except the unexpected reboot from flashing.)
 
Top