LACP Intel x710 connection issues

rienk.dejong

Cadet
Joined
Mar 19, 2020
Messages
4
Hi all,

I've been having some network issues with our new FreeNAS server. I already posted in another forum to figure out the issue. Link to other tread:
I think the root of the issues I've been having is in FreeNAS / FreeBSD, but I'm not sure and how I fixed is not future proof I think.

Lets start with the hardware involved:
FreeNAS server:
  • Mainboard: Supermicro H11SSL-I
  • Processor: AMD EPYC™ Rome 7262
  • Network card: AOC-STGF-i2S-O Intel X710 chip ( 2 x 10Gbit SFP+)
  • 128 GB RAM
  • 2 x 240GB SSD for OS
  • 10 x 4TB HDD + 4 x 480GB SSD for storage pool
Application / Xen XCP-ng servers:
  • Mainboard: Supermicro H11SSL-I
  • Processor: AMD EPYC™ Rome 7402P
  • Network card: AOC-STGF-i2S-O Intel X710 chip ( 2 x 10Gbit SFP+)
  • 128 GB RAM
  • 2 x 240GB SSD for OS
Switch gear:
  • Ubiquity UniFi Switch 16XG ( 16 x 10Gbit )
    • Connected to servers with SFP+ DAC cables
  • UniFi Switch 16 POE
The 1Gbit ports of the servers are used for management interfaces, and the 10Gbit ports for storage and office network shares.
To separate the traffic I configured separate VLANS for Office and SAN traffic.
Below is a network diagram.
simple_network_diagram.png


When I had everything initially configured as above with all 2 x 10Gbit connection configured as LACP lagg I had trouble pinging some hosts ( VM's and real machines)
For instance I could ping the following:
  • 172.16.8.2 (FreeNAS) to 172.16.8.3 (XCP-ng)
  • 172.16.8.2 (FreeNAS) to 172.16.8.4 (XCP-ng)
  • 172.16.8.2 (FreeNAS) to 172.16.8.101 (Debinan VM @ 172.16.8.5)
  • 172.16.8.5 ((XCP-ng) to 172.16.8.3 (XCP-ng)
  • 172.16.8.5 ((XCP-ng) to 172.16.8.4 (XCP-ng)
  • 172.16.8.5 ((XCP-ng) to 172.16.8.33 (Debinan VM @ 172.16.8.3)
  • 172.16.8.5 ((XCP-ng) to 172.16.8.101 (Debinan VM @ 172.16.8.5)
But I couldn't ping the following:
  • 172.16.8.2 (FreeNAS) to 172.16.8.5 (XCP-ng)
  • 172.16.8.2 (FreeNAS) to 172.16.8.33 (Debian VM @ 172.16.8.3)
When I used arping on the Linux host I could not connect to I found that only the broadcast packets got a reply and the unicast packets failed.

Eventually I found that disconnecting one of the 10Gbit ports of the FreeNAS box solved the problem. And if I reconnected that cable and pulled out the other I got a “Network is down” error.

With some google-ing I found some people that had similar problems but not my exact problem and with older versions of FreeNAS. But they all had to do with the driver of the card.
So I downloaded http://pkg.freebsd.org/FreeBSD:11:amd64/latest/All/intel-ixl-kmod-1.11.9.txz ,
copied the if_ixl_updated.ko to the /boot/modules folder and loaded it with a tunable.

It seems to work now, alt least after a reboot of the FreeNAS box.

The thing that changed is:
original output of dmesg | grep ixl:
ixl0: <Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.11.9-k>
output with the driver form pkg.freebsd.org dmesg | grep ixl:
ixl0: <Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.11.9>


My main question is, does anybody have the same experience with a similar setup?
And can anybody make sense of why the above driver change would solve my issue. Because I'm not 100% comfortable with just loading another driver that is not form the official FreeNAS repository.

Kind Regards,
Rienk
 
Joined
Jan 4, 2014
Messages
1,644

rienk.dejong

Cadet
Joined
Mar 19, 2020
Messages
4
I've read the tread above, but from what I could see the issue there was that two interfaces had an IP address in the same subnet.
In my situation I have an LACP lag configured both on the FreeNAS side and on the switch so I only have one interface to assign one IP to.
 
Joined
Dec 29, 2014
Messages
1,135
Your FreeNAS may be doing LACP, but are the XCP hosts also doing LACP? I would suspect that the MAC addresses are perhaps going down the wrong interface and one side or the other is discarding the frames. Try your tests with only one NIC connected to each hosts. If everything works then, that means something is not aggregating the LAGG/port channel correctly.
 

rienk.dejong

Cadet
Joined
Mar 19, 2020
Messages
4
Thanks for the tip. I haven't tested that.
On the XCP side all links are also configured as LACP, and they seem to work fine because pings between XCP hosts work.
If I have time this week I'll change over the driver again and see if I get the problem back and I'll try disconnecting the XCP-ng side.
 

calpwns

Cadet
Joined
Feb 29, 2020
Messages
4
Hi all,

I've been having some network issues with our new FreeNAS server. I already posted in another forum to figure out the issue. Link to other tread:
I think the root of the issues I've been having is in FreeNAS / FreeBSD, but I'm not sure and how I fixed is not future proof I think.

Lets start with the hardware involved:
FreeNAS server:
  • Mainboard: Supermicro H11SSL-I
  • Processor: AMD EPYC™ Rome 7262
  • Network card: AOC-STGF-i2S-O Intel X710 chip ( 2 x 10Gbit SFP+)
  • 128 GB RAM
  • 2 x 240GB SSD for OS
  • 10 x 4TB HDD + 4 x 480GB SSD for storage pool
Application / Xen XCP-ng servers:
  • Mainboard: Supermicro H11SSL-I
  • Processor: AMD EPYC™ Rome 7402P
  • Network card: AOC-STGF-i2S-O Intel X710 chip ( 2 x 10Gbit SFP+)
  • 128 GB RAM
  • 2 x 240GB SSD for OS
Switch gear:
  • Ubiquity UniFi Switch 16XG ( 16 x 10Gbit )
    • Connected to servers with SFP+ DAC cables
  • UniFi Switch 16 POE
The 1Gbit ports of the servers are used for management interfaces, and the 10Gbit ports for storage and office network shares.
To separate the traffic I configured separate VLANS for Office and SAN traffic.
Below is a network diagram.
View attachment 36677

When I had everything initially configured as above with all 2 x 10Gbit connection configured as LACP lagg I had trouble pinging some hosts ( VM's and real machines)
For instance I could ping the following:
  • 172.16.8.2 (FreeNAS) to 172.16.8.3 (XCP-ng)
  • 172.16.8.2 (FreeNAS) to 172.16.8.4 (XCP-ng)
  • 172.16.8.2 (FreeNAS) to 172.16.8.101 (Debinan VM @ 172.16.8.5)
  • 172.16.8.5 ((XCP-ng) to 172.16.8.3 (XCP-ng)
  • 172.16.8.5 ((XCP-ng) to 172.16.8.4 (XCP-ng)
  • 172.16.8.5 ((XCP-ng) to 172.16.8.33 (Debinan VM @ 172.16.8.3)
  • 172.16.8.5 ((XCP-ng) to 172.16.8.101 (Debinan VM @ 172.16.8.5)
But I couldn't ping the following:
  • 172.16.8.2 (FreeNAS) to 172.16.8.5 (XCP-ng)
  • 172.16.8.2 (FreeNAS) to 172.16.8.33 (Debian VM @ 172.16.8.3)
When I used arping on the Linux host I could not connect to I found that only the broadcast packets got a reply and the unicast packets failed.

Eventually I found that disconnecting one of the 10Gbit ports of the FreeNAS box solved the problem. And if I reconnected that cable and pulled out the other I got a “Network is down” error.

With some google-ing I found some people that had similar problems but not my exact problem and with older versions of FreeNAS. But they all had to do with the driver of the card.
So I downloaded http://pkg.freebsd.org/FreeBSD:11:amd64/latest/All/intel-ixl-kmod-1.11.9.txz ,
copied the if_ixl_updated.ko to the /boot/modules folder and loaded it with a tunable.

It seems to work now, alt least after a reboot of the FreeNAS box.

The thing that changed is:
original output of dmesg | grep ixl:
ixl0: <Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.11.9-k>
output with the driver form pkg.freebsd.org dmesg | grep ixl:
ixl0: <Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.11.9>


My main question is, does anybody have the same experience with a similar setup?
And can anybody make sense of why the above driver change would solve my issue. Because I'm not 100% comfortable with just loading another driver that is not form the official FreeNAS repository.

Kind Regards,
Rienk

I registered specifically to reply to this as I have a similar setup (Supermicro Xeon-D X11SDV variant with built-in X722 Intel SFP+ NICs, ESXI 6.7u3b, XG-16 with LACP) - getting crashes and hangs using the ixl0: <Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.11.9-k when trying to configure a new LAGG connection on a fresh install. VM hangs on that output and turns off. Granted I have VT-d turned on and I'm passing through the NIC's directly to the VM to avoid adding ESXI's strange distributed switch features into the troubleshooting mix. I've managed to make it work once but I'm constantly getting crashes. If I remove the NIC's from the VM, FreeNAS boots again. But the strange bit is - if I use the CLI to set up the LAGG using LACP, MTU 9000 and putting in an IP address of /24 subnet, I lose host connectivity and the CLI changed to "no connections". Rebooting clears and changes but a strange "lagg0" entry survives when adding/deleting aggregation options. Can't seem to delete it after. Have the XG-16 setup as aggregation but no dice. Using Cisco genuine 10Gb-SR transceivers with OM3 fiber.

You mention replacing the driver - I have no idea how to do that in FreeNAS and frankly, I don't know if I want to replace this driver if I'm going to have issues with connectivity as it's going to be a 100TB+ storage pool.

Care to share how you replaced it? Have you had any issues since?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
This is a standard issue with any sort of LAGG. All sides have to agree on the hash algorithm to route traffic down both members of the LAGG. When you experience this sort of issue, there's a mismatch in hash algorithms. Usually one side is set to the default of MAC address, and the other side is using IPs.
 

calpwns

Cadet
Joined
Feb 29, 2020
Messages
4
This is a standard issue with any sort of LAGG. All sides have to agree on the hash algorithm to route traffic down both members of the LAGG. When you experience this sort of issue, there's a mismatch in hash algorithms. Usually one side is set to the default of MAC address, and the other side is using IPs.
That's what I'm thinking - another reason why I kept ESXi and it's 5000 different hash options out of the mix. From what I gather, the XG-16 only runs... passive LACP? ...and FreeNAS is looking for active?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399

calpwns

Cadet
Joined
Feb 29, 2020
Messages
4
According to https://help.ui.com/hc/en-us/articl...-USW-Configuring-Link-Aggregation-Groups-LAG-, the XG-16 can only hash based on Layer 2 (MAC, VLAN, or EtherType). This setup is optimized for LAGGs between switches, not for LAGGs to servers. You'll not gain any benefit using LAGG to the servers, as all the communication to a single server will only use one link of the LAGG.

This new NAS will be hit pretty hard by multiple servers, so any additional aggregation I can get into it is best.

I'm research the other driver to see if I can get that working. I'm curious what the "k" means in that driver OP swapped out.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Given the limitations of the switch, you'll be better off trying to implement multipath IO.
 

calpwns

Cadet
Joined
Feb 29, 2020
Messages
4
Given the limitations of the switch, you'll be better off trying to implement multipath IO.
In regards to iSCSI MPIO? I toyed around with the idea of iSCSI but my topology is best setup for any but block-based storage.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
OK, then if you're serving NFS and SMB, you can still do multipath via DNS round-robin. Give each server 2x share IPs, and DNS round-robin between them.
 
Top