High local network latency (including localhost?)

rvman

Cadet
Joined
Mar 26, 2021
Messages
3
Hello,

I've just installed a new server (Dell R640, dual Xeon Gold 12 core, 64GB RAM) with the latest Truenas (TrueNAS-12.0-U2.1) and am getting it all set up.

Everything is pretty much set at defaults at the moment.

Everything works great, but I've run into a curious case of high latency on the local ethernet ports and even the localhost interface and it's driving me a bit silly now.

This server has an Intel X550 quad-port 10G card in it and it's hooked up to a Dell 10G switch.

Actual throughput seems to be ok - using iperf3 I get 9.8Gbit transfer rates in both directions.

I've searched all around and can't see any reasonable explanation for this, so asking here if anyone else has seen this behaviour.

As a quick test, I booted up a random linux live image on the same hardware and it does not show any of these symptoms at all, which eliminates any hardware-type issues.

Quick summary:

I would normally expect sub-ms ping times on a local 10G network (our other Freenas boxes are sub-ms).

Pinging localhost on this box:

Code:
root@nas1[~]# ping localhost
PING localhost (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=7.629 ms
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=8.518 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=12.191 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=11.324 ms
64 bytes from 127.0.0.1: icmp_seq=4 ttl=64 time=11.748 ms



Pinging the local ethernet interface address (ix1):

Code:
root@nas1[~]# ping 10.100.0.130
PING 10.100.0.130 (10.100.0.130): 56 data bytes
64 bytes from 10.100.0.130: icmp_seq=0 ttl=64 time=14.793 ms
64 bytes from 10.100.0.130: icmp_seq=1 ttl=64 time=11.109 ms
64 bytes from 10.100.0.130: icmp_seq=2 ttl=64 time=9.899 ms
64 bytes from 10.100.0.130: icmp_seq=3 ttl=64 time=17.252 ms
64 bytes from 10.100.0.130: icmp_seq=4 ttl=64 time=12.138 ms


Given that the problem shows without packets even leaving the box, I'm pretty much discounting it being a switch or cabling problem.

Curiously enough though, pinging any external device on the second interface (ix1), returns normal,expected results:

Code:
root@ssdnas1[~]# ping 10.100.0.197
PING 10.100.0.197 (10.100.0.197): 56 data bytes
64 bytes from 10.100.0.197: icmp_seq=0 ttl=64 time=0.210 ms
64 bytes from 10.100.0.197: icmp_seq=1 ttl=64 time=0.235 ms
64 bytes from 10.100.0.197: icmp_seq=2 ttl=64 time=0.219 ms
64 bytes from 10.100.0.197: icmp_seq=3 ttl=64 time=0.226 ms


And this wouldn't be such a big deal except the reverse direction (from 10.100.0.197) has the high-latency issue:

Code:
root@vmnas[~]# ping 10.100.0.130
PING 10.100.0.130 (10.100.0.130): 56 data bytes
64 bytes from 10.100.0.130: icmp_seq=0 ttl=64 time=12.191 ms
64 bytes from 10.100.0.130: icmp_seq=1 ttl=64 time=13.345 ms
64 bytes from 10.100.0.130: icmp_seq=2 ttl=64 time=9.856 ms
64 bytes from 10.100.0.130: icmp_seq=3 ttl=64 time=12.245 ms




And for completeness, here's the current interface configuration:
Code:
root@ssdnas1[~]# ifconfig -a
ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
        options=e53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether e4:43:4b:be:b5:a0
        inet 10.99.0.20 netmask 0xffffff00 broadcast 10.99.0.255
        media: Ethernet autoselect (10Gbase-T <full-duplex,rxpause,txpause>)
        status: active
        nd6 options=9<PERFORMNUD,IFDISABLED>
ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=e53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether e4:43:4b:be:b5:a1
        inet 10.100.0.130 netmask 0xffffff00 broadcast 10.100.0.255
        media: Ethernet autoselect (10Gbase-T <full-duplex,rxpause,txpause>)
        status: active
        nd6 options=9<PERFORMNUD,IFDISABLED>
ix2: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=e53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether e4:43:4b:be:b5:a2
        media: Ethernet autoselect
        status: no carrier
        nd6 options=1<PERFORMNUD>
ix3: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=e53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether e4:43:4b:be:b5:a3
        media: Ethernet autoselect
        status: no carrier
        nd6 options=1<PERFORMNUD>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x5
        inet 127.0.0.1 netmask 0xff000000
        groups: lo
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
pflog0: flags=0<> metric 0 mtu 33160
        groups: pflog


Has anyone run into this or have any suggestions on where I should look to get this resolved?

Thanks!
 

HolyK

Ninja Turtle
Moderator
Joined
May 26, 2011
Messages
654
Do you get same or different values when pinging "127.0.0.1" and "localhost" ? Might sound silly but try it...
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
This sounds like possible fragmentation/reassembly at the lo0 loopback interface. What's the output of ifconfig lo0? In particular, what MTU is displayed for lo0? Also, what's the output of sysctl -a | grep mss?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Check the Yottamark on the Intel card. There are fake X550's out there.
 

rvman

Cadet
Joined
Mar 26, 2021
Messages
3
Some additional notes:

This is a brand new Dell R640, so I hope the hardware is genuine :smile: , plus, this affects loopback the same way, which leads me to believe it's not hardware related.

- Pinging "127.0.0.1" gives same result.

lo0 ifconfig:
Code:
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x5
        inet 127.0.0.1 netmask 0xff000000
        groups: lo
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>


I didn't mention it originally, but I have 2 of these servers (identical hardware/configuration) and they both behave identically. Or, well, they *did*.

I've not touched the first one in any way, logged-in, nor rebooted it in any way since the original report, but it now reports proper ping times !?

Code:
root@nas1[~]# ping localhost
PING localhost (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.089 ms
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.151 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.143 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.213 ms


All the other interfaces now operate properly as well, so something has worked itself out.

The other R640, however, behaves as originally reported with the high latency (12-17ms+).

Weird....
 

rvman

Cadet
Joined
Mar 26, 2021
Messages
3
One more discovery...

I realized I had a ping going to the host from another box in the background. While that other ping was running, pings to localhost were within expected times.

When I stopped the ping from the otehr box, suddenly the localhost pings jumped up again.

This is an idle box for the most part, but maybe there needs to be some sort of ongoing network traffic before the latency returns to normal?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
This is a brand new Dell R640, so I hope the hardware is genuine :smile: , plus, this affects loopback the same way, which leads me to believe it's not hardware related.

An interrupt storm or traffic flood can affect the network stack in that way, so, yes, it can be hardware related, even if the particular interface isn't a hardware device.

The X550 is not a super-common choice for cards and most of the time when I see them show up in the forums, it is in the context of problems. Therefore my first thought tends to head in the direction of fake cards. But 10GBase-T is also sometimes problematic -- usually shows as performance problems, though, I can't think of how it might impact loopback. Mmm.

Fake card obviously isn't the only possibility, but it made sense to get it out of the way.
 
Top