[Help] RC1 SMB/NFS/HTTPS/SSH disconnects when routed

mikensan · Nov 2, 2021

Since BETA 2 (and now RC1), I’ve had issues when accessing a local service directly on my TuesNAS Scale server over a routed IP (192.168.0.10 > 192.168.1.5), but on the same subnet no problems (192.168.1.4 > 192.168.1.5). The service becomes unresponsive after about a minute and then I have to reconnect / refresh. I’ve observed this behavior in HTTPS/SMB/NFS/SSH. I expect iSCSI would have this issue, but can’t be bothered to test since the big 4 experience this issue anyway.

However here’s the kicker, I can ping through a gateway all day (granted not TCP and not a state so it is different) and I can access a container (nextcloud) hosted on TuesNAS scale through a gateway, without any issue. My gut tells me, it’s definitely something in the network stack of Scale that’s not happy about something going on.

My gateway/firewall is pfsense, and pretty much default - so it /should/ not be closing states at specific intervals. No other service between subnets is impacted, only those hosted directly on TuesNAS scale.

morganL · Nov 2, 2021

Your gateway may be allowing ping and http, but not other protocols. You would have to look at pfsense settings carefully.

mikensan · Nov 3, 2021

morganL said:
Your gateway may be allowing ping and http, but not other protocols. You would have to look at pfsense settings carefully.

HTTP/S is one of the issues as well, ping (the only non-TCP) and other non-TruesNAS services work fine through the same IP.

Example: I go to https://truenas.domain.com and after about a minute, it is unresponsive and I must refresh the page a few times.
Example: I go to https://truenas.domain.com:9000 (Nextcloud K8) and I experience no problems.

mikensan · Nov 3, 2021

Here you can see normal activity, couple dropped packets, and then outright packet loss. This is a packet capture from pfsense, it's passing everything between the two:

mikensan · Nov 3, 2021

Just in case, same issue observed in HTTP (80/tcp):

And here is a capture from using nextcloud via the same interface to a container (9001/tcp):

paxswill · Nov 3, 2021

Does your TrueNAS Scale host have multiple interfaces? If it does, could you try a packet dump on all interfaces while connecting to one of the impacted services, like so:

Code:

tcpdump -vne -i any -w all_interfaces.pcap

When you look at it in Wireshark, are the response packets going out on the same interface that the original requests came in on?

mikensan · Nov 4, 2021

paxswill said:
Does your TrueNAS Scale host have multiple interfaces? If it does, could you try a packet dump on all interfaces while connecting to one of the impacted services, like so:

Code:
tcpdump -vne -i any -w all_interfaces.pcap

When you look at it in Wireshark, are the response packets going out on the same interface that the original requests came in on?

Thanks for thinking this through with me. I indeed have multiple interfaces, and it does look like it's sending out of the wrong interface. I recently moved the IP to a different interface while troubleshooting this issue - it was originally on a Mellanox 10gb card. I thought maybe it was the driver or something with the card, so I deleted the VLAN10 interface I created, setup a switchport to native vlan10, and connected the ethernet to an available port on the box. This way I thought I'm removing vlan tags from the equation and an older NIC.

Funny though, it starts off this way and "works" for 30-60 seconds.

Desktop > Scale: 1gbe correct NIC
Scale > Desktop: 10gb old NIC.

paxswill · Nov 4, 2021

I've run into this before with multi-homed Linux hosts, and have used routing policy rules to fix it. This is a quick and dirty hack (emphasizing the dirty and hack part) I'm using to work around this issue right now:

Code:

#!/bin/sh

set -eu

EXPECTED_VERSION="22.02-RC.1"
ACTUAL_VERSION="$(cat /etc/version)"

# Super quick check to prevent this script from running on later versions
if [ "$ACTUAL_VERSION" != "$EXPECTED_VERSION" ]; then
        printf "TrueNAS Scale version (%s) not supported (%s)\n" \
                "$ACTUAL_VERSION" \
                "$EXPECTED_VERSION"
        exit 1
fi

# Argument list:
# 1: IP address of the interface
# 2: Gateway address for that interface
# 3: Interface name
# 4: Table ID number.
#
# The Table ID is also used as the rule priority, so pick a number lower than
# 32760 or so, and not 0, 253, 254, or 255.
create_rule_route() {
        /usr/bin/ip rule add from "$1" table "$4" priority "$4"
        /usr/bin/ip route replace default via "$2" dev "$3" table "$4"
}

# I'm using VLANs 60 and 90, with corresponding subnets. I am adding the VLAN
# tags to 10000 to get the routing table numbers.

create_rule_route "172.17.60.2/24" "172.17.60.1" vlan60 10060
create_rule_route "172.17.90.2/24" "172.17.90.1" vlan90 10090

Again, This is a dirty hack, until iXsystems addresses the actual issue (which seems to be tracked in NAS-113103). I have the script set to run as a postinit startup script, and it seems to be working alright. I also removed the default gateway set through the web UI, as the script sets that as well.

mikensan · Nov 4, 2021

paxswill said:
I've run into this before with multi-homed Linux hosts, and have used routing policy rules to fix it. This is a quick and dirty hack (emphasizing the dirty and hack part) I'm using to work around this issue right now:

Code:
#!/bin/sh set -eu EXPECTED_VERSION="22.02-RC.1" ACTUAL_VERSION="$(cat /etc/version)" # Super quick check to prevent this script from running on later versions if [ "$ACTUAL_VERSION" != "$EXPECTED_VERSION" ]; then printf "TrueNAS Scale version (%s) not supported (%s)\n" \ "$ACTUAL_VERSION" \ "$EXPECTED_VERSION" exit 1 fi # Argument list: # 1: IP address of the interface # 2: Gateway address for that interface # 3: Interface name # 4: Table ID number. # # The Table ID is also used as the rule priority, so pick a number lower than # 32760 or so, and not 0, 253, 254, or 255. create_rule_route() { /usr/bin/ip rule add from "$1" table "$4" priority "$4" /usr/bin/ip route replace default via "$2" dev "$3" table "$4" } # I'm using VLANs 60 and 90, with corresponding subnets. I am adding the VLAN # tags to 10000 to get the routing table numbers. create_rule_route "172.17.60.2/24" "172.17.60.1" vlan60 10060 create_rule_route "172.17.90.2/24" "172.17.90.1" vlan90 10090

Again, This is a dirty hack, until iXsystems addresses the actual issue (which seems to be tracked in NAS-113103). I have the script set to run as a postinit startup script, and it seems to be working alright. I also removed the default gateway set through the web UI, as the script sets that as well.

Very cool, I will have to try that tomorrow. Yup that's my bug submission lol. Right now my workaround is just making sure I am on the same subnet but obviously I don't want every service on every subnet lol.

Important Announcement for the TrueNAS Community.

[Help] RC1 SMB/NFS/HTTPS/SSH disconnects when routed

mikensan

Dabbler

morganL

Captain Morgan

mikensan

Dabbler

mikensan

Dabbler

mikensan

Dabbler

paxswill

Cadet

mikensan

Dabbler

Attachments

paxswill

Cadet

mikensan

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

[Help] RC1 SMB/NFS/HTTPS/SSH disconnects when routed

Dabbler

Captain Morgan

Dabbler

Dabbler

Dabbler

Cadet

Dabbler

Attachments

Cadet

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "[Help] RC1 SMB/NFS/HTTPS/SSH disconnects when routed"

Similar threads