[Help] RC1 SMB/NFS/HTTPS/SSH disconnects when routed

mikensan

Dabbler
Joined
Jan 11, 2016
Messages
16
Since BETA 2 (and now RC1), I’ve had issues when accessing a local service directly on my TuesNAS Scale server over a routed IP (192.168.0.10 > 192.168.1.5), but on the same subnet no problems (192.168.1.4 > 192.168.1.5). The service becomes unresponsive after about a minute and then I have to reconnect / refresh. I’ve observed this behavior in HTTPS/SMB/NFS/SSH. I expect iSCSI would have this issue, but can’t be bothered to test since the big 4 experience this issue anyway.

However here’s the kicker, I can ping through a gateway all day (granted not TCP and not a state so it is different) and I can access a container (nextcloud) hosted on TuesNAS scale through a gateway, without any issue. My gut tells me, it’s definitely something in the network stack of Scale that’s not happy about something going on.

My gateway/firewall is pfsense, and pretty much default - so it /should/ not be closing states at specific intervals. No other service between subnets is impacted, only those hosted directly on TuesNAS scale.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Your gateway may be allowing ping and http, but not other protocols. You would have to look at pfsense settings carefully.
 

mikensan

Dabbler
Joined
Jan 11, 2016
Messages
16
Your gateway may be allowing ping and http, but not other protocols. You would have to look at pfsense settings carefully.
HTTP/S is one of the issues as well, ping (the only non-TCP) and other non-TruesNAS services work fine through the same IP.

Example: I go to https://truenas.domain.com and after about a minute, it is unresponsive and I must refresh the page a few times.
Example: I go to https://truenas.domain.com:9000 (Nextcloud K8) and I experience no problems.
 

mikensan

Dabbler
Joined
Jan 11, 2016
Messages
16
Here you can see normal activity, couple dropped packets, and then outright packet loss. This is a packet capture from pfsense, it's passing everything between the two:
1635935840258.png
 

mikensan

Dabbler
Joined
Jan 11, 2016
Messages
16
Just in case, same issue observed in HTTP (80/tcp):
1635945738164.png


And here is a capture from using nextcloud via the same interface to a container (9001/tcp):
1635946275572.png
 

paxswill

Cadet
Joined
Jul 16, 2019
Messages
5
Does your TrueNAS Scale host have multiple interfaces? If it does, could you try a packet dump on all interfaces while connecting to one of the impacted services, like so:
Code:
tcpdump -vne -i any -w all_interfaces.pcap

When you look at it in Wireshark, are the response packets going out on the same interface that the original requests came in on?
 

mikensan

Dabbler
Joined
Jan 11, 2016
Messages
16
Does your TrueNAS Scale host have multiple interfaces? If it does, could you try a packet dump on all interfaces while connecting to one of the impacted services, like so:
Code:
tcpdump -vne -i any -w all_interfaces.pcap

When you look at it in Wireshark, are the response packets going out on the same interface that the original requests came in on?
Thanks for thinking this through with me. I indeed have multiple interfaces, and it does look like it's sending out of the wrong interface. I recently moved the IP to a different interface while troubleshooting this issue - it was originally on a Mellanox 10gb card. I thought maybe it was the driver or something with the card, so I deleted the VLAN10 interface I created, setup a switchport to native vlan10, and connected the ethernet to an available port on the box. This way I thought I'm removing vlan tags from the equation and an older NIC.

Funny though, it starts off this way and "works" for 30-60 seconds.

Desktop > Scale: 1gbe correct NIC
Scale > Desktop: 10gb old NIC.

1636037988461.png


1636038608743.png
 

Attachments

  • 1636038188258.png
    1636038188258.png
    38.6 KB · Views: 156

paxswill

Cadet
Joined
Jul 16, 2019
Messages
5
I've run into this before with multi-homed Linux hosts, and have used routing policy rules to fix it. This is a quick and dirty hack (emphasizing the dirty and hack part) I'm using to work around this issue right now:
Code:
#!/bin/sh

set -eu

EXPECTED_VERSION="22.02-RC.1"
ACTUAL_VERSION="$(cat /etc/version)"

# Super quick check to prevent this script from running on later versions
if [ "$ACTUAL_VERSION" != "$EXPECTED_VERSION" ]; then
        printf "TrueNAS Scale version (%s) not supported (%s)\n" \
                "$ACTUAL_VERSION" \
                "$EXPECTED_VERSION"
        exit 1
fi

# Argument list:
# 1: IP address of the interface
# 2: Gateway address for that interface
# 3: Interface name
# 4: Table ID number.
#
# The Table ID is also used as the rule priority, so pick a number lower than
# 32760 or so, and not 0, 253, 254, or 255.
create_rule_route() {
        /usr/bin/ip rule add from "$1" table "$4" priority "$4"
        /usr/bin/ip route replace default via "$2" dev "$3" table "$4"
}

# I'm using VLANs 60 and 90, with corresponding subnets. I am adding the VLAN
# tags to 10000 to get the routing table numbers.

create_rule_route "172.17.60.2/24" "172.17.60.1" vlan60 10060
create_rule_route "172.17.90.2/24" "172.17.90.1" vlan90 10090

Again, This is a dirty hack, until iXsystems addresses the actual issue (which seems to be tracked in NAS-113103). I have the script set to run as a postinit startup script, and it seems to be working alright. I also removed the default gateway set through the web UI, as the script sets that as well.
 

mikensan

Dabbler
Joined
Jan 11, 2016
Messages
16
I've run into this before with multi-homed Linux hosts, and have used routing policy rules to fix it. This is a quick and dirty hack (emphasizing the dirty and hack part) I'm using to work around this issue right now:
Code:
#!/bin/sh

set -eu

EXPECTED_VERSION="22.02-RC.1"
ACTUAL_VERSION="$(cat /etc/version)"

# Super quick check to prevent this script from running on later versions
if [ "$ACTUAL_VERSION" != "$EXPECTED_VERSION" ]; then
        printf "TrueNAS Scale version (%s) not supported (%s)\n" \
                "$ACTUAL_VERSION" \
                "$EXPECTED_VERSION"
        exit 1
fi

# Argument list:
# 1: IP address of the interface
# 2: Gateway address for that interface
# 3: Interface name
# 4: Table ID number.
#
# The Table ID is also used as the rule priority, so pick a number lower than
# 32760 or so, and not 0, 253, 254, or 255.
create_rule_route() {
        /usr/bin/ip rule add from "$1" table "$4" priority "$4"
        /usr/bin/ip route replace default via "$2" dev "$3" table "$4"
}

# I'm using VLANs 60 and 90, with corresponding subnets. I am adding the VLAN
# tags to 10000 to get the routing table numbers.

create_rule_route "172.17.60.2/24" "172.17.60.1" vlan60 10060
create_rule_route "172.17.90.2/24" "172.17.90.1" vlan90 10090

Again, This is a dirty hack, until iXsystems addresses the actual issue (which seems to be tracked in NAS-113103). I have the script set to run as a postinit startup script, and it seems to be working alright. I also removed the default gateway set through the web UI, as the script sets that as well.
Very cool, I will have to try that tomorrow. Yup that's my bug submission lol. Right now my workaround is just making sure I am on the same subnet but obviously I don't want every service on every subnet lol.
 
Top