iSCSI Multipath Networking in SCALE

r34lj4k3 · Jan 31, 2023

Hello,

I've just migrated from CORE over to SCALE (super excited for Linux over FreeBSD btw) and upon first boot, I noticed my iSCSI networking was broken.

Hardware is a Dell r720xd with 26 ssds in it, 384gb ecc ddr3 1333, 2x E5-2643 V2, perc flashed to IT mode, Dell rNDC NIC with dual 1gb Ethernet and Dual 10gbps sfp+.

Previous config under CORE was working perfectly:

10gbps SFP+ port 1 was on the 101 vlan, all the way through the switch to the NICs on the server, vmkping from the esxi host and regular ping from the truenas server were successful.
10gbps SFP+ port 2 was on the 102 vlan, same story.

After the upgrade, even after re-making the vlan interfaces and assigning ports, I can only ever get one to ping at a time, but either one can be active. I am fairly certain this is a networking issue or a bug, as there is no outbound (except for failed pings sent from truenas) nor inbound on the interface that doesn't work. It is listed as DEAD in the iSCSI paths view on VMWare esxi8.

Output of ifconfig -a:
root@Stronghold[~]# ifconfig -a
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether c8:1f:66:ec:e9:33 txqueuelen 1000 (Ethernet)
RX packets 20036112 bytes 21176045187 (19.7 GiB)
RX errors 0 dropped 116181 overruns 0 frame 0
TX packets 20409168 bytes 26716749643 (24.8 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 143 memory 0xd5000000-d57fffff

eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether c8:1f:66:ec:e9:35 txqueuelen 1000 (Ethernet)
RX packets 197953 bytes 21273790 (20.2 MiB)
RX errors 0 dropped 116173 overruns 0 frame 0
TX packets 246 bytes 18435 (18.0 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 170 memory 0xd6000000-d67fffff

eno3: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether c8:1f:66:ec:e9:37 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 170 memory 0xd7000000-d77fffff

eno4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.249 netmask 255.255.255.0 broadcast 192.168.1.255
ether c8:1f:66:ec:e9:39 txqueuelen 1000 (Ethernet)
RX packets 203629 bytes 21705551 (20.7 MiB)
RX errors 0 dropped 116163 overruns 0 frame 0
TX packets 10580 bytes 9433136 (8.9 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 191 memory 0xd8000000-d87fffff

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 6044 bytes 1572156 (1.4 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 6044 bytes 1572156 (1.4 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

vlan101: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.101.100 netmask 255.255.255.0 broadcast 192.168.101.255
ether c8:1f:66:ec:e9:33 txqueuelen 1000 (Ethernet)
RX packets 7296752 bytes 20066216937 (18.6 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 8661678 bytes 25778147747 (24.0 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

vlan102: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.102.100 netmask 255.255.255.0 broadcast 192.168.102.255
ether c8:1f:66:ec:e9:35 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 37 bytes 8165 (7.9 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Networking settings page:

From the esxi side (hostname Bastion):
[root@Bastion:~] vmkping -I vmk1 192.168.101.100
PING 192.168.101.100 (192.168.101.100): 56 data bytes
64 bytes from 192.168.101.100: icmp_seq=0 ttl=64 time=0.146 ms
64 bytes from 192.168.101.100: icmp_seq=1 ttl=64 time=0.179 ms
64 bytes from 192.168.101.100: icmp_seq=2 ttl=64 time=0.140 ms

--- 192.168.101.100 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.140/0.155/0.179 ms

[root@Bastion:~] vmkping -I vmk2 192.168.102.100
PING 192.168.102.100 (192.168.102.100): 56 data bytes
sendto() failed (Host is down)
[root@Bastion:~]

The Round Robin config set from before:

jgreco · Jan 31, 2023

r34lj4k3 said:
super excited for Linux over FreeBSD btw

You're super excited to be running the thing that iXsystems said they were not prioritizing on SCALE, and recommended that everybody who needed iSCSI to keep using CORE?

Okay then....?

r34lj4k3 · Jan 31, 2023

jgreco said:
You're super excited to be running the thing that iXsystems said they were not prioritizing on SCALE, and recommended that everybody who needed iSCSI to keep using CORE?

Okay then....?

Hi jgreco, big fan.

That being said, I did not see that as part of the migration guide. Can you point me in that direction?

The excitement stems from my hatred of FreeBSD and a love of Linux.

jgreco · Jan 31, 2023

r34lj4k3 said:
Hi jgreco, big fan.

That being said, I did not see that as part of the migration guide. Can you point me in that direction?

The excitement stems from my hatred of FreeBSD and a love of Linux.

I haven't seen any "migration guide." I'm just working off what has been said. It was made very clear that there are a lot of issues with SCALE that are not going to be resolved in the short term, especially including things like the sucky Linux memory management (half the memory for ARC) and a variety of performance issues. My understanding is that iXsystems is focused on making certain subsystems such as Kubernetes, containers, and scale-out features work well. They are not particularly interested in investing time to fix use cases already addressed by CORE, such as iSCSI, where iXsystems actually invested significant time and effort in creating a very high performance kernel iSCSI subsystem. We already know the Linux iSCSI stuff kinda sucks.

Why do you hate FreeBSD, and why do you even care? It's an appliance. Your interactions with it should be through the GUI.

HoneyBadger · Jan 31, 2023

Have you tried removing your NMP claimrules for TrueNAS and letting it revert to the default behavior? I don't have a SCALE iSCSI setup on hand right at the moment, but I'll see if I can spin one up.

r34lj4k3 said:
After the upgrade, even after re-making the vlan interfaces and assigning ports, I can only ever get one to ping at a time, but either one can be active. I am fairly certain this is a networking issue or a bug, as there is no outbound (except for failed pings sent from truenas) nor inbound on the interface that doesn't work. It is listed as DEAD in the iSCSI paths view on VMWare esxi8.

I'd also suggest removing the VLAN tags from the TrueNAS SCALE machine and setting the switch-side ports to edge-type, "native VLAN" - the Cisco IOS equivalent here would be akin to switchport access vlan 101 and switchport access vlan 102 in the interface configuration.

~~Is reverting back to a CORE boot environment possible to restore full functionality?~~
Can you try booting a new install of CORE? Normally, the upgrade is one-way, but importing the pool alone should be portable.

Whattteva · Jan 31, 2023

r34lj4k3 said:
The excitement stems from my hatred of FreeBSD and a love of Linux.

Kinda' funny cause for me, it's the other way around. I friggin' dislike Linux and its tendency to reinvent the wheel a million times and often times even make it worse (arguably). I mean, just look at systemd and pulseaudio history of their introduction mess just to name a few. That said, Linux does tend to have better desktop HW/SW support, which is why I use it for some of my workstations, but FreeBSD all the way for all my server machines.

Also, I'm currently running an experimental SCALE setup with dummy pool (no real data) and maybe 4 TrueCharts apps and the system constantly hovers around 8-11% CPU usage.... It's literally doing NOTHING and no one is using it. Meanwhile, my CORE, which is actually running production shares and a bunch of jails running production services mostly idle at 0-2%.... Never gonna use SCALE for my production NAS due to this.

r34lj4k3 · Feb 5, 2023

jgreco said:
I haven't seen any "migration guide." I'm just working off what has been said. It was made very clear that there are a lot of issues with SCALE that are not going to be resolved in the short term, especially including things like the sucky Linux memory management (half the memory for ARC) and a variety of performance issues. My understanding is that iXsystems is focused on making certain subsystems such as Kubernetes, containers, and scale-out features work well. They are not particularly interested in investing time to fix use cases already addressed by CORE, such as iSCSI, where iXsystems actually invested significant time and effort in creating a very high performance kernel iSCSI subsystem. We already know the Linux iSCSI stuff kinda sucks.

Why do you hate FreeBSD, and why do you even care? It's an appliance. Your interactions with it should be through the GUI.

Here's the migration guide:

/scale/gettingstarted/migrate/

I think we've all needed to hit the CLI for various things, last time I was getting more SMART information via the commands. I just find it very annoying that a lot of the standard Linux commands don't work and the lack of a BASH environment, where I spend most of my time in Linux.

HoneyBadger said:
~~Is reverting back to a CORE boot environment possible to restore full functionality?~~
Can you try booting a new install of CORE? Normally, the upgrade is one-way, but importing the pool alone should be portable.

I ended up booting from the TrueNAS CORE OS still loaded on my NAS and everything came back right away. I did not upgrade the pool once booting into the SCALE OS just in case. I did see that it was supposedly a one way transaction, but after loading back into CORE, it even picked up the previously broken iSCSI connection automagically. Guess I won't be on Linux for a while :(

Whattteva said:
Also, I'm currently running an experimental SCALE setup with dummy pool (no real data) and maybe 4 TrueCharts apps and the system constantly hovers around 8-11% CPU usage.... It's literally doing NOTHING and no one is using it. Meanwhile, my CORE, which is actually running production shares and a bunch of jails running production services mostly idle at 0-2%.... Never gonna use SCALE for my production NAS due to this.

I didn't have this experience with CPU usage, maybe it's service/hardware relevant?

jgreco · Feb 5, 2023

r34lj4k3 said:
Here's the migration guide:

So perhaps I'm a bit confused. You expected the migration guide to give you reasons not to migrate? There are already conspicuous warnings that SCALE is not an upgrade from CORE; in much the same way users are expected to understand the difference between a truck and an SUV when purchasing a vehicle, you are expected to understand that these are different things.

r34lj4k3 said:
I think we've all needed to hit the CLI for various things, last time I was getting more SMART information via the commands. I just find it very annoying that a lot of the standard Linux commands don't work and the lack of a BASH environment, where I spend most of my time in Linux.

I feel the same way about the idiotic ZSH environment thrust upon us in older versions of TrueNAS. I mostly hate on BASH because an entire generation was raised that cannot tell the difference between BASH and their rear ends; Bourne shell is NOT the same thing as BASH and those of us who write true Bourne scripting would appreciate it if the BASHies would take their damn BASHisms and choke to death on them. Do NOT frickin' shebang /bin/sh if you are writing in BASH! (just a quick venting there, heh) In any case, a "lack of the standard Linux commands" is likely related to your PATH; try defining some dotfiles, especially if you are using a non-root administrative account, due to the issues I pointed out in

Shell command issues with non-root-admin

Hi, I have created a new non-root-admin on Bluefin after the latest update. However while using the shell as this user for example I can't run the "zpool" command, as this will give me the error "zpool: command not found" When going back to the root-account it works fine. Any hints how to fix...

www.truenas.com

Whattteva · Feb 5, 2023

jgreco said:
I feel the same way about the idiotic ZSH environment thrust upon us in older versions of TrueNAS. I mostly hate on BASH because an entire generation was raised that cannot tell the difference between BASH and their rear ends; Bourne shell is NOT the same thing as BASH and those of us who write true Bourne scripting would appreciate it if the BASHies would take their damn BASHisms and choke to death on them. Do NOT frickin' shebang /bin/sh if you are writing in BASH! (just a quick venting there, heh)

I'm 100% with you on this one. Linux changing default shell to bash/zsh and symlinking /bin/sh to it really is 100% to blame on this. Hence why in the FreeBSD world, we call these things Linuxims/Bashisms. Don't even get me started on systemd and a whole bunch of software with hard dependencies on systemd making it impossible to port to other POSIX systems.

Whattteva · Feb 5, 2023

r34lj4k3 said:
I didn't have this experience with CPU usage, maybe it's service/hardware relevant?

I don't think there's anything in my hardware that would make it do anything (listed in my signature primary system). It's also just a simple VM with NO DATA and only 4 TrueCharts apps installed that were not even configured yet (just default deployment). I wanted to test the apps, but didn't bother to do so when I saw that CPU (2 cores) was constantly hovering around 5-13%.... on a VM that essentially has no data, no real users, and not even a configured app? What could k3s process possibly be doing? My CORE VM (2 cores also) sees no such issue and it's an actual production server with actual TB worth of data and real users and a transmission jail that is seeding like 20-30 torrents and it's barely using 2% if even that.

NickF · Feb 6, 2023

Hey OP!

I didn't run into this problem because when I migrated from CORE to SCALE I also migrated my VMs from VMWare to SCALE. That being said, have you tried making your management network not native? In other words, VLAN 101 and VLAN 102 are tagged interfaces on your eno4 port, can you also change your management network from native/untagged to tagged and see if that resolve the problem? I've seen weirder things happen....

I don't see anything "wrong" with your config (except you only have a single uplink and not a LAG but that doesn't have anything to do with your problem :P).

Important Announcement for the TrueNAS Community.

iSCSI Multipath Networking in SCALE

r34lj4k3

Cadet

jgreco

Resident Grinch

r34lj4k3

Cadet

jgreco

Resident Grinch

HoneyBadger

actually does care

Whattteva

Wizard

r34lj4k3

Cadet

jgreco

Resident Grinch

Shell command issues with non-root-admin

Whattteva

Wizard

Whattteva

Wizard

NickF

Guru

Similar threads

Important Announcement for the TrueNAS Community.

iSCSI Multipath Networking in SCALE

Cadet

Resident Grinch

Cadet

Resident Grinch

actually does care

Wizard

Cadet

Resident Grinch

Wizard

Wizard

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "iSCSI Multipath Networking in SCALE"

Similar threads