LAGG spanning multiple nonstacked switches

Torkil Svensgaard · May 22, 2015

I've got a pair of Zyxel XS3700 10G layer 2 switches in my setup, and they do not support any kind of stacking, as far as I can tell.

For redundancy purposes I'd like to create a LAGG on my FreeNAS boxes, with an interface connected to each switch. The only possible protocol is failover I think, but that should work, right?

Thanks,

Torkil

dlavigne · May 28, 2015

Did you try this, and if so, what worked?

Torkil Svensgaard · May 28, 2015

dlavigne said:
Did you try this, and if so, what worked?

Not yet, but I'll update when I do get around to it. I discussed it with a guy from iX and we're pretty sure it will work though, but only with failover. Next question is if it will work with a TrueNAS HA setup.

jgreco · May 28, 2015

It may be dodgy and quite possibly won't work. Link aggregation isn't exactly designed to do this (MLAG implementations disregarded) and there are often odd problems when breaking networking paradigms in odd ways. As you note, failover is the only option that has even a chance of working. You may want to investigate net.link.lagg.failover_rx_all as something to set in any case.

Torkil Svensgaard · May 28, 2015

Thanks for the tip, I'll keep that in mind when testing.

Additionally, I'm contemplating these two topologies, any comments? The Junipers only have 4 SFP+ 10G ports, limiting my options.

The Juniper switches are EX3300-24s configured as Virtual Chassis, the Zyxels are XS3700. Some version of STP will be used on the Zyxels to avoid switching loops, and also on the uplinks, to keep traffic on the fast one.

I'll have a number of servers connected to the Zyxels (TrueNAS, FreeNAS and ESXi), and about 20 Linux hosts connected to the Junipers with 1G, bonded. My goal is redundancy, so a switch dying doesn't take down all my infrastructure, so failover is fine, both on the FreeNAS/TrueNAS boxes and on the ESXi vSwitches.

jgreco · May 31, 2015

Well, you're going to end up with topology weirdness since STP's going to shut down some of the paths. The question is whether or not they'd be the ones you want, and the truth is that they probably won't be.

In your first scenario, it is either the link beteen the Zyxels or one of the links between the Junipers and one of the Zyxels that is likely to get put into blocking mode, assuming the Junis hold the root.

In the second scenario, the link between the Zyxels and one set of links between the Junis and the Zyxels will get put into blocking mode.

I'm not entirely clear on what's needed for the Juniper "virtual chassis" setup you're using. Presumably management traffic, but only switching traffic where topology dictates? I will note that in the second scenario, that there is no scenario where 10G worth of traffic would flow between the Junipers, but I suppose if you have the ports...

Torkil Svensgaard · May 31, 2015

In the simplest one, I would have the Juniper VC/stack hold the root and block the path between the Zyxels with STP, like depicted below. The blocked path would become active if I lose either a Zyxel or a Juniper and connectivity would be restored?

Juniper recommends a dual link between the devices in a VC to create a ring, but with only two devices that's sort of irrelevant. My suggestion is the Juniper VC can be regarded as a single device for this design and I'll only have to worry about STP on the Zyxels.

Torkil Svensgaard · Jun 1, 2015

I guess skipping STP all together makes it a lot simpler and as long as both FreeNAS/TrueNAS and the vSwitch on ESXi can handle being connected to two different non stacked edge switches I'll still be able to survive a switch going south. If it's a Zyxel, the scenario is identical and if it's a Juniper it'll take one of the Zyxels with it, but everything will still have a link on the other switch. Any suggestions or better topologies given my hardware? =)

Torkil Svensgaard · Jun 1, 2015

Hooked it up like this for testing:

Code:

         Client
            |
Zyxel --- Juniper --- Zyxel
     \               /
      FreeNAS w. failover

It seems to work, hardly a hickup when I unplug and replug the two interfaces.

jgreco · Jun 1, 2015

Well, I didn't say you wanted to skip STP... but you want to be very aware what the implications will be. Especially if you are doing this with a single vSwitch on ESXi.

In our most recent design here, we've got something that somewhat resembles the above. Two Dell Networking N4032F's form the network core in the server rack, and two Dell Networking 7048P's provide edge switching in the distribution frame (can be scaled as needed).

There are some differences: Nothing's stacked. We have many, many vlans in the environment which appear on all switches. For around two decades, we've done fully redundant core networks, but that used to be at layer 3 with OSPF. It is still designed that way but now we can handle redundancy at layer 2 as well.

The two N4032F's are connected at 40Gbps. Each N4032F has a 20Gbps LACP to each 7048P. Clients typically can't be redundant but in theory could be. To get from one edge switch to the other requires going to the core. The nonobvious thing is that LACP will shut down the 20Gbps links from the non-root N4032F to both 7048P's, meaning that traffic from that N4032F to the edge always traverses the 40Gbps link. Unless that's down, in which case one of the blocked 20Gbps LACP's will be enabled. Since the edges really don't need more than 10Gbps, this is a very robust design that ought to last a few years. ;-)

The ESXi hosts each have a pair of vSwitches, with 4 10G uplinks. In our environment we have LOTS of vlans, so some are presented on one vSwitch and some are on the other. This is tied up with the layer 3 routing protocols, and would be very confusing without the historical perspective, but it basically causes some load distribution between the switches. However, with modern gear we can have link aggregation in failover mode, so the links in green are backup links that only come up if the primary goes down. This means that either N4032F can be rebooted without an issue, all vlans and traffic fall back onto the other switch and take backup pathing as needed. It's also expandable as needed.

So that's all lots of networking fun and doesn't even touch on the layer 3 goodness.

Now back to yours.

In your diagram, you've tried to connect the two Zyxels together, but the thing I'd have to wonder is, how much traffic will actually be traversing that route? Keeping them separate is probably a better idea; in the unusual case where you would actually need traffic to flow from one edge switch to another, let it traverse the Junipers.

In the architecture I've drawn above, I have the advantage of multiple vlans, and so I can encourage traffic not to be traversing the 40Gbps by putting all the "vlan A, B, C" on vSwitch 0 on each ESXi, and then "vlan D, E, F" on vSwitch 1 on each ESXi. In that model, then, one core switch tends to handle most traffic for those specific vlans.

You don't have that advantage, and you might find that you get into an oversubscription situation if you allow traffic to flow between Zyxels in that manner, because your upstream connectivity on that one Juni is 10G. I'm unclear on an optimal way to deal with this. There seem to be a bunch of suboptimal possibilities, but I think it requires a better understanding of what your traffic patterns are likely to be, to understand the implications of STP blocking, and this probably goes beyond the sort of help you'll get on a forum that isn't even focused on networking topologies.

I got tired of all that crap years ago and so as you can see above, I just do a kind of maximal configuration and then don't worry about it.

Torkil Svensgaard · Jun 1, 2015

jgreco said:
Well, I didn't say you wanted to skip STP... but you want to be very aware what the implications will be. Especially if you are doing this with a single vSwitch on ESXi.
In your diagram, you've tried to connect the two Zyxels together, but the thing I'd have to wonder is, how much traffic will actually be traversing that route? Keeping them separate is probably a better idea; in the unusual case where you would actually need traffic to flow from one edge switch to another, let it traverse the Junipers.

Neat setup.

I'm getting a little bit tired myself, so I think I'll try to keep it as simple as possible, keeping the Zyxels separate, as you also recommend. I'll probably get the best performance confining my internal storage traffic to just one Zyxel anyway, avoiding having to go between them. I can also still lose/reboot any of the switches and keep connectivity, so my goal is met.

I'm also looking at a solution for my HA TrueNAS. Originally I figured LAGG failover was the only option, but a nicer solution could be creating a LACP LAGG for each node, one LAGG on each Zyxel. If I lose a switch I recover by doing a node failover instead of a LAGG failover, with the added bonus of LACP for my clients.

Thanks for the tips =)

jgreco · Jun 1, 2015

I can't tell exactly what you're thinking.

For LACP you need both endpoints to end on the same switch, except, perhaps, in the case of the Juniper, if it supports MLAG, splitting between the units may work since the STP computations are done on a virtual-chassis-wide basis.

So do you mean that you are thinking of a 2x1G LAGG from the primary TrueNAS node to one Zyxel and then another 2x1G LAGG from the secondary to the other? That's fine of course, but there you may have just invented a scenario where you could be better off contemplating hooking the Zyxels together and making sure the STP does the right thing through careful application of preferences. How many 1G ports do the TrueNAS nodes have, anyways?

cyberjock · Jun 1, 2015

The TrueNAS nodes can have anywhere from 2 to 4 on-board and have multiple PCIe slots from which you could have multiple 4-port cards.

Torkil Svensgaard · Jun 1, 2015

The actual servers I'm going to connect to the Zyxels are these

2 x SuperMicro ESXi hypervisors, each with 2 x 10G electrical
2 x FreeNAS boxes, each with 2 x 10G optical
1 x TrueNAS Z20, with 2 x 10G optical

All the servers also have 2 x 1G, but those interfaces are connected to an internal network at the facility where I work. For the NAS boxes those interfaces are configured as LACP LAGG.

My current idea is depicted below. The hypervisors are to be connected to both Zyxels, with an active interface on each. The FreeNAS boxes are to be connected to both Zyxels, with failover LAGG. The TrueNAS nodes are to be connected one to each Zyxel with LACP LAGG. Unfortunately the Zyxels don't support MLAG, whereas the Junipers do, but the Junipers only have 4 10G ports each and of those 1 is needed for the uplink to my ISP and at least one is needed for the VC, leaving only 1-2 free on each.

How do you envision the STP preferences to improve this? The Zyxels could be connected by more than one link if needed.

jgreco · Jun 1, 2015

Ok. I finally figured out what was nagging at me and then figured out what you should really do here. Give me a little bit to hammer this out. It's obvious ...

jgreco · Jun 1, 2015

Okay. So here's the thing. Your setup is inherently asymmetric, and that's unfortunate. What you want is for the 10GE link from Juniper 1 to Zyxel 1 to be the blocked port. Now let's consider the failure cases.

1) Juniper 1 fails. You're in big pain. But the Zyxels have a fast path to Juniper 2 and the 1GE uplink. Yay.

2) Juniper 2 fails. You're in modest pain. Zyxels have a fast comm path to each other, and a 10GE to Juniper 1, which only has 10GE ingress/egress anyways. Yay.

3) Zyxel 2 fails. Traffic falls back onto the 10GE normally blocked link to Juniper 1. You have 11Gbps of ingress/egress but only 10Gbps to the Zyxels. Not totally ideal but not bad.

4) Zyxel 1 fails. You've still got the fast path to Juniper 1 and 2. Yay.

Now the catch here is that what you really want is not to be switching packets from switchtoswitchtoswitchtoswitch all the time if you can avoid it, so you put stuff you EXPECT to stay local on Zyxel 1, and stuff you EXPECT to go out ingress/egress on Zyxel 2 ... if you can.

An astute observer would notice that there is not much opportunity for that 2 x 10GE LACP between the Junis to be maximized. It is, however, creating a higher level of availability for the virtual chassis. Similarly, it could be argued that the 2 x 10GE LACP from Zyxel 2 to Juniper 2 is similarly somewhat underutilized. I would imagine that there is a possibility at some point where the 1G uplink there could become a link aggregation, though. The goal is to avoid creating traffic hot spots through unanticipated STP blocking transitions.

Torkil Svensgaard · Jun 2, 2015

Very nice analysis and write up, thanks =)

I failed to mention that the 1G uplink is not electrical but optical, so either Juniper1<->Juniper2 or Juniper2<->Zyxel2 will have to be single link, as the Junipers only have 4 SFP+ ports. My uplinks will be configured so STP blocks the 1G path.

As to my traffic I have the following workloads. My "clients" are either desktop connections coming in through the Internet connection or calculations running on 20 compute nodes, the latter being connected to both Junipers through bonded 1G interfaces.

Users connect through SSH and get an X11 desktop. This hits the hypervisors.
Users submit jobs to the cluster. This hits the NAS boxes for data.
Snapshots, these move between the NAS boxes

The only truly local traffic is transfer of snapshots, so I guess I could do the following, which will keep NAS<->NAS traffic on Zyxel1. This would mean that the rest of my traffic to the storage servers has to take the longest path though.

Josh2079 · Jul 12, 2015

@jgreco Just FYI the Junipers in VC mode is the equivalent of 2 CISCO switches in a stack so the STP wont actually work the way you'd anticipate.You can utilize things like redundant trunk grouping, cross member link aggregation, etc. Unfortunately you can not do a 2 x 10Gb LACP link between the members. it doesn't work that way.

I could see you doing some sort of ether channel from each Zyxel switch to both Junipers since they allow cross member link aggregation that would provide you the redundancy and some fault tolerance. I don't however know that that is really the best solution.

Important Announcement for the TrueNAS Community.

LAGG spanning multiple nonstacked switches

Torkil Svensgaard

Dabbler

dlavigne

Guest

Torkil Svensgaard

Dabbler

jgreco

Resident Grinch

Torkil Svensgaard

Dabbler

jgreco

Resident Grinch

Torkil Svensgaard

Dabbler

Torkil Svensgaard

Dabbler

Torkil Svensgaard

Dabbler

jgreco

Resident Grinch

Torkil Svensgaard

Dabbler

jgreco

Resident Grinch

cyberjock

Inactive Account

Torkil Svensgaard

Dabbler

jgreco

Resident Grinch

jgreco

Resident Grinch

Torkil Svensgaard

Dabbler

Josh2079

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

LAGG spanning multiple nonstacked switches

Dabbler

dlavigne

Guest

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Dabbler

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Inactive Account

Dabbler

Resident Grinch

Resident Grinch

Dabbler

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "LAGG spanning multiple nonstacked switches"

Similar threads