An MLAG Problem in a Split Brain Situation

Multi-Chassis Link Aggregation Groups, often abbreviated MLAG, but also known under the acronyms CLAG, MC-LAG, MCT, SMLT, VLT, or vPC, provide a relatively simple method for active/active Layer 2 (Ethernet) redundancy. The usual implementation is based on two switches that work with independent control planes and use some (proprietary) protocol to negotiate MLAG operations.

Terminology

Basically every vendor of networking equipment, i.e., switches in the context of MLAG, uses different terminology to describe more or less the same thing. I will use the terms MLAG, peer link, and split brain as follows:

MLAG
provides the ability to connect one device, e.g., a server, to two independant switches using a Link Aggregation Group (LAG). MLAG functionality is restricted to LAG operations, as opposed to creating a single virtual switch out of two or more independent switches.
peer link
is the connection used between two independent switches to realize MLAG operations. This link is used for synchronization of data, e.g., MAC addresses, and state, e.g., port status. Keepalives can be sent on this link as well, although some implementations require a dedicated keepalive connection. The peer link is often implemented as two direct connections between the MLAG peers combined into a LAG. Some implementations allow to use a virtual peer link instead.
split brain
refers to failure of the peer link with both MLAG peer switches still functioning.

ARP Resolution Failure

In a split brain situation, ARP resolution by the MLAG peer switches may no longer work reliably.

Basic Example Setup

This potential problem in a split brain situation, i.e., when the peer link between the two MLAG peer switches of an MLAG pair fails, pertains to a simple MLAG setup comprising just two switches Sw1 and Sw2, and two servers Srv1 and Srv2. Two servers provide some motivation for first hop gateway functionality on the MLAG peer switches, because they could be in different VLANs with different subnets.

basic MLAG setup

Without Failures

As long as there are no link failures and no single-attached devices (also known as orphan ports), the connection between the two switches of an MLAG pair is nearly unused. Most data frames received by one of the MLAG peer switches are forwarded to another local (MLAG) port.

Pure Layer 2

The above holds for nearly all frames in a pure Layer 2 setup without link failures or single-attached devices. As a result the MLAG contruct then functions even in a split brain situation, i.e., when the direct links between the MLAG peer switches all fail. Lack of MAC address synchronization between the MLAG peer switches results in additional unknown unicast flooding, but end-system connectivity is maintained (unless the network is overloaded by the extra flooding).

First Hop Gateway on MLAG Peer Switches

But as soon as the MLAG peer switches start to perform some Layer 3 (IP) functionality, the peer link is required. For example, ARP resolution by one of the MLAG peer switches may require the response frame to traverse the peer link, depending on the load sharing decision performed locally on the responding device.

ARP illustration

If the peer link between the two MLAG peer Switches fails, ARP resolution may fail, inhibiting some IP communication.

ARP failure illustration

Device local load sharing decisions determine if ARP resolution works or fails for some other device. Some IP communication will work, some will fail. Mitigations for this problem exist and should be considered for any MLAG deployment.

A Split Brain Mitigation Mechanism

One way to mitigate this problem (and similar ones) is to disable the MLAG ports of one of the MLAG peers in a split brain situation. This requires defining one of the peers to fulfill a primary role, and the other to fulfill a secondary role. Then the secondary MLAG peer is configured to disable its MLAG ports if a split brain situation is detected.

ISC failure mitigation illustration

Usually an additional, potentially logical, link between the MLAG peers is established to allow the secondary MLAG peer to distinguish between failure of the MLAG peer link and failure of the primary MLAG peer (this is an additional keepalive connection). An MLAG setup without this additional connection cannot provide mitigation of both the failure of any MLAG peer switch and failure of just the MLAG peer connection. The same basic idea is supported in some switch stacking and many chassis bonding solutions as well.

Combining MLAG and VXLAN

If the MLAG implementation is combined with an anycast VTEP for redundant server connectivity to a VXLAN fabric, the secondary MLAG peer's anycast VTEP needs to be disabled as well as any (other) MLAG port in a split brain situation. The important part is to no longer advertise the anycast VTEP IP into the underlay.

ARP Synchronization via Keepalive Connection

Another mitigation would be to use the additional keepalive connection for synchronization of ARP cache contents instead of disabling the MLAG ports of the secondary peer.

EVPN Control Plane for VXLAN as Mitigation

If the VXLAN deployment uses a control plane protocol like EVPN, information about MAC address to IP address association can be distributed via this protocol. This not only potentially allows to locally answer ARP requests allowing to suppress ARP flooding, it can also mitigate the ARP resolution problem described above as long as the control plane protocol provides this feature and still works correctly after a failure of the MLAG peer link, e.g., via Layer 3 uplinks. This mitigation is a possible replacement for disabling the MLAG ports of the secondary MLAG peer by preventing a split brain situation in the case of a failed MLAG peer link.

Virtual Peer Link

In a network comprising more than just two switches, i.e., switches in addition to the MLAG pair (possibly additional MLAG pairs), use of a virtual peer link, i.e., peer link functionality implemented by encapsulating frames for the MLAG peer for transport over the switch's uplinks, allows to treat the MLAG peers equally without primary or secondary designation. If an MLAG peer loses all uplinks, it needs to disable all MLAG ports. As long as at least one uplink of the other peer is still working, the MLAG construct provides connectivity. This obviously does not work if there are only direct connections between the two MLAG peer switches and thus no uplinks at all.

It may in general be helpful to disable all downlinks of a switch that has lost all its uplinks in order to signal downstream devices (e.g., servers) to switch to another port (unless there are single-attached devices). Some hypervisor vendors expect this network behavior.

Avoid Single-Attached Devices

If there are any single-attached devices connected to any of the MLAG peer switches, a failure of the connection between the MLAG peers cannot always be mitigated. Thus the network design should avoid single-attached devices, e.g., by adding an additional switch to connect all hosts that cannot use an MLAG connection.

A single-attached device is similar to an MLAG attached device where all but one links have failed. An MLAG construct cannot protect against arbitrary multiple failures, but a single-attached device is equivalent to a dual-attached device where some failure has already occurred, thus the first real failure is actually a multiple failures situation for single-attached devices.


back to my homepage.