BGP in the Data Center: Part Two

In part one of this series, I explained the challenges with traditional data center architectures. Here’s how to overcome these challenges, starting with an oldie but goodie that has been resurrected.

CLOS Architecture to the Rescue

The whole story begins with the “new” architecture called CLOS. New in this context means new in data centers. Telephony network engineer Charles Clos developed this architecture in the 1950s to meet similar scalability requirements. It’s being used today to improve performance and resiliency. Let us take a closer look at CLOS Architecture in the data center.

CLOS Architecture in the Data Center

Figure 1. CLOS Data Center Topology

A CLOS topology (Figure 1) is comprised of spine and leaf layers. Servers are connected to leaf switches (Top of Rack – TOR) and each leaf is connected to all spines. There is no direct leaf-to-leaf and spine-to-spine connection. Here are the architectural advantages of this topology:

  • Regardless of being in the same vlan condition, each server is three hops away from the others. That’s why this is called 3-stage CLOS topology. It can be expanded to 5-stage CLOS by dividing the topology into clusters and adding another top-spine layer (also known as a super-spine layer). No matter how many stages there are, total hop count will be the same between any two servers. Thus, consistent latency can be maintained throughout the data center.
  • Multi-Chassis Link Aggregation Group (MLAG or MCLAG) is still available on the server side. Servers can be connected to two different leaf or TOR switches in order to have redundancy and load balancing capability. On the other hand, as the connectivity matrix is quite complex in this topology, failures can be handled gracefully. Even if two spine switches go down at the same time, the connectivity between servers will remain.
  • The CLOS topology scales horizontally, which is very cost effective. The bandwidth capacity between servers can be increased by adding more spine-leaf links as well as adding more spine switches. As newly added spine switches will be connected to each leaf, server to server bandwidth/throughput will increase significantly.

How could this be cost effective compared to the traditional design? The key point is that spine switches do not have to be big and expensive as opposed to the core switches in the traditional design. But if there are so many TORs/leaves that users hit some kind of port limitation, then users should select switches with a high number of ports (up to 2300-10G) for the spine layer. The best way in this case, though, would be to build 5-stage, 7-stage or a multiple pod architecture.

L2 or L3 in CLOS?

It would be a huge mistake to design a data center, based on the CLOS architecture, and use only L2 switching instead of L3 routing. When compared to L2 switching, L3 routing has lots of benefits, not only for scalability and resiliency, but also for visibility, which is quite important to the planning and operation teams.

But how should L3 routing be integrated into the design? Partially or fully? Which protocol should be used? Is it possible to get rid of L2 switching? First, here’s a design that separates TOR switches from the leaf layer and makes the spine to leaf connections L3 (Figure 2).

CLOS Topology using Layer 3 Routing for Spine to Leaf Connections

Figure 2. CLOS Topology using Layer 3 Routing for Spine to Leaf Connections

In this design, TOR instances are connected to leaf instances with L2 links and spine-leaf connections are L3. MLAG enables operators to utilize all the links and create bandwidth aggregation to leaf switches.

Is it possible to eliminate the use of STP and MLAG? In general, the size of L2 domains should be limited due to the difficulty in troubleshooting L2 environments. There is also the risk of broadcast storms in large L2 domains, and MLAG is vendor dependent. This introduces limitations such as being able to use only two leaves in one instance.

What about expanding L3 routing all the way down to the TORs? Here’s how the topology looks in that case (Figure 3).

CLOS Topology using only Layer 3 Routing

Figure 3. CLOS Topology using only Layer 3 Routing

In the third and final part of this blog series, I will compare and contrast BGP and IGP to help determine which protocol to use in the CLOS architecture.

The BGP in the Data Center blog series written by Onsel Kuluk, Systems Engineer at Packet Design is now available as an E-Book. Download your free copy here: