BGP in the Data Center: Part Three

In Part 2, I introduced the CLOS architecture as a way to overcome many limitations of traditional data center architectures and stressed the importance of using L3 routing in the design. This begs the question: Which routing protocol should be used in such an architecture? How about RIP? Just kidding! Here are the pros and cons of using IGP (OSPF/ISIS) and BGP and if you go with BGP, whether to use iBGP or eBGP in the data center.

I’ll consider OSPF and ISIS together as I don’t want to dive into an argument that is similar to comparing the relative merits of a Lamborghini and a Ferrari.

At first look, the choice may seem to be highly dependent on the size of the data center. IGP could easily scale if there are only a few thousand prefixes in total. This isn’t wrong, but there are more considerations than sizing.

  • From a configuration perspective, IGP is easier to deploy compared to BGP, especially considering the number of peers to be configured in BGP. Therefore, automation is a must in BGP deployments.
  • IGP is more likely to be supported by TOR switches. In BGP deployments, this limitation could result in aggregating the TOR and leaf layers.
  • As stated above, BGP is better when dealing with a high number of prefixes, and it has a limited event propagation scope. On the other hand, in such a complex connectivity matrix (depends on the number of switches), IGP will propagate link state changes throughout the network, even to the irrelevant nodes which are not impacted. SPF calculations will take place on each node. Some mechanisms such as Incremental SPF or Partial SPF could avoid this but may add more complexity.
  • If IGP is used, BGP will still remain in the data center, most likely at the edge. This means there will be two routing protocols in play and redistribution will be necessary. That being said, BGP can be the only routing protocol if chosen.
  • With its attributes and filters, BGP provides much better flexibility on controlling the traffic, and it provides per-hop traffic engineering capability. BGP AS path visibility also helps operators troubleshoot problems more easily.
  • BGP can be extended across data centers and used as a control plane protocol (EVPN) for VXLAN overlays. This provides many benefits such as active/active multi-homing.
  • By default, IGP is more compatible with ECMP, where BGP configuration should be adjusted to meet the load balancing requirements.


When looking to enable BGP in data center networks, whether to use iBGP or eBGP is a good question. There are similar obstacles in both. However, eBGP-based design is seen more commonly, because iBGP can be tricky to deploy, especially in large-scale data centers.

One of the issues with iBGP is route reflection between nodes. Spine switches can be declared as route reflectors, but this causes another issue: Route reflectors reflect only the best paths. Therefore, nodes won’t be able to use ECMP paths. The BGP add-path feature has to be enabled to push all routes to the leaves. On the other hand, “AS_Path Multipath Relax (Multi-AS Pathing)” needs to be enabled in eBGP. Even so, eBGP does not require maintaining route reflectors in data centers.

As stated before, BGP has extensive capabilities on per-hop traffic engineering. With iBGP, it is possible to use some part of this capability but eBGP’s attributes provide better visibility, such as directly comparing BGP-Local-RIB to Adj-RIB-In and Adj-RIB-Out. In terms of traffic engineering and troubleshooting, it is more advantageous than iBGP.

CLOS with eBGP Design

As seen in the figure below, spine switches share one common AS. Each leaf instance (pod) and each TOR switch has its own AS. Why are spines located in one AS while each TOR switch has its own? This is to avoid path hunting issues in BGP. Here are some concerns about this architecture.

CLOS with eBGP in the data center

Figure 1. CLOS Topology using eBGP for Spine, Leaf and TOR Switches


ECMP is one of the most critical points, because otherwise it wouldn’t be possible to use all links. This is indeed one of the reasons that we are avoiding STP (see part one). As stated in part two, IGP can easily deal with this requirement, whereas BGP needs some adjustments.

The topology above is multiple pod/instance design, which reduces the number of links between leaf and spine layer. It also allows operators to put more than one leaf node in each AS. This type of setup is commonly seen, especially in large data center networks. Looking up the possible paths from server A to server H, there are four different paths, and each of them has the same AS_Path attribute (64515 64513 64512 64514 64518). If the other attributes are also the same, BGP can utilize all ECMP paths by enabling multi-path functionality.

It seems quite straightforward in this topology, but what if each leaf switch has its own AS or servers are connected to more than one TOR switch for the sake of redundancy? Then, the length of each AS Path attribute will be the same, which does not suffice for using multi-paths in BGP. Even if the length is identical, in order for BGP to route traffic over multiple paths, all attributes have to be the same, including the content of the AS Path attribute. Fortunately, there is a way to meet this requirement: The “AS Path multi-path relax” feature needs to be enabled to let BGP ignore the content of the AS Path attribute, while installing multiple best routes as long as the length is identical.


Due to its nature and area of use, convergence time initially was not one of the first concerns in BGP. Stability was prioritized over fast convergence. However, some fast convergence enhancements have been introduced to BGP subsequently:

  • BGP neighbor fall-over (iBGP) and/or BGP fast external fall-over (eBGP) should be enabled.
  • BFD can also be used, but until recently, there was a limitation that BFD did not take any action in case of a single link failure on a LAG. Still, it needs to be checked with vendors.
  • The Events Advertisement Interval in eBGP peering should be set to zero, which is the default value in iBGP peering.
  • The Keepalive timer should be set to five seconds at most and hold time should be set to 15 seconds.

Number of ASes to be used:

Since each TOR switch is located in its own AS, how are operators able to scale the number of ASes, especially considering the number of TOR switches in a large data center?

  • There are 1,023 private ASNs (64112-65535). If this is not sufficient, one of the options is using 4-byte ASNs, which enables millions of ASNs.
  • TOR ASNs can be used more than once. In this case, the BGP Allow-AS-In feature needs to be enabled on TOR switches. This will turn off one of BGP’s main loop avoidance mechanisms, and TOR switches will accept the routes even though they see their own ASN in received updates.
  • If instances/pods are used in the topology where leaf switches share ASes, summarization might create black holes when a specific prefix is withdrawn on the TOR side. To avoid these kinds of issues, specific prefixes should not be hidden.

Evolving with BGP

BGP is a protocol that meets several requirements and architectural needs in various segments of the network. In this series, I covered the challenges in traditional data center architectures, walked through the CLOS topology as an answer, explained why L3 routing is preferable to L2 switching, and assessed IGP vs. BGP. I personally think the CLOS with eBGP-based architecture is going to be seen more commonly in data centers. As a matter of fact, even if it is BGP or IGP, moving the whole data center network from L2 to L3 is very beneficial by itself in many aspects.

Packet Design’s Explorer Suite provides extensive network visibility by correlating L3 control plane protocols (OSPF, ISIS, BGP, MP-BGP, 6PE, etc.) with services, traffic, and performance data.

Explorer Suite | More Resources | Request Demo | Follow us on Twitter