In Part 2, I introduced the CLOS architecture as a way to overcome many limitations of traditional data center architectures and stressed the importance of using L3 routing in the design. This begs the question: Which routing protocol should be used in such an architecture? How about RIP? Just kidding! Here are the pros and cons of using IGP (OSPF/ISIS) and BGP and if you go with BGP, whether to use iBGP or eBGP in the data center.
I’ll consider OSPF and ISIS together as I don’t want to dive into an argument that is similar to comparing the relative merits of a Lamborghini and a Ferrari.
At first look, the choice may seem to be highly dependent on the size of the data center. IGP could easily scale if there are only a few thousand prefixes in total. This isn’t wrong, but there are more considerations than sizing.
When looking to enable BGP in data center networks, whether to use iBGP or eBGP is a good question. There are similar obstacles in both. However, eBGP-based design is seen more commonly, because iBGP can be tricky to deploy, especially in large-scale data centers.
One of the issues with iBGP is route reflection between nodes. Spine switches can be declared as route reflectors, but this causes another issue: Route reflectors reflect only the best paths. Therefore, nodes won’t be able to use ECMP paths. The BGP add-path feature has to be enabled to push all routes to the leaves. On the other hand, “AS_Path Multipath Relax (Multi-AS Pathing)” needs to be enabled in eBGP. Even so, eBGP does not require maintaining route reflectors in data centers.
As stated before, BGP has extensive capabilities on per-hop traffic engineering. With iBGP, it is possible to use some part of this capability but eBGP’s attributes provide better visibility, such as directly comparing BGP-Local-RIB to Adj-RIB-In and Adj-RIB-Out. In terms of traffic engineering and troubleshooting, it is more advantageous than iBGP.
As seen in the figure below, spine switches share one common AS. Each leaf instance (pod) and each TOR switch has its own AS. Why are spines located in one AS while each TOR switch has its own? This is to avoid path hunting issues in BGP. Here are some concerns about this architecture.
ECMP is one of the most critical points, because otherwise it wouldn’t be possible to use all links. This is indeed one of the reasons that we are avoiding STP (see part one). As stated in part two, IGP can easily deal with this requirement, whereas BGP needs some adjustments.
The topology above is multiple pod/instance design, which reduces the number of links between leaf and spine layer. It also allows operators to put more than one leaf node in each AS. This type of setup is commonly seen, especially in large data center networks. Looking up the possible paths from server A to server H, there are four different paths, and each of them has the same AS_Path attribute (64515 64513 64512 64514 64518). If the other attributes are also the same, BGP can utilize all ECMP paths by enabling multi-path functionality.
It seems quite straightforward in this topology, but what if each leaf switch has its own AS or servers are connected to more than one TOR switch for the sake of redundancy? Then, the length of each AS Path attribute will be the same, which does not suffice for using multi-paths in BGP. Even if the length is identical, in order for BGP to route traffic over multiple paths, all attributes have to be the same, including the content of the AS Path attribute. Fortunately, there is a way to meet this requirement: The “AS Path multi-path relax” feature needs to be enabled to let BGP ignore the content of the AS Path attribute, while installing multiple best routes as long as the length is identical.
Due to its nature and area of use, convergence time initially was not one of the first concerns in BGP. Stability was prioritized over fast convergence. However, some fast convergence enhancements have been introduced to BGP subsequently:
Since each TOR switch is located in its own AS, how are operators able to scale the number of ASes, especially considering the number of TOR switches in a large data center?
BGP is a protocol that meets several requirements and architectural needs in various segments of the network. In this series, I covered the challenges in traditional data center architectures, walked through the CLOS topology as an answer, explained why L3 routing is preferable to L2 switching, and assessed IGP vs. BGP. I personally think the CLOS with eBGP-based architecture is going to be seen more commonly in data centers. As a matter of fact, even if it is BGP or IGP, moving the whole data center network from L2 to L3 is very beneficial by itself in many aspects.
Packet Design’s Explorer Suite provides extensive network visibility by correlating L3 control plane protocols (OSPF, ISIS, BGP, MP-BGP, 6PE, etc.) with services, traffic, and performance data.