Here is the continuation of this two-part series based on an article our CTO Cengiz Alaettinoglu published in the APNIC blog. In part one, we looked at how traffic engineering has evolved from an offline model to today’s widely used on-device model based on RSVP-TE. While RSVP-TE has advantages over the offline model, it also presents some challenges. Here, we detail these challenges and show why SDN is a more optimal solution.
A major challenge with RSVP-TE is that it requires tunnels to be formed between each router in the network to get the complete traffic matrix. This leads to what is known as the “n-squared” or “full-mesh” problem, where too many tunnels are created.
For example, a small service provider with 75 routers and 300 links can have 1,600 tunnels (when creating a full mesh). A medium-sized service provider with a little less than 500 routers can have up to 20,000 tunnels, and a large service provider with 1,900 routers and 8,000 links can end up with more than one hundred thousand tunnels (132,000 tunnels to be precise)! Imagine the plight of a network engineer who will have to configure and manage these many numbers of tunnels in their network.
The n-squared problem also creates another issue. RSVP-TE makes use of the behavior of IGP to flood the available link bandwidth through the network. But now, with the n-squared problem, the high number of tunnels created results in IGP propagating more about the available bandwidth than its primary task of propagating the link up and down status through the network.
The other challenge arising from RSVP-TE is the race conditions it triggers when a link fails. Take the case of the medium-sized service provider with 450 routers, 2,000 links, and 20,000 tunnels. While most of the links in this provider network may carry less than 200 tunnels, there can be one link that carries around 1,000 tunnels.
If that one link fails, all the 1,000 tunnels on that link must find a new path. And that means all the headend routers have to run CSPF and re-optimize the tunnels. This triggers a race condition where each headend router associated with the tunnels are independently optimizing for itself, going after the same finite amount of bandwidth without being aware of the other routers’ requirements. When such a condition occurs, some of the tunnels fail to find a path and re-optimize. The result is that on a network-wide scale, re-optimization can be considered to have failed because not all tunnels found a path.
Combined with the n-squared problem, this means that in a medium-sized service provider network about five percent of the tunnels, which is nearly 1,200 tunnels, are down most of the time. Imagine how high these numbers can go in larger networks. It is very time consuming for engineers to triage why 1,200 tunnels are down at the same time.
Each of the challenges has different solutions. Operators can achieve adaptive traffic engineering – quickly responding to a failure – with a real-time model of the network. The n-squared problem is addressed by creating as few tunnels as necessary, which reduces the IGP as well as the signaling overhead. Network providers can overcome the race conditions – triggered when each router independently tries to reserve bandwidth – by going back to the original global view of the network that was part of the offline traffic engineering model. Finally, software can help with the manageability issues that arise from having n-squared tunnels. And an approach that incorporates all these solutions? Software Defined Networking or SDN.
The SDN-based approach comes from many recent network enhancements. Segment Routing, which can replace RSVP-TE (RSVP-TE can still be used in this approach), is less complex and simplifies the IP/MPLS control plane. It can set up any type of path in the network and also comes with a low overhead. A push-based telemetry based on YANG models provides the traffic matrix replacing NetFlow, which has been traditionally used to create traffic matrices. The real-time topology of the network comes from the SDN controller, which is part of the network control plane.
Finally, the actual task of traffic engineering is once again removed from the router and moved onto an SDN application. The SDN application not only responds to traffic demands based on the real-time network state, but it can also be programmed to handle future traffic demand.
For example, it can reserve bandwidth for a big game later in the week, which cannot be done by routers that respond only to current demands. In addition, SDN applications do not need IGP to know the bandwidth availability and instead support both push-based telemetry (YANG) as well as the traditional NetFlow to get the traffic matrices.
The SDN application now takes care of computing the path and allocating the bandwidth for the entire network. This centralized approach addresses both the key challenges of on-device traffic engineering. It overcomes race conditions because the SDN app has a global network view, which allows for network-wide resource optimization. It also provides the ability to overcome the n-squared problem by creating tunnels only if needed and if they will have a positive impact.
Network providers also benefit from freedom of choice if traffic engineering is shifted to a stand-alone, third-party application rather than depending on the SDN controller. This way, if they come across a new SDN application with more features or a better algorithm, they can shift to it without being locked down by the SDN controller or the dependent routers.
This is why the SDN application-based approach is the best solution to overcome traffic engineering challenges. Network providers can transform their networks to be scalable and flexible to handle the ever-changing demands of users.