If you are a network engineer, should you be apprehensive? After all, you still trust a 27-year-old mechanism – conceived in 1989 as a short-term solution to be soon replaced by a better one – to send expensive data around the globe. And what would that be? Border Gateway Protocol, of course!
Jokingly called the “three-napkin-protocol,” because the idea was first written on the back of a napkin, BGP enabled the Internet to scale and continue its explosive growth to what it is today. This is something that BGP’s predecessor, EGP, was not capable of providing. While attempts were and are being made to design alternatives, none has come close to replacing BGP, which is now the standard for inter-AS routing.
BGP is designed to work based on trust, and that may be its greatest strength and weakness. If you are an Autonomous System, you are bound to trust, accept, and propagate the routes advertised by your peers, no questions asked. It is this quality of BGP that allows the whole of the Internet to be quickly updated about the most efficient path to a destination. And therein lies the weakness.
In 2008, state-owned Pakistan Telecom brought down YouTube for a couple of hours to protest anti-Islam cartoon videos. The telco advertised a narrow range of 256 prefixes to reach YouTube, which was then accepted and forwarded by an upstream provider that, in a perfect BGP world, would have rejected it.
Seven years later nothing had changed. In 2015, Telekom Malaysia almost brought down the Internet. The telco advertised routes to 179,000 prefixes, which a US-based tier 1 provider accepted and forwarded to its peer providers and customers. As a result, Telekom Malaysia was inundated with most of the world’s Internet traffic, which it was not prepared to handle. This caused a considerable slowdown of the Internet.
Known as route leaks, these and other similar incidents occur because the inherent nature of BGP is to trust. It is this same trust-based approach that abets BGP hijacking. Configure an edge router to announce more specific prefixes to a destination, and peer networks will accept and divert traffic intended for the destination to the attacker. There are numerous incidents throughout Internet history, some accidental and some deliberate, that have caused data to traverse the most confounding paths. In one incident, two Denver computers exchanged their traffic through Iceland and in another, British traffic to the Atomic Weapons Establishment flowed through Ukraine! Hopefully this was accidental!
The network has more to throw at you. Route flaps, link congestion, prefix changes, missing routes, incorrectly configured filters and policies, all can lead to different routing issues. The causes can be anything from network hardware issues to misconfigurations to fat fingers. Whatever the reason for your routing troubles, as a provider of network services the last thing you want to hear is customers complaining when you violate SLAs.
Routing issues can take a long time to resolve. Networks are large and complex, the problems are varied, and the root cause is often difficult to find. In the chaos, you need to find what is affecting data delivery, while making sure you do not violate SLAs. If you think “show ip bgp” or “debug ip ospf *” are all you need to solve your routing troubles, you may be in for a long night.
Because both inter- and intra-AS routing are complex and dynamic, traditional SNMP-based monitoring tools cannot explain why data traversed a slower or unknown route for five minutes a week ago. If you use such a tool that provides routing information via SNMP or probes for operational monitoring, can you afford to wait until the next ‘SNMP get’ request is triggered and completed, or until a probe collects and sends its data over the Internet to your server? I wouldn’t want to wait that long. I would prefer to begin my troubleshooting as soon as I see the first signs of anomalous behavior.
Help comes in the form of proactive monitoring, providing real-time visibility into the complete data path of services you care about – a capability known as route analytics. Route analytics can help fast-forward your slow and manual CLI-based troubleshooting by providing real-time and historical information about routing performance and help take informed decisions to optimize peering relationships. Add to this the capability to be alerted to routing incidents, such as changes to path, prefixes, and route state, the ability to drill down on a path change to look at underlying metrics, and you can make great strides to reducing MTTR and achieving SLA targets.
With route analytics, you can trust your routes and worry less.