Everyday BGP Peering Challenges

In a previous life I was a routing engineer at an Internet Service Provider. One of my responsibilities was peering analytics — making sure that all peering was optimized to allow for the best routing between our networks. This was my introduction to hot potato routing. In general, our goal was to carry traffic on our network for as little time as possible before routing it to our peer transit provider’s network. Of course, their job was the same, and that lead (and still leads) to a lot of asymmetry in BGP routing, especially when carrying OTT services like Netflix. 

One of the common issues that I had to deal with was ensuring consistent routes from peer ASs. If I had five peering sessions with a service provider, I needed to ensure that I received the same set of routes at each of those five locations within a percent or two. If not, then I was no longer able to do hot potato routing for all the prefixes that I received, and this could have significant service delivery consequences.

To illustrate this point, let’s assume I have two peers that provide me with their full routes, one on the west coast and one on the east coast. I can send traffic that originates from the east coast to the east coast peer and west coast traffic to the west coast peer. Let’s say I am trying to get to Amazon.com from the west coast but don’t have the Amazon prefixes advertised to my west coast peer but do to my east coast peer. In this case, I will carry that traffic across my backbone to the east coast where it will exit my network. It will then get carried back across the east coast provider’s backbone to the west coast and the Amazon server. Obviously, this is not ideal for either company. So the goal is to have the same set of prefixes from every comparable peer. 

This sounds easy, and for the most part, it was. Examining the number of prefixes received from each peering partner by querying the routers was pretty straightforward, but doing it regularly was tedious. And as the number of peers grew so did the task. Soon, I was analyzing about 70 peers from a couple of dozen different ASs. Some had two sessions, some had six. Each had to be compared with its similar AS peer to keep my network stable and efficient. I wound up performing this week-long task about every six months. It wasn’t the most efficient process. 

If I found significant differences between the prefixes received from each AS, I would then figure out if we were dropping the routes on our inbound filter (were the filters the same at each location?) or if the peer was not advertising them to us correctly. I would gather and document the evidence and then contact the peer to tell them what was happening. Inevitably, they would ask when the issue occurred, and because of the infrequent analyses my best answer was usually “in the last 6 months.” Clearly, this wasn’t that helpful. What would have helped is an alert when significant changes in advertised prefixes occurred or when routes were advertised by peers that shouldn’t be (like my own routes), and then the ability to view a history of the sessions so that I could diagnose and fix the problem quickly. 

These BGP peering challenges working at the ISP help me appreciate the benefits of Packet Design’s routing analytics. It would have been far better to be able to tell my peer, “Well, it happened at 3:15 AM on Saturday morning.” They might still have been reluctant to admit responsibility, but I know that it would have accelerated mean time to repair. 

You can learn more about BGP peering analytics in this technical white paper.