Close to the wire: How route analytics can help prevent BGP outages

At around 3:00 a.m. Eastern Daylight Time on August 13th, Internet users started reporting slow connectivity and intermediate outages. This impacted many large networks and hosting providers including eBay, Comcast, and Time-Warner.

The problem was that some older Cisco routers have a default limit of 512k border gateway protocol (BGP) routing entries in their TCAM memory. Normally, routing tables typically have around 500k entries, so there’s a little bit of a buffer. But BGP prefix aggregation for a major service provider’s systems temporarily failed. The service provider quickly fixed the problem on their end, but not before 15,000 new prefixes were sent to the global routing table, surpassing that 512k limit.

There is a work-around for these routers to increase the maximum size for the routing tables, but one has to wonder why these routers were running so close to maximum to begin with. In short, there is clearly a need for a larger margin of error.

The August 13th event highlights one of the reasons that route analytics are more important than ever. With the visibility and proactive monitoring that route analytics provides, the ISP would have been alerted to the aberration in the number of advertised BGP prefixes. Route analytics helps network teams correctly diagnose problems like this that conventional operational management tools (that rely on point-in-time discovery methods such as SNMP polling and CLI command screen scraping) cannot, lowering the mean time to repair.

We’ve seen a number of outages related to BGP misconfiguration and simple overload. The other consideration is that BGP was not designed with security in mind and is vulnerable to route hijacking, as we saw with the Pakistani government’s efforts to block YouTube.

However, one of the reasons to be optimistic about BGP going forward is that software defined networking (SDN) should help reduce some of these risks. As Cengiz Alaettinoglu explains in his recent blog post about thwarting BGP security hijacking, routing policies change constantly due to the high number of policy objects needed to represent all BGP routes and AS’s. These changes are made during maintenance windows, which typically are scheduled once or a few times a week. This leaves windows when new policies are not enacted, and this can result in poor routing performance. With SDN, routing configuration changes will happen automatically in near-real time, reducing the need for maintenance windows.

Either way, route analytics technology detects these types of problems and provides ongoing network audits to help operators prevent them, even in dynamic SDN environments. Automation is good; but as we saw with the outage of August 13th, real-time monitoring is still necessary to make sure that automated systems are functioning as intended.