Routing needs no introduction. Almost every organization in the world that needs connectivity depends on BGP or IGP routing for data delivery. Every organization with a routed network will have to deal with routing issues at some point and they can be hard to troubleshoot. Most routing-related trouble tickets are either closed within a couple of minutes (when the issue resolves by itself) or in hours (when the NOC team has to find the root cause). Now imagine how hard the troubleshooting can be in a service provider network that caters to hundreds of customers and has thousands of routing paths.
There are numerous reasons why routing issues are hard to troubleshoot, some of which we have discussed in previous blog topics. Let us consider a use case where a customer complains about intermittent connectivity to a critical application. In the normal course of troubleshooting, we figure out where the application is hosted, determine the path taken by traffic from the customer’s source to the destination, find the problematic path, link or node, troubleshoot and resolve the issue, and connectivity is back up! But in the real world, troubleshooting routing issues are never that easy.
If we have an intermittent service delivery issue, we might start troubleshooting by trying to determine traffic paths using traceroute. But this will not work if the issue was reported after convergence. While traceroute will show us the current routing path, it holds no records of paths taken before convergence and thus we have no information about the path with intermittent connectivity. We cannot recreate the problem so that is a ticket closed without a resolution.
Now, say the issue persists and we find the current path. By running a traceroute we can find the path from the source to the destination and all the hops along the path. But how do we know which node, link or path is the root cause of the problem? This will require connecting to each router hop along the path, ensuring everything is fine on it and going on until we determine the problematic node and link. It could easily take a couple of hours or more to check all the routers along the path and get everything back up and running. That is a ticket closed with a resolution but possibly with an SLA violation.
But troubleshooting routing issues can be made quick and easy if we have real-time route analytics and the right tool, like the Route Explorer module of the Packet Design Explorer Suite, to capture real-time and historical routing data with analytics and reports.
Real-time route analytics records all live IGP and BGP routing events. These are used to build and maintain an always-current model of the network control plane. The data is displayed in real-time as a live topology map of the network and is also stored for troubleshooting and historical analysis. This means that even if the routing issue was reported after the routes converged, all we need to do is playback the routing events from the captured routing history, find the problematic link(s) and troubleshoot only those concerned routers and links. This not only reduces the MTTR but also ensures that routing related trouble tickets are always closed with a resolution.
But if you are still not convinced that this can be done in minutes, check out this video where we use the Packet Design Explorer Suite to troubleshoot an intermittent routing issue in less than 5 minutes: