Packet Design’s use by many of the world’s largest service providers and network operators gives us the opportunity to see how they operate. We get to witness the sort of issues they deal with during their everyday operations. We thought we would share some of these stories periodically through Life in the Control Plane blog series. After all, life in the control plane is never dull!
It is well documented that one of the major causes for network downtime is change management. Because SDN and automation are still in the early adoption phase, most network operators have to schedule maintenance windows to make the necessary changes or upgrades. Add the capacity for human error to this and the result can be anything from a minor service degradation to a countrywide outage.
Here’s an example of how a large U.S. network service provider handled a significant network change that could have caused a major issue. In order to satisfy organizational security polices, the service provider decided to upgrade the IOS on all their Cisco ASR routers.
The operations team decided to upgrade all their devices at one time in order to avoid multiple maintenance activities. To do this, they first had to determine their non-peak traffic hours and fix a maintenance window. The service provider uses Packet Design’s Explorer Suite in their network for a number of requirements, including route analytics and network planning. Using Traffic Explorer, which, in conjunction with Route Explorer, provides visibility into network-wide routing paths and traffic behavior, the service provider determined their non-peak traffic hours over the week and scheduled the IOS upgrades.
During the scheduled maintenance window, the network operations team upgraded the IOS on all the ASR routers that were part of their network. The code upgrade went without any glitches, and the routers were all upgraded to the new code. After the upgrade, the operations team checked all the routers, around 20 of them, and made sure that all the interfaces on the devices were up and had Tx and Rx traffic.
To further ensure that the network was behaving as before the changes, the service provider ran a Routing Comparison report in Route Explorer. The report compares the state of the network elements in a topology, thus providing a before/after view of the routers. In this case, the report showed that four ports on one router that were previously up were now down. But the view from the router showed that all ports on the router were up.
On further analysis using data from Route Explorer, the operations team found that an entire 4x10G line card that carries two of their backbone links and two links to local PoPs for redundancy had disappeared from the view of the device chassis. So, despite the device reporting that all ports were up, four ports from the line card had not been detected by the router IOS after the upgrade.
With Route Explorer, the service provider was able to proactively identify and resolve an issue that otherwise would have caused major downtime. The provider continues to use Route Explorer for route analytics and as a standard component in their network change management processes, thus allowing them to avoid unintended consequences from maintenance activities, including SLA violations.