When everything else is done right, the only thing an enterprise has to do to stay in business is stay up. In other words, uptime or near-zero downtime ensures that you stay in business.
Downtime can be an incident as simple as Amazon’s 20-minute outage but cost them $3.8 million. It can also be as complex as a route leak caused by Telekom Malaysia when peering with Level 3, which caused the slowdown of almost the entire Internet. Yes, really!
What is the cost of network downtime? Various reports cite a range of costs, but it is mostly put in the range of $140k to $540k per hour! That of course does not include the value of opportunities lost. If an outage happens to hit a customer at a critical business hour, they most certainly will start scouting for alternate options.
Beyond the lost opportunities, customers and revenue, there is also lost productivity and the man-hour costs of waiting or attempting to fix the issue. Finally, there are the intangibles, such as damage to reputation and corporate image, both of which cannot be monetized.
Networks have evolved over time – from ‘being flat’ in the beginning to being a complex mesh. Yet, the usual suspects that lead to downtime remain unchanged. Here are my top three:
Hardware Failure – Hardware can fail anytime and there is nothing much anybody can do about it. Even the biggest networks, with a reasonable redundancy mechanism, cannot always escape downtime caused due to hardware failure. Remember Telstra in March?
While nothing can prevent a hardware failure, the workaround is to be prepared to deal with one when it occurs. To start with, SNMP tools have the ability to monitor the health of hardware devices and see potential issues. The hardware health reports will help a network admin take preventive actions before a device fails and its ripple effect causes a major outage. However, not all hardware failures are ‘hard’ conditions. For example, a failing router interface can cause intermittent route flapping, where the router announces itself alternately “up” and “down” in quick succession. SNMP tools can easily miss these conditions between polling cycles.
It is good practice to have a recovery plan for when failures occur. This may be to reroute traffic over a redundant link, having standby hardware loaded with the essential configuration, or anything that the network engineer knows is the best for their network. While hardware failures can never be dispelled, having a recovery plan will definitely help reduce the outage time.
Configuration Errors – To err is human and unforgivingly, the network will go down. An Avaya study found that 82% of companies surveyed have faced network downtime due to human error. Be it fat fingers or applying a configuration change without considering the after effect, most organizations – including giants Facebook, and Google – have been victims of human error.
While misconfiguration due to human error cannot be avoided, what organizations can do is follow a change management policy with the help of network change management tools to prevent unauthorized changes. There are also times when network admins make changes without having considered all the “what-if” scenarios, which in turn leads to an outage. Using tools that can help simulate network changes and study their impact, expensive outages can be avoided.
Routing Issues – While there are a range of network issues that can lead to an outage, it is routing issues that play a major role because they are difficult to troubleshoot in large and complex networks. Link congestion, route flaps, route leaks and route hijacks are a few of them. There have been numerous instances where route leaks have slowed down the Internet or knocked out major services and route hijacks have dealt major damage.
Because you cannot trust the Internet to handle your routes perfectly, using a route analytics tool for route monitoring is the only solution. With route monitoring in place, you can monitor and be alerted about path changes, router state changes, prefix changes, prefix floods or any anomalous routing behavior that can lead to downtime.
Having real-time routing telemetry and analytics designed for today’s complex IP/MPLS networks with end-to-end visibility will get you closer to “five-nines” availability. To stay up in business, bring up the uptime.