The Care and Feeding of a High Maintenance Network

A network is an organic creation. The minute it’s born, when all new core and edge connections are made and routing is turned up, things begin to change. Many changes are self-driven due to unexpected interactions: Equal Cost Paths (ECMPs), Asymmetric Paths, etc. Other changes are due to the random nature of the Internet and are readily noticeable at the peering points into the newborn network.

Some people think that once the switch is turned on things will just work as designed. I’ve found that is rarely the case. Networks need care and feeding. Tools to check on the processing capacity, resource consumption, and well being of the network and its individual elements are required.

For the monitoring aspect of this “care and feeding,” simple SNMP tools may be used. They are perfectly adequate for tracking and graphing CPU rates, available memory and throughput for connections between network elements. However, when it comes to understanding the network’s routing and traffic patterns, using SNMP-based tools is rarely the best method.

Today’s dynamic IP networks require visibility into what’s happening on the network when it’s happening. Real-time monitoring and alerting enables network engineers to discover anomalous routing and traffic behavior. SNMP sampling delays make it difficult to spot these trends. When you think about it, using SNMP polling without real-time analytics is like only checking on a newborn baby every 15 minutes. Instead, you could put a monitor on your nightstand that enables constant vigilance and detects junior’s inevitable cries for food, changing or just a comforting touch. Real-time monitoring offers better peace of mind.

A good example of how networks evolve is when static routes are added to manage an expanding network. Static routes are great if the network is small and not running any dynamic routing protocols. But what happens if someone on a large network wants to reach engineering resources at a remote research division? Should the static route configuration be propagated on every router along the path to reach the R&D division? Or, should the static route be redistributed in the dynamic routing database? This is a “Danger, Will Robinson” moment in the life of the network. The chances of introducing instability by creating a routing loop increase dramatically.

In situations like this, where static routes introduce routing loops or unexpected routing changes, using an SNMP-based monitoring tool is almost worthless. While they are well suited for monitoring reachability, these tools offer no visibility into the behavior of the Layer 3 routing topology. Using them to figure out why links are flapping after the introduction of static and/or redistributed routes will be futile. This is especially true if the problematic static route was introduced sometime in the past.

Incidentally, one of the “nice” things about routing loops that cause flapping in a network is the signature periodicity of the up/down or announcement/withdrawal of the routes. A pattern of events likes this, shown in a route analytics histogram, stands out like a crying baby on a packed airplane, making it easy to identify the problem and when it was introduced into the network.

Routing analysis technology offers real-time visibility into routing changes and also can perform sophisticated path analysis, incorporating static and dynamic routing information from points in time. This is important for SDN as well, where having an automated network doesn’t mean you can take a hands-off management approach. Like the network, the network engineer’s job continues to morph. Having a good management foundation and the right tool/practices will keep our networks – and businesses – up and running effectively.