Internet Infrastructure
May 12, 2003

N+I Highlight: A Case Study in Troubleshooting a Network Outage

Router operations may be low-level issues for many IT shops.  However, if you can cut down time, think of all the screaming users you won't have to hear from. When the network went down at Networld+Interop, a major networking trade show, quick action by operations staff helped to get users back online. It emergency offered an example of the correct way to troubleshoot a network. 

This year’s NetWorld+Interop show (N+I) in Las Vegas was similar to most contemporary trade shows.  It was a smaller event compared with previous years, with an estimated 28,000 attendees as opposed to 40,000 last year and 61,000 two years ago.  The show floor was smaller, consolidated into 150,000 square feet instead of spreading over 320,000.

The most interesting part for this show-goer happened on the morning of Wednesday, April 30, 2003.  Around 9:30 that morning, I was in the press room when, suddenly, network access went down.  At first, it was a minor inconvenience.  Anyone in this business expects temporary network disruptions.  As the outage stretched from five minutes to 10, to 20, someone joked, “Here we are at a networking show, and they have the same problems as everyone else.”  One journalist was literally screaming at the pressroom attendant, “I have a 10 o’clock deadline!  Can’t you do anything about this?”  At 9:50, the network came back online, just before the show floor opened at 10.  I don’t think he made his deadline.

Anyone working in IT can identify with the attendant — some irate user yelling at you when the problem is probably outside of your responsibility.  Deciding to find out what happened, I talked to two people integrally involved in running the N+I network:  Dan Backman, corporate systems engineer at Extreme Networks, and Kim Dimick, director of technical services at Packet Design.  Backman and Dimick gave me the scoop on the outage and the quick recovery.

The Setup

Dimick stated, “The N+I network went from nothing to a small ISP in about three weeks.”  There were 15 companies involved, including Extreme, Packet Design, Computer Associates, Fluke, NetScreen, Spirent, Aruba, and Sockeye.  “We designed it like you would a metro-area network,” said Backman.  It consisted of a fiber-optic backbone connecting 18 routers, all of which used OSPF (Open Shortest Path First) protocol to direct backbone traffic over the shortest route to the Internet.  For redundancy, the network was multihomed to Qwest and Optigate. 

The network was set-up in a hot-staging environment in Belmont, CA, where it was tested for the usual issues, such as security, capacity, and fault tolerance.  It was then broken down, packaged, and trucked to Las Vegas, where it was recreated over a four-hour period.  By April 23, the network was ready for N+I’s 275 exhibitors. 

Trouble Starts

The first few days showed no real problems.  “There were some Linux worms crawling around the NOC (network operations center),” said Backman, “but that was from dirty laptops and didn’t effect anyone but us.” 

Then, at 9:27 a.m. on April 30, something went wrong.  Suddenly, no traffic was getting to or from the Internet from the N+I network.  The NOC team scrambled into action.  Backman was startled in his hotel room by his cell phone, pager, and hotel phone ringing simultaneously.  He was quickly on the phone to the NOC, reading out CLI (command line interface) actions to ping routers to determine their availability.  Brandon Ross, from Sockeye Networks, quickly confirmed that it was not a service provider outage.  Rather, something was wrong inside the network.  As Backman said, “We can still ping.  What the heck is wrong with OSPF?” 

In the Packet Design booth, Dimick and his team knew something was seconds after the outage occurred.  The company’s Route Explorer product graphically represents changing traffic flows and routing paths.  It also records path changes over time and correlates network events with the network configuration at a point in time.  By 9:30, Route Explorer showed a series of downed links, all going through the same point.

Problem Analysis and Resolution

With the graphic interface, Dimick knew at a glance that it was one specific router that had gone offline, killing the entire network.  He ran to the NOC in time to overhear the technician on the phone with Backman, finding the problem.  Backman’s team had also brought up the Route Explorer, having used it during the hot-stage testing in Belmont.  They too had localized the problem to the same router.

The team then worked through a number of options in a few minutes.  They found that, although unicast routing was working, link-local multicast routing was not.  BGP (Border Gateway Protocol) was telling traffic to use an OSPF route that it could not reach.  Also, the router could not make any OSPF route announcements and was unable to tell neighboring routers to remove the wounded router from the network.

The NOC team manually redirected all the traffic to the second external router and pushed the change around the network.  By 9:50, everything was up and running.  Backman arrived from the hotel, and additional analysis revealed that the wounded router’s network interface card had failed.  Using a spare NIC on hand, the wounded router was working within 30 minutes.

Lessons Learned

What is remarkable is not that the network outage happened.  These things happen all the time.  What is remarkable is the speed with which Backman, Dimick, and their teams localized the problem to a specific router.  First-level troubleshooting showed that everything was working.  Backman summarized the problem: “This was a Byzantine failure in the physical interface on the router that is hard to locate without a deep and specific knowledge of OSPF.  The failure had blocked OSPF communication in one direction on two of the router interfaces.  The next-hop for some of the BGP routes then became unreachable.”

“We were able to quickly rule out what was not the problem,” said Backman.  With a host of experts in network routing and operations on hand, as well as the use of the Route Explorer, took the fix time “from hours down to minutes,” according to Backman. “Without the Route Explorer, we would have had to spend a lot more time digging,” which could have meant longer down time.  And, he said, analyzing OSPF protocols is “intimidating at worst, time consuming at best.” 

Most problems are not as esoteric as this OSPF problem, but most operations staff does not have the expertise that was sitting around the N+I NOC.  Router operations may be low-level issues for many IT shops.  However, if you can cut down time, think of all the screaming users you won’t have to hear from.


~ Michael Hoch

To provide us with your feedback on this research, please go to www.aberdeen.com/feedback .

Analyst Name: Michael Hoch
Practice Area: Internet Infrastructure

Aberdeen Group
260 Franklin Street
Boston, MA 02110-3112
www.aberdeen.com
  AberdeenGroup is a leading market analysis and positioning services firm that helps Information Technology vendors establish leadership in emerging markets. Steeped in technology and armed with end-user field research, Aberdeen analysts answer clients' critical business and technology questions in the context of the Internet economy and across the product lifecycle. This document is the result of independent research initiated and performed by Aberdeen Group. Aberdeen Group believes its findings are objective and represent the best analysis available at the time of publication.

Phone: (617) 723-7890
Analyst Bio:
Michael Hoch

This Document is for Electronic Distribution Only
-- REPRODUCTION PROHIBITED --

Copyright © 2003 Aberdeen Group, Inc., Boston, Massachusetts

The trademarks and registered trademarks of the corporations mentioned in this publication are the property of their respective holders. Unless otherwise noted, the entire contents of this publication are copyrighted by Aberdeen Group, Inc. and may not be reproduced, stored in a retrieval system, or transmitted in any form or by any means without prior written consent of the publisher. This document is the result of independent research initiated and performed by Aberdeen Group. It first appeared on www.aberdeen.com on May 12, 2003. Aberdeen Group believes its findings are objective and represent the best analysis available at the time of publication.