Posted by: Eric Siegel
This is the first of what will be a series of discussions on
management of complex networks and systems, a problem that's confronting us all.
Before I became an analyst, I spent years in freezing server rooms in the
middle of crises, listening to lots of threats ("If you don't fix it fast
we're going to ...") and gobbling lots of free pizza.
And those were the "good old days"; now, things
are worse. Everything is more real-time; networks, appliances, servers, and
distributed applications are more intertwined; network and data flow topologies
are out of control; nothing is stable; it's almost impossible to reproduce the
production system in a test environment. And yet we're expected to detect
incidents and repair them as soon as they occur, if not sooner.
It's a nightmare. We have to get a solid grip on how to
manage this mess, or we're all doomed to even more stress and stale pizza and
late nights.
So let's see if we can come up with a scheme that makes our
lives easier without being too expensive or impossible to implement.
I'd like to start by discussing triage-based incident
management, which I've been working on and writing about for a few years now.
It seems, at least to me, to be a decent alternative to the classical strategy
of spending staggering amounts of money on system monitoring solutions that
measure everything that's easily measured (and for which the vendor can bill
licensing fees!), then hoping that an expensive "expert system" will
somehow sort through that overwhelming flood of data to understand what is
going on and assist the operations staff.
Triage-based incident management is based on rapid incident
classification, and the basic idea is to find out where the problems are and
which group is responsible for them—and then to turn the incident over to the
responsible group, with sufficient credible evidence to quash any attempts at
finger-pointing. The first goal is to determine the seriousness of the
incident; the second is to decide which specialist group should handle it.
Incident classification is based on measuring the service
delivered across the boundaries of clearly-defined subsystems, such as DNS,
queries to back-end databases, transport across network backbones, and transit
time through firewalls and load balancer subsystems. It looks primarily at services delivered, not at individual
element measurements such as CPU time and queue length.
For incident classification to work well, clear demarcation
points need to be identified at the boundaries between responsibility domains,
and there should be easily-understood, credible, real-time measurements of the
services being delivered across those boundaries. For example, the service desk
can instantly see that DNS response time has suddenly jumped to 30 seconds, or
that transit time across a critical VLAN has tripled (or the VLAN has vanished
entirely!). They'll then know that they have to call the DNS people or the
people responsible for the VLAN. They won't know why the service is failing,
but they will be able to prove, in a few seconds, that there really are
problems and they'll know the appropriate group to handle the issue.
In actual use, the service desk uses end-to-end performance
metrics, such as total transaction response time or VoIP voice quality, to
detect incidents, then uses other service measurements to instantly sort the
incident into one of three groups:
- A well-known incident with a standard fix (e.g., reboot the
server that runs some packaged software that hangs occasionally and that seems
to restart without any further problems)
- A new incident that clearly is being caused by problems in a
single "responsibility domain" (e.g., a DNS problem is slowing down
all transactions, or a slow identity database response is slowing all new
logons, so it's obvious which responsible organization gets the trouble ticket)
- A new incident that doesn't clearly fall into a
responsibility domain (e.g., some subtle interaction among system elements is
causing a confusing set of symptoms; diagnostic procedures should be started
and higher-level diagnosticians should be notified)
Within each responsibility domain there can be additional,
lower level "diagnostic domains" with the same types of
clearly-defined demarcation boundaries and measurements to assist in rapid
isolation of the incident to a particular subsystem.
In effect, all this is just what I call "pre-planning
your next meltdown." After all, what do we do when we're confronted with a
major crisis in the server room (after the obligatory freakout and
power-cycling of every box we can find, of course)? We typically start putting
in measurements or trying to find existing measurements to isolate the problem
to a subsystem. Although we usually don't know what the performance is like
when the thing is working, we can guess that response time is not some
insanely-long value.
What we're doing with triage-based incident management is
just pre-designing the measurements we'll use to diagnose system problems, but
we're doing it early, so we can get the benefit of a baseline when the thing
actually fails, and we don't have to do it while listening to threats from top
management.
Well, that's a start. I'm going to keep writing on this
topic, and I'd love to see your comments. What are YOU doing to manage the
mess? Does triage-based incident
management seem reasonable?
(Burton Group subscribers, see: "A Framework for
Network Incident Management" and the 2007 Catalyst conference presentation "Pre-Plan Your Next Network Meltdown" for much longer discussions of these ideas.)