Posted by: Eric Siegel
This is the first of what will be a series of discussions on management of complex networks and systems, a problem that's confronting us all. Before I became an analyst, I spent years in freezing server rooms in the middle of crises, listening to lots of threats ("If you don't fix it fast we're going to ...") and gobbling lots of free pizza.
And those were the "good old days"; now, things are worse. Everything is more real-time; networks, appliances, servers, and distributed applications are more intertwined; network and data flow topologies are out of control; nothing is stable; it's almost impossible to reproduce the production system in a test environment. And yet we're expected to detect incidents and repair them as soon as they occur, if not sooner.
It's a nightmare. We have to get a solid grip on how to manage this mess, or we're all doomed to even more stress and stale pizza and late nights.
So let's see if we can come up with a scheme that makes our lives easier without being too expensive or impossible to implement.
I'd like to start by discussing triage-based incident management, which I've been working on and writing about for a few years now. It seems, at least to me, to be a decent alternative to the classical strategy of spending staggering amounts of money on system monitoring solutions that measure everything that's easily measured (and for which the vendor can bill licensing fees!), then hoping that an expensive "expert system" will somehow sort through that overwhelming flood of data to understand what is going on and assist the operations staff.
Triage-based incident management is based on rapid incident classification, and the basic idea is to find out where the problems are and which group is responsible for them—and then to turn the incident over to the responsible group, with sufficient credible evidence to quash any attempts at finger-pointing. The first goal is to determine the seriousness of the incident; the second is to decide which specialist group should handle it.
Incident classification is based on measuring the service delivered across the boundaries of clearly-defined subsystems, such as DNS, queries to back-end databases, transport across network backbones, and transit time through firewalls and load balancer subsystems. It looks primarily at services delivered, not at individual element measurements such as CPU time and queue length.
For incident classification to work well, clear demarcation points need to be identified at the boundaries between responsibility domains, and there should be easily-understood, credible, real-time measurements of the services being delivered across those boundaries. For example, the service desk can instantly see that DNS response time has suddenly jumped to 30 seconds, or that transit time across a critical VLAN has tripled (or the VLAN has vanished entirely!). They'll then know that they have to call the DNS people or the people responsible for the VLAN. They won't know why the service is failing, but they will be able to prove, in a few seconds, that there really are problems and they'll know the appropriate group to handle the issue.
In actual use, the service desk uses end-to-end performance metrics, such as total transaction response time or VoIP voice quality, to detect incidents, then uses other service measurements to instantly sort the incident into one of three groups:
- A well-known incident with a standard fix (e.g., reboot the server that runs some packaged software that hangs occasionally and that seems to restart without any further problems)
- A new incident that clearly is being caused by problems in a single "responsibility domain" (e.g., a DNS problem is slowing down all transactions, or a slow identity database response is slowing all new logons, so it's obvious which responsible organization gets the trouble ticket)
- A new incident that doesn't clearly fall into a responsibility domain (e.g., some subtle interaction among system elements is causing a confusing set of symptoms; diagnostic procedures should be started and higher-level diagnosticians should be notified)
Within each responsibility domain there can be additional, lower level "diagnostic domains" with the same types of clearly-defined demarcation boundaries and measurements to assist in rapid isolation of the incident to a particular subsystem.
In effect, all this is just what I call "pre-planning your next meltdown." After all, what do we do when we're confronted with a major crisis in the server room (after the obligatory freakout and power-cycling of every box we can find, of course)? We typically start putting in measurements or trying to find existing measurements to isolate the problem to a subsystem. Although we usually don't know what the performance is like when the thing is working, we can guess that response time is not some insanely-long value.
What we're doing with triage-based incident management is just pre-designing the measurements we'll use to diagnose system problems, but we're doing it early, so we can get the benefit of a baseline when the thing actually fails, and we don't have to do it while listening to threats from top management.
Well, that's a start. I'm going to keep writing on this topic, and I'd love to see your comments. What are YOU doing to manage the mess? Does triage-based incident management seem reasonable?
(Burton Group subscribers, see: "A Framework for Network Incident Management" and the 2007 Catalyst conference presentation "Pre-Plan Your Next Network Meltdown" for much longer discussions of these ideas.)

It's a worthwhile concept to consider and in fact is the way most IT organizations operate.
The challenge has been that there is no central repository or system-of-record for IT. While there is massive amounts of heterogeneous data being generated every minute, it is never captured. And historically IT management systems require the admin to specific what they want to monitor before something breaks.
We believe a disruptive approach is to collect everything and use IT Search as the mechanism to instantly access the data that is needed. This is what we are doing at Paglo. Paglo automatically collects all of the data about the computers, servers, network, and users and instantly allows IT admins to search and visualize the data. Paglo is the search engine for IT and is offered as IT Management SaaS.
[It would be great to talk to you about what we are doing - send me an email at bdehaaff at paglo dot com]
Posted by: Brian de Haaff | December 13, 2008 at 08:47 PM