Posted by: Eric Siegel
OK, now I'm worrying about how we're going to manage VMs without losing our minds.
You see, I've been hearing great stories about how there are going to be huge racks of servers containing zillions of VMs: virtual servers, virtual switches, virtual routers, virtual firewalls, virtual performance optimization appliances, virtual everything! And the system will instantly adapt to changing workloads, spawning new, fully-configured servers with their complete environment, along with switches, routers, whatever.
And if a newly-spawned virtual server is too many virtual-router hops away from closely-coupled applications or processes, well, then, that new virtual server will be automatically re-provisioned and re-configured closer to where it should be for minimum latency!
Yes, the entire mass of virtual whatevers is going to be in constant, thrilling motion! Everything is going to be moving around, minute by minute! Total efficiency in hardware utilization! Minimum transport latency! Automation! I just can't wait! It's going to be just totally wonderful!
Until it breaks.
And when it does break (and it will), I don't want to be the miserable technician trying to figure out where everything was when that transaction failed, trying to track down precisely what subtle interaction of data flow, firewall, optimization device, VLAN, switch... you get the picture... was the source of the difficulty.
It'll be like yelling "STOP!" six hours too late at a frenzied set of hyperactive squirrels loaded on caffeine and springing among the trees while chattering at each other. Now, where were they six hours ago, when the problem occurred? All the actors in the drama will have melted into thin air. I'll bet the guilty one won't even be there any more; he appeared, messed something up, and then vanished into the virtual fabric, leaving just the rack behind.
What will the vendors give us? Probably only the assurance that the virtual servers and switches and routers will have virtual MIBs, just like the real ones. But I know where the real ones are, and I can put my hand on them and monitor them. What happens if a virtual switch appears, switches a gigabyte or two, and then vanishes -- poof! -- inside of a minute? Where did it go? What did it do while it was alive? We'll be MIB-less in the fabric, without any vision.
I can already see a rule I'd like to impose on this mess, which will tie in with the "triage-based incident management" discussed in my post "Network and Systems Management: Over the Edge." It's this:
"Just because a production group can create a maze of unmeasured real and virtual servers, network components, and interconnections that change from one minute to the next is no reason it should. Edsger Dijkstra wrote a famous letter to the editor titled 'GO TO Considered Harmful' (Communications of the ACM, August 1968) in which he made the case for more program structure and less intertwined process flow. The same applies to IT infrastructure." [This is from my new Burton Group paper, "Using Network Management Platforms for Triage-Based Incident Management."]
I'll bet we network folks are going to be stuck in the middle of this VMess, and a miserable place it will be. We'd better start figuring out how we're going to handle it when incidents occur. And I think that triage orientation, with clearly-defined diagnostic domains and rules to ensure that the virtual servers providing a service are all grouped within the same domain, where we can evaluate them as a group with active measurement techniques, is a good place to start.

I have been thinking and worrying about this issue for a while. The VM Infrastructure Client creates a false sense of security and fosters complacency in the SA's. Strict operational procedures and change control help but I worry about the day I have to troubleshoot or recover in the middle of the day. There is ample opportunity for tool set improvement but nothing replaces good planning a sound management acumen.
Posted by: Henry Mayorga | January 07, 2009 at 02:30 PM