posted by: Eric Siegel
While I'm out here blogging today, here's an idea from the other area I cover: network management.
Please, programmers, when you get an error message as a result of a communications call, don't just retry retryable errors and then not bother to log the problem! We really need to know what is going on; we really need you to log all errors and messages you get from the communications calls and network systems. Maybe the intermittent error didn't crash your program, but it's out there screwing around with someone else's program nevertheless. We'd really love to be able to see the history of the errors and almost-errors in a log somewhere when the production systems are down, people are climbing the walls, and management is making the usual threats.
And another idea: when you write that error message, please don't just issue some generic message: "Things are screwed up somehow, but I've no real idea what's going on. Some damn pointer somewhere is pointing to Mars, maybe. So I'm just going to toss this transaction into the crapper and reset, or panic, or whatever."
Instead, how about a message that contains two parts: first, a series of numbers that give a detailed description of the error, along with the application ID, the transaction ID, the step you'd just completed in that transaction, the specific identity of the servers, databases and other subsystems that were open, etc. And second, a number that includes the time of day and maybe a token that we can use to match to what's going on in the system log files and monitoring files at that time. The end user can read those numbers to the service desk from the screen when the error message shows up, and we can then sort on the first numbers to see if we're suddenly getting a lot of errors with the same database, or server, or whatever. We can use the last number to look at what was going on in our other log files and in measurement system reports. That'll really help cut down our mean time to repair incidents.
I know that your deadlines are insane and your management is foaming. But think of all the time you'll save the operations staff and yourself (when you get called in to consult on the crisis, as you will). That time savings will doubtless come when you're up to your eyeballs in a different project with different insane deadlines, and you'll be really happy with your nice error messages when they save you lots of time!
Well, end of entry. Thanks for listening!

Comments