I heard a talk recently by Mark Settle, the CIO of BMC Software. Mark’s talk was titled “Managing Global IT Assets in Tough Times”, and covered several aspects of IT cost awareness. During the talk, Mark told a story which bears repeating.
In Mark’s organization, one server in a 4 node cluster hung due to a software bug. Users were unaffected thanks to the cluster architecture. But the NOC received a systems alert reporting the server problem, and an operator was dispatched to restart the server.
After the incident a time and motion analysis was performed. While the act of rebooting the server may seem simple, it was not the end of the chain. The operator reported the action to his supervisor and time was spent on discussion. The supervisor reported the incident to his manager. The incident report moved up the chain, until it was discussed by a high level group. And of course, appropriate incident tracking documentation was generated as well.
Once all of these activities were totaled, the analysis found that the cost to reboot that one server, one time, was $5K.
And now my first confession. While I have considered the cost impact from big picture perspectives such as business continuity assessments, I have never before considered the costs of such an isolated action. This story made me stop, and evaluate IT incidents in a totally different frame of mind. And I hope with this posting, you will too.
In these economic times, organizations are squeezing every dollar out of every department, including IT. The challenge for IT managers is how to continue to provide the services demanded by the organization, while at the same time dealing with demands for lower costs - demands which usually equate - as Mark pointed out - to reducing headcount. But headcount reduction in turn reduces the quality of the services IT can deliver. It’s a trap you can fall into, unless you can offer alternative ways to cut costs, which brings me back to that $5K server reboot. With this kind of analysis, you can identify other costs that may be reduced, as an alternative to headcount.
I admit I was shocked at the cost of the reboot, and I’ve been wondering how that cost might be reduced. My second confession is, I have not developed an answer. In a modern IT department, there are processes and procedures which have been adopted based on best practices, such as ITIL. The procedures define the steps that must be taken to report and document an issue - and there is nothing wrong with these procedures. But following them exacts a price.
I am not suggesting abandoning ITIL, or any best practices system. Far from it. But I will point out that it is important to be aware of their financial impact. And more than that, it is vital that these best practices be used in such a way that incident costs can be determined - as Mark was able to do in his organization. So the answer isn’t missing - but the question is wrong. The question isn’t about how the cost can be reduced. The question is, how can the reboot cost information we now have be used?
Remember - our task in IT is to provide valuable information to the organizations we serve. But in addition to being providers of information, we must be consumers of information. And we must use that information to make our task of serving the organization more effective, more efficient, and more economical.
Note: I’d like to thank Mark Settle for permission to use some material from his presentation for this posting.