Monday, November 29, 2010

Enterprise Operations Assurance

As technology becomes more integral to operations and interconnected across every aspect of the enterprise, the risk and cost of failure is expanding. Yet many companies introduce changes into their production environment on a frequent – even daily - basis, with only unit-level testing if any. Changes introduced through projects may be tested somewhat more thoroughly, but even their potential impact on connected systems is rarely tested at all.

The best explanation for this shocking but true predicament is the boiling frog syndrome. Biologists claim that if you drop a frog into boiling water it will jump out immediately, but if you place it in cold water and then heat it slowly, it will stay in until it boils to death. Apparently this phenomenon has to do with the rate of change and is often used to describe human tendencies to miss or dismiss significant changes that occur gradually.

Here’s how it happens to enterprise operations: A project is formed to develop, acquire or configure a set of functionality within a given timeframe. Almost inevitably the scope creeps up and surprises introduce delays, resulting in a threat to the delivery schedule. Unfortunately, delivery dates are seldom sacrificed due to other forces and expectations, including compensation incentives as well as operational and logistical momentum.

As a result, projects are often declared complete on the scheduled date, and any remaining functional gaps or issues are simply transferred into the maintenance organization, where they become part of the ongoing stream of changes and corrections that keep systems operational on a daily basis. This stream of changes is rarely routed through a test organization, instead relying on individual participants to verify their own work or perform spot checks as best they can. Enterprise level, end to end testing is entirely absent.

The net result of inadequate testing can, of course, be disastrous. Software failures have killed people, bankrupted companies, cost or lost hundreds of millions, and caused other losses and embarrassments too numerous to mention. This assumption of risk is tolerated either out of ignorance – as in the boiling frog - or faith that since nothing dramatic has failed in the past it won’t happen in the future. The fact is that not all errors are disastrous, and many organizations have become extremely adept at handling problems before they become critical.

But failures don’t have to be dramatic to be traumatic. Many companies find themselves struggling to manage their backlog as the maintenance budget drains away resources. Do you know how much of your maintenance costs are invested in improvements versus fixing problems? Does anyone know how much business productivity loss is caused by software issues? In other words, do you know what it costs not to test enough?

On the other hand, how is it even possible to test enough on a daily basis? Typical enterprises have hundreds of applications, interconnected internally or externally, across a complex platform landscape. Application ownership and business process expertise is typically organized within functional silos with little or no up and downstream visibility, and few or no resources dedicated full-time to testing.

What is missing in most companies is an enterprise level validation of end to end business processes, what I’m calling Enterprise Operations Assurance (EOA), that executes every single day before changes are accepted into production. The only way it is possible – and it is possible - is through automation. And the only way it is feasible is to focus on it.

Automate or Else

The reason automation is a must is because there simply aren’t enough hours in a day to manually execute all the end to end processes that operate the enterprise. For example, it took one company several days to manually execute all the way from Order to Cash; the automated test took 44 seconds. The majority of the difference was the logistics of coordinating individual roles and resources across functional silos, and the remainder was the difference in execution efficiency between a manual and automated test.

So what automation does is make it possible – on a daily basis - to execute the hundreds of end to end business process variations that are necessary to assure that essential enterprise operations will continue uninterrupted after any changes are made. These are the “no matter what” tests, as in “No matter what, we have to be able to sell our products, buy inventory, pay our people, etc.”.

Understand that EOA is not organized or prioritized by what is changing; changes are happening too fast everywhere and may not may not even be planned or documented. Instead, it is organized and prioritized around what must not fail, whether it has changed or not.

Focus or Fail

Although EOA won’t work without it, automation alone won’t bring it into existence. You still need to pay attention to it, and by that I mean someone has to take ownership of acquiring the business processes from all the silos, integrating them into a coherent, automated end to end sequence, and adding the infrastructure necessary to make it repeatable day after day or night after night.

The good news is, we’re not talking about adding a huge staff and workload. Business process experts already have to document their procedures for training and perform manual testing; all that is missing is the enterprise level view and the automation capability. Any additional investment will be more than paid for by the reduced costs of manual testing and the reduced costs of chasing problems due to unexpected impact from inadequate test coverage.

On the other hand, this is not a part time job or sideline. EOA must be a full-time effort on the part of dedicated resources. Furthermore, it cannot be optional. It must be executed each and every time before any changes are introduced to production, even if that is daily, nightly, or even more often. Because without focus and discipline, your critical enterprise operations remain at risk and it is only a matter of time before you will pay a steep price.

The bad news is that there is no clear owner. Because operations are distributed across functional silos, no single area has the vision or the mission to assure enterprise-wide operations. You might think IT should carry this banner, but most organizations are under unrelenting pressure to reduce costs, not increase them, and without a business imperative they are unlikely to invent work for themselves.

What’s the answer? I wish I knew.