Intro
The point of my last post about Control vs intelligence was to provide some background to an upcoming post extrapolating from my personal experiences. This post is to provide the background on another subject, changes and change management. If you ever worked in a large organization anywhere near IT, you will be able to relate.
Lead up
After finishing my interview applying for a job in a large outsourcing company, as the hiring manager walked me to the elevator, he asked me “are you familiar with change management?”
I have not worked in a large organization before and I had no idea what he was talking about so I tried to interpret it. I assumed that he was talking about managing the introduction of new technologies, new software and processes and managing the transition and started talking about my involvement with such changes. In the small enterprises I worked in before, formal change management as it is understood in large organization simply did not exist.
I know better now. After many years of exposure to change management practices, I know that they are the perfect examples of what ails outsourcers and large enterprises:
…Sloppy concepts multiplied by dumb processes with bureaucratic, uncontrolled and undocumented executions based on arbitrary decisions to achieve political, not technological goals.
Don’t misunderstand me, the concept, the idea is very important I just haven’t seen any place yet where it is done properly, where there is a clear conceptual understanding of what it is supposed to be, what it is supposed to achieve and how it is supposed to work.
What are the business objectives of change management?
The main goal is to know what is happening in the environment, the need to be able to track changes (systems management).
The secondary goal is to manage the expectations of the client/users (service management).
The third, and most elusive, of the goals is to control who is doing what and when, the actions of the IT professionals (people management).
The ideal is to have a system where at any point we can understand and explain not only how the system is configured but also why so and how it got to be that way through well documented, well understood and often tested steps that we call changes. It is a great ideal but reality is far from that ideal. The reason why the ideal is never met is the hopeless mingling of the above three goals without any real attempt to conceptually divide them and to practically separate them.
What is the practice?
Since we have no other way to track what is going in most of our managed environments, the change management process ends up being a replacement for proper documentation and configuration management.
Its usefulness in this respect is very limited.Toward the clients, it is an attempt to show that we are in control of the environment while also spreading responsibility by involving them even in the most trivial of decisions.
he process is most detrimental to the IT professionals whose activities are hindered by the convoluted approval process and who are discouraged from taking ‘ownership’ of the systems they manage.
My definition above may seem harsh, but I have yet to see a change management system that– on a scale of 1-10 - I could rate higher than three. I have definitely seen enough to judge and hopefully also enough to understand what is wrong with them and how they could be done better.
So let’s explore.
The problem
Every now and then the IT admins, the ones who actually work on the changes, receive an e-mail warning about the serious consequences of performing ‘unauthorized’ changes. What is ‘unauthorized’ is simple to understand: anything that did not go through the process. What is ‘change’ is not defined at all. The best practical definition experience can guide me to is: “anything that may interrupt or in any way hinder (now, or in the future) the operation of the system.”
The result is operational sclerosis. Since nobody have a clear picture of what will be deemed to be a change, we try to avoid doing anything that may be. I have seen countless hours spent on first avoiding ridiculously simple decisions then waiting on someone who cannot even understand what we are talking about to approve it. Picture rebooting a hung server. Nothing works on the server, it is not doing what it is supposed to do, but it cannot be rebooted without proper authorization.
We may be spending hours every week creating Change management records for routine maintenance procedures. I was regularly implementing ‘changes’ that were weekly tasks of rebooting three servers. The amount of effort to administer the change record for these tasks were greater than the action itself which is, actually, not a change in any sense of the word.
There are beautifully code-named templates, teams to create layered and ever growing processes of checks and approvals, yet in the end, nothing gets safer. The only result is that people are very-very careful doing anything at all. It is better if the system goes down on its own than as the result of someone trying to fix a problem.
The very idea of clustering is that if any service fails, it can be simply and automatically restored on the failover node. Clusters are built to do it automatically. We can obviously NOT create a formal change request with its two week long and seven level approval process, yet many of my clients treat the 30 second interruption as something that requires a change record and once we have a change record, we need to follow the process and answer questions with full explanation why we could not do it if we didn’t. The CR will still have to go through the full approval process even if it is just the confirmation of something that has been done already. God forbid that you would fail to properly fill out some irrelevant sections such as explaining why was user acceptance testing not performed for the reboot. If we miss doting and ‘i’ or crossing a ‘t’ the change will be rejected and we will have to go through the whole process again.
We seem to be in a trap. Although I am sure that many a change manager would extoll the virtues of the processes, and argue for even more, I am not sure that they would make us safer.
NOT because documenting changes is not important, not because processes are not important, but because
we are not documenting the changes but the processes and because we do not have a proper understanding, a proper definition of what a change is.
In the end, the change management process is just a convoluted CYA (cover your behind) exercise both for the individual members of the support delivery organization and the organization itself vis-á-vis its clients. Change management is not about control over the environment but about spreading responsibility and accountability as thinly as possible. The process, as it is, has limited use; when something breaks, it is easy to find out what changed recently that may have caused the problem, but this is a benefit that can be easily achieved by more effective means.
Why change management tend to end up the way it usually does?
The reason, I suggest, is the lack of conceptual thinking.
The questions
What is a change?
Now really, what is and what is not? If you knock over a chair by accident then you stand it up – did anything change? You just restored an error condition to its normal state.
You get a promotion and your job title changes. Changing the job title on your AD account is a change, is it not? Definitely more so than restarting a failed service, is it not? If I give you full access rights to the mailbox of the company’s CEO, isn’t that a change?
Is service interruption a change? Is regular maintenance a change? Is troubleshooting change?
If you open a port on a firewall, is that a change? Is closing down one a change? If you give access rights to a person or a process to a printer or a mail relay, isn’t that a change?
Service interruption
One source of the confusion about the role of change management is that it is mushed up with obligations about service availability. If the service is not available because there is no power in the building, how is that a change? When the power comes back, everything will continue exactly as it was working before. What difference does it make if we know about it ahead of time?
Maintenance
Maintenance is usually a planned service interruption in a planned maintenance window when service interruption is expected. Whatever the purpose of the maintenance is, it is never really a change. Cleaning up data drives, compressing or defragmenting databases are not changes. Yes, of course, something may go wrong during routine maintenance, but that is not a change but an incident. Yes, of course, some functionality tests need to be performed after the completion of the maintenance, but that should also be considered as a part of the maintenance process.
Patching
We do patching on our devices, operating systems and applications. The purpose of patching is to fix problems that were found across all the users of the product. Patching is designed to increase the stability, reliability and security of the products.
Troubleshooting
When we have major system outages, trying to fix the problem often involve changes. Most of these go undocumented and completely escape formal change process. Some of these changes are simple fixes of misconfiguration, some may be necessary to address problems created by vendor service pack introduced changes, some are substantial, and the bureaucratic process does not know how to deal with them.
Staging
While this subject deserves a post on its own, I should mention staging here as well. The value of staging environments is highly questionable as there is absolutely no way to properly replicate an environment and its full workload (I challenge anybody to show me one). For some clients we are subjecting things we do in a staging environment to the full change management process with all its silly sub-processes such as user acceptability testing.
The importance of process
If, after all of these questions, you are under the impression that I do not like change management, you are mistaken. What I do not like is that instead of a sensible process with important checks, we developed it into a giant CYA (cover your behind) exercise where we treat the process as more important than the original goal. The goal is to ensure that what is done is properly understood, planned, executed and documented.
My contention is that the way change management is usually done, is not working too well.
The problem is that when the process turns out not to be working as expected, we try to fix it by doing more of the same making it more bureaucratic, more complicated, more cumbersome.
The foundation of this approach is an unshakeable faith in control. Faith in our ability to have control, the belief that we can fully control ever more complex systems and the conviction that control is the only right way to handle complexity.
I would challenge both of these foundational believes. What I believe is that there is a superior approach, focusing our abilities to deal with problems instead of just trying to prevent them.
Change management should not be a replacement of configuration management, access auditing or properly managed documentation.
Answers
How can we sort out the mess? Let me start with definitions:
Define what is a change
…..and remove from the process everything that is not.
What is a change
An Infrastructure Change is something that permanently changes the way the affected system operates.
Installation of new hardware or software, including major functional upgrades such as service packs
Change of configuration or policy.
Security changes affecting the operating ability of the systems
The installation or connection of a new component in the environment affecting the operation of the overall system.
What is not a change?
There is a very simple test to determine what is NOT change:
If it does not require a change in some documentation, it is not a change.
Patching is not change.
Maintenance is not a change
Server reboots, service restarts, cluster failovers are NOT changes
Changes to access lists, relay lists, IP address translations are not changes
Opening ports on a firewall are not changes.
The last three of this list above should most definitely be documented and some of them should be post implementation tested but since they do not change how the system operates, they are not changes and should not be subjected to the same complicated testing and multilevel approval processes as real changes.
Once this is sorted out, 75-80% of the changes will disappear and we can start dealing with the important tasks:
Documenting everything
Run books
The function of a run book is to document all changes and disruptive action in an IT environment. The entries should be simple and searchable records of a database.
Configuration management
The state of every managed environment - including rationales for any particular configuration - should be accurately represented on some sort of documentation.
Activity calendar
I had a few close calls working on changes where other processes or changes were still going on. Documenting automated process and overlaying them with planned changes would probably do more good than user acceptance testing of a reboot.
Change auditing
Change auditing is an automated process. There are several products on the market that can collect and analyze any change in most environments. Adding an IP to a relay, changing firewall configuration, changing access rights on a directory object should all be logged. There is no need for the change management process as long as we have well-articulated policies and full accountability.
Which finally leads me to the main point of this post: control does not equal accountability.
Control is an illusion. Our perception of it should change.
I appreciate your staying with me on this subject, there is just one more before I get to exploring their implications in the larger world.
Let me finish with a silly question:
When you looked at the title of this post, did the voice of David Bowie jump into your head?