Wednesday, 27 November 2013

Differences between Incident Management and Problem Management

Differences between Incident Management and Problem Management?

There's sometimes considered to manifest as a conflict of interest between Problem Management and Incident Management, which I have discussed this briefly in other posts. Problem Management is mostly about uncovering the right answer. It is a qualitative method of Problem solving, taking considered decisions to recognize and eradicate the root cause of an issue.

Incident Management is naturally a quantitative approach, which is certainly focusing on speed. Incidents should be resolved as speedily as possible, which frequently leads to quick decision making to restore service. While on the surface the two processes align rather effectively. Now that the Incident is resolved the Problem begins to investigate root cause. However there are obvious conflicts of interest between Incident and Problem Management.

Incident Management is concerned with speed. This truly is understandable due to the fact the incident is having current impact to clients. Incident Management really don't have time period to take considered decisions. They usually have to make quick ones. The Email service has gone down. Technical Support believes the issue is most definitely a corrupt database, which needs a restore from backup and shall take 8 hours.

Tech Support estimates that it is 95% almost certainly going to will be the resolving action. There is however a 5% chance that it is actually a memory issue while a simple reboot of the application servers will resolve the impact, that will just take 10 minutes. This action will however delete all diagnostic logs.

Incident Management decide to did the reboot then a database backup or run both actions in parallel. Currently being a Problem Manager You can empathise with their decision, whichever way it is not great news for Problem Management as now there are without any diagnostic logs to identify why the database became corrupt to begin with.

One other example, the sales application is inaccessible to customers and Tech Support suggests that the issue is produced by a memory leak inside the application servers. To restore service technical support recommends to Incident Management to reboot the application servers. Tech Support suggests that it can take 10 minutes to reboot the servers. The reboot will delete all diagnostic logs, nonetheless they can possibly be downloaded prior to the reboots, which will add on an additional thirty minutes towards the restoration time.

In this particular scenario the response be influenced by the Incident manager since this is somewhat subjective. Many will require the instant reboot, will most probably have an appreciation with regards to a diagnostic logs and immediately take action to retain them. It truly is interesting to discuss this point with Incident Management to understand their viewpoint. I also sometimes ask the question if you wouldn't retain the logs in the above example how quickly would they should be retained (i.e. only 5 minutes added to the duration of the impact) to make sure you to download them? I think this often makes for an interesting debate.

I have never found a working example of a formula for this decision. It would be useful to have something thought as the real driver for the incident management team is to restore service quickly. They don't get any praise for retaining logs when they have had an extra 30 minutes of service disruption. If the incident was getting close to SLA (their Service Level Agreement) the decision would also change, which again does not feel like the correct approach. The driver should be to add the utmost value to the customer or get the best return on investment (ROI). In the above example the investment is time and the return would be Problem Management identifying the cause to avoid further recurrences.

In reality it is not as easy a call as a simple return on investment calculation (effort vs reward). There are so many variables to think about and experience is usually the best way to get the desired result. Incident Management come under similar pressures from Problem Management in regards to recurrences. In my opinion most people don't know Problem Management exists and believe that the Incident Management team fixes all issues and the action taken by them stops it from happening again. An Incident fix to the customer is naturally a root cause fix rather than a reboot. They live in sweet naivety. The more the customer knows about IT infrastructure the more they appreciate the complexity but also the more they recurring impacts and will explore the ways to dissolve themselves from responsibility.

A choice to retain logs takes two major variables into account; the current impact & the chance of recurrence. How do Problem Management and Incident Management Interact with one another? Considering the above it seems like an effort to imagine Incident Management and Problem Management working well together as their drivers are frequently at cross purposes.

Close collaboration and an extremely close working relationship are necessary, as although the processes often diverge at key sections they will be mutually counting on one another. Problem Management places confidence in quality information to be retained during the Incident. Incident Management depend on Problem Management to eradicate the root causes to avoid recurrences and thus creates new incidents that Incident Management need to manage. They additionally have a common goal of improving the experience of the customer, although I find this difficult to instil in people, self interest is less complicated for people to relate with.

Problem Manager's Key performance indicators (KPIs) drive an unsuitable behaviours as seen in the examples above. I want to talk further on KPIs and Problem Management metrics in a different post (soon). With KPIs that do not strictly align to a real goal, it is crucial to understand the mutual benefit that the majority of these processes have on each other and be able to articulate the benefit of taking a particular action. Incident Management is a completely different mind set to Problem Management you could use a us school example and say Incident Management are the jocks, they get all the praise for restoring service and are always in the limelight. Problem Management are the geeks, in the background working away saving the organisation millions, without the business ever knowing it. Close working relationships can break any barrier. It is vital toward the success of the two processes with regards to a Incident and Problem Managers to get a good working relationship.

Author: Craig Kelly
Source: Link

1 comment:

  1. Thank you, at work we don't do proper incident management I tell my boss the importance of it.