Problem with problem management

Discussion on issues related directly or largely to ITIL problem management.
Post Reply
User avatar
ITIL Expert
ITIL Expert
Posts: 117
Joined: Fri Dec 14, 2007 7:00 pm

Sun Sep 27, 2009 11:59 am

Hi All,

We tend to agree sometimes openly and sometimes in muted voice that the problem management is one area which could have been defined even better than what it is now. The problem management domain is not as evolved as Incident management domain. Its easier to define the KPIs of the Incident Management but its tougher to define the problem management KPIs. The reason for this being that the functions overlap at more places than one. In one of the articles of this forum, I read that if the CIO has to terminate the Incident management and the problem management functions of the organization, he needs to sack just one guy and not two.
The problem with the problem management has been in the way the function is potrayed. For example, an incident has to re-occur for the problem managers to open the Kepner-Tregoe books and do RCA. If the incidents of whatever severity don't re-occur, the problem management remains dormant. We will have to see and possibly eliminate the recurring element of an incident to give a boost to the problem management function. I have gone through the possible KPIs of the problem management. I am listing a few which I picked from this forum itself.

-percentage reduction in average time to resolve Problems
*percentage reduction of the time to implement fixes to Known Errors
*percentage reduction of the time to diagnose Problems
*percentage reduction of the average number of undiagnosed Problems
*percentage reduction of the average backlog of 'open' Problems and errors.

Reduction cost of Problems to Users:
*percentage reduction of the impact of Problems on Users
*reduction in the business disruption caused by Incidents and Problems
*percentage reduction in the number of Problems escalated (missed target)
*percentage reduction in the IT Problem Management budget
*increased percentage of proactive Changes raised by Problem Management, particularly from Major Incident and Problem reviews.

The KPIs make the problem management a dormant function and by all means the guy in charge of the Incident management can be given the additional responsibility of problem management. When we release ITSM positions in our organization, we release positions like a) Availability manager b) Capacity manager c) Incident and Problem manager ( just taking names). There is no separate position for the problem manager and they don't normally find a seat at the table.

The problem lies in a few inherent definitions. A disruption is termed as an 'incident'. For example, users not getting mails due to exchange server failure is a major incident, users can access network devices due to boxes running out of capacity is an incident, transactions not happening in the ATMs of the bank due to the link failure is again a major incident. Let's look this from the end users perspective. The end user would say that I have a problem with the mails, I have a problem with the network share. The bank's customer will say that there is a problem with the ATM. Incident is a lingo understood to the IT folks only and not to the end users. I am telling this because I firmly believe that what we call as pro-active incident management should be the function of the problem management. What is managed proactively is a problem for the end users and not incidents. The problem management should be held accountable for Incidents if they happen. Before the problem management performs the RCA of a major incident or a recurring incident, it should be able to answer why the Incident happened in the first place. The idea is that the incidents should not happen and this should be the prime KPI of the problem management. The incident management KPI should not be anything more than providing quick work around thus ensuring limited or no damage to the organization's finance, reputation and reducing the service minutes lost for the end users. To ensure that the incidents don't happen and show a reducing trend from last month/last quarter should be the problem management function. This is not limited to incidents which have occured in the past but involves any incidents for that matter. This will ensure that the problem manager has a 'seat on the table" for all functions be it change, capacity, availability etc. They can police all these functions to ensure that incidents don't happen.

Does anyone agree with my thoughts? Can we think of some KPIs which make the problem management a quintessential function of the ITSM? Please feel free to disagree with my thoughts.

User avatar
ITIL Expert
ITIL Expert
Posts: 207
Joined: Mon Sep 26, 2005 8:00 pm

Mon Sep 28, 2009 12:19 am

By no means can I fully comprehend the issue that you are presenting here but I can tell you this:

1) In principle, every technical incident is the symptom of a problem. Realistically, however, you are not going to do an RCA for every incident. So, placing the bar at a certain severity might help.

2) Problem Mgt is a cornerstone of Service Support because it it the process that bridges incident mgt and change mgt. Very often, you will see that an incident (and major incidents especially) get fixed by changing something. In an organization where both Incident & Change are fairly mature, Problem mgt can have difficulties in establishing itself. This is difficult organizational change mgt.

3) The activities of problem mgt don't need to be absolutely distinct of incident mgt however. Very often, people will find the root cause of an incident while resolving it. When you acquire the capability of identifying that, and isolating it to (1) create records and (2) measure it, you've got a basis for problem mgt

Like any other process, it can take numerous forms. I know of a place where Problem Mgt is only executed on Major Incident. It makes sense there, given a number of parameters. It may not make sense somewhere else.

Maybe what you need is not a big bang approach, and rather a progressive process improvement approach:

1) Figure out what the fundamental requirements of problem management must be. You can get a pretty good list out of ISO20000.
2) Identify what you are doing, how consistently, and identify an internal best practice.
3) Promote the practice as an example, gain acceptance, generalize its use, move to the next one.

...some food for thought...
Fabien Papleux

Technology Consulting | Service Excellence
Red Badge Certified

Twitter @itilgeek
User avatar
ITIL Expert
ITIL Expert
Posts: 1894
Joined: Mon Mar 03, 2008 7:00 pm
Location: Helensburgh

Mon Sep 28, 2009 3:04 am

viv121 wrote:an incident has to re-occur for the problem managers to open the Kepner-Tregoe books and do RCA. If the incidents of whatever severity don't re-occur, the problem management remains dormant.
I know this is covered in Fabien's detailed response, but I want to emphasise it somewhat.

It is absolutely not correct. It is perfectly possible to raise a problem without there ever having been a related service incident, never mind waiting for one or more repeats.

When an incident does occur, all you need is:

- a degree of uncertainty about the cause
- a sense that a recurrence is more than remotely possible
- a sense that other incidents may stem from the same cause
- a judgement that there is some cost risk from further incidents (sufficient to justify expending more on prevention and/or circumvention)

Then you probably want to raise a problem. And you don't need many of these pre-conditions. The higher the possible cost, the lower the probability you will tolerate; the less well you understand the cause, the more important to pursue an investigation.

As for problems without incidents. An example:

performance trend analysis shows an increasing utilization of resources beyond predictions. If it continues much longer, there will be detriment to delivered services. You don't wait for services to breach there performance levels before you start investigating the cause. You don't even wait for the situation to infringe your internal threshold levels. So you do not have a service incident. Of course, you can report the situation as if it were an incident if your management system relies on such an arrangement, but that is just a convenience.
"Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718
User avatar
ITIL Expert
ITIL Expert
Posts: 441
Joined: Sat Oct 06, 2007 8:00 pm
Location: Jakarta, INA

Mon Sep 28, 2009 10:19 am

From me, just a reminder.
Processes tend to span functions within the organisation. Therefore it is important to define the responsibilities associated with the activities in the process that have to be performed. To remain flexible, it is advisable to use the concept of roles. A role is defined as a set of responsibilities, activities and authorisations. In this chapter, very brief examples of relevant roles within the process are defined.

Roles should be assigned to people or groups within an organisation. This assignment can be full-time or part-time, depending on the role and the organisation
I quote the above from blue book, Chapter 6.11. Roles within Problem Management.
My point is that Problem Management need not be formalized into the organization structure. Problem Management Process is a process not a function meaning that people who runs the process could be anyone authorized for that purpose. The Problem Manager, however, is required to ensure that the standard is followed, as well as other tasks stated in the blue book.

It is recommended that the Service Desk Manager and the Problem Manager roles are not combined because of the conflicting interests inherent in these roles.
I assume that Service Desk Manager is identical to the Incident Manager.
User avatar
ITIL Expert
ITIL Expert
Posts: 117
Joined: Fri Dec 14, 2007 7:00 pm

Sat Oct 03, 2009 11:25 am

Thanks everyone for your response. I know my question was long and to some degree sensible ( not sure if its only me who thinks so :) ). I am somewhat responsible to draw the line between the problem management and incident management functions in my organization. I just needed to know if an incident avoided is a success of Problem management. What we call incident is a problem for the end user. Should the task of avoiding incidents rest completely with the problem management? Blue book surely have something to say but am not too sure if it does a lot of good to the PM function. Do we restrict the incident management ONLY to fight the fire, inform the user base and senior management. The problem management takes up the 'morning report' which is a report on ongoing incidents, resolved incidents , root causes etc. The problem management also takes up alerts in wake of upcoming changes to avoid incidents. Decrease in multi-user incident also becomes the prime KPI for the problem management. Effective KEDB for the service desk and second line support becomes KPI of the problem management.

I think the debate stands closed with some fantastic answers coming from you. However, any further thoughts would be welcome.
User avatar
Posts: 23
Joined: Thu Jul 23, 2009 8:00 pm
Location: Sydney, Australia

Sun Dec 06, 2009 4:19 pm

Gday Viv,

I was wondering how this is now going for you and would like to add my input...
I just needed to know if an incident avoided is a success of Problem management.
It depends what you mean by that. I would say that if you have good event management and alerting in place then you can have a good foundation for proactive problem management.
For example, a server that hosts a vital business function (lets say the ERP system) gets to the point where it has 75% disk space being utilised. This event triggers an alert and an incident is created and assigned to the Windows Server team to investigate.
They see that for months and maybe years... the Server has been running steady at 40-50% disk space utilisation. So they see that there is something unusual that has happened to create this increased spike in disk utilisation.
They go about creating a problem record, linked to that alert/incident to run RCA adn discover that the disk space has been filling up due to the SQL transaction log having changed from being circular (archving and overwriting) to appending. This is what is causing the disk space to get chewed up over time.
The techies can now resolve this problem by investigating why the logging type was changed and then putting in an RFC to change it back - if approved.
If all goes to plan then the problem is resolved before it actually caused an impact to users and prevented related incidents from being generated.

So in this case - absolutely the prevention of incidents was due to a direct result of good proactive problem management.
Should the task of avoiding incidents rest completely with the problem management? Blue book surely have something to say but am not too sure if it does a lot of good to the PM function
Again - I guess that depends. It can be argued that in the above case, the Capacity and Availability processes should be very involved and have a good link back to incident and problem management. Not all organisation are at the same ITIL maturity level, however, for different reasons.
Do we restrict the incident management ONLY to fight the fire, inform the user base and senior management.
That is pretty much the job of IM. They can be helped though to restore service ASAP by having a good KEDB and/or Knowledge system at their fingertips.

I hope that helps.
ITIL V3 Capability - Operational Support & Analysis Certified
Post Reply