For general information and resources, ITIL and ITSM World is the most well known for both ITIL and ITIL Books. A shorter snapshot approach can be found at ITIL Zone
Note: ® ITIL is a registered trademark of OGC. This portal is totally independent and is in no way related to them. See our Feedback Page for more information.
The Itil Community Forum: Forums
ITIL :: View topic - Major Problem / Major Incident
Joined: Jul 09, 2009 Posts: 2 Location: Little Rock, AR. USA
Posted: Fri Jul 10, 2009 8:11 am Post subject: Major Problem / Major Incident
Disclaimer: I apologize in advance if this has already been asked/addressed, but I searched the forum and did not find a similar thread.
The questions are related more to business implementation than they are to ITIL process. I'm curious how others handle these situations in their organizations. Specifically, let's say you have a high-impact high-urgency incident (service down for an extended period of time), root cause is unknown, no solution in sight.
0.) Do you have strict business criteria to declare this a "Major Incident" (and/or "Major Problem"), or is it more of a judgment call by senior management?
1.) How do you successfully continue to run your processes in parallel? (i.e. PM subset of folks investigates for root cause while IM subset of folks looks for workaround).
2.) If you run in parallel, do you have one person with accountability over the whole "war room" situation?
2.a.) Is that person the Incident Manager, the Problem Manager, or another party?
3.) How do you manage resources when both Incident and Problem resolution efforts need the same SMEs? (developers for example).
Joined: Mar 04, 2008 Posts: 1883 Location: Newcastle-under-Lyme
Posted: Fri Jul 10, 2009 6:05 pm Post subject:
Andy,
Little Rock is a long way from here. So I won't bother throwing stones.
0.) You do both; you have criteria based on, for example, loss (for indeterminate time) or imminent loss of vital services; but senior business management can wade in if in their judgement an incident is threatening the business; part of the issue is often going to be uncertainty of recovery time.
1.) The priority is effective restoration of service; that may well require deep investigation (in effect root cause analysis) but is nothing to do with Problem Management; if there is possibility of recurrence (one way to determine that there is in fact a problem), then the activities (not the management functions) merge into the problem analysis and investigation once resources are free from service restoration; there may not be an underlying problem once the incident is dealt with and the major incident review can check and confirm if that is the case.
2.) The war room is around the incident and, if there is an underlying problem that threatens repeat incidents (for example), especially if they could occur at any time or sooner, then you move the war room to resolving the problem as soon as the incident is fixed.
2a.) In a major incident situation, you do not assign lead role on basis of normal roles; you want the most capable person available running the show; it would be natural for the most senior person present to take charge and s/he may well delegate coalface management to the best and most experienced person available while retaining close contact with what is going on and with senior business management; in any event whoever leads must have authority (at least for the duration of the incident) to acquire and deploy whatever resources are required.
3.) I think this is implicitly answered above; the priority is effective restoration of service.
Problem Management is not a front line activity. If you expect the service to drop again in an hour, or even a day, you are still deep in incident management because you are working on an imminent breach. If you expect the service to stay up but you are not sure whether it might go down again sometime, then you can apply problem management to it and your first step might probably be to validate the workaround that was achieved and streamline its application. _________________ "Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718
This is an interesting topic. I was discussing this with our problem manager here... My understanding of the problem process was that a Problem becomes a Known Error and THEN has a workaround applied. This doesn't fit Incident Management - Incident Management seems to be able to apply workarounds without understanding the root cause in order to restore service.
Does anyone else find it confusing that Problem Management highlights the linear notion of Problem->Known Error->Workaround/solution, and that Incident Management allows Workarounds without understanding the root cause?
Should we have different names for the blind-faith bandaid type of workaround that incident management comes up with (bounce the box!) and the reasoned workarounds that Problem Management comes up with after understanding Root Cause?
Joined: Mar 04, 2008 Posts: 1883 Location: Newcastle-under-Lyme
Posted: Mon Jul 13, 2009 9:53 pm Post subject:
milligna wrote:
Should we have different names for the blind-faith bandaid type of workaround that incident management comes up with (bounce the box!) and the reasoned workarounds that Problem Management comes up with after understanding Root Cause?
I don't see the necessity. It is not strictly "blind faith" since the workaround has been applied to resolve at least one incident before it ever reaches the status of workaround. Even after "root cause" is established and a workaround derived, this is not set in stone. It is perfectly possible to change the workaround after this if a better one is found for the business. there is more than a technical component to a workaround.
Properly speaking a problem does not become a "known error", it spawns a known error or it acquires an attribute of "known error".
It is clear from the recent posts in several threads that this area is confusing. I do not propose to try to unravel ITIL, if for no other reason than I do nor have access to the books. However, the best way to do something is the best way to do it regardless of book learning.
To my mind the high emphasis on "known error" is symptomatic of an environment where development of software is important. In many service environments (probably most outside of the really big organizations), it seems overkill to start managing a sophisticated "known error" system on top of Problem Management.
I prefer to unravel the knot. You need a workaround to resolve an incident. The first one you come up with should be used until either you find a better one (possibly through "root cause analysis") or you find a serious defect in the one you are using (in which case you find another one pronto).
You don't really want staff ferreting around a bunch of incident records, a bunch of problem records and a bunch of known error records every time there is an incident. You want one unified search ("match the symptoms and tell me the workaround"). You do not care about all the technical terms at that point and the only time you are interested in workarounds is when you have incidents.
Equally, you are only interested in "known errors" as distinct from "problems" because this gives you better information as to what is happening to cause the incident and therefore a better confidence in and understanding of using the workaround.
Service management is also interested in "known error" status as part of tracking current problems as to their current state of progress.
I prefer to think of all these terms as concepts and to use my understanding of the concepts to apply good process where and how it is needed.
The relationships that matter are the practical ones. The theory is just there to help you understand things. Draw up a process flow that will provide good incident resolution and good problem resolution and let the concepts of "known error" and "workaround" fall into their natural place. _________________ "Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718
Joined: Jul 24, 2009 Posts: 23 Location: Sydney, Australia
Posted: Mon Jul 27, 2009 12:10 pm Post subject:
AndyBostian - I agree with Diarmid and UKVIKING. I would just like to add that the way we work with this situation is that we have a criteria for creating a problem investigation being a P2 incident record with a high urgency to resolve. This is usually when a large number of users are impacted and the service outage / interuption is impacting business. An example of that would be a line outage to a site.
The key focus is to restore the service to the business, as soon as possible. The speed of the service restoration, in this case, will be dependant on whether the line provider needs to be involved etc. If a simple restart of the modem or NTU restored service then great... otherwise it may be through a reduced performance solution, such as a wireless service (until the line is back up again).
On the problem management side, the reason for outage would be assesed (if problem investigation is warranted). In some organisations and situations it is decided that since the root cause was outside of the internal IT group's control then it is not worth spending the time and effort to pursue root cause with the vendor. This all depends on how your organisation works. In other cases... service disruptions are assesed against the vendor's agreed uptime and action is taken from there. _________________ ITIL V3 Capability - Operational Support & Analysis Certified
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum