Posted: Mon Feb 19, 2007 11:42 pm Post subject: Root cause analysis & process improvement action items
When doing an after the fact assessment of a high impact problem, I see at least 3 ITIL disciplines that need to be considered:
- Availability management: What was the outage time, and how did the incident affect our customers?
- Problem management: What was the root cause of the problem?
- Incident management: Is there anything in our incident management process we could improve so that, should this happen again, our MTTR is improved?
My question is this: What is best practice for gathering this data? Today we use two methods. For the availability management data, we gather that during an Urgent Problem meeting that we have each day. For the Problem and Incident management information, use something we call an After Action Review (AAR) that combines both the problem and incident management questions. We then keep the problem ticket open untio all action items are addressed... whether it's an action item to fix the root cause or an action item to improve the incident management process. This gives us a false reading on how many problems are still awaiting root cause analysis; but if we don't keep the ticket open we seem to loose accountability for making the incident management process improvements.
I'd appreciate others sharing how they handle these action items to make sure that accountability occurs and all action items are followed through.
You may want to create statuses that reflect the true state of the problem investigation. For example, you can use a status like Pending root Cause Analysis, Pending change implementation or Pending Process improvement implementation.
we operate a similar process. All "major system failures" - predominently P1's & P2's have a problemmanager allocated to support the incident management process. In this capacity, the PM is mainly gathering data, working with the resolvers and providing business escallation, communication and milestones.
Once the service is restored, what we refer to as our Major Incident Report is created within 48hrs of restoraton. This looks at both technical and process (including incident and problem) failings.
All remedial actions are allocated to an individual within the report and a recommended date for completion. The report is sent out to evenyone who has a vested interest OR an acton allocated to them and they have 72hrs to responde. After the 72hrs the rpeort is locked and the allocated owners (if they have not challenged it) accept the acton and milestone. These are then transferred inth the PMDB and tracked by the Problem Manager.
On a weekly bsis each of the IT account manager (of which there are 7) recieve a repot of their incident, change and problem stats. The IT director also recives this and the compliance to it is used as part of their performance review.
The secret - good process and management buy-in
Rob _________________ Pondering the question:
Why dont small companies get it ??
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum