Posted: Wed May 09, 2007 12:44 pm Post subject: proper incident management of a known problem
Assumptions: we know for sure there is a problem related to a server that affects many users. Server group has a problem ticket. We also know it won't be too long before users gain their access back.
If calls come in related to the server, I assume we still have to record each as "incident" not "problem", is that correct? And I assume we should attach each incident to the Problem ticket, correct? Should we close the incident tickets immediately if we know server will be up very soon.
Similarly, sometimes users call about their network not working, but it is caused by a problem on the network not users. Can we close the incident tickets immediately since very often the users are satisfied to know that the problem is not just at their pc. We know it will be fixed in the next hour.
Joined: Jan 12, 2007 Posts: 48 Location: Warsaw, Poland
Posted: Wed May 09, 2007 2:53 pm Post subject:
In theory the Incident is closed only after the service is available.
It is however a good question - how is it done in practice?
User's happiness may change every minute and the Incident stays as a fact. Also the Service Desk staff may misjudge the user's mood. I would not rely on that.
I would suggest to record all Incidents, associate them to the Problem and leave them opened until the Server works.
There is yet another question associated with this one. When The Server is down, the Service Desk is flooded with calls. How are they measured? Are they included in normal statistics or they are treated differently. I assume it depends ( ) on what you measure, but I am interested when do you treat such situations differently and how do you change statistics? _________________ Krzysztof (Chris) Baczkiewicz
IT Standards Support
We probably are going to count all incident/Problem/Change tickets even if incidents are caused by a Problem or Change ticket created by the Problem for the same issue. I don't know what most people do here since if you don't count, then it is hard to sort out what incidents are associated with the Problem and what Changes are caused by what Problem. My software does not do that kind of sophisticated counting. I would think you need someone like Problem or Incident Manager to manually monitor this type of situation and record how many incidents are associated with the Problem.
the incident is opened when the service is disrupted. how many users might call, it remains 1 incident: there is 1 service disruption, but may impact several users. here the discussion begins on creating either a call or incident: log a call and relate the calls to the incident, or create an incident for each user and work on the parent incident.
the incident remains open untill the service is back up, no matter how. If you can duplicate the service (e.g. datastore) of your failing server on a temporary server that is accessible, the incident would be closed, because your user can work. However, the problem is not fixed and would require a planned intervention.
Let's say your network will be out on daily basis for at least an hour, this might be a problem/known error. Focus on monitoring tools that can simulate availability and close the incident automatically when it gets back up. Focus on the problem, because your endusers might get frustrated they're dealing with it daily. Also, your availability targets might reach their threshold.
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum