Can an incident uncover multiple problems?

Discussion on issues related directly or largely to ITIL problem management.
Post Reply
Leatherneck71
Newbie
Newbie
Posts: 3
Joined: Thu Oct 18, 2018 1:11 am

Thu Oct 18, 2018 1:50 am

I am having debates / discussions with different members of the Service Management Group regarding how we handle our Problem Tickets and assignments that I am hoping to get some guidance on here. Perhaps we are not following the framework as well as I think we are.

Example:

Loss of network connectivity causing multiple applications to not function and service is down- This is the Incident

Root Cause:
Human error, rookie engineer accidentally shut down both primary and secondary uplinks to a switch, closed his CR and went for a hot coco. Realizes mistake after 20 min when the Service Desk pages him, he hurries up and brings the links back up.

One application does not come back up, all others do. Monitoring does not catch that this one application is still down. Ooops. Later discovered and restart of the application is needed. Some of the files being transferred are lost for one of the applications that did come up because storage ran out of room.

Now this is where we may go astray from the framework, honestly I am not sure.....
We hold a Major Problem Review where we look to identify what went wrong, what we could have done better etc. This is done by the Problem Manager.

So now we have one problem ticket from the incident. The Network team owns the problem ticket because they caused the outage.
During the MPR the following is identified and a "problem task" is created under the problem ticket.
Task 1: Monitoring for the application that did not recover needs to be enhanced (Tools Team)
Task 2: Application had to be restarted even after connectivity was established, need to fix that. (Application Team)
Task 3: Storage for one of the Apps needs to be adjusted (Storage Team)
Task 4: Review change procedure for verifying interfaces before shutting down and update (Network)

Okay so I don't want to get carried away here and my example has some pretty easy fixes, but lets say the application fix could take weeks for the dev team to write code for, the storage is actually outdated and full, entire new unit needs to be added, and the monitoring tool needs an upgrade before it can actually catch the down app.

Here is my issue with this. The Network team now owns the problem ticket. That makes them accountable for even the tasks that are out of their area of responsibility. They have to chase the other teams to make sure they are actively working the sub task items.

It seems to me that during this "Incident" you actually uncovered multiple "Problems".
Tools team should get their OWN problem ticket, they can have tasks for what it would take to solve their piece.
Application Team should get their OWN problem ticket, they can have tasks for what it would take to solve their piece.
Storage should get their OWN problem ticket, they can have tasks for what it would take to solve their piece.

I could be totally off base here, all of the tasks should fall under the single problem ticket, but then should it be the Network team owning the problem or should a Problem Manager own the problem and drive each team?

I am really causing some heated debates here at work over this, but my team resolves our part quick, then has an open problem ticket under our control (Which we report on problem tickets as KPI items) that could go on for weeks or months even.

I also feel like doing it with one problem ticket, you mask the "problems" of other groups because you are just reporting on the "network" problem and everything else is hidden as a task.

Whew, anyone who read all of that, thank you! Anyone who actually replies....much appreciated!

Thanks all.


User avatar
UKVIKING
ITIL Expert
ITIL Expert
Posts: 3639
Joined: Fri Sep 15, 2006 8:00 pm
Location: London, UK

Thu Oct 18, 2018 10:01 am

Clarifications - nit picking
an incident is an interruption to the service being delivered
a Problem is an unknown underlying root cause of one or more instances.

What you have here is a gaggle of incidents and NO Problem per se

The engineer who did the stupid should be fired if he is that DUMB to shut down both links and walk off.
His team lead should be as well. But that is because I am certified as88ole.
What he did indicates that there is a major process failure in that department. You escalate to your common mgmt.

All of the services that were shut down as the result need to be identified and an incident raised - for each service = and then verified that the service restored . Do this even if the incident has been resolved. THIS WILL SHOW THE DAMN IMPACT the idiot caused

The ones that did not recover - well - work to get them restored.

NOTE> They may have other issues that are the cause of the incident

NOTE2: A problem is NOT needed to investigate why something went down. that is part of Incident mgmt. process
John Hardesty
ITSM Manager's Certificate (Red Badge)

Change Management is POWER & CONTROL. /....evil laughter
Leatherneck71
Newbie
Newbie
Posts: 3
Joined: Thu Oct 18, 2018 1:11 am

Thu Oct 18, 2018 10:21 am

Thanks for the input. For the record, the incident described is a hypothetical example to demonstrate the steps and how the multiple "incidents" get strung together within our apparently incorrect format. I would also argue that you are not a certified a-hole but just one who is firm on following processes and procedures as well as a hefty dose of common sense!

I am going to have to keep going over and over the appropriate terminology until it become natural but I do very much appreciate the clarifications, not nit picking, and corrections, keep me honest.

Thanks again!
User avatar
UKVIKING
ITIL Expert
ITIL Expert
Posts: 3639
Joined: Fri Sep 15, 2006 8:00 pm
Location: London, UK

Thu Oct 18, 2018 2:19 pm

with you name as a reference I would presume - not assume - that you are a misguided child ?
John Hardesty
ITSM Manager's Certificate (Red Badge)

Change Management is POWER & CONTROL. /....evil laughter
Leatherneck71
Newbie
Newbie
Posts: 3
Joined: Thu Oct 18, 2018 1:11 am

Thu Oct 18, 2018 3:31 pm

I am one of my Uncles Misguided Children, Semper Fi
89 to 95 - Not as mean, not as lean.
Post Reply