SLA with low volumes

Forum to discuss ITIL issues and disciplines covering Service Delivery.
Post Reply
botty1963
Newbie
Newbie
Posts: 1
Joined: Mon Aug 17, 2020 5:16 am

Mon Aug 17, 2020 5:27 am

Hi all. I have a customer where around 80% of the Incidents raised, are ultimately identified as not being valid Incidents as they are not faults within our solution. We spend a lot of resource time verifying that this is the case, and we then suffer against the SLA of genuine Incidents.

I am looking to amend the contract to take this into account as we effectively fail the SLA almost every month due to this issue. Therefore I would like to create a mechanism that takes into account this time that we spend working on Incidents that are outside of our remit and mitigates against the current SLA target. It also needs to take into account that we get low volumes of P1's for example.

So as an example, we may get 8 x P1's per month, 6 of which we work on say for an average of 6hrs (so 36hrs opportunity cost), we when fail one of the 2 genuine Incidents and end up with a 50% success rate.

Any suggestions would be grateful.


User avatar
Corde Wagner
Senior Itiler
Senior Itiler
Posts: 60
Joined: Fri Nov 10, 2006 7:00 pm
Location: El Dorado Hills, California
Contact:

Fri Sep 11, 2020 7:34 pm

Greetings,

I do not have enough information to fully understand the issues you are facing, but from what information you have provided, here’s some thoughts/questions:

- You wrote: 80% of the Incidents raised, are ultimately identified as not being valid Incidents as they are not faults within our solution.
Response:
o An incident is an incident, so if in fact something that should be working is not working (aka: an incident), it’s not clear what is considered to be valid.
o Given that the reported incident is an incident, if what is being reported in the incident is not supported by your organization (solution?), that should be identified through a catalog of supported services in your solution that the incident is aligned with and if the incident is not within the catalog of supported services, then the incident is cataloged (and prioritized) appropriately.

You wrote: I would like to create a mechanism that takes into account this time that we spend working on Incidents that are outside of our remit and mitigates against the current SLA target.
Response:
o All incidents must be recorded in the ticketing system, which starts the clock on incident resolution
o Each incident should be categorized based on supported services and systems.
o The incident team taking in the reported incident must gather /be given the impact and severity of the incident.
o The combination of the category (impacted service) and the severity/impact, the incident should be assigned a priority (I assume this is what you mean by “P1’s).
o The incident management team should handle P1 (the highest urgency in your prioritization scheme) incidents, and they should have enough information in your SLA and service catalog to quickly determine if the incident is actually part of your “solution” and if the P1 priority is correct.
o By best practices, all P1 incidents should be reviewed (major incident management and Major Problem Management), and those not properly prioritized should require a follow-up with to those who made the initial determination of the priority and re-training be applied.
o If the client is over estimating the urgency and impact, which happens a LOT, that too would be a signal for client training and areas for your team to be aware of to better manage the call.

You wrote: So as an example, we may get 8 x P1's per month, 6 of which we work on say for an average of 6hrs (so 36hrs opportunity cost), we when fail one of the 2 genuine Incidents and end up with a 50% success rate.
Response:
o It’s not clear in this “8 x P1” and “failing one of the 2 genuine incidents” means, but it sounds like only 2 of the 8 incidents were really P1 priority or only 2 of the 8 were real incidents? Either way, see responses above!
 Ensure you have clearly documented catalog of services
 Ensure you have clearly documented incident recording process/procedures, that are especially precise on the service, the categorization, the impact / severity (put in details field of the ticket if necessary) and that the prioritization is a true as the incident team can make it. Train and retrain as necessary
o Look deeply into why the resolve time and resolution time is taking longer than what you have in your SLAs. Is the environment you are supporting so complex (or not properly documented) that it’s difficult to understand what is truly impacted? Is there a great deal of “technical debt” that is related to the longer duration of the incident?
 Do you have an effective and well trained “Major Incident Manager” (aka: incident commander) leading the resolution effort?
 Do you have the right people working the incident and/or is your major incident management team having to wait for the one available specialty person to get online and that adds to the long restoration times? (if so, fix that situation!)
 Often technical team members will spend too much time poking around looking for something wrong, because they don’t know what to do or are afraid to ask for help. Be sure to impose time limits in the major incident process where restoration teams must escalate to the next level, etc.
 Are your teams leveraging the ticketing system to look for previous incidents that are the same or similar, so they can possibly leverage past incident resolutions?
 Do you have a known error database (ITIL problem management) or knowledge base that your teams are using to research possible issues and solutions?

Documented processes and supported services will help.
Corde Wagner
ITIL 4 Managing Professional - ITIL v3 Expert - v2 Red Badge - VeriSM-Plus - Certified Agile Service Manager
User avatar
vsudo
Newbie
Newbie
Posts: 3
Joined: Sat Apr 03, 2021 1:01 am

Thu May 20, 2021 11:43 am

I was confused this your descriptions about incident case. It's a long time ago and could you share me more information about this case and how did you resolve this incident? Thank!
Post Reply