For general information and resources, ITIL and ITSM World is the most well known for both ITIL and ITIL Books. A shorter snapshot approach can be found at ITIL Zone
Note: ® ITIL is a registered trademark of OGC. This portal is totally independent and is in no way related to them. See our Feedback Page for more information.
The Itil Community Forum: Forums
ITIL :: View topic - Alerting leads to Problem mgt : how to process this ?
Posted: Fri Jun 06, 2008 1:01 am Post subject: Alerting leads to Problem mgt : how to process this ?
Hi,
Our organization is currently facing a "tsunami" of alerts since we recently plugged the monitoring tool. This of course dreadfully diminishes the Service Delivery team's productivity.
A short investigation on alerts that are popping every day (around 1500 / day) reveals that :
- a large number are caused by the same incident
- cause can be technical or applicative
- eradication plans often involve both Operations and Project teams.
Would anybody have already faced the same problem and drawn a procedure (best a process) to place this issue under control ?
Joined: Mar 31, 2008 Posts: 109 Location: North West England
Posted: Fri Jun 06, 2008 1:39 am Post subject:
JLB
Been there and have got the t-shirt.
First of all, are the alert valid? If you don't care what it's telling you, then turn the alert off. If you do care, then raise problem records, prioritise them and fix the underlying issue.
Also, can you system agregate alerts, i.e. raise 1 incident with a count variable? For example, if a directory is full now (causing an alert), it's likely to be full in 10 minutes (when the alert fires again). A sensible system monitoring system would see that as 1 incident that has started at time X and has caused Y alerts. If a user rings the service desk every 10 minutes to tell them that their PC is still broken, do you record this as separate incidents?
Lastly, it's worth mentioning that all that a monitoring tool does is alert you to incidents before a user does. Therefore, if you are having problems with how to handle the alerts, then I would turn it off and get your incidend management process resolved first.
Hope that helps
Mick _________________ Mick Smith
Change, Configuration and Release Manager
can you set up the monitoring tool to be capable of identifying Sevverity 1, 2, 3 etc type events.
By this I mean don't stop the events from being created bit look at the conditions used to fire the event. i.e if the condition is a warning event then set it to create a severity 3 incident. if the event is for a major issue i.e. CPU failure, database offline issue then have it create a Sev 1 incident and have your incident management tool send an alert to the relevant people. That way you get alerted to the major incidents but have all others recorded as events / incident records for review.
It does help - bit does not solve the overall issue of relating the flood of events that come in. But get a handle on the important ones and then look to deal with the rest.
Granted a CPU failure can spawn a multitue of other events being triggered - sometimes I have found you just have to stop the flood of events- fix the issue and then switch it back on. _________________ Mark O'Loughlin
ITSM / ITIL Consultant
Joined: Sep 16, 2006 Posts: 3110 Location: London, UK
Posted: Fri Jun 06, 2008 2:02 am Post subject:
Usually NMS tools generate alerts so that INCIDENT records get created, actions and closed
Problems - ITIL defined problems - usually dont get created via the NMS tool.
NMS tools also have filters like - if an alert clears w/in 5 minutes - it disappears
However, this does not mean that the team that is doing the role of NMS monitoring ignores all alerts because they appear/disappear w/in 5 minutes.....because they appear/disappear w/in 5 minutes.....ebecause they appear/disappear w/in 5 minutes.....
Something call Human Intelligence needs to decide if the pattern s reason enough to generate an INCIDENT Ticket
and then if the incident ticket warrents a Problem ticket raised to solve the unknown underlying proble rather than restore service - Incident!!!! _________________ John Hardesty
ITSM Manager's Certificate (Red Badge)
Change Management is POWER & CONTROL. /....evil laughter
Posted: Fri Jun 06, 2008 10:43 pm Post subject: Thanks & further details
Thank you for those very quick answers.
Actually, I think I could have been more specific in my question : my current concern is to set the sound organization and process to best address, for a set of correlated alerts (eg same object, same server ...), whether I should :
- tune up my monitoring tools (alerting threshold for instance),
- change sth on the hardware
- update the instructions manual
- ask analysts to patch their developments,
- whatever ...
I was more thinking of setting up a process like a draft I could send you (since I don't know how to link an image here).
Who do I have to involve ? In what case ? Who should be in charge for coordination ? aso ...
Please excuse my english (we french people are not always very at ease with you language).
Kind regards,
JLB
Joined: Sep 16, 2006 Posts: 3110 Location: London, UK
Posted: Fri Jun 06, 2008 11:05 pm Post subject:
The answer to your question .. is it depends
First
You have to have a defined Incident mgmt process first
this should have an addenda to deal with automated tools, alerts & System monitoring alerts
These SM alerts should be used to create incidents
If the alerts says ' insufficient memory... system crash'
then the system people for that system would get the incident...resttore service and THEN investigate why there was insufficient memory
This is PROBLEM MGMT.
For example, the investigation reveals that an application
Call it SarkozyTHoughtProcess - hey - I saw a comment that his wife ... like his six brains ----
needs # amount of RAM and ## amuont of hard drive for swap space
There seems to be insufficient hard drive space...therefoe the solution to the problem is . add a new hard drive with ##^8 and use it solely for this applicatiion
a change is raised to implement
it gets approved
it gets schedu;led
it gets implemented
---------meanwhile... the system suffers the alerts and the system team restores service and tickets (incidents are generated and linked to the existing problem which is being dealt with )
and lo after the implementation.. the alerts disappear
-----------------------------
In regards to your question... all five can be done or not be done... depending onthe results of the analysis of the proplem
If the alert is set to low or to high...then this should feed into the System mgmt peiople to investigate the impact of more or less alerts
....NOTE: Before the 5 minute rule went into affect, every alert would have to have an incident ticket potherwise it did not clear
we generate thousand of useless tickets _________________ John Hardesty
ITSM Manager's Certificate (Red Badge)
Change Management is POWER & CONTROL. /....evil laughter
What I understand is that the process I'm trying to define seems to be a hazy mix of incident and problem mgt, since the amount of alerts is a bit confusing and generates way to many problems my team is able to deal with.
I shall think of an effective dashboard that helps to follow-up so many eradication actions plans scattered among so many ITs.
Any template ?
In any case, MANY MANY thanks for your advices : one often need sbdy to remind the basis of Service Management when operations are ... intense.
Regards,
JLB
The three levels should be say 85% 90% 95% of space
so then your incident (problem process ) too would be linked to do something about each level through a series of incident ticket states _________________ John Hardesty
ITSM Manager's Certificate (Red Badge)
Change Management is POWER & CONTROL. /....evil laughter
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum