Search
Topics
  Create an account Home  ·  Topics  ·  Downloads  ·  Your Account  ·  Submit News  ·  Top 10  
Modules
· Home
· Content
· FAQ
· Feedback
· Forums
· Search
· Statistics
· Surveys
· Top
· Topics
· Web Links
· Your_Account

Current Membership

Latest: CDJFN
New Today: 100
New Yesterday: 202
Overall: 130664

People Online:
Visitors: 52
Members: 5
Total: 57 .

Languages
Select Interface Language:


Major ITIL Portals
For general information and resources, ITIL and ITSM World is the most well known for both ITIL and ITIL Books. A shorter snapshot approach can be found at ITIL Zone

Related Resources
Service related resources
Service Level Agreement
Outsourcing

Note: ITIL is a registered trademark of OGC. This portal is totally independent and is in no way related to them. See our Feedback Page for more information.


The Itil Community Forum: Forums

ITIL :: View topic - Alerting leads to Problem mgt : how to process this ?
 Forum FAQForum FAQ   SearchSearch   UsergroupsUsergroups   ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Alerting leads to Problem mgt : how to process this ?

 
Post new topic   Reply to topic    ITIL Forum Index -> Problem Management
View previous topic :: View next topic  
Author Message
jayalebee
Newbie
Newbie


Joined: Jun 05, 2008
Posts: 3
Location: PARIS

PostPosted: Fri Jun 06, 2008 1:01 am    Post subject: Alerting leads to Problem mgt : how to process this ? Reply with quote

Hi,

Our organization is currently facing a "tsunami" of alerts since we recently plugged the monitoring tool. This of course dreadfully diminishes the Service Delivery team's productivity.
A short investigation on alerts that are popping every day (around 1500 / day) reveals that :
- a large number are caused by the same incident
- cause can be technical or applicative
- eradication plans often involve both Operations and Project teams.

Would anybody have already faced the same problem and drawn a procedure (best a process) to place this issue under control ?

Many thanks in advance.
Regards
JLB
Back to top
View user's profile Send e-mail
mnsmith
Senior Itiler


Joined: Mar 31, 2008
Posts: 109
Location: North West England

PostPosted: Fri Jun 06, 2008 1:39 am    Post subject: Reply with quote

JLB

Been there and have got the t-shirt.

First of all, are the alert valid? If you don't care what it's telling you, then turn the alert off. If you do care, then raise problem records, prioritise them and fix the underlying issue.

Also, can you system agregate alerts, i.e. raise 1 incident with a count variable? For example, if a directory is full now (causing an alert), it's likely to be full in 10 minutes (when the alert fires again). A sensible system monitoring system would see that as 1 incident that has started at time X and has caused Y alerts. If a user rings the service desk every 10 minutes to tell them that their PC is still broken, do you record this as separate incidents?

Lastly, it's worth mentioning that all that a monitoring tool does is alert you to incidents before a user does. Therefore, if you are having problems with how to handle the alerts, then I would turn it off and get your incidend management process resolved first.

Hope that helps

Mick
_________________
Mick Smith
Change, Configuration and Release Manager
Back to top
View user's profile
Mark-OLoughlin
Senior Itiler


Joined: Oct 12, 2007
Posts: 306
Location: Ireland

PostPosted: Fri Jun 06, 2008 1:40 am    Post subject: Reply with quote

Hi,

can you set up the monitoring tool to be capable of identifying Sevverity 1, 2, 3 etc type events.

By this I mean don't stop the events from being created bit look at the conditions used to fire the event. i.e if the condition is a warning event then set it to create a severity 3 incident. if the event is for a major issue i.e. CPU failure, database offline issue then have it create a Sev 1 incident and have your incident management tool send an alert to the relevant people. That way you get alerted to the major incidents but have all others recorded as events / incident records for review.

It does help - bit does not solve the overall issue of relating the flood of events that come in. But get a handle on the important ones and then look to deal with the rest.

Granted a CPU failure can spawn a multitue of other events being triggered - sometimes I have found you just have to stop the flood of events- fix the issue and then switch it back on.
_________________
Mark O'Loughlin
ITSM / ITIL Consultant
Back to top
View user's profile
UKVIKING
Senior Itiler


Joined: Sep 16, 2006
Posts: 3251
Location: London, UK

PostPosted: Fri Jun 06, 2008 2:02 am    Post subject: Reply with quote

Usually NMS tools generate alerts so that INCIDENT records get created, actions and closed

Problems - ITIL defined problems - usually dont get created via the NMS tool.

NMS tools also have filters like - if an alert clears w/in 5 minutes - it disappears

However, this does not mean that the team that is doing the role of NMS monitoring ignores all alerts because they appear/disappear w/in 5 minutes.....because they appear/disappear w/in 5 minutes.....ebecause they appear/disappear w/in 5 minutes.....

Something call Human Intelligence needs to decide if the pattern s reason enough to generate an INCIDENT Ticket

and then if the incident ticket warrents a Problem ticket raised to solve the unknown underlying proble rather than restore service - Incident!!!!
_________________
John Hardesty
ITSM Manager's Certificate (Red Badge)

Change Management is POWER & CONTROL. /....evil laughter
Back to top
View user's profile
jayalebee
Newbie
Newbie


Joined: Jun 05, 2008
Posts: 3
Location: PARIS

PostPosted: Fri Jun 06, 2008 10:43 pm    Post subject: Thanks & further details Reply with quote

Thank you for those very quick answers.
Actually, I think I could have been more specific in my question : my current concern is to set the sound organization and process to best address, for a set of correlated alerts (eg same object, same server ...), whether I should :
- tune up my monitoring tools (alerting threshold for instance),
- change sth on the hardware
- update the instructions manual
- ask analysts to patch their developments,
- whatever ...

I was more thinking of setting up a process like a draft I could send you (since I don't know how to link an image here).
Who do I have to involve ? In what case ? Who should be in charge for coordination ? aso ...

Please excuse my english (we french people are not always very at ease with you language).
Kind regards,
JLB
Back to top
View user's profile Send e-mail
UKVIKING
Senior Itiler


Joined: Sep 16, 2006
Posts: 3251
Location: London, UK

PostPosted: Fri Jun 06, 2008 11:05 pm    Post subject: Reply with quote

The answer to your question .. is it depends

First

You have to have a defined Incident mgmt process first
this should have an addenda to deal with automated tools, alerts & System monitoring alerts

These SM alerts should be used to create incidents

If the alerts says ' insufficient memory... system crash'

then the system people for that system would get the incident...resttore service and THEN investigate why there was insufficient memory

This is PROBLEM MGMT.

For example, the investigation reveals that an application

Call it SarkozyTHoughtProcess - hey - I saw a comment that his wife ... like his six brains ----

needs # amount of RAM and ## amuont of hard drive for swap space

There seems to be insufficient hard drive space...therefoe the solution to the problem is . add a new hard drive with ##^8 and use it solely for this applicatiion

a change is raised to implement
it gets approved
it gets schedu;led
it gets implemented

---------meanwhile... the system suffers the alerts and the system team restores service and tickets (incidents are generated and linked to the existing problem which is being dealt with )

and lo after the implementation.. the alerts disappear

-----------------------------

In regards to your question... all five can be done or not be done... depending onthe results of the analysis of the proplem

If the alert is set to low or to high...then this should feed into the System mgmt peiople to investigate the impact of more or less alerts

....NOTE: Before the 5 minute rule went into affect, every alert would have to have an incident ticket potherwise it did not clear
we generate thousand of useless tickets
_________________
John Hardesty
ITSM Manager's Certificate (Red Badge)

Change Management is POWER & CONTROL. /....evil laughter
Back to top
View user's profile
jayalebee
Newbie
Newbie


Joined: Jun 05, 2008
Posts: 3
Location: PARIS

PostPosted: Fri Jun 06, 2008 11:33 pm    Post subject: Reply with quote

Thanks John,

What I understand is that the process I'm trying to define seems to be a hazy mix of incident and problem mgt, since the amount of alerts is a bit confusing and generates way to many problems my team is able to deal with.
I shall think of an effective dashboard that helps to follow-up so many eradication actions plans scattered among so many ITs.
Any template ?

In any case, MANY MANY thanks for your advices : one often need sbdy to remind the basis of Service Management when operations are ... intense.
Regards,
JLB
Back to top
View user's profile Send e-mail
UKVIKING
Senior Itiler


Joined: Sep 16, 2006
Posts: 3251
Location: London, UK

PostPosted: Sat Jun 07, 2008 1:55 am    Post subject: Reply with quote

Please separate the alert management process from the incident and problem mgmt process

Develop with your system monitoring team with reps from systems, applications, etc

And decide basic alerts for outages degradation etc and what they mean

then decide for alert type
for example
swap space alerts for unix systems should have 2-3 levels

warning threshold (medium to low priority)
alert threshold (medium priority incident)
critical threshold (cirtical incident)

The three levels should be say 85% 90% 95% of space

so then your incident (problem process ) too would be linked to do something about each level through a series of incident ticket states
_________________
John Hardesty
ITSM Manager's Certificate (Red Badge)

Change Management is POWER & CONTROL. /....evil laughter
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    ITIL Forum Index -> Problem Management All times are GMT + 10 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Powered by phpBB 2.0.8 © 2001 phpBB Group
phpBB port v2.1 based on Tom Nitzschner's phpbb2.0.6 upgraded to phpBB 2.0.4 standalone was developed and tested by:
ArtificialIntel, ChatServ, mikem,
sixonetonoffun and Paul Laudanski (aka Zhen-Xjell).

Version 2.1 by Nuke Cops 2003 http://www.nukecops.com

Forums ©

 

Logos/trademarks property of respective owner. Comments property of poster. Rest 2004 Itil Community for Service Management & Foundation Certification. SV
Site source copyright (c)2003, and is Free Software under the GNU / GPL licence. All Rights Are Reserved.