For general information and resources, ITIL and ITSM World is the most well known for both ITIL and ITIL Books. A shorter snapshot approach can be found at ITIL Zone
Note: ® ITIL is a registered trademark of OGC. This portal is totally independent and is in no way related to them. See our Feedback Page for more information.
The Itil Community Forum: Forums
ITIL :: View topic - Reducing Fake alerts - Best Practice
Posted: Mon May 18, 2009 2:58 pm Post subject: Reducing Fake alerts - Best Practice
Hi,
I would like to have a suggestion on the Event management.
we are moving to a Service Desk model and as a part of service desk we have alerts coming to our lotus notes.
- The alerts that we receive for a day are about 500 to 650.
- In these alerts most of them are fake alerts
- Out of these 500 to 650 alerts we monitor and on an average we only work on 100 alerts per day.
I do not see this as a BEST practice as the we need to assign an analsyt to just monitor the alert emails that we receive.
Would you suggest how we can reduce the fake alerts
Hello,
Decide as to what alerts are of interest to you and make the necessary adjustment to the templates of the monitoring software.
I have used HP Openview in the past in which we were getting alerts of all descriptions at first.
Example - A service stopped on a server would generate an alert,multiply that scenario by 500 servers and you have a problem.
Tweaking the templates to suit you and your clients needs is the way to go.
Good luck....
Joined: Sep 16, 2006 Posts: 3118 Location: London, UK
Posted: Mon May 18, 2009 6:13 pm Post subject:
What do you mean by Fake alerts.
An alert only happens when the conditions are met for which the alert is set
that being said, you need to either
a) adjust the criteria for the alert
or
b) establish procedures about dealing with the alerts
ANECDOTE
I used HP OV to manage/monitor the network portion of our environment
five data centres, switches, routers in each all reporting
there were a lot of alerts that came from routers about link UP/ Down for 1 second.
Unless the link was down for 5 minutes, the alert was ignored.
Unless the link was bouncing more than x times, the repeating alerts were ignored
You have to combine the twealing of the NMS tools with operational process and procedure _________________ John Hardesty
ITSM Manager's Certificate (Red Badge)
Change Management is POWER & CONTROL. /....evil laughter
Joined: Mar 04, 2008 Posts: 1883 Location: Newcastle-under-Lyme
Posted: Mon May 18, 2009 7:08 pm Post subject:
Which just goes to prove that automation makes things so much easier...or something.
My view when you are about to implement a system like this:
1. switch off all alerts.
2. decide what you want alerted about, thresholds, severity etc.
3. decide if there is anything else you want someone to know about, but not the service desk (perhaps a log history).
4. switch on the alerts you want at the level you want and routed to whom you want and graded how you want.
5. test.
6. implement.
Harness the power before you use it. _________________ "Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718
Posted: Sun Sep 19, 2010 2:21 pm Post subject: Event Tickets
Very interesting topic.
But I'm wondering why it is not possible to make the reporting tool to close the fake alert automatically.
Let’s say MOM/Tivoli notice a services on a particular server is not responding for 5 seconds and an event ticket generated.
The report can be due to overload in the server or any other issue but the services report back to normal after the next check.
Why can’t the tool configured in way that this event closed automatically rather then making the engineer login to server and check the service.
After all this is a fake event.
Joined: Mar 04, 2008 Posts: 1883 Location: Newcastle-under-Lyme
Posted: Sun Sep 19, 2010 7:05 pm Post subject:
Well dlinking, an interesting idea, but if you have decided that 5 seconds does not indicate loss of service why not just change the alert criterion to ten or twenty or whatever does indicate a loss of service, and save all that "artificial intelligence" stuff for a rainy day?
Why add a further layer of (probably insufficient) automation to deal with the deficiency of the first one? _________________ "Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718
First of all, let's not use the term "fake alert". That suggests that it was an alert about an event that did not occur, which is not true. The events did occur as per the defined thresholds in place. The only issue in this situation is that many of the alerts are apparently not actionable.
An event is the occurrence of a condition that meets a number of predefined criteria. These criteria typically consists of a change in state (up/down, on/off), exceeding a threshold (> 95% utilization of a resource), existence of a particular warning or error message in a log file, or something of that nature. You don't necessarily have to send an alert for every defined event. You only want alerts for events that are actionable. Those are typically events that may become incidents if no action is taken or that already are incidents. Most monitoring tools will allow you to define rules around this that usually consider duration, frequency of occurrence, etcetera. Most tools also allow events to be automatically marked as 'closed' when the condition that started the event no longer exists.
Let's look at some examples:
1) You want an event generated when a server exceeds 95% CPU utilization. However, you only want to receive an alert and an event ticket opened when CPU remains above 95% for 5 consecutive polls at 10 minute polling intervals (in other words: for one whole hour CPU is over 95% every time it is checked). You may also want to event ticket to be automatically closed by the monitoring tool once CPU drops under 95%. In this example you are still capturing all events and the monitoring tool will typically capture these in an event database. However, an event record (that requires somebody to take a look at the server) and the corresponding alert will only be issued when the server continues to run at such high capacity. After all, a single spike over 95% is not a concern for many servers.
2) You are monitoring Uninterruptable Power Supplies (UPS). One type of event that you might be interested in is when utility power is lost and the UPS battery has to supply power. When this occurs there really is nothing wrong; the UPS does what it is designed to do. Whether you want to receive an alert and have an event ticket opened depends on what you want to do with it. If you need the alert so that you can quickly and gracefully shut down the servers that depend on the UPS, then by all means generate it (for the sake of this example let's assume the UPS does not initiate this shutdown automatically). However, if this shutdown needs to occur within 5 minutes (when the UPS battery dies) and it normally takes you at least 20 minutes to respond to an alert, then the alert might be useless and not worth sending. So it all depends on what you want to do with it. Having this event captured in the event database is useful in any case as it allows you to determine the success rate of the UPS.
Bottom line: don't generate alert and event tickets if they are not actionable. You can still capture the events for after-the-fact analysis purposes. _________________ Manager of Problem Management
Fortune 100 Company
ITIL Certified
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum