Posted: Wed Nov 21, 2007 3:45 am Post subject: Define an Outage
During a cross-functional meeting today, an interesting observation was raised. What, in fact, is an outage?
Sometimes, the answer is black & white. A service is completely unavailable because somebody shut a server down or some such event.
Other times, however, we have a grey area. Suppose, for example, that a server is experiencing a "memory leak" due to a particular application. The server is up and running, the application is up and running, but response is so slow that the system is unusable. Although the switch is on, nobody can use the system to do their job - or maybe 1% of the people can - or 5% - or 25%?
Or suppose that an application is up and running, but for some reason a popular module of the application is non-responsive. The application (service) isn't down, but for all intents and purposes it's disrupted.
We expect our technicians to tell us if a service is or is not experiencing an outage. Left to their own judgement, we get inconsistency across the enterprise (this is a Fortune 500 company).
What black & white criteria can we present to our technicians to define what constitutes an outage?
Joined: Jan 03, 2007 Posts: 189 Location: Redmond, WA
Posted: Wed Nov 21, 2007 6:56 am Post subject: Re: Define an Outage
I think you may be allowing the Technical side of the house to define an outage. According to ITIL, the Business is the group that defines an outage. This is done through the Service Level Management process and encompasses both interruptions of service and performance degradation.
Also, ITIL doesn't use the term Outage except when the Availability Management process is calculating availability. ITIL uses the term Incident. An Incident is defined as any event outside the normal delivery of a service that causes (or may cause) an interruption or degradation in the service.
Per this definition, a server in cluster that fails would meet the criteria of an Incident. The increased risk of an outage and the increased risk of performance degradation, even though there was no perceivable affect in the delivery of the service, are grounds for it to be considered an Incident.
If you are measuring Availability and need to know the times when services are unavailable, then it is dependent on Service Level Management to define what levels of service delivery are acceptable to the business.
Considering that I wear both an Availability and SLM hat within the organization, then are you saying that the definition of "Outage" is negotiable with the customer? I was hoping that there was some black & white definition within some ITIL book that I've just overlooked.
Joined: Sep 16, 2006 Posts: 3476 Location: London, UK
Posted: Wed Nov 21, 2007 7:45 pm Post subject:
As stated already, an outage is defined based on context
If there is a cluster of 10 web servers and 1 panic boots, the service is NOT affected (directly) but the server is affected (directly)
There should be an incident record for the panic reboot tracking the panic reboot based on the fact that the panic reboot is not part of normal service (DIGRESS: Just because Microsoft O/S reboots as part of the poor design does not mean it is part of the service GRIN END DIGRESS)
So from an Incident Mgmt POV, there was an incident w/ an outage / down time while the server is rebooting
From a Problem Mgmt POV, should this incident be used as a reason to initiate a Problem record . Does the same server panic boot often, does the same O/S, service pack, patch level, Architecture, Make/model or applications on the server always panic boot.. look for trends
From an Availability Management POV and where it tracks the Web Service - there was NO downtime as the 9 of 10 servers still provided service.
If AM is tracking % of a cluster and performance metrics and the service still hummed along.... no outage.. however...if the AM track the service performance and the performance failed to meet the defiend spec, then AM is impacted
From a SLM POV, how is the SLA written, if the SLA is written and it states that Web Service - based on Clustered servers can be w/in SLA if #% of cluster is active and # performance is met. Then teh SLA is either breached or not depending on the SLA
In other words.
IT DEPENDS...... which is the paradigm for ITIL _________________ John Hardesty
ITSM Manager's Certificate (Red Badge)
Change Management is POWER & CONTROL. /....evil laughter
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum