For general information and resources, ITIL and ITSM World is the most well known for both ITIL and ITIL Books. A shorter snapshot approach can be found at ITIL Zone
Note: ® ITIL is a registered trademark of OGC. This portal is totally independent and is in no way related to them. See our Feedback Page for more information.
Joined: Mar 10, 2008 Posts: 401 Location: Sunderland
Posted: Fri Jan 30, 2009 10:17 pm Post subject: Designing for availability
SOOOooooooo............My customer wants 99.5% availability from their perspective.
The service in question is based on a Genesys front end with a SAP engine. The infrastructure footprint covers locally deployed components/hardware in the UK and most of the SAP kit in Germany.
Our method of calculating whether we can deliver required availability is along the lines of 'resilient servers x network x routers x desktop.....etc' (i.e. the traditional straight out of an ITIL book view of designing for availability). This is far too simplistic for my liking as it doesn't account for the internal and external support required to keep these components running and to fix them when they go wrong.
So, in the interests of challenging our SLM guys and support groups as to whether the customer requirement is truly underpinned I'm looking to apply some more exacting science to the calculations. My thoughts are along the lines of where support hours don't cover online availability hours
what is the likelihood of a component failing and how much downtime are we likely to be exposed to.
I have never really seen this type of thing done particularly well so if anybody has had success in scientifically covering support as well as hardware/software components in underpinning availability I would be very interested in knowing how you did it?
Support is easy to factor in; start with third party contracts and negotiate a lower 'out of hours' downtime rate with your customers to reflect the reduced operational impact.
Factoring in support time is doable if you are managing your support team time/effort accurately.
UJ _________________ Did I just say that out loud?
Joined: Mar 10, 2008 Posts: 401 Location: Sunderland
Posted: Sat Jan 31, 2009 1:01 am Post subject:
UJ
This is only 'out of hours' for some of the IT Support Organisation not the Business Customer. They want 99.5% for the entire hours of service and they expect us to work out how that can be delivered and charge them for it appropriately, not to say that we can't offer them that but we can offer them something different.
The key here is that its from the customer's perspective not the IT organisation.
Joined: Sep 16, 2006 Posts: 3110 Location: London, UK
Posted: Sat Jan 31, 2009 1:35 am Post subject:
BorisBear,
the following is a real situation
One of the company's I had work had an application that was suppose to be available 24x7 w/99.5 % availability
The application was supported by the 3rd party application developers
The application resided on Unix O/S supported by a SUN approved local country IT provider (3rd party)
The company packaged this and all we did was manage the NMS tools and escalated out.
Funny thing on the way to the service.
Sun has a hierarchal support system w/response times - diamond, gold silver, etc however the hierarchy is
their vendor had a different interpretation on support and response
One of the application servers died on a Thursday Night. The server was not on any of the Diamond or Platinum or ulitmate support / response models. It was on the lowest. The response was w/in 24 hours. the service fix time was even longer
The server did not get fixed until the following wednesday / thursday
- however, it was fixed in spec w/the contract hours for that MACHINE not for the company or the application / service
So the 99.5% was not met because some body decided not to spend extra money for maintenance / support contract _________________ John Hardesty
ITSM Manager's Certificate (Red Badge)
Change Management is POWER & CONTROL. /....evil laughter
Joined: Mar 04, 2008 Posts: 1883 Location: Newcastle-under-Lyme
Posted: Sat Jan 31, 2009 1:52 am Post subject:
BorisBear wrote:
The key here is that its from the customer's perspective not the IT organisation.
The customer perspective of availability is the only one there is. Availability means availability to the customer.
I think you are already on the right lines in identifying the factors involved. The two main questions that need answering before working out costs and preparing commitment are:
1. what is the time frame for the 99.5%? it makes a big difference whether it is a year, a month, a week a day, an hour ( (: ).
2. What is the tolerance to time down and how does it vary?
The easy case for you is if you have a robust system and they want 99.5% over a year and they can survive a fail during unsocial hours, then you might get away with fixing by 06:00, say. Probably not realistic, but it illustrates some issues.
The more critical their business imperative, the higher the risk/cost is going to be.
Do you have historical data on outages and recovery times? What is your minimum recovery time for DB crash? network node failure? OS crash? etc. how long does it take to detect outage? how long to get support to the 'coal face'? How long to achieve diagnosis and initiate correction and recovery? how much do these figures vary from their averages?
How much contingency needs to be there for investigating mysterious outages?
If gearing up for round the clock quick response is expensive, perhaps some aspects can be put to a third party who already operate round the clock. This will almost certainly be true if your organization is not large.
The true test of how serious they are about the figure is when they see the cost and risk analysis. How near to certainty do you need to be before the risk is acceptable? _________________ "Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718
Joined: Mar 10, 2008 Posts: 401 Location: Sunderland
Posted: Mon Feb 02, 2009 6:23 pm Post subject:
Diarmid - I think the truth is that because we don't have historical data we won't be able to offer 99.5% with any degree of confidence and therefore will probably work towards this over a period of a few months.
We do have some periods of the day where availability is more critical than at other times so we could weight availability/downtime accordingly. I guess what I'm struggling with is how to put some science behind what the support groups do particularly as we're very much an organisation that works in silos without joined up ownership of service provision. We're not going to solve those problems overnight but some tips on how to tackle the science and pin down the support groups would be good.
Joined: Mar 04, 2008 Posts: 1883 Location: Newcastle-under-Lyme
Posted: Mon Feb 02, 2009 7:58 pm Post subject:
Boris,
since your issue is time, you have to measure time.
If your silos are deep, then you measure from the time the incident enters the silo until it emerges and you have to obtain a commitment from the silo manager on how long things take. And you design slick interfaces between the silos.
If your silos are shallow, then you measure all the phases and possibly measure processing time (time sheets) by individual staff and manage it all co-operatively with the silo managers.
Either way you are measuring time and putting yourself in a position to ask for improvements in specific areas. A priority could be to integrate the procedures used by the silos, or build an integrated procedure from scratch. there won't be silos then.
Without the data you can't predict.
Once you have the data, you will see your shortfall. But you can't just tell the business to spend more to address the shortfall. you also have to make improvements. Chances are that breaking down the silos will help a lot. If you set up improvement programs with targets and then meet those targets, then the business will listen to you when you show them the limits of your present capacity. _________________ "Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718
Joined: Sep 16, 2006 Posts: 3110 Location: London, UK
Posted: Mon Feb 02, 2009 8:22 pm Post subject:
BorisBear,
one of the hares I use to split about SLAs was response time
The tool we used for incident mgmt accepted emails from customers (specific email addresses of course). This would create a 'ticket' in the system that would reply to the customer when we accepted the ticket.
there was a 15 minute SLA for Priority 1 (P1) issues. One customer thought everything was a P1.
The SD would make sure the tickets were accepted and the email went to customer w/in 15 minutes
So if the service is M-F 0800 - 2000, then the availability is against that time frame.
Now I also used to split hair w/availability.
If the web site was up and able to accept requests (we used external sitescope), then the availability was 100% if the requests was between 0800 & 2000 (GMT in this case).
Hare splitting I know ...but bugs bunny was always my favorite _________________ John Hardesty
ITSM Manager's Certificate (Red Badge)
Change Management is POWER & CONTROL. /....evil laughter
Joined: Oct 07, 2007 Posts: 441 Location: Jakarta, INA
Posted: Mon Feb 02, 2009 10:16 pm Post subject:
Hi,
Referring the first post, it is common to exclude maintenance window from the agreed availability.
Availability could be calculated in a monthly, quarterly, or annually basis.
But usually providers would calculate availability in annual basis for the reason to give them the flexibility of managing the service, although the report is in monthly basis.
For instance the full calendar availability is 365 x 24 x 60 = 525,600 minutes (= 31,525,000 seconds).
Let's say the maintenance window is 20% = 52,560 minutes.
That makes your 100% availability equals 473,040 minutes.
99.5% availability means that the 0.5% (equals 2,365 minutes) unplanned downtime could be spread anywhere throughout the year.
Further, you could breakdown the availability to wrap up all the equipments from end to end, of you could also set availability for individual equipment, or combination of both.
Joined: Mar 04, 2008 Posts: 1883 Location: Newcastle-under-Lyme
Posted: Mon Feb 02, 2009 11:02 pm Post subject:
asrilrm wrote:
99.5% availability means that the 0.5% (equals 2,365 minutes) unplanned downtime could be spread anywhere throughout the year.
Which is not good if it all happens on one day. (The Longest Day with a vengeance).
The business has to be protected and it won't wash if you are within the literal terms of your contract but the service has been down too long or too often
Where an annual figure is used it should be qualified by other limits over shorter periods. There can be such things as:
- no breaks lasting more than four hours
- 98% availability within any rolling four weeks
- no more than three breaks in any month
These tools keep the service provider 'good' while allowing a little contingent flexibility. _________________ "Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718
Joined: Oct 13, 2006 Posts: 116 Location: South Africa
Posted: Tue Feb 03, 2009 9:05 pm Post subject:
Good points already made - I'll try not to repeat stuff.
This is what you should be taking to the customer.
- Tell them to forget about measures like 99.5%, even if they've read such things in the press, unless they have some solid history (from what you've said, they don't have history from you) or solid industry benchmarks, that allow them to match a percent availability with an actual business risk.
In other words, focus on the length of outages as much as, or more than, total lost time. As discussed in previous replies!
- Negotiate requirements for each distinct period ... such as main business hours, critical month-end times, early evening "best efforts" support, etc (I put that phrase in quotes most deliberately) ... and do your calculations separately for each period.
- I have seen the ITIL-style calculations done well, with full consideration of the different formulae for redundant independent components and so on. But the long-term value is only created if you make actual measurements of customer service availability and internal component availability and use them to calibrate and verify your calculations.
Joined: Mar 10, 2008 Posts: 401 Location: Sunderland
Posted: Tue Feb 03, 2009 11:08 pm Post subject:
Diarmid wrote:
asrilrm wrote:
99.5% availability means that the 0.5% (equals 2,365 minutes) unplanned downtime could be spread anywhere throughout the year.
Which is not good if it all happens on one day. (The Longest Day with a vengeance).
The business has to be protected and it won't wash if you are within the literal terms of your contract but the service has been down too long or too often
Where an annual figure is used it should be qualified by other limits over shorter periods. There can be such things as:
- no breaks lasting more than four hours
- 98% availability within any rolling four weeks
- no more than three breaks in any month
These tools keep the service provider 'good' while allowing a little contingent flexibility.
....is the right answer. i don't know any of the big Vendors who don't measure services at least in part on a monthly basis, especially given that service reviews and improvement initiatives need to be 'of the moment'
Joined: Mar 10, 2008 Posts: 401 Location: Sunderland
Posted: Tue Feb 03, 2009 11:09 pm Post subject:
JoePearson wrote:
Good points already made - I'll try not to repeat stuff.
This is what you should be taking to the customer.
- Tell them to forget about measures like 99.5%, even if they've read such things in the press, unless they have some solid history (from what you've said, they don't have history from you) or solid industry benchmarks, that allow them to match a percent availability with an actual business risk.
In other words, focus on the length of outages as much as, or more than, total lost time. As discussed in previous replies!
- Negotiate requirements for each distinct period ... such as main business hours, critical month-end times, early evening "best efforts" support, etc (I put that phrase in quotes most deliberately) ... and do your calculations separately for each period.
- I have seen the ITIL-style calculations done well, with full consideration of the different formulae for redundant independent components and so on. But the long-term value is only created if you make actual measurements of customer service availability and internal component availability and use them to calibrate and verify your calculations.
Hmmmm...not sure I agree. What if the service is up and down like a gigolo's bottom with each downtime period being just a few minutes. In my experience this can be an even worse customer experience.
Joined: Mar 10, 2008 Posts: 401 Location: Sunderland
Posted: Tue Feb 03, 2009 11:12 pm Post subject:
asrilrm wrote:
Hi,
Referring the first post, it is common to exclude maintenance window from the agreed availability.
Availability could be calculated in a monthly, quarterly, or annually basis.
But usually providers would calculate availability in annual basis for the reason to give them the flexibility of managing the service, although the report is in monthly basis.
For instance the full calendar availability is 365 x 24 x 60 = 525,600 minutes (= 31,525,000 seconds).
Let's say the maintenance window is 20% = 52,560 minutes.
That makes your 100% availability equals 473,040 minutes.
99.5% availability means that the 0.5% (equals 2,365 minutes) unplanned downtime could be spread anywhere throughout the year.
Further, you could breakdown the availability to wrap up all the equipments from end to end, of you could also set availability for individual equipment, or combination of both.
Cheers,
Asril
Asril - I think you have misunderstood me.....I wasn't suggesting that maintenance windows be part of the availability calculation. At the core of my original question was that we have expected failure rates and availability predictions for hardware components but we don't have the understanding of the impact of availability and capability of support staff to deal with incidents when they occur.
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum