For general information and resources, ITIL and ITSM World is the most well known for both ITIL and ITIL Books. A shorter snapshot approach can be found at ITIL Zone
Note: ® ITIL is a registered trademark of OGC. This portal is totally independent and is in no way related to them. See our Feedback Page for more information.
Posted: Wed Aug 26, 2009 5:47 pm Post subject: Capacity Mgmt Metrics
Been having a confrontation with a newly appointed associate director who has recently been involved in conducting an ISO20k internal audit.
Within the report he has stated that CPM is failing as no 'processing capacity' metrics are monitored, measured, and reported preventing correct decisions to be made.
Now my issue here, which I would appreciate you thoughts & comments on are, we monitor CPU & Memory usage with Ganglia (previously using Orca). We have real time access to the information, however, choose not to measure the CPU or Memory levels of 800+ servers (mixture of windows, linux, aix, unix, physical, virtual etc). My justification is that we recieve alerts when thresholds are reached so that the users are not affected - that when new applications / services are being added the guys have access to this information to judge whether or not to add it to box X. Finally, that to measure CPU & Memory over so many servers to make it efficient the data would have to be culmiated into so sort of average, overall figure, which then makes it completely in-effective.
Do you guys measure 'processing capacity'? & How do you do it?
Joined: Mar 04, 2008 Posts: 1883 Location: Newcastle-under-Lyme
Posted: Wed Aug 26, 2009 7:44 pm Post subject:
Tony,
if you do not know the present average and peak utilization of a machine, how do you determine its capacity to support a new application, a modified application with new functionality, a change in utilization patterns. The absence of threshold alerts is no guarantor of available capacity for additional usage.
Performance monitoring is not Capacity Management although it is a necessary component.
Now the practical aspect is, as always, cost and risk. What is the cost of maintaining current baselines for each machine? What are the consequences if a particular machine "overloads" (on the busiest day of the year, for example).
So it can be okay to make a judgement call. Especially for small stable systems running non-critical systems. It can amount to how tolerant your business activity is of small glitches in responsiveness occurring for a period;how readily you can either add capacity or reshuffle apps between boxes; how acceptable it is to regress and move a system newly added when the capacity is found to be wanting.
But consider this scenario:
A project is proposed to revamp a cluster of applications that are spread over 20 small servers. This will be a big deal. Therefore, either:
- you set up life size tests to measure everything as it will be
- you measure the new apps and apply modelling techniques (or at least extrapolations although that is less reliable) to predict impact on capacity
- you make a judgement call (I don't think so)
If the first is too expensive (on people, machines and time) and possibly rather difficult, then you have to look at the second. This is fine. But if you do not have current baselines for these machines you will have to extend the project to achieve these before you proceed (typically that involves working through peak demand periods which may be monthly for example)
So one of the risks is that it might become important in the future. But it is still cost and risk.
Your auditor is only correct if you cannot show the cost and risk analysis to support what you do. _________________ "Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718
if you do not know the present average and peak utilization of a machine, how do you determine its capacity to support a new application, a modified application with new functionality, a change in utilization patterns. The absence of threshold alerts is no guarantor of available capacity for additional usage.
Ganglia provides the view of the CPU & Memory usage per server, and retains historical information to be able to see 'normal' performance. Therefore, in the example you cited we would review the performance of the servers to know whether they could handle the changes.
The key is that we do not make this a specific metric to report on due to the volume of the servers.
I know ITIL does not view thresholds as being a pro-active mechanism, however, I disagree with this as this provides IT with the early warning to ensure that users are not impacted.
I'm sure there is something I am missing, just cannot see it yet.
Joined: Mar 04, 2008 Posts: 1883 Location: Newcastle-under-Lyme
Posted: Thu Aug 27, 2009 12:05 am Post subject:
SwissTony wrote:
The key is that we do not make this a specific metric to report on due to the volume of the servers.
Report to whom? Who is interested in the capacity of every single server apart from the Capacity Manager and the Operations Manager? The business isn't, nor should be the CIO nor the head of Service Management. Although those latter two would expect to see evidence that you do have all that information reliably at your fingertips.
Reports to senior management on the performance of Capacity Management function should be focussed on overall capacity and how ready it is to meet planned and unplanned variance in demand; activities in support of projects and future planning both for business, applications and technology/equipment changes; any interesting trends that are emerging; issues resolved and under investigation.
I have to say that threshold monitoring is essentially reactive in terms of Capacity Management (although it can be considered pro-active in terms of Incident and Problem Management) as there is no element of prediction or anticipation involved; rather, reaching it is something you react to. It's a bit moot though, because the words slip about too much in this area. It's probably better just to be clear about what you require and how you achieve that.
My real point is that threshold monitoring is incapable of predicting the impact of additional workloads on a system because that impact is not additive. Bottlenecks can appear out of nowhere.
[PS. reference to bottleneck implies no particular preference for John Fahey.] _________________ "Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718
Joined: Feb 27, 2009 Posts: 16 Location: North Coast, USA
Posted: Thu Aug 27, 2009 12:37 am Post subject:
Tony,
I've been bouncing around a few sites concerned with measurement and capacity...
One term that often shows up alongside Processing Capacity is Cost of Transaction. (Forgive me if it's in ITIL as well, I've been introduced to alot of terms from many sources lately)
So not only do you need to capture the box metrics, but also usage / throughput / transactional metrics to match up with performance metrics.
... we haven't gotten there yet either.
Then when you know of a change in demand coming down the line (seasonal, new customer, etc) , you can estimate what the increase in transactions will demand from your infrastructure.
For new apps or enhancements, ideally that relationship would be captured during the QA phase.
And Yes, when you roll historical data into averages you lose visibility of reality. So maybe keeping peaks, or average of peaks alongside a variance metric like standard deviation or confidence interval (6Sigma background ) might give a better picture.
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum