Posted: Wed Dec 07, 2005 5:49 am Post subject: Batch job failures and ITIL
I'm in the processing of working to re-engineer our current incident and problem management processes to be more in line with the ITIL standards. One issue that has caught my attention is, do others treat batch job failures as "Incidents."
For example, a scheduled job fails. IS-Operations team will see the alert, provide some limited initial support, then document the issue. Sometimes they can re-run the job immediately, other times it will require app dev help or infrastructure help. For all of these occurances, I can see the value of tracing as incidents/problems in the same way we would treat an incident being called in to the service desk. If nothing else, it would allow us to track and trend the occurences of the job failures through the processes to determine root cause/known errors.
Have others incorporated these type incidents into their incident/problem management process?? If so, how?
Joined: Nov 01, 2004 Posts: 81 Location: Sask, Canada
Posted: Wed Dec 07, 2005 10:52 am Post subject:
Hi, Carl - our company tracks production batch job failures as incidents just as you describe, with the benefits as you describe.
The mainframe environment is so mature that the incidents rarely become problems, which makes my life as Problem Analyst much easier
The quick answer would be yes: any disruption to normal service operation would be classed as an Incident and would therefore 'envoke' the Incident Management process regardless of service or Configuration Item (CI) and an individual Incident Record created.
Problem Management, as you say, utilising Error Control, would help trend analysis, by looking at Incident Records, to ascertain root cause of the Incident and obtain a work-around for the Service Desk/Incident Management to resolve in the event of future occurance (as recorded within the Known Error Db). Ultimately, if required and cost-effective by your organisation, structural Changes to the service or CI's can be made, interfacing into Change Management, to erradicate the reoccurance of any further Incidents or related-Problems.
It may also be benficial to utilise Availability Management, in agreeance with the business, using such metrics as MTTR, MTBSI and MTBF and techniques as Service Outage Analysis, Business Impact Analysis, and good old Techincal Observation Posts to correctly assess the true impact of the system Incident as well as getting a gang to be dedicated to resolving the underlying root cause.
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum