Problem determination and resolution (PDR) is at the core of Incident and Problem Management. PDR is the process of detecting an anomaly in a monitored system, identifying the nature of the anomaly in view of routing to the appropriate support team, determining the root cause responsible for the anomaly and fixing or eliminating the cause of the problem. The cost of PDR represents a substantial part of operational costs, and faster, more effective PDR can contribute to a substantial reduction in system administration costs. The methodologies described by the authors in this chapter relate to automation of critical aspects of PDR, such as problem classification for targeted diagnosis and structuring of solved problem tickets for offering systematized resolution to the support personnel.
Problem determination and resolution, or PDR, is the core process of the ITIL Incident and Problem Management (Office of Government Commerce (OCG): IT Infrastructure Library (ITIL), 2002). Incident Management is an IT service management process area that aims to restore a normal service operation as quickly as possible and to minimize the impact on business operations. Incidents are the result of a service failing or under-performing. The cause of incidents may be obvious and thus addressed without any further action by the appropriate support team. However, when an incident is not the result of a known problem or error, it may require that the Problem Management process become involved. Problem Management is also an IT service management process area, which aims to resolve the root causes of incidents and thus to minimize the harmful impact on business by incidents and problems that are caused by IT infrastructure issues, and to pro-actively prevent recurrence of incidents related to these issues. Both processes result in new problem records, aka tickets, being raised. The information related to the symptom and the resolution of the problem is recorded in the ticket for potential reuse by the support personnel when a similar issue is reported.
The manual steps involved in the PDR, such as the problem classification in view of routing issues to the support teams, the search for resolutions in past tickets of similar issues by the support personnel, the prevention of known issues, are time consuming and error prone. The methodologies described in this chapter relate to automation of these PDR critical aspects by means of analytical techniques.
First, the authors address the automation of classifying the problems users experience and require support for. The goal of this classification is to identify the problem’s specificity by comparing the symptoms at hand to available training data like historical performance data and logs data. To this aim, the authors transform the problems type space into a hierarchical taxonomy which can be predetermined. They propose an efficient hierarchical incremental learning algorithm which is capable of adjusting its internal local classifier parameters in real-time. Comparing to the traditional batch learning algorithms, this online learning framework can significantly decrease the computational complexity of the training process by learning from new instances on an incremental fashion. In the same time this reduces the amount of memory required to store the training instances. This problem classification technique was used to enhance the remote monitoring infrastructure SNAPPiMON (SNAPPiMON, n.d.) through integration with the ISA (IBM Support Assistant, n.d.) serviceability workbench that assists the problem diagnose, resolution and service request submission.
Second, the authors present the automatic structuring of solved problem tickets consisting of free form, heterogeneous textual data. Most of the existing ticketing data is not explicitly structured, is highly noisy, and very heterogeneous in content, making it hard to effectively apply common data mining techniques to analyze and search the raw data. An example of such an analysis is the detection of the units of information containing the steps taken by the technical people to resolve a particular customer issue. The support team, having to solve IT problems over the phone, benefits from being given a set of steps to guide them fix the customers’ issues. Thus, having access to relevant tickets with similar problems encountered by other users together with their aggregated resolution actions, saves them time and reduces the time to repair for the customer.