Tuning up IT Services using Monitoring Configuration Analytics

Tuning up IT Services using Monitoring Configuration Analytics

Liang Tang (Florida International University, USA), Chunqiu i. Zeng (Florida International University, USA), Tao Li (Florida International University, USA), Larisa (Laura) Shwartz (IBM T.J. Watson Research Center, USA) and Genady Ya. Graharnik (St. John's University, USA)
DOI: 10.4018/978-1-4666-8496-6.ch007
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Monitoring has a goal of early identification of an issue through assessment of system vital signs against acceptable thresholds. Symptoms of system degradation constitute an event and are flagged for sending to supporting personnel as an incident ticket. Event management is based on a hypothesis that a violation of specific threshold identifies a problem. This hypothesis arises two types of errors - False positives corresponding to ticketing harmless condition, and False negatives corresponding to not flagging degradation of a vital sign for ticketing. This chapter describes the methodology and analytics for minimizing errors of both types. Furthermore, many IT Service Providers rely on partial automation for diagnosis and resolution of incident created by monitoring. This chapter proposes the methodology that allows moving from partially to fully automated problem resolution by eliminating the ambiguity of service incident description and classification in a highly variable service delivery environment.
Chapter Preview
Top

Introduction

Large computing systems are often constructed in distributed IT environments and maintained by IT service providers. IT service providers are facing an increasingly intense competitive landscape and growing industry requirements. Modern forms of distributed computing (say, cloud) provide some standardization of the initial configuration of the hardware and software. However, in order to enable most enterprise level applications, an individual infrastructure for the given application must be created and maintained on behalf of each outsourcing customer. This requirement creates great variability in the services provided by IT support teams. The aforementioned issues contribute largely to the fact that routine maintenance of the information systems remains semi-automated, and manually performed. Significant initiatives like autonomic computing led to awareness of the problem in the scientific and industrial communities and helped to introduce more sophisticated and automated procedures, which increase the productivity and guarantee the overall quality of the delivered service (Kephart & Chess, 2003).

Automatic problem detection is typically realized by system monitoring software, such as IBM Tivoli Monitoring (IBM, 1996) and HP OpenView (HP, 2007). System monitoring is an automated reactive system that provides an effective and reliable means of ensuring that degradation of the vital signs, defined by acceptable thresholds with monitoring conditions, is flagged as a problem candidate (monitoring event) and sent to the service delivery teams as an incident ticket. The threshold with the corresponding monitoring condition represents a monitoring situation. A monitoring system usually consists of various monitoring situations that monitor different components of the target system. For instance, in order to prevent high CPU resource consumptions, people can add monitoring situation into the monitoring system as follows.

  • Condition: CPU_UTILIZATION > 90%

  • Duration: 20 minutes

  • Severity: Warning

In this monitoring situation, “Condition” is the condition rule for triggering this alert, in which the “CPU_UTILIZATION” is a variable predefined in the monitoring system and represents the CPU utilization of the target system. “90%” is the threshold. “Duration” is the lasting time for this condition. “Severity” is the severity level for the triggered alert. A monitoring system usually has thousands of predefined variables capturing different resource consumptions, status and errors, such as the available disk space, network speed, and number of TCP connection attempts or database table space. Consequently, system administrators are able to specify various monitoring situations for a particular IT environment. The collection of specified monitoring situations is a part of the configuration of the monitoring system. The system administrators can add/remove/change/deploy them to each agent running on the servers (or devices). Based on the specified rule, each agent periodically probes the server (or device) and sends the triggered alert (or say, monitoring event) to the central console whenever any monitoring situation is satisfied.

Complete Chapter List

Search this Book:
Reset