Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Towards a Holistic Approach to Fault Management : Wheels Within a Wheel

Moises Goldszmidt, Miroslaw Malek, Simin Nadjm-Tehrani, Priya Narasimhan, Felix Salfner, Paul A. S. Ward, John Wilkes

Source Title: Dependability and Computer Engineering: Concepts for Software-Intensive Systems

DOI: 10.4018/978-1-60960-747-0.ch001

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Systems with high dependability requirements are increasingly relying on complex on-line fault management systems. Such fault management systems involve a combination of multiple steps – monitoring, data analysis, planning, and execution – that are typically independently developed and optimized. We argue that it is inefficient and ineffective to improve any particular fault management step without taking into account its interactions and dependencies with the rest of the steps. Through six real-life examples, we demonstrate this inefficiency and how it results in systems that either under-perform or are over-budget. We propose a holistic approach to fault management that is aware of all relevant aspects, and explicitly considers the couplings between the different fault management steps. We believe it will produce systems that will better meet cost, performance, and dependability objectives.

Chapter Preview

Top

Introduction

Large, complex systems frequently experience faults, and those faults need to be handled to limit the damage they cause. Fault management is the set of processes used to ensure dependability of the service, i.e., uninterrupted, reliable, secure and correct operation, at reasonable cost. An important sub-goal in modern fault management approaches is to automate as much as possible to reduce human intervention and administration, which is expensive, error-prone and may even be infeasible in large systems. In turn, that requires a clear statement of the objectives and processes used to achieve the desired outcomes.

Fault management systems typically include many steps, which are often carried out sequentially, as shown in Figure 1: system monitoring, analysis of monitored data, planning of recovery strategies, and execution of mitigation actions. Although this taxonomy is a convenient compartmentalization, it is usually ineffective and inefficient to optimize a particular fault management step in isolation, without taking into account its interaction with the rest of the steps and how these interactions affect the overall objectives. That is, rather than asking “how should we improve a particular step?” a better question is “how should we configure the steps to maximize the overall benefit?”

Figure 1.

The loop of actions performed in fault management systems

For example, if the planning step offers only three possible recovery actions – say, to reboot a machine, reimage a machine, or call the operator – it is unnecessary for the analysis step to do anything other than to map the outcome of the monitoring step to one of these three actions. Any further effort in the analysis step is irrelevant as it has no bearing on the overall fault management outcome; indeed, such further effort might complicate the overall system (e.g., introduce bugs), waste cycles at runtime (which may impact availability), and waste developers’ time on meaningless tasks.

We propose a holistic approach to the problem of deciding how much effort to invest where by addressing all four steps (monitoring, analysis, planning and execution) and their influences on each other, keeping in mind the main objectives – cost minimization and high availability. Our approach allows local optimization within the operational envelope of each fault management step, but links the global adaptation of these local optimizations to the (global) business goals of the system, resulting in a highly effective and coordinated combination of the four fault management steps. Therefore, we avoid local optimizations that do not help achieve the overall goal.

In support of our arguments, we present six real-life examples that explore the pitfalls of merely-local optimizations. The price of not having a proper focus on key objectives and ignoring the holistic approach is high and can no longer be neglected. This paper is a call to arms, to energize the community and give the impetus needed to overcome pitfalls of local optimization, and thereby improve the overall effectiveness and cost of systems.

Historically, academic research on automated fault management has been mainly driven by technical challenges, while the focus in industry has emphasized economic issues and customer service agreements. Much academic fault management research has focused on one or more aspects of monitoring, analysis or adaption, omitting considerations of overall business objectives, budget constraints or the total cost of ownership. (There are a few exceptions, such as the International Workshop on Business-Driven IT Management, or BDIM, http://businessdrivenitmanagement.org/).

Research on monitoring solutions deal with the challenge of collecting system data with minimal impact on the target system’s performance. Algorithms for failure prediction, automated diagnosis and root-cause analysis aim at figuring out whether a failure is looming or where a technical defect is located in multi-processor systems that deal with reconfiguring multiple machines consisting of diverse software and hardware platforms. Unfortunately, analyses of quantifiable effectiveness in terms of availability enhancement and cost reduction are rare.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Towards a Holistic Approach to Fault Management : Wheels Within a Wheel

Abstract

Introduction

Complete Chapter List