Scalable Fault Tolerance for Large-Scale Parallel and Distributed Computing

Scalable Fault Tolerance for Large-Scale Parallel and Distributed Computing

Zizhong Chen (Colorado School of Mines, USA)
Copyright: © 2010 |Pages: 24
DOI: 10.4018/978-1-60566-661-7.ch033
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Today’s long running scientific applications typically tolerate failures by checkpoint/restart in which all process states of an application are saved into stable storage periodically. However, as the number of processors in a system increases, the amount of data that need to be saved into stable storage also increases linearly. Therefore, the classical checkpoint/restart approach has a potential scalability problem for large parallel systems. In this chapter, we introduce some scalable techniques to tolerate a small number of process failures in large parallel and distributed computing. We present several encoding strategies for diskless checkpointing to improve the scalability of the technique. We introduce the algorithm-based checkpoint-free fault tolerance technique to tolerate fail-stop failures without checkpoint or rollback recovery. Coding approaches and floating-point erasure correcting codes are also introduced to help applications to survive multiple simultaneous process failures. The introduced techniques are scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. Experimental results demonstrate that the introduced techniques are highly scalable.
Chapter Preview
Top

Background

Current parallel programming paradigms for high-performance distributed computing systems are typically based on the Message-Passing Interface (MPI) specification (Message Passing Interface Forum,1994). However, the current MPI specification does not specify the behavior of an MPI implementation when one or more process failures occur during runtime. MPI gives the user the choice between two possibilities on how to handle failures. The first one, which is the default mode of MPI, is to immediately abort all survival processes of the application. The second possibility is just slightly more flexible, handing control back to the user application without guaranteeing that any further communication can occur.

FT-MPI Overview

FT-MPI (Fagg, Gabriel, Losilca, Angskun, Chen, Pjesivac-Grbovic, et al., 2004) is a fault tolerant version of MPI that is able to provide basic system services to support fault survivable applications. FT-MPI implements the complete MPI-1.2 specification and parts of the MPI-2 functionality, and extends some of the semantics of MPI to support self-healing applications. FT-MPI is able to survive the failure of n − 1 processes in an n-process job, and, if required, can re-spawn the failed processes. However, fault tolerant applications have to be implemented in a self-healing way so that they can survive failures. Although FT-MPI provides basic system services to support self-healing applications, prevailing benchmarks show that the performance of FT-MPI is comparable (Fagg, Gabriel, Bosilca, Angskun, Chen, Pjesivac-Grbovic, et al., 2005) to the current state-of-the-art non-fault-tolerant MPI implementations.

Key Terms in this Chapter

Parallel and Distributed Computing: Parallel and distributed computing is a sub-field of computer science that handles computing involving more than one processing unit.

Checkpointing: Checkpointing is a type of techniques to incorporate fault tolerance into a system.

Message Passing Interface: Message Passing Interface is a specification for an API that allows different processes to communicate with one another.

Fault Tolerance: Fault tolerance is the property of a system that enables it to continue operating properly after a failure occurred in the system.

Pipeline: A pipeline is a set of data processing elements connected in series so that the output of one element is the input of the next one.

Fail-Stop Failure: Fail-stop failure is a type of failures that cause the component of a system experiencing this type of failure stops operating.

Erasure Correction Codes: An erasure correction code transforms a message of n blocks into a message with more than n blocks, such that the original message can be recovered from a subset of those blocks.

Complete Chapter List

Search this Book:
Reset