Soft-Checkpointing Based Hybrid Synchronous Checkpointing Protocol for Mobile Distributed Systems

Soft-Checkpointing Based Hybrid Synchronous Checkpointing Protocol for Mobile Distributed Systems

Parveen Kumar, Rachit Garg
DOI: 10.4018/978-1-61350-101-6.ch318
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Minimum-process coordinated checkpointing is a suitable approach to introduce fault tolerance in mobile distributed systems transparently. In order to balance the checkpointing overhead and the loss of computation on recovery, the authors propose a hybrid checkpointing algorithm, wherein an all-process coordinated checkpoint is taken after the execution of minimum-process coordinated checkpointing algorithm for a fixed number of times. In coordinated checkpointing, if a single process fails to take its checkpoint; all the checkpointing effort goes waste, because, each process has to abort its tentative checkpoint. In order to take the tentative checkpoint, an MH (Mobile Host) needs to transfer large checkpoint data to its local MSS over wireless channels. In this regard, the authors propose that in the first phase, all concerned MHs will take soft checkpoint only. Soft checkpoint is similar to mutable checkpoint. In this case, if some process fails to take checkpoint in the first phase, then MHs need to abort their soft checkpoints only. The effort of taking a soft checkpoint is negligibly small as compared to the tentative one. In the minimum-process coordinated checkpointing algorithm, an effort has been made to minimize the number of useless checkpoints and blocking of processes using probabilistic approach.
Chapter Preview
Top

Background

Mobile Hosts (MHs) are increasingly becoming common in distributed systems due to their availability, cost, and mobile connectivity. They are also considered suitable for effective and efficient disaster management. In case of disaster, the static connectivity may not work; therefore, we have to depend on mobile computing environments in such cases. An MH is a computer that may retain its connectivity with the rest of the distributed system through a wireless network while on move. An MH communicates with the other nodes of the distributed system via a special node called mobile support station (MSS). A “cell” is a geographical area around an MSS in which it can support an MH. An MSS has both wired and wireless links and it acts as an interface between the static network and a part of the mobile network. Static nodes are connected by a high speed wired network (Acharya & Badrinath, 1994).

A checkpoint is a local state of a process saved on the stable storage. In a distributed system, since the processes in the system do not share memory, a global state of the system is defined as a set of local states, one from each process. The state of channels corresponding to a global state is the set of messages sent but not yet received. A global state is said to be “consistent” if it contains no orphan message; i.e., a message whose receive event is recorded, but its send event is lost (Chandy & Lamport, 1985). To recover from a failure, the system restarts its execution from the previous consistent global state saved on the stable storage during fault-free execution. This saves all the computation done up to the last checkpointed state and only the computation done thereafter needs to be redone.

In coordinated or synchronous checkpointing, processes take checkpoints in such a manner that the resulting global state is consistent. Mostly it follows the two-phase commit structure (Chandy & Lamport, 1985). In the first phase, processes take tentative checkpoints, and in the second phase, these are made permanent. The main advantage is that only one permanent checkpoint and at most one tentative checkpoint is required to be stored. In the case of a fault, processes rollback to the last checkpointed state (Elnozahy et al., 2002). The Chandy and Lamport (1985) algorithm is the earliest non-blocking all-process coordinated checkpointing algorithm. In this algorithm, markers are sent along all channels in the network which leads to a message complexity of O(N2), and requires channels to be FIFO.

We have to deal with various issues while designing a checkpointing algorithm for mobile distributed systems (Acharya & Badrinath, 1994; Parkash & Singhal, 1996). These issues are mobility, disconnections, finite power source, vulnerable to physical damage, lack of stable storage etc. Prakash and Singhal (1996) proposed a nonblocking minimum-process coordinated checkpointing protocol for mobile distributed systems. They proposed that a good checkpointing protocol for mobile distributed systems should have low overheads on MHs and wireless channels; and it should avoid awakening of an MH in doze mode operation. The disconnection of an MH should not lead to infinite wait state. The algorithm should be non-intrusive and it should force minimum number of processes to take their local checkpoints. In minimum-process coordinated checkpointing algorithms, some blocking of the processes takes place (Koo & Toueg, 1987) or some useless checkpoints are taken (Cao & Singhal, 2001).

Complete Chapter List

Search this Book:
Reset