Unreliable Failure Detectors for Mobile Ad-hoc Networks

Unreliable Failure Detectors for Mobile Ad-hoc Networks

Luciana Arantes, Fabíola Greve, Pierre Sens
DOI: 10.4018/978-1-60960-042-6.ch063
(Individual Chapters)
No Current Special Offers


Failure detection is an important abstraction for the development of fault-tolerant middleware, such as group communication toolkits, replication and transaction services. Unreliable failure detector (namely, FD) can be seen as an oracle which provides information about process failures. The dynamics and self-organization of Mobile Ad-hoc Networks (MANETs) introduce new restrictions and challenges for the implementation of FDs with which static traditional networks do not have to cope. It is worth mentioning that in some way, fault tolerance is more critical for MANETs than for the latter, since wireless network can present high error rates and mobile nodes are more prone to failures, physical damages or transient disconnections. The aim of this chapter is thus to discuss the impact of all these characteristics, intrinsic to MANET, in the implementation of FDs. It presents a survey of the few works about FD implementations for wireless networks, including the different possible assumptions to overcome the dynamics and lack of both global view and synchrony of MANETs.
Chapter Preview


The distributed computing scenario is rapidly evolving for integrating unstructured, self-organizing and dynamic systems, such as peer-to-peer, wireless sensor and mobile ad-hoc networks. Nonetheless, the issue of designing reliable services which can support the high dynamics of these systems is a challenge. Current large-scale distributed applications are usually built on top of failure-prone asynchronous distributed systems, i.e., systems where there are no bounds for communication delays, neither for process speeds and nodes can crash. The design of fault-tolerant applications on top of such systems is a very difficult task (or even impossible) due to the difficulty of correctly distinguish a crashed process from a process which is slow or with which communication is slow. There are even some fundamental problems, such as the consensus, which is impossible to be solved deterministically in a pure asynchronous distributed system where nodes can crash (Fischer, M. J., Lynch, M. J., & Paterson, M. S., 1985). Roughly, a consensus is an agreement problem which allows a set of processes to agree on a common proposal output value. To circumvent such difficulties and impossibilities, Chandra and Toueg (Chandra, T. D., & Toueg, S., 1996) have introduced the concept of unreliable failure detectors.

Unreliable failure detectors (namely, FDs) can be seen as oracles which provide information about process failures. Each process has access to a local failure detector module which, when queried, returns a list of processes that it currently suspects of having crashed. A local failure detector is unreliable in the sense that it may not suspect a crashed process or suspect a correct one. In other words, it may erroneously add to its list a process which is actually correct. However, if the detector later believes that suspecting this process is a mistake, it then removes the process from its list. FDs are abstractly characterized by two properties: completeness and accuracy. Two kinds of completeness and four kinds of accuracy are defined in (Chandra, T. D., & Toueg, S., 1996), which once combined yield eight classes of failure detectors. All of them can be used to solve the above mentioned consensus and many other agreement problems in a crash-prone asynchronous distributed system.

Many papers (Chandra, T. D., & Toueg, S., 1996), (Bertier, M., Marin, O., & Sens P., 2002), (Larrea, M., Arévalo, S., & Fernandez, A., 1999), (Gupta, I., Chandra, T. D., & Goldszmidt, G. S., 2001) in the literature have proposed to implement some or all of the classes of FDs. Nevertheless, the majority of them consider fully connected static networks with reliable links where nodes fail by crashing. The number of crashes is usually bounded. Both the initial number of processes in the system and the identity of processes are known by all processes. In addition, some synchrony, such as eventually timely links or relative difference in link latencies, is usually added to the system. However, all these assumptions are not actually suitable for mobile ad hoc networks (MANETs) which are characterized as extremely dynamic systems where connections between nodes frequently change due to different reasons such as arbitrary failures, lack of node energy, limited bandwidth, disconnections, node arrivals and departures, mobility of nodes, etc. Furthermore, in MANETs, nodes do not have a global knowledge of the system and the number of participant nodes is unknown. The network is not fully connected and a node can only send messages to nodes that are within its transmission range. Hence, it may happen that a message sent by a node should be routed through a set of intermediate nodes until reaching the destination node. Therefore, the dynamics and self-organization of MANETs introduce new restrictions and challenges for the implementation of FDs to which static traditional networks do not face.

Key Terms in this Chapter

Query-Response Communication: It is an abstraction that can be used to implement timer free (or asynchronous) failure detector in which a process sends a “request” to a set of processes and waits for k “replies”. When the bound k is reached, processes which do not reply are added in the suspected list.

Heartbeat Failure Detector: It is an implementation of FD where every process q periodically sends an “I am alive” message to the processes in charge of detecting its failure. If a process p does not receive such a message from q after the expiration of a timeout, it adds q to its list of suspected processes.

Lease Failure Detector: A process q sends to processes an “I am alive” message and in addition, sends a request for a lease for some duration d . Some time before d expires, q must send a request for lease renewal to all the processes. Otherwise, if a process does not receive any message from q after d , it will put q in its suspected list.

Partial Synchronous System: A distributed system is partial synchronous if there are bounds on transmission delay, clock drift and processing time and these bounds are unknown.

Pinging Failure Detector: It is an implementation of FD where every process p periodically monitors a set of processes by sending them “Are you alive?” messages. Upon reception of such messages, the monitored process replies with an “I am alive” message. If process p times out on a process q , it adds q to its list of suspected processes.

Consensus: Consensus is a fundamental agreement problem where each process proposes an initial value and all correct processes (those not crashed) must agree on a single value.

Impossibility of FLP: Fischer, Lynch and Paterson (FLP) shown in 1985 that consensus is impossible to be solved deterministically in an asynchronous distributed system with reliable channels and prone to a single process crash.

Unreliable Failure Detector (FD): It is an oracle which provides information about process failures. A FD is unreliable in the sense that it can make mistakes, that is, it may not suspect a crashed process or suspect a correct one.

Complete Chapter List

Search this Book: