A Failure Detection System for Large Scale Distributed Systems

A Failure Detection System for Large Scale Distributed Systems

Andrei Lavinia, Ciprian Dobre, Florin Pop, Valentin Cristea
DOI: 10.4018/978-1-4666-2647-8.ch008
(Individual Chapters)
No Current Special Offers


Failure detection is a fundamental building block for ensuring fault tolerance in large scale distributed systems. It is also a difficult problem. Resources under heavy loads can be mistaken as being failed. The failure of a network link can be detected by the lack of a response, but this also occurs when a computational resource fails. Although progress has been made, no existing approach provides a system that covers all essential aspects related to a distributed environment. This paper presents a failure detection system based on adaptive, decentralized failure detectors. The system is developed as an independent substrate, working asynchronously and independent of the application flow. It uses a hierarchical protocol, creating a clustering mechanism that ensures a dynamic configuration and traffic optimization. It also uses a gossip strategy for failure detection at local levels to minimize detection time and remove wrong suspicions. Results show that the system scales with the number of monitored resources, while still considering the QoS requirements of both applications and resources.
Chapter Preview

Fault tolerance in LSDS is based on one form of a failure detection system. Such a failure detector is generally capable of running a detection algorithm and it can communicate with other services which it monitors. This model was first proposed, in the form of an “oracle” detection service, by Chandra and Toueg (1996).

The failure detection module is independent to the main application flow and is being responsible with the monitoring of a subset of the processes within the monitored system and maintaining a list of those it currently suspects to have crashed. A process can query its local failure detector module at any time to check its status. The list of suspected processes is permanently updated such that, at any time, new processes can be added and old ones removed. This failure detector is considered unreliable because it is allowed to make mistakes up to a certain degree. Therefore a module might erroneously suspect some correct process (wrong suspicion) or fail to detect processes that have already crashed. At any given time two failure detector modules may have different lists of processes.

Complete Chapter List

Search this Book: