A Failure Detection System for Large Scale Distributed Systems

Andrei Lavinia, Ciprian Dobre, Florin Pop, Valentin Cristea

Source Title: International Journal of Distributed Systems and Technologies (IJDST)2(3)

ISSN: 1947-3532|EISSN: 1947-3540|EISBN13: 9781613506639|DOI: 10.4018/jdst.2011070105

MLA

Lavinia, Andrei, et al. "A Failure Detection System for Large Scale Distributed Systems." IJDST vol.2, no.3 2011: pp.64-87. http://doi.org/10.4018/jdst.2011070105

APA

Lavinia, A., Dobre, C., Pop, F., & Cristea, V. (2011). A Failure Detection System for Large Scale Distributed Systems. International Journal of Distributed Systems and Technologies (IJDST), 2(3), 64-87. http://doi.org/10.4018/jdst.2011070105

Chicago

Lavinia, Andrei, et al. "A Failure Detection System for Large Scale Distributed Systems," International Journal of Distributed Systems and Technologies (IJDST) 2, no.3: 64-87. http://doi.org/10.4018/jdst.2011070105

Export Reference

Favorite Full-Issue Download

View Full Text HTML

View Full Text PDF

Abstract

Failure detection is a fundamental building block for ensuring fault tolerance in large scale distributed systems. It is also a difficult problem. Resources under heavy loads can be mistaken as being failed. The failure of a network link can be detected by the lack of a response, but this also occurs when a computational resource fails. Although progress has been made, no existing approach provides a system that covers all essential aspects related to a distributed environment. This paper presents a failure detection system based on adaptive, decentralized failure detectors. The system is developed as an independent substrate, working asynchronously and independent of the application flow. It uses a hierarchical protocol, creating a clustering mechanism that ensures a dynamic configuration and traffic optimization. It also uses a gossip strategy for failure detection at local levels to minimize detection time and remove wrong suspicions. Results show that the system scales with the number of monitored resources, while still considering the QoS requirements of both applications and resources.

You do not own this content. Please login to recommend this title to your institution's librarian or purchase it from the IGI Global bookstore.

Username or email:

Password:

Forgot individual login password?

Create individual account

A Failure Detection System for Large Scale Distributed Systems

MLA

APA

Chicago

Export Reference

Abstract

Request Access