Abstract
Within distributed computing platforms, some computing abilities (or services) are offered to clients. To build dynamic applications using such services as basic blocks, a critical prerequisite is to discover those services. Traditional approaches to the service discovery problem have historically relied upon centralized solutions, unable to scale well in large unreliable platforms. In this chapter, we will first give an overview of the state of the art of service discovery solutions based on peer-to-peer (P2P) technologies that allow such a functionality to remain efficient at large scale. We then focus on one of these approaches: the Distributed Lexicographic Placement Table (DLPT) architecture, that provide particular mechanisms for load balancing and fault-tolerance. This solution centers around three key points. First, it calls upon an indexing system structured as a prefix tree, allowing multi-attribute range queries. Second, it allows the mapping of such structures onto heterogeneous and dynamic networks and proposes some load balancing heuristics for it. Third, as our target platform is dynamic and unreliable, we describe its powerful fault-tolerance mechanisms, based on self-stabilization. Finally, we present the software prototype of this architecture and its early experiments.
Top1. Introduction
Any device in a computational grid provides some computing abilities (software components, scientific computing libraries, binaries …). They are for instance able to multiply matrices. Offered to the community, these abilities may be called services. Users around the world, may need to multiply some matrices while being unable to make it locally, for instance because the multiplication program is missing. It is then required for the user to use such a service in a remote mode. A critical prerequisite is the discovery of such a service. In other words, the user needs information about how to logon to a device providing the sought service (e.g., protocol, address, ports), and how to use it (encoding of the matrices, location of the result …). The notion of service was generalized by the SOA reference model (Newcomer & Lomow, 2004), which is an attempt to define a standard architecture for using business computing entities. SOA is introduced as a paradigm for organizing and exploiting distributed capabilities that may be under the control of different ownership domains. The SOA standard describes a service as a mechanism to enable access to one or more capabilities, where the service is accessed using a prescribed interface and exercised consistent with constraints and policies as specified by the service description. This description is what the users are looking for. What is missing is a directory of services the user could consult to find what it needs.
The actual implementation and integration of a P2P service discovery system into computational grids require a set of protocols allowing servers, clients and the service discovery system to communicate. Figure 1 illustrates the whole architecture: servers declare their services (the BLAS library (Dongarra et al., 1990), complex simulations …). A client needs a service running under a Linux system and requiring 1 Gb of memory.
Figure 1. Integration of a P2P service discovery service to a computational grid: servers declare their services through a registration (application-level) protocol, clients express their needs through queries and the system sends a response (for instance, the address of a server satisfying criteria given inside the query) through a discovery (application level) protocol.
One important problem is to make such a service discovery possible within platforms emerging today, which are large (gathering a high number of geographically distributed nodes), heterogeneous (in terms of hardware, operating systems, and network performance), and dynamic (processors are constantly joining and leaving the system). Moreover, in such large environments, the number of services and requests is constantly growing. Under these conditions, maintaining a view of the available services and answer the requests become a far more challenging problem than the maintenance of a simple directory. New applications need to perform a set of tasks (basic services) with dependencies (expressed through workflows) and have a set of Quality of Service (QoS) requirements. The service discovery system should be able to answer a query such as:
Services == DGEMM,DTRSM,DGEMV; Memory >= 512; Storage >= 4096
meaning that the application to be built needs the DGEMM,DTRSM and DGEMV routines (from the BLAS library) and that QoS requirements are memory > 512 Mb and a Storage Space of 4 Gb.
New infrastructures fail to provide a minimum set of reliable components that could be able to maintain a global view of the platform. Grid software, initially designed assuming such a stable infrastructure needs to be redesigned, to face the nature of emerging platforms. Techniques introduced by the peer-to-peer (P2P) community, offering purely decentralized techniques to share resources on highly large and dynamic platforms appears to be of high interest for the grid community. Peer-to-peer technologies offer robust tools for large scale content distribution (content being possibly information on available services), while addressing failures (leaving processors). In other words, P2P technologies could address the scale and the dynamic nature of computational grids. This convergence was initially put in words a few years ago (Iamnitchi & Foster, 2003). The remainder of this chapter and the solution we focus on in particular is a mature example of this convergence.
Key Terms in this Chapter
Grid Computing: Aggregate heterogeneous and geographically distributed computational resources
Load Balancing: Sharing a workload among several computing devices
Service discovery: Find a computation ability corresponding to some needs expressed through a query
Peer-to-Peer Systems: Systems in which each node executes the same software component
Self-Stabilization: General technique tolerating transient faults
Prefix Trees: Data structure to store data coded in alphanumeric
Mapping: Distributing the nodes of a logical structure on the network
Fault-Tolerance: Ability to face the dynamic and faulty nature of a platform