The State of the Art and Open Problems in Data Replication in Grid Environments

The State of the Art and Open Problems in Data Replication in Grid Environments

Mohammad Shorfuzzaman (University of Manitoba, Canada), Rasit Eskicioglu (University of Manitoba, Canada) and Peter Graham (University of Manitoba, Canada)
Copyright: © 2010 |Pages: 31
DOI: 10.4018/978-1-60566-661-7.ch022
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Data Grids provide services and infrastructure for distributed data-intensive applications that need to access, transfer and modify massive datasets stored at distributed locations around the world. For example, the next-generation of scientific applications such as many in high-energy physics, molecular modeling, and earth sciences will involve large collections of data created from simulations or experiments. The size of these data collections is expected to be of multi-terabyte or even petabyte scale in many applications. Ensuring efficient, reliable, secure and fast access to such large data is hindered by the high latencies of the Internet. The need to manage and access multiple petabytes of data in Grid environments, as well as to ensure data availability and access optimization are challenges that must be addressed. To improve data access efficiency, data can be replicated at multiple locations so that a user can access the data from a site near where it will be processed. In addition to the reduction of data access time, replication in Data Grids also uses network and storage resources more efficiently. In this chapter, the state of current research on data replication and arising challenges for the new generation of data-intensive grid environments are reviewed and open problems are identified. First, fundamental data replication strategies are reviewed which offer high data availability, low bandwidth consumption, increased fault tolerance, and improved scalability of the overall system. Then, specific algorithms for selecting appropriate replicas and maintaining replica consistency are discussed. The impact of data replication on job scheduling performance in Data Grids is also analyzed. A set of appropriate metrics including access latency, bandwidth savings, server load, and storage overhead for use in making critical comparisons of various data replication techniques is also discussed. Overall, this chapter provides a comprehensive study of replication techniques in Data Grids that not only serves as a tool to understanding this evolving research area but also provides a reference to which future e orts may be mapped.
Chapter Preview
Top

Introduction

The popularity of the Internet as well as the availability of powerful computers and high-speed network technologies is changing the way we use computers today. These technology opportunities have also led to the possibility of using distributed computers as a single, unified computing resource, leading to what is popularly known as Grid Computing (Kesselman & Foster, 1998). Grids enable the sharing, selection, and aggregation of a wide variety of resources including supercomputers, storage systems, data sources, and specialized devices that are geographically distributed and owned by different organizations for solving large-scale computational and data intensive problems in science, engineering, and commerce (Venugopal, Buyya, & Ramamohanarao, 2006).

Data Grids deal with providing services and infrastructure for distributed data-intensive applications that need to access, transfer and modify massive datasets stored across distributed storage resources. For example, scientists working in areas as diverse as high energy physics, bioinformatics, and earth observations need to access large amounts of data. The size of these data is expected to be terabyte or even petabyte scale for some applications. Maintaining a local copy of data on each site that needs the data is extremely expensive. Also, storing such huge amounts of data in a centralized manner is almost impossible due to extensively increased data access time. Given the high latency of wide-area networks that underlie many Grid systems, and the need to access or manage several petabytes of data in Grid environments, data availability and access optimization are key challenges to be addressed.

An important technique to speed up data access for Data Grid systems is to replicate the data in multiple locations, so that a user can access the data from a site in his vicinity (Venugopal et al., 2006). Data replication not only reduces access costs, but also increases data availability for most applications. Experience from parallel and distributed systems design shows that replication promotes high data availability, lower bandwidth consumption, increased fault tolerance, and improved scalability. However, the replication algorithms used in such systems cannot always be directly applied to Data Grid systems due to the wide-area (mostly hierarchical) network structures and special data access patterns in Data Grid systems that differ from traditional parallel systems.

In this chapter, the state of the current research on data replication and its challenges for the new generation of data-intensive grid environments are reviewed and open problems are discussed. First, different data replication strategies are introduced that offer efficient replica1 placement in Data Grid systems. Then, various algorithms for selecting appropriate replicas and maintaining replica consistency are discussed. The impact of data replication on job scheduling performance in Data Grids is also investigated.

The main objective of this chapter, therefore, is to provide a basis for categorizing present and future developments in the area of replication in Data Grid systems. This chapter also aims to provide an understanding of the essential concepts of this evolving research area and to identify important and outstanding issues for further investigation.

The remainder of this chapter is organized as follows. First, an overview of the data replication problem is presented, describing the key issues involved in data replication. In the following section, progress made to date in the area of replication in Data Grid systems is reviewed. Following this, a critical comparison of data placement strategies, probably the core issue affecting replication efficiency in Data Grids, is provided. A summary is then given and some open research issues are identified.

Key Terms in this Chapter

Access Latency: Access latency is the time that elapses from when a node sends a request for a file until it receives the complete file.

Data Grids: Data Grids primarily deal with providing services and infrastructure for distributed data-intensive applications that need to access, transfer and modify massive datasets stored in distributed storage resources.

Replica Selection: A replica selection service discovers the available replicas and selects the best replica that matches the user’s location and quality of service (QoS) requirements.

Job Scheduling: Job scheduling assigns incoming jobs to compute nodes in such a way that some evaluative conditions are met, such as the minimization of the overall execution time of the jobs.

Replica Placement: The replica placement service is the component of a Data Grid architecture that decides where in the system a file replica should be placed.

Replica Consistency: The replica consistency problem deals with the update synchronization of multiple copies (replicas) of a file.

Replication: Replication is an important technique to speed up data access for Data Grid systems by replicating the data in multiple locations, so that a user can access the data from a site in his vicinity.

Complete Chapter List

Search this Book:
Reset