Replica Management in Data Intensive Distributed Science Applications

Replica Management in Data Intensive Distributed Science Applications

Ann Chervenak (University of Southern California, USA) and Robert Schuler (University of Southern California, USA)
DOI: 10.4018/978-1-61520-971-2.ch009
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Management of the large data sets produced by data-intensive scientific applications is complicated by the fact that participating institutions are often geographically distributed and separated by distinct administrative domains. A key data management problem in these distributed collaborations has been the creation and maintenance of replicated data sets. This chapter provides an overview of replica management schemes used in large, data-intensive, distributed scientific collaborations. Early replica management strategies focused on the development of robust, highly scalable catalogs for maintaining replica locations. In recent years, more sophisticated, application-specific replica management systems have been developed to support the requirements of scientific Virtual Organizations. These systems have motivated interest in application-independent, policy-driven schemes for replica management that can be tailored to meet the performance and reliability requirements of a range of scientific collaborations. The authors discuss the data replication solutions to meet the challenges associated with increasingly large data sets and the requirement to run data analysis at geographically distributed sites.
Chapter Preview
Top

Introduction

In the last decade, a large number of distributed scientific collaborations have generated an ever-increasing amount of application data. These scientific collaborations include projects in high energy physics, gravitational wave physics, neuroscience, earthquake science, astronomy, and many others. Collaborations that span multiple institutions (see Figure 1) are typically called Virtual Organizations (Foster, Kesselman et al. 2001). Management of the large data sets produced by these data-intensive scientific applications is complicated by the fact that participating institutions are often geographically distributed and separated by distinct administrative domains. A key data management problem in these distributed collaborations has been the creation and maintenance of replicated data sets.

Figure 1.

Distributed collaboration or virtual organization

Data sets are replicated among the institutions of a scientific collaboration for a variety of reasons. One goal of replication is to provide high availability for data sets, so that if there are hardware failures at one site or if one region is affected by a network outage or a natural disaster, additional copies of the data sets may still be accessed at other locations. Another goal of replication is to improve performance by allowing a data set to be accessed at multiple locations. A client fetching the desired data can choose a particular site based on resource characteristics, such as the available bandwidth on storage systems or networks. Alternatively, clients can access portions of the file at multiple replica sites in parallel to increase the aggregate bandwidth available for downloading the data. Other reasons for replicating data may be the desire of scientists to have a copy of essential data sets at their home institutions. In addition, data sets may be replicated to improve the performance of scientific computations or workflows running on particular computational resources, such as a supercomputer or computation cluster. In this case, the data sets needed for the computation are frequently replicated or staged in to storage resources on or near the computational site.

Scientific data sets are frequently read-only, but in some cases, they may be updated to reflect corrections or recalibrations of data sets. When an original data set is updated, its replicas must also be updated using a consistency scheme agreed upon by the Virtual Organization. This may include a versioning scheme that keeps all versions of a data item and gives them increasing version numbers as data items are updated.

In this chapter, we trace the evolution of replica management techniques for distributed scientific collaborations, from an early focus on creating scalable catalogs for registration and discovery of read-only replicas of data items to more sophisticated, application-specific tools for distributed replication and finally to general tools for policy-driven replica management.

Top

Systems For Cataloguing And Discovery Of Replicas

Early approaches to replica management in distributed scientific environments focused on developing scalable catalogs that record the locations of replicas and allow them to be discovered. These efforts include the Replica Location Service (RLS) framework that was jointly developed by the Globus project and the European DataGrid (EDG) project (Chervenak 2002). This framework was the basis for two RLS implementations by the Globus and EDG (Kunszt 2003) teams. We describe the implementation and usage of the Globus Replica Location Service below.

Other early approaches to replica management in distributed environments include the Storage Resource Broker (Rajasekar 2003) and GridFarm (Tatebe 2003) systems that manage replicas using a centralized metadata catalog. Unlike RLS, these catalogs maintain logical metadata to describe the content of data in addition to attributes related to physical replicas; these catalogs are also used to maintain replica consistency. The LHC Computing Grid project (LCG) and the Enabling Grids for E-SciencE (EGEE) projects at CERN have implemented file catalogs that combine replica location functionality and hierarchical file system metadata management (Baud, Casey et al. 2005; Kunszt, Badino et al. 2005; Munro and Koblitz 2006).

Complete Chapter List

Search this Book:
Reset