Data Storage, Retrieval and Management

Data Storage, Retrieval and Management

Valentin Cristea (Politehnica University of Bucharest, Romania), Ciprian Dobre (Politehnica University of Bucharest, Romania), Corina Stratan (Politehnica University of Bucharest, Romania) and Florin Pop (Politehnica University of Bucharest, Romania)
DOI: 10.4018/978-1-61520-703-9.ch006


The latest advances in network and distributedsystem technologies now allow integration of a vast variety of services with almost unlimited processing power, using large amounts of data. Sharing of resources is often viewed as the key goal for distributed systems, and in this context the sharing of stored data appears as the most important aspect of distributed resource sharing. Scientific applications are the first to take advantage of such environments as the requirements of current and future high performance computing experiments are pressing, in terms of even higher volumes of issued data to be stored and managed. While these new environments reveal huge opportunities for large-scale distributed data storage and management, they also raise important technical challenges, which need to be addressed. The ability to support persistent storage of data on behalf of users, the consistent distribution of up-to-date data, the reliable replication of fast changing datasets or the efficient management of large data transfers are just some of these new challenges. In this chapter we discuss how the existing distributed computing infrastructure is adequate for supporting the required data storage and management functionalities. We highlight the issues raised from storing data over large distributed environments and discuss the recent research efforts dealing with challenges of data retrieval, replication and fast data transfers. Interaction of data management with other data sensitive, emerging technologies as the workflow management is also addressed.
Chapter Preview

Data Storage

Many approaches to build highly available and incrementally extendable distributed data storage systems have been proposed. Solutions span from distributed storage repositories to massively parallel and high performance storage systems. A large majority of these aim at a virtualization of the data space allowing users to access data on multiple storage systems, eventually geographically dispersed. Independent of the technical solutions adopted, the common objective is to build the storage infrastructure able to support intensive computation on large datasets, of peta-byte order, across widely distributed organizations.

Current storage facilities are developed to address scientific communities’ rapidly advancing needs, while taking advantage of the equally rapid evolution of network technologies in order to provide the most effective solutions with adequate up-to-date performance. As these systems are architected and operated to guarantee full performance to support both large-scale data management and real-time traffic, one of the main concerns are the high demanding requirements expected to be dealt with. We outline in the following the main challenges addressed by distributed storage systems.

Providing high availability proves to be the main issue in such environments: the storage should remain available, in a transparent fashion to the users, whenever any single or multiple storage units (disks, servers, tapes, etc.) fail. This translates into high resilience levels expected from the storage infrastructure, i.e. the fail of a large number of storage units is tolerated without affecting the overall system’s availability and consistency. The resilience level is closely coupled to the manner in which the distributed storage system handles corruption of the storage units or even users: this can take various forms ranging from hardware faults, software bugs to malicious intrusions or behavior. The term used in literature for these issues is arbitrary (or byzantine) faults and if not treated accordingly, affected systems can deviate from their implemented behavior. Approaches include the use of fault thresholds for long-term storage with service splitting (Chun et al., 2009) and also algorithms that combine strong consistency and liveness guarantees with space-efficiency (Dobre et al., 2008).

Complete Chapter List

Search this Book: