Campus Cloud Storage and Preservation: From Distributed File System to Data Sharing Service

Campus Cloud Storage and Preservation: From Distributed File System to Data Sharing Service

Jinlei Jiang (Tsinghua University, China & Research Institute of Tsinghua University in Shenzhen, China), Xiaomeng Huang (Institute for Global Change Studies, China), Yongwei Wu (Tsinghua University, China & Research Institute of Tsinghua University in Shenzhen, China) and Guangwen Yang (Institute for Global Change Studies, China & Tsinghua University, China)
DOI: 10.4018/978-1-4666-2854-0.ch012

Abstract

We are now living in the era of big data. The large volume of data raises a lot of issues related to data storage and management, stimulating the emergence of Cloud storage. Unlike traditional storage systems such as SAN (Storage Area Network) and NAS (Network Attached Storage), Cloud storage is delivered over a network and has such features as easy to scale and easy to manage. With Cloud storage shielding complex technical details such as storage capacity, data location, data availability, reliability and security, users can then concentrate on their business rather than IT (Information Technology) system maintenance. However, it is not an easy task to develop a Cloud storage system because multiple factors are involved. In this chapter, the authors show their experience in the design and implementation of a Cloud storage system. They detail its key components, namely the distributed file system Carrier and the data sharing service Corsair. A case study is also given on its application at Tsinghua University.
Chapter Preview
Top

1 Introduction

It is clear that we are now living in the era of big data. According to IDC (2011), the new digital data has reached 1,200 exabytes in 2010. Still, more and more digital sources are coming. For example, the new generation fine-resolution climate models (Kouzes et al., 2009) will produce 8 petabytes per run for the same simulation, resulting in a 1000-fold increase in data volume. If all the data generated by the ATLAS experiment at the Large Hadron Collider (LHC) at CERN were recorded, they would fill 100,000 CDs (each holding 640 Mbytes) per second. The Large Synoptic Survey Telescope, which will launch to function in 2016, will collect 140 terabytes of information every five days (Cukier, 2010). The large volume of data raises a lot of issues in terms of storage, management and processing. In this chapter, we focus on data storage and management.

SAN (Storage Area Network) and NAS (Network Attached Storage) are two widely used techniques for storage sharing and management. SAN deploys a dedicated network for multiple servers to access storage devices (e.g., disk arrays and tape libraries) and provides no file abstraction, whereas NAS connects to a network and provides file-based data storage services to other devices on that network. Though SAN and NAS have gained wide adoption, they have inherent deficiencies in meeting the new needs. The SAN solution usually uses high-end storage and communication devices and as a result, the one-time cost and total cost of ownership (TCO) are high. As for the NAS solution, since both the data and the control commands flow through the NAS controller, the controller is apt to become a bottleneck of the whole system. In addition, a single NAS appliance has a capacity limit and it is difficult to seamlessly combine the storage space of two different NAS appliances. To deal with the above-stated deficiencies comes into being Cloud storage.

Unlike SAN and NAS, the Cloud storage is delivered over a network, easy to scale and easy to manage. Please note that the network used in Cloud storage is usually the Internet, whereas the network of SAN is a dedicated one and the network used in NAS is a local area network (LAN). With Cloud storage, users need not care about such complex technical details as storage capacity, data location, data availability, reliability and security. The only thing they need to do is to make a contract with a service provider and pay for what they consume. In this way, users can concentrate more on their business rather than IT (Information Technology) system maintenance. It is due to these features that Cloud storage attracts more and more attention. As a result, more and more Cloud storage services are available today, for example, Amazon S3 (Simple Storage Service), Google online storage (GDrive), Microsoft SkyDrive, Dropbox, to name but just a few. In spite of the fact, it is not an easy task to build a Cloud storage system because multiple factors are involved. These factors include but are not limited to: 1) heterogeneity in storage devices, networks and operating systems (OSes), and 2) various workloads to support, and 3) varied requirements on transparency, reliability, scalability, availability, cost and so on. In addition, to the best of our knowledge, the requirement of sharing data among users is not highlighted in current Cloud storage solutions except for GDrive and SkyDrive.

This chapter reports our experience in the design and implementation of a Cloud storage system. It aims to provide an example for people to deliver services of this kind to their users. The rest of the chapter is organized as follows. In the coming section, we give an overview of distributed file system and some related work, and explain the motivation of our work. Then our self-developed distributed file system Carrier and data sharing service Corsair are detailed in Section 3 and Section 4 respectively. Section 5 shows the application of Carrier and Corsair at Tsinghua University. Section 6 highlights our experience. The chapter ends in Section 7 with some conclusions, where some future work is also given.

Complete Chapter List

Search this Book:
Reset