Article Preview
Top1. Introduction
With the development of cloud computing, cloud storage and related technologies have drawn a great deal of attentions from both industry and academia (Hsu, Slagter, & Chung, 2015; Zhao, Qiao, & Raicu, 2015). In cloud computing, the amount of data is increasing at an exponential rate (Aye &Thein, 2015). According to the IDC survey, the amount of data information is doubling in size every two years in the world. The amount of data will continue to grow at this rate, and the digital universe will reach 44ZB by 2020. It brings challenges to the cloud storage systems in datacenters, so deduplication technology was proposed. Through eliminating duplicated data chunks, we can improve the data storage utilization and reduce the data transmission bandwidth overhead (Xu & Tu, 2015; Manogar & Abirami, 2014). According to the principle of deduplication technology, the data stream is divided into data chunks according to different deduplication algorithm such as WFD (whole file detection), FSP (fixed-size partition), and CDC (content-defined chunking). Then a cryptographic hash function such as MD5, SHA1, is used for each chunk as its fingerprint, which determines if that chunk has been stored. When there exists the same fingerprint, this chunk is assumed identical. The duplicated data chunks can be eliminated and references are updated for that duplicated chunk, so there is only one version of data chunk exists in the storage system (Li, He, Lin, & Wei, 2016; Yang, Ren, & Ma, 2015). Consequently, adopting deduplication technology, the amount of data is significantly reduced. It can improve storage utilization and conserve network bandwidth (Chu, Ilyas, & Koutris, 2016). In addition, deduplication technology can decrease management, energy, and cooling overhead in the datacenter. Thus, the deduplication technology is wildly used in storage system (Meyer & Bolosky, 2012).
Adopting deduplication technology, there exists only one version of data chunk in the storage system. When some nodes fail, it will bring challenge to the data chunks availability (Fu, Lee, Feng, Chen, & Xiao, 2016). Furthermore, the availability of files which depend on the data chunk stored on the failed node is greatly reduced. If a data chunk is lost, all files which depend on it is unavailable. So according to the reference count, the importance of a data chunk is different. According to temporal locality of data being accessed, if a data chunk is accessed, the probability of it being accessed again is large recently (Zhao, Qiao, & Raicu, 2015). Thus, the importance of a chunk is different according to the data access frequency. In addition, based on different value of data access frequency, we set different redundancy level for data chunks that can balance the load of storage system (Zhou, Deng, & Xie, 2014). After adopting deduplication technology, the importance of a chunk has a great impact on data availability in storage system.
There are lots of research findings to guarantee data availability when deduplication technology is adopted in the storage system, and the main idea is to introduce redundant information for the data chunk. Replication and erasure codes are two methods proposed to increase data redundancy. They can protect against data loss and enhance data chunk availability. In this article, we mainly consider replication technology. As we all know, when data redundancy increases, the data availability can be guaranteed, but it also increases storage overhead at the same time. In this paper, we analyze the characteristic information of data chunks, and then determine the redundancy degree according to the importance of a data chunk. Through adopting this method, we can ensure data availability and optimize storage utilization. That is, it can achieve a tradeoff between storage space efficiency and data availability.