Flash-Based Storage Deduplication Techniques: A Survey

Flash-Based Storage Deduplication Techniques: A Survey

Ilya A. Chernov (Institute of Applied Mathematical Research, KRC of RAS, Petrozavodsk State University, Petrozavodsk, Russia), Evgeny Ivashko (Institute of Applied Mathematical Research, KRC of RAS, Petrozavodsk State University, Petrozavodsk, Russia), Dmitry Kositsyn (Petrozavodsk State University, Petrozavodsk, Russia), Vadim Ponomarev (Petrozavodsk State University, Petrozavodsk, Russia), Alexander Rumyantsev (Institute of Applied Mathematical Research, KRC of RAS, Petrozavodsk State University, Petrozavodsk, Russia) and Anton Shabaev (Petrozavodsk State University, Petrozavodsk, Russia)
DOI: 10.4018/IJERTCS.2019070103

Abstract

Exponential growth of the amount of data stored worldwide together with high level of data redundancy motivates the active development of data deduplication techniques. The overall increasing popularity of solid-state drives (SSDs) as primary storage devices forces the adaptation of deduplication techniques to technical peculiarities of this type of storage (such as write amplification and wearout), implying active research in SSD-equipped storage data deduplication subdomain. In this survey paper the authors summarize the recent results on deduplication in SSD-enhanced storage, providing a novel taxonomy of the techniques. They classify the techniques on the basis of storage device complexity, starting from a sub-device level up to the storage network. Linux deduplication implementations are discussed, and the results of experimental comparison of several widely used tools are presented. Finally, the authors briefly outline open problems in the field and possible points of future research.
Article Preview
Top

2. General Deduplication Process

The aim of deduplication is reducing data redundancy. The idea is to store only unique data pieces using references on existing ones instead. A general deduplication process consists of four stages (see Figure 1).

It starts with splitting the input data stream (e.g., a file) into data pieces called chunks. For the file-level deduplication, this stage is omitted, but chunk-level shows better deduplication ratio (Xia et al., 2016). Dynamic chunk size provides higher deduplication ratio compared to the fixed size. In SSD-based storage systems, the chunk size usually equals the SSD page size (4 KB): this improves the write performance, garbage collection, and wear levelling. However, other options of chunk size are considered, e.g.:

  • A page (in (Seo et al., 2015): Page-sized chunking is a base for a novel deduplicating Flash Translation Layer (FTL) scheme);

  • Several pages (in (Lee et al., 2011): Different sizes of chunks are compared from the point of view of performance);

  • A block (in (Ha et al., 2013): A new “block-level content-aware chunking” deduplication scheme is proposed to extend the lifetime of SSD).

Figure 1.

General deduplication process

IJERTCS.2019070103.f01

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 11: 4 Issues (2020): 1 Released, 3 Forthcoming
Volume 10: 4 Issues (2019)
Volume 9: 2 Issues (2018)
Volume 8: 2 Issues (2017)
Volume 7: 2 Issues (2016)
Volume 6: 2 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing