A Context-Based Performance Enhancement Algorithm for Columnar Storage in MapReduce with Hive

A Context-Based Performance Enhancement Algorithm for Columnar Storage in MapReduce with Hive

Yashvardhan Sharma (Department of Computer Science and Information Systems, Birla Institute of Technology and Science, Pilani, India), Saurabh Verma (Department of Computer Science and Information Systems, Birla Institute of Technology and Science, Pilani, India), Sumit Kumar (Department of Computer Science and Information Systems, Birla Institute of Technology and Science, Pilani, India) and Shivam U. (Department of Computer Science and Information Systems, Birla Institute of Technology and Science, Pilani, India)
Copyright: © 2013 |Pages: 13
DOI: 10.4018/ijcac.2013100104
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the cluster-based architecture. In this context, MapReduce has emerged as a promising architecture for large scale data warehousing and data analytics on commodity clusters. The MapReduce framework offers several lucrative features such as high fault-tolerance, scalability and use of a variety of hardware from low to high range. But these benefits have resulted in substantial performance compromise. In this paper, we propose the design of a novel cluster-based data warehouse system, Daenyrys for data processing on Hadoop – an open source implementation of the MapReduce framework under the umbrella of Apache. Daenyrys is a data management system which has the capability to take decision about the optimum partitioning scheme for the Hadoop's distributed file system (DFS). The optimum partitioning scheme improves the performance of the complete framework. The choice of the optimum partitioning is query-context dependent. In Daenyrys, the columns are formed into optimized groups to provide the basis for the partitioning of tables vertically. Daenyrys has an algorithm that monitors the context of current queries and based on the observations, it re-partitions the DFS for better performance and resource utilization. In the proposed system, Hive, a MapReduce-based SQL-like query engine is supported above the DFS.
Article Preview

2. Background

To carry out data-intensive analysis in a scalable, fault-tolerant and efficient manner for a distributed environment, Google introduced a distributed and parallel programming framework called MapReduce (Condie et al., 2010; Dean & Ghemawat, 2010). The MapReduce framework is highly desirable as it allows a programmer to specify the analytical job and address the issue of translating the job into sub tasks on multiple machines which are completely automated. Under the umbrella of Apache, an open source implementation of the MapReduce framework, referred to as Hadoop, is freely available to both commercial and academic users. Given its easy access, Hadoop has become a popular choice to process big data produced by the web applications and business industry. Furthermore, due to the success of Hadoop and MapReduce, there is a significant interest in the traditional data warehousing industry to explore the integration of the MapReduce paradigm for large-scale analytical processing of relational data. The two major efforts to provide a declarative interface on top of Hadoop run-time environment are the Pig from Yahoo! and the Hive from Facebook. (Dittrich et al., 2012; Kaldewey et al., 2012; Pavlo et al., 2009; Songting et al., 2010; Thusoo et al., 2009)

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing