An Optimal Data Placement Strategy for Improving System Performance of Massive Data Applications Using Graph Clustering

An Optimal Data Placement Strategy for Improving System Performance of Massive Data Applications Using Graph Clustering

S. Vengadeswaran (National Institute of Technology, Tiruchirappalli, India) and S. R. Balasundaram (National Institute of Technology, Tiruchirappalli, India)
Copyright: © 2018 |Pages: 16
DOI: 10.4018/IJACI.2018070102

Abstract

This article describes how the time taken to execute a query and return the results, increase exponentially as the data size increases, leading to more waiting times of the user. Hadoop with its distributed processing capability is considered as an efficient solution for processing such large data. Hadoop's Default Data Placement Strategy (HDDPS) allocates the data blocks randomly across the cluster of nodes without considering any of the execution parameters. This result in non-availability of the blocks required for execution in local machine so that the data has to be transferred across the network for execution, leading to data locality issue. Also, it is commonly observed that most of the data intensive applications show grouping semantics. Hence during query execution, only a part of the Big-Data set is utilized. Since such execution parameters and grouping behavior are not considered, the default placement does not perform well resulting in several lacunas such as decreased local map task execution, increased query execution time, query latency, etc. In order to overcome such issues, an Optimal Data Placement Strategy (ODPS) based on grouping semantics is proposed. Initially, user history log is dynamically analyzed for identifying access pattern which is depicted as a graph. Markov clustering, a Graph clustering algorithm is applied to identify groupings among the dataset. Then, an Optimal Data Placement Algorithm (ODPA) is proposed based on the statistical measures estimated from the clustered graph. This in turn re-organizes the default data layouts in HDFS to achieve improved performance for Big-Data sets in heterogeneous distributed environment. Our proposed strategy is tested in a 15 node cluster placed in a single rack topology. The result has proved to be more efficient for massive datasets, reducing query execution time by 26% and significantly improves the data locality by 38% compared to HDDPS.
Article Preview

Introduction

Large volume of data is being generated every day in a variety of domains such as Social networks, Health care, Finance, Telecom, Government sectors etc. The data which these domains generate are voluminous (GB, PB, and TB), varied (structured, semi-structured or unstructured) and ever increasing at an unprecedented pace (Jain & Bhatnagar, 2016; Manogaran & Lopez, 2017). Big-Data is thus the term applied to such large volume of data sets whose size is beyond the ability of the commonly used software tools to capture, manage, and process within a tolerable elapsed time (Bihl et al., 2016; White, 2012). Processing such large volume of data and retrieving usable information can be strenuous job in computation, this has led to the use of Hadoop to analyze and gain insights from the data (Baumgarten et al., 2013). The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models (https://hadoop.apache.org/; Narayanapppa et al., 2016; Sammer, 2012). Local storage and computation is achieved through the two core components of Hadoop namely Hadoop Distributed File System (HDFS) and Map Reduce (MR). HDFS is a Distributed File System, designed for storing massive data reliably and streaming the data with high bandwidth (Shvachko et al., 2010). By optimizing the storage and computation of HDFS, the queries can be solved earlier.

Hadoop follows master slave architecture with one Name-Node and multiple Data-Nodes. Whenever a file is pushed into HDFS for storage, the file splits into number of data blocks of desired size, placed randomly across the available DN. When executing a query, Meta information is obtained from the Name-Node about the locality of the required blocks and then query executed in Data-Node where required blocks are located. The most important feature of Hadoop is this movement of the computation to the data rather than the way around (Dean & Ghemawat, 2008). Hence the position of data across the Data-Nodes plays a significant role in efficient query processing. Hence, we focus on finding an innovative data placement strategy so that the queries are solved at the earliest possible time to enable quick decisions as well as to derive maximum utilization of resources. Since the real value of the analyzing Big-Data is, accelerating the time-to-answer especially in streaming data where immediate response for taking better decision is much desired. (Lee et al., 2014; Wang et al., 2014; Yuan et al., 2010).

Figure 1.

Need for an optimal data placement - Illustration

The time taken to execute a query and return the results increase exponentially as the data size increases, leading to more waiting time for the user. The complexity of query execution is influenced by volume of data, the amount of data requested from the query, the type of data, complexity of the data etc. Sometimes the wait times could range from minutes, to hours, to days and to weeks in some worst cases. One of the major reasons for slow speed of executions could be due to the non-availability of the blocks required for execution locally so that the data has to be transferred across the network for execution, leading to increased execution time as shown in the Figure 1.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 11: 4 Issues (2020): Forthcoming, Available for Pre-Order
Volume 10: 4 Issues (2019): 2 Released, 2 Forthcoming
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 2 Issues (2016)
Volume 6: 2 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing