Article Preview
TopIntroduction
Large volume of data is being generated every day in a variety of domains such as Social networks, Health care, Finance, Telecom, Government sectors etc. The data which these domains generate are voluminous (GB, PB, and TB), varied (structured, semi-structured or unstructured) and ever increasing at an unprecedented pace (Jain & Bhatnagar, 2016; Manogaran & Lopez, 2017). Big-Data is thus the term applied to such large volume of data sets whose size is beyond the ability of the commonly used software tools to capture, manage, and process within a tolerable elapsed time (Bihl et al., 2016; White, 2012). Processing such large volume of data and retrieving usable information can be strenuous job in computation, this has led to the use of Hadoop to analyze and gain insights from the data (Baumgarten et al., 2013). The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models (https://hadoop.apache.org/; Narayanapppa et al., 2016; Sammer, 2012). Local storage and computation is achieved through the two core components of Hadoop namely Hadoop Distributed File System (HDFS) and Map Reduce (MR). HDFS is a Distributed File System, designed for storing massive data reliably and streaming the data with high bandwidth (Shvachko et al., 2010). By optimizing the storage and computation of HDFS, the queries can be solved earlier.
Hadoop follows master slave architecture with one Name-Node and multiple Data-Nodes. Whenever a file is pushed into HDFS for storage, the file splits into number of data blocks of desired size, placed randomly across the available DN. When executing a query, Meta information is obtained from the Name-Node about the locality of the required blocks and then query executed in Data-Node where required blocks are located. The most important feature of Hadoop is this movement of the computation to the data rather than the way around (Dean & Ghemawat, 2008). Hence the position of data across the Data-Nodes plays a significant role in efficient query processing. Hence, we focus on finding an innovative data placement strategy so that the queries are solved at the earliest possible time to enable quick decisions as well as to derive maximum utilization of resources. Since the real value of the analyzing Big-Data is, accelerating the time-to-answer especially in streaming data where immediate response for taking better decision is much desired. (Lee et al., 2014; Wang et al., 2014; Yuan et al., 2010).
Figure 1. Need for an optimal data placement - Illustration
The time taken to execute a query and return the results increase exponentially as the data size increases, leading to more waiting time for the user. The complexity of query execution is influenced by volume of data, the amount of data requested from the query, the type of data, complexity of the data etc. Sometimes the wait times could range from minutes, to hours, to days and to weeks in some worst cases. One of the major reasons for slow speed of executions could be due to the non-availability of the blocks required for execution locally so that the data has to be transferred across the network for execution, leading to increased execution time as shown in the Figure 1.