A Predictive Map Task Scheduler for Optimizing Data Locality in MapReduce Clusters

A Predictive Map Task Scheduler for Optimizing Data Locality in MapReduce Clusters

Mohamed Merabet (EEDIS Laboratory, DjillaliLiabes University of Sidi Bel Abbes, Sidi Bel Abbes, Algeria), Sidi mohamed Benslimane (LabRI Laboratory, Ecole Superieure en Informatique, Sidi Bel-Abbes, Algeria), Mahmoud Barhamgi (Claude Bernard Lyon 1 University, Lyon, France) and Christine Bonnet (Université Lyon 1, Lyon, France)
Copyright: © 2018 |Pages: 14
DOI: 10.4018/IJGHPC.2018100101

Abstract

This article describes how data locality is becoming one of the most critical factors to affect performance of MapReduce clusters because of network bisection bandwidth becomes a bottleneck. Task scheduler assigns the most appropriate map tasks to nodes. If map tasks are scheduled to nodes without input data, these tasks will issue remote I/O operations to copy the data to local nodes that decrease execution time of map tasks. In that case, prefetching mechanism can be useful to preload the needed input data before tasks is launching. Therefore, the key challenge is how this article can accurately predict the execution time of map tasks to be able to use data prefetching effectively without any data access delay. In this article, it is proposed that a Predictive Map Task Scheduler assigns the most suitable map tasks to nodes ahead of time. Following this, a linear regression model is used for prediction and data locality based algorithm for tasks scheduling. The experimental results show that the method can greatly improve both data locality and execution time of map tasks.
Article Preview

2. Architecture Of Mapreduce

MapReduce is a programming model for large-scale data- intensive distributed data processing. It divides the execution in two phases: map and reduce. In the map phase, amounts of map tasks process data blocks independently. After all map tasks are finished, the reduce phase begins. The intermediate results of map tasks are shuffled, sorted, and processed in parallel with one or more reduce tasks.

A user submits a job comprising of a map function and a reduce function which are subsequently transformed into map and reduce tasks scheduled on slots hosted by participating nodes in the cluster. HDFS loads data partitions into fixed equal-size splits, and distributes splits across cluster nodes. Each split is assigned a map task.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 11: 4 Issues (2019): Forthcoming, Available for Pre-Order
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing