A Predictive Map Task Scheduler for Optimizing Data Locality in MapReduce Clusters

A Predictive Map Task Scheduler for Optimizing Data Locality in MapReduce Clusters

Mohamed Merabet, Sidi mohamed Benslimane, Mahmoud Barhamgi, Christine Bonnet
Copyright: © 2018 |Pages: 14
DOI: 10.4018/IJGHPC.2018100101
(Individual Articles)
No Current Special Offers


This article describes how data locality is becoming one of the most critical factors to affect performance of MapReduce clusters because of network bisection bandwidth becomes a bottleneck. Task scheduler assigns the most appropriate map tasks to nodes. If map tasks are scheduled to nodes without input data, these tasks will issue remote I/O operations to copy the data to local nodes that decrease execution time of map tasks. In that case, prefetching mechanism can be useful to preload the needed input data before tasks is launching. Therefore, the key challenge is how this article can accurately predict the execution time of map tasks to be able to use data prefetching effectively without any data access delay. In this article, it is proposed that a Predictive Map Task Scheduler assigns the most suitable map tasks to nodes ahead of time. Following this, a linear regression model is used for prediction and data locality based algorithm for tasks scheduling. The experimental results show that the method can greatly improve both data locality and execution time of map tasks.
Article Preview

2. Architecture Of Mapreduce

MapReduce is a programming model for large-scale data- intensive distributed data processing. It divides the execution in two phases: map and reduce. In the map phase, amounts of map tasks process data blocks independently. After all map tasks are finished, the reduce phase begins. The intermediate results of map tasks are shuffled, sorted, and processed in parallel with one or more reduce tasks.

A user submits a job comprising of a map function and a reduce function which are subsequently transformed into map and reduce tasks scheduled on slots hosted by participating nodes in the cluster. HDFS loads data partitions into fixed equal-size splits, and distributes splits across cluster nodes. Each split is assigned a map task.

Complete Article List

Search this Journal:
Volume 16: 1 Issue (2024)
Volume 15: 2 Issues (2023)
Volume 14: 6 Issues (2022): 1 Released, 5 Forthcoming
Volume 13: 4 Issues (2021)
Volume 12: 4 Issues (2020)
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing