Survey on Various MapReduce Scheduling Algorithms

Survey on Various MapReduce Scheduling Algorithms

Vaibhav Pandey (Punjab Engineering College (Deemed), India) and Poonam Saini (Punjab Engineering College (Deemed), India)
DOI: 10.4018/978-1-5225-8407-0.ch022

Abstract

The advent of social networking and internet of things (IoT) has resulted in exponential growth of data in the last few years. This, in turn, has increased the need to process and analyze such data for optimal decision making. In order to achieve better results, there is an emergence of newly-built architectures for parallel processing. Hadoop MapReduce (MR) is a programming model that is considered as one of the most powerful computation tools for processing the data on a given cluster of commodity nodes. However, the management of clusters along with various quality requirements necessitates the use of efficient MR scheduling. The chapter discusses the classification of MR scheduling algorithms based on their applicability with required parameters of quality of service (QoS). After classification, a detailed study of MR schedulers has been presented along with their comparison on various parameters.
Chapter Preview
Top

Mapreduce Background

Hadoop is an open source framework to store and process huge data sets with a cluster of commodity hardware (“Apache Hadoop,” 2018). It enables application to work with petabytes of data and a large number of computationally independent computers. The core of Hadoop consists of two components: Hadoop Distributed File System (HDFS) and MapReduce (MR).

MapReduce is main processing component and is often referred to as heart of Hadoop. It is a framework which is used to execute applications where large data sets are processed on a cluster of commodity hardware. Here, the input is a set of key-value pair and corresponding output is also a key-value pair. Further, there is a single master node running JobTracker (JT) process and multiple slave nodes running TaskTrackers (TT) processes. The master node is responsible for task scheduling on slave nodes, monitoring those tasks and handling failure by re-executing the task. Slave nodes, on the other hand, follow the instructions of master node and execute the assigned tasks. In addition, slave nodes perform two sets of tasks, namely, map and reduce. The map function takes <key, value> pair as input and produces an intermediate <key, value> pair as output. Thereafter, output of map function with same key are grouped together and are given as an input to reduce function. The reduce function, lastly, performs reduce operation and output is appended to a final output file. Figure 1 shows the schematic view of task execution in Hadoop using MapReduce.

Figure 1.

Task execution in Hadoop MapReduce

Complete Chapter List

Search this Book:
Reset