A Survey on MapReduce Implementations

A Survey on MapReduce Implementations

Amer Al-Badarneh (Jordan University of Science & Technology, Irbid, Jordan), Amr Mohammad (Jordan University of Science & Technology, Irbid, Jordan) and Salah Harb (Jordan University of Science & Technology, Irbid, Jordan)
Copyright: © 2016 |Pages: 29
DOI: 10.4018/IJCAC.2016010104
OnDemand PDF Download:
List Price: $37.50


A distinguished successful platform for parallel data processing MapReduce is attracting a significant momentum from both academia and industry as the volume of data to capture, transform, and analyse grows rapidly. Although MapReduce is used in many applications to analyse large scale data sets, there is still a lot of debate among scientists and researchers on its efficiency, performance, and usability to support more classes of applications. This survey presents a comprehensive review of various implementations of MapReduce framework. Initially the authors give an overview of MapReduce programming model. They then present a broad description of various technical aspects of the most successful implementations of MapReduce framework reported in the literature and discuss their main strengths and weaknesses. Finally, the authors conclude by introducing a comparison between MapReduce implementations and discuss open issues and challenges on enhancing MapReduce.
Article Preview

1. Introduction

In this age of data explosion and the availability of high performance and low-cost computers and with the rapid advancement of network technologies, large-scale distributed cluster systems can be extended to process and analyze a large-scale, massively parallel data (DeWitt et al., 2008). MapReduce, which has been introduced by Google (Dean & Ghemawat, 2004), is a distributed computing programming model, which is widely used to process a massive volume of data on large clusters networks. The core idea behind MapReduce framework is to allow the users to implement their own map and reduce functions, while the system is responsible for scheduling and synchronizing the map and reduce tasks (Li, Ooi, Özsu, & Wu, 2014).

Currently, MapReduce is being used in a wide range of applications, including Machine learning (Chu et al., 2006; Wolfe, Haghighi, & Klein, 2008), Singular Value Decomposition (Reza & Gunnar, 2014), textual retrieval (Elsayed, Lin, & Oard, 2008), statistical machine translation (Dyer, Cordova, Mont, & Lin, 2008), content pattern analysis (Guo, Tan, Chen, Zhang, & Zhao, 2009), and click-stream sessionization (Friedman, Pawlowski, & Cieslewicz, 2009), among others. Furthermore, the MapReduce framework has been adapted to numerous computing fields like volunteer computing environments (Lin et al., 2010), dynamic cloud environments (Marozzo, Talia, & Trunfio, 2012), desktop grids (Tang, Moca, Chevalier, He, & Fedak, 2010), mobile environments (Dou, Kalogeraki, Gunopulos, Mielikainen, & Tuulos, 2010), and multi-core and many-core systems (Chen, Chen, & Zang, 2010).

Before introducing MapReduce framework, Google used many different implementations to process and compute large data. Most of the input data was very large but the computations were relatively simple. For this, the computations needed to be distributed across hundreds of computers to complete all calculations in a reasonable time. The success of MapReduce comes from its distinguishing features, including flexibility, scalability, fault tolerance, and efficiency (Li, Ooi, Özsu, & Wu, 2014). In addition, it can be used to process large datasets. As well as it hides the difficulty of data parallelization since the application developers need only to define the parameters which control data distribution and parallelism (Lämmel, 2008).

However, MapReduce has some limitations on its functionality and its unsuitability for particular types of application (Pavlo et al., 2009; Ordonez, Song, & Garcia-Alvarado, 2010; Stonebraker et al., 2010) Hence, there have been considerable research efforts that attempt to overcome the limitations of MapReduce model (Zahria, Konwinski, Joseph, Katz, & Stoica, 2008; Jiang, Ooi, Shi, & Wu, 2010; Condie et al., 2010; Bu, Howe, Balazinska, & Ernst, 2010; Thusoo et al., 2009; Olston, Reed, Srivastava, Kumar, & Tomkins, 2008; Pike, Doward, Griesemer, & Quinlan, 2005; Chambers et al., 2010).

The goal of this paper is to provide a detailed study of the current MapReduce implementations focusing on their pros and cons. A comparison between MapReduce implementations is introduced and a set of open issues and ideas is provided to enhance and improve the MapReduce framework.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing