Article Preview
Top1. Introduction
In this age of data explosion and the availability of high performance and low-cost computers and with the rapid advancement of network technologies, large-scale distributed cluster systems can be extended to process and analyze a large-scale, massively parallel data (DeWitt et al., 2008). MapReduce, which has been introduced by Google (Dean & Ghemawat, 2004), is a distributed computing programming model, which is widely used to process a massive volume of data on large clusters networks. The core idea behind MapReduce framework is to allow the users to implement their own map and reduce functions, while the system is responsible for scheduling and synchronizing the map and reduce tasks (Li, Ooi, Özsu, & Wu, 2014).
Currently, MapReduce is being used in a wide range of applications, including Machine learning (Chu et al., 2006; Wolfe, Haghighi, & Klein, 2008), Singular Value Decomposition (Reza & Gunnar, 2014), textual retrieval (Elsayed, Lin, & Oard, 2008), statistical machine translation (Dyer, Cordova, Mont, & Lin, 2008), content pattern analysis (Guo, Tan, Chen, Zhang, & Zhao, 2009), and click-stream sessionization (Friedman, Pawlowski, & Cieslewicz, 2009), among others. Furthermore, the MapReduce framework has been adapted to numerous computing fields like volunteer computing environments (Lin et al., 2010), dynamic cloud environments (Marozzo, Talia, & Trunfio, 2012), desktop grids (Tang, Moca, Chevalier, He, & Fedak, 2010), mobile environments (Dou, Kalogeraki, Gunopulos, Mielikainen, & Tuulos, 2010), and multi-core and many-core systems (Chen, Chen, & Zang, 2010).
Before introducing MapReduce framework, Google used many different implementations to process and compute large data. Most of the input data was very large but the computations were relatively simple. For this, the computations needed to be distributed across hundreds of computers to complete all calculations in a reasonable time. The success of MapReduce comes from its distinguishing features, including flexibility, scalability, fault tolerance, and efficiency (Li, Ooi, Özsu, & Wu, 2014). In addition, it can be used to process large datasets. As well as it hides the difficulty of data parallelization since the application developers need only to define the parameters which control data distribution and parallelism (Lämmel, 2008).
However, MapReduce has some limitations on its functionality and its unsuitability for particular types of application (Pavlo et al., 2009; Ordonez, Song, & Garcia-Alvarado, 2010; Stonebraker et al., 2010) Hence, there have been considerable research efforts that attempt to overcome the limitations of MapReduce model (Zahria, Konwinski, Joseph, Katz, & Stoica, 2008; Jiang, Ooi, Shi, & Wu, 2010; Condie et al., 2010; Bu, Howe, Balazinska, & Ernst, 2010; Thusoo et al., 2009; Olston, Reed, Srivastava, Kumar, & Tomkins, 2008; Pike, Doward, Griesemer, & Quinlan, 2005; Chambers et al., 2010).
The goal of this paper is to provide a detailed study of the current MapReduce implementations focusing on their pros and cons. A comparison between MapReduce implementations is introduced and a set of open issues and ideas is provided to enhance and improve the MapReduce framework.