Recent Developments on Security and Reliability in Large-Scale Data Processing with MapReduce

Recent Developments on Security and Reliability in Large-Scale Data Processing with MapReduce

Christian Esposito, Massimo Ficco
Copyright: © 2016 |Pages: 20
DOI: 10.4018/IJDWM.2016010104
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The demand to access to a large volume of data, distributed across hundreds or thousands of machines, has opened new opportunities in commerce, science, and computing applications. MapReduce is a paradigm that offers a programming model and an associated implementation for processing massive datasets in a parallel fashion, by using non-dedicated distributed computing hardware. It has been successfully adopted in several academic and industrial projects for Big Data Analytics. However, since such analytics is increasingly demanded within the context of mission-critical applications, security and reliability in MapReduce frameworks are strongly required in order to manage sensible information, and to obtain the right answer at the right time. In this paper, the authors present the main implementation of the MapReduce programming paradigm, provided by Apache with the name of Hadoop. They illustrate the security and reliability concerns in the context of a large-scale data processing infrastructure. They review the available solutions, and their limitations to support security and reliability within the context MapReduce frameworks. The authors conclude by describing the undergoing evolution of such solutions, and the possible issues for improvements, which could be challenging research opportunities for academic researchers.
Article Preview
Top

1. Introduction

Big data has become a buzzword for several research projects since the research in such a topic is currently encouraged by huge funding possibilities. As a practical example, big data analytics represents one of the Excellent Science priorities of Horizon 2020 on scientific infrastructures and the development of innovative high value added services1 for the European Commission, and a recent call “FP7 - ICT - 2013 - 11 - 4.2 Scalable data analytics” had a budget of 26 M€ on the topic of “Big Data R&D”. Big data is also attracting a lot of attention within industry, since, as shown by several surveys, such as the one conducted by the IAIS in Sankt Augustin2, It is able to provide to companies strategic competitive advantages, increased sales, higher productivity and reduced costs (Brynjolfsson et al., 2011). The definition of big data within the IT dictionary of Gartner3 is the following one: “Big data is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”. Indeed, the need of processing a large amount of data has caused a revolution in analytics, as stated by Roger Barga from Microsoft at his last keynote speech at ACM DEBS 2013. Traditionally, data processing is conducted according to the classic data mining approach or Batch Processing (Babcock et al., (2002): data need to be stored prior to analysis taking place. Companies used to have their own data centre to store and process their information of interest. Due to the increasing volume of data to be stored and processed, it is no more economical affordable to possess data storing and processing capabilities within small- and/or medium-sized companies. The current tendency is to outsource such capabilities by using third-party cloud platforms and to apply the so-called Data as a Service solution (Hong-Linh & Dustdar, 2009), which makes data management very cost-effective.

Not only has the ICT infrastructure for data management changed so as to accomplish the vision of big data analytics, but also the way of elaborating data. Besides the traditional batch processing, there is also the “data-in-motion” approach or Real-Time Processing (Gupta et al., 2012), where analytics is applied as soon as data are produced and while moving from their sources to their intended destinations. Such a different data processing approach imposes novel requirements so as to provide the scalability, low latency, and performance needed to handle big data projects with real-time processing demands. At the moment, IT managers are using both batch and real-time processing equally, with a 50-50 split. However, as illustrated in a survey of 200 IT professionals about big data analytics conducted by Intel4, we will experience an increase of demands for real-time processing.

Such a change in the way of looking at data management in the current industry is pushing towards a radical rethink of the ICT platforms for data management, and leaving behind the old means for analytics, i.e., descriptive analytics based on warehouses. We are currently witnessing the rapid success of novel data processing platforms based on the MapReduce programming paradigm (Dean & Ghemawat, 2008), which offers easiness of implementing programs for processing large amounts of data in a parallel fashion and the possibility of using non-dedicated distributed computing hardware. As a concrete example, an open-source implementation of MapReduce, named Apache Hadoop5, has emerged as the leading big data solution, with more than 1,200 people across 80 companies contributing to its development since its emergence in 2005. However, if we consider the Emerging Technologies Hype Cycle in 2013 by Gartner6, we can see that the area of Big Data is approaching its peak of expectation. This means that such analytics are not yet mature technologies, and Apache Hadoop and similar products are still relatively new. Thus, there is a great deal of confusion about their strengths and weaknesses. In fact, there are several possible research challenges still to be completely resolved, and several research opportunities to take advantage of. The above-mentioned survey of Intel clearly indicates the importance of having a better analytics than the one possible with the current platforms based on MapReduce as Hadoop.

Complete Article List

Search this Journal:
Reset
Volume 20: 1 Issue (2024)
Volume 19: 6 Issues (2023)
Volume 18: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing