Big Data Processing: Application of Parallel Processing Technique to Big Data by Using MapReduce

Big Data Processing: Application of Parallel Processing Technique to Big Data by Using MapReduce

Abhishek Mukherjee (VIT University, India), Chetan Kumar (VIT University, India) and Leonid Datta (VIT University, India)
DOI: 10.4018/978-1-5225-3643-7.ch009


This chapter is a description of MapReduce, which serves as a programming algorithm for distributed computing in a parallel manner on huge chunks of data that can easily execute on commodity servers thus reducing the costs for server maintenance and removal of requirement of having dedicated servers towards for running these processes. This chapter is all about the various approaches towards MapReduce programming model and how to use it in an efficient manner for scalable text-based analysis in various domains like machine learning, data analytics, and data science. Hence, it deals with various approaches of using MapReduce in these fields and how to apply various techniques of MapReduce in these fields effectively and fitting the MapReduce programming model into any text mining application.
Chapter Preview

Main Focus Of The Chapter

The main focus of this chapter is to know about the modern day systems to keep up and process all this information in a timely manner, new technologies have been developed and old ones have been improved upon. Massively parallel processing, or MPP (Metropolis, N., 1986) and MapReduce are such technologies.

MPP manages the coordinated processing of huge datasets utilizing different processors. This empowers fast rates of execution for many-sided inquiries running against extensive information distribution centers. The primary reason this technology came about is a result of the huge amounts of data that were inundating applications not intended for such huge volumes. The need to process this data for the reasons for investigation urged originators to create ultra-quick preparing systems. Without the methods MPP employs, a query may take very long time to finish, making present day business insight frameworks and data warehouses not as much as helpful. MPP is at the heart of a wide range of sorts of huge data solutions, and has made considerable advances as a critical innovation. Amazon Redshift,is a fast, fully managed, petabyte-scale data warehouse solution and a prevalent cloud-based information (Gupta, A., Agarwal, D., Tan, D., Kulesza, J., Pathak, R., Stefani, S., & Srinivasan, V., 2015) warehousing arrangement, utilizes MPP design to accomplish to a great degree quick inquiry execution. MPP is one of the five key execution empowering agents of Redshift, alongside columnar data storage, data compression, query optimization, and compiled code.

Complete Chapter List

Search this Book: