Hadoop Framework for Handling Big Data Needs

Hadoop Framework for Handling Big Data Needs

Rupali Ahuja (University of Delhi, India)
DOI: 10.4018/978-1-5225-3142-5.ch004


The data generated today has outgrown the storage as well as computing capabilities of traditional software frameworks. Large volumes of data if aggregated and analyzed properly may provide useful insights to predict human behavior, to increase revenues, get or retain customers, improve operations, combat crime, cure diseases, etc. In conclusion, the results of effective Big Data analysis can be used to provide actionable intelligence for humans, as well as for machine consumption. New tools, techniques, technologies and methods are being developed to store, retrieve, manage, aggregate, correlate and analyze Big Data. Hadoop is a popular software framework for handling Big Data needs. Hadoop provides a distributed framework for processing and storage of large datasets. This chapter discusses in detail the Hadoop framework, its features, applications and popular distributions, and its Storage and Visualization tools.
Chapter Preview


The foundation of Hadoop was laid by Doug Cutting and Mike Cafarella at Yahoo Incorporated (Wikipedia, 2015b) in 2005. Hadoop was originally developed to support web indexing in Nutch search engine project. The main components of Hadoop i.e. MapReduce and HDFS are inspired from Google papers. MapReduce is a user-defined function developed by Google in early 2000 for indexing the Web. HDFS has been derived from Google File System (GFS). The database component of Hadoop, HBase is inspired by the Google Bigtable. Currently, Hadoop is an open source project of the Apache Software Foundation (Apache Hadoop, 2015) and is being continuously improved and enhanced by thousands of contributors worldwide. Top IT giants like Yahoo, Facebook, Google, Microsoft, eBay, EMC, etc. are using Hadoop to handle Big Data needs. Hadoop is a Java based framework and requires Java Runtime Environment for its execution. Yahoo has more than 100,000 CPUs in over 40,000 servers running Hadoop, with its biggest Hadoop cluster running 4,500 nodes (Assay, 2014). Figure 1 depicts the Hadoop timeline.

Key Terms in this Chapter

Cloudera: A software company which provides commercial support and services to Hadoop enterprise users.

HortonWorks: An open source big-data platform based upon Apache Hadoop.

Hadoop: An open source project of Apache Software Foundation which consists of a software framework for storing, processing and analyzing large data sets.

MapReduce: A data processing framework of Hadoop which provides data intensive computation of large data sets by dividing tasks across several machines and finally combining the result.

HDFS (Hadoop Distributed File System): A distributed and scalable storage system of Hadoop framework which stores large quantities of data in a distributed fashion across clusters of commodity hardware.

MapR: A software company which develops and distributes software derived from Apache Hadoop.

Sqoop: An Apache Software Foundation project for transferring data from external data stores like relational databases, data warehouse or mainframes and Hadoop system.

HBase: A distributed column oriented database of the Hadoop ecosystem.

Hive: A data warehouse infrastructure provided on top of HDFS to query data using SQL like language called HiveQL.

Ambari: A sub project of Hadoop which provisions, manages, administers and monitors clusters.

Complete Chapter List

Search this Book: