Driving Big Data with Hadoop Technologies

Driving Big Data with Hadoop Technologies

Siddesh G. M., Srinidhi Hiriyannaiah, K. G. Srinivasa
DOI: 10.4018/978-1-4666-5864-6.ch010
(Individual Chapters)
No Current Special Offers


The world of Internet has driven the computing world from a few gigabytes of information to terabytes, petabytes of information turning into a huge volume of information. These volumes of information come from a variety of sources that span over from structured to unstructured data formats. The information needs to update in a quick span of time and be available on demand with the cheaper infrastructures. The information or the data that spans over three Vs, namely Volume, Variety, and Velocity, is called Big Data. The challenge is to store and process this Big Data, running analytics on the stored Big Data, making critical decisions on the results of processing, and obtaining the best outcomes. In this chapter, the authors discuss the capabilities of Big Data, its uses, and processing of Big Data using Hadoop technologies and tools by Apache foundation.
Chapter Preview

2. What Is Big Data?

The world of computing is driven by data and can change the way we perceive and interact with the world. The data generated by the computing devices are generally stored in the conventional databases. The data have to be processed and analyzed effectively that is critical for ensuring decisions made on the introduction of new products, generating the quarterly reports, maintaining the relationships with the customers, manage their finances and thus understand about the world (LaValle, Hopkins, Lesser, Shockley, & Kruschwitz, 2010). In the telecom industry, call data records need to be analyzed for ensuring quality of service with the customers (Schroeck, Shockley, Smart, Romero-Morales, & Tufano, 2012 ; Banerjee, 2011). Another example is the online retail industry that keeps track of each click of browsing by the customers for ensuring smarter shipping and inventory decisions (Schroeck et. al., 2012). The Banking sector needs to keep track of both customer and financial details to ensure how money is managed and transferred (Hickins, 2013). Since, the internet has reached all round the globe, data is being generated from various sources such as blogs, social networking sites, videos, transactions of various businesses, sensors of traffic flow, GPS information from satellites and so on the list continues with different characteristics termed as big data (Schroeck et. al., 2012). Big data spans over three basic characteristics namely volume, variety and velocity, commonly called as 3 V’s that provide a better view of different aspects of Big Data and the platforms available to exploit them.

Key Terms in this Chapter

Hadoop Distributed File System (HDFS): HDFS is a distributed file system, the primary storage of Hadoop and allows computations to be carried out in parallel using MapReduce paradigm.

Big Data: Data that spans over 3 Vs namely Volume, Variety and Velocity.

Sqoop: A command line interface application that facilitates to import the data directly from the RDBMS systems into any platform of the Hadoop file system.

Pig: One the Hadoop platform that helps in analyzing large data sets stored in the Hadoop file system.

Hive: Hive is a data warehousing solution built on top of Hadoop and empowered by query language HiveQL.

HBase: HBase is a distributed column oriented non-relational database that runs on top of HDFS.

MapReduce: MapReduce is a framework or a programming model that allows carrying out tasks in parallel across a large cluster of computers.

Hadoop: Hadoop is one of the platforms that help in storing and accessing the Big Data across clusters of systems.

Complete Chapter List

Search this Book: