Hadoop History and Architecture

Hadoop History and Architecture

Copyright: © 2019 |Pages: 13
DOI: 10.4018/978-1-5225-3790-8.ch003

Abstract

As the name indicates, this chapter explains the evolution of Hadoop. Doug Cutting started a text search library called Lucene. After joining Apache Software Foundation, he modified it into a web crawler called Apache Nutch. Then Google File System was taken as reference and modified as Nutch Distributed File System. Then Google's MapReduce features were also integrated and Hadoop was framed. The whole path from Lucene to Apache Hadoop is illustrated in this chapter. Also, the different versions of Hadoop are explained. The procedure to download the software is explained. The mechanism to verify the downloaded software is shown. Then the architecture of Hadoop is detailed. The Hadoop cluster is a set of commodity machines grouped together. The arrangement of Hadoop machines in different racks is shown. After reading this chapter, the reader will understand how Hadoop has evolved and its entire architecture.
Chapter Preview
Top

Hadoop History

Doug Cutting started writing the first version of Lucene in 1997. Lucene is a full text search library. In 2000, he open sourced Lucene under GPL license. Many people started using Lucene. In the next year (ie, 2001), Lucene was moved to Apache Software Foundation. By the end of 2001, Doug Cutting started indexing web pages. University of Washington student Mike Cafarella also joined in his work. The new product developed is called “Apache Nutch”. Nutch is a web crawler going from page to page. Nutch uses Lucene to index the contents of page to make it searchable. It was achieving an index rate of 100 pages per second when installed on a single machine. To improve the performance of Nutch, Doug Cutting and Mike Carfarella used four machines. Space allocation and data exchange between these four machines had to be done manually. Really it was very complex to do all these jobs manually.

They tried further to build a scalable search engine with reliability, fault tolerant and schema less design. In 2003, Google published a paper on their Google File System (GFS) which is a scalable search engine. Taking GFS as reference, Cutting and Carfarella started implementation using Java. They named the new file system as Nutch Distributed File System (NDFS). NDFS focuses on cluster of nodes and single reliable file system making operational complexity transparent to users. Also it handles failures of systems without user intervention.

(Dean & Ghemawat, 2004) from Google published a paper about Mapreduce which is more suitable for data processing on large clusters. Considering its simplicity and powerfulness, Cutting integrated Mapreduce into Nutch in 2005. In February 2006, Cuting pulled out NDFS and Mapreduce from Nutch and started a new project under Lucene. He named the new system as Hadoop and made it as Open Source.

Complete Chapter List

Search this Book:
Reset