Apache Hadoop is an open source framework for storage and processing massive amounts of data. The skeleton of Hadoop can be viewed as distributed computing across a cluster of computers. This chapter deals with the single node, multinode setup of Hadoop environment along with the Hadoop user commands and administration commands. Hadoop processes the data on a cluster of machines with commodity hardware. It has two components, Hadoop Distributed File System for storage and Map Reduce/YARN for processing. Single node processing can be done through standalone or pseudo-distributed mode whereas multinode is through cluster mode. The execution procedure for each environment is briefly stated. Then the chapter explores the Hadoop user commands for operations like copying to and from files in distributed file systems, running jar, creating archive, setting version, classpath, etc. Further, Hadoop administration manages the configuration including functions like cluster balance, running the dfs, MapReduce admin, namenode, secondary namenode, etc.
TopSingle Node Setup
The steps for setting up a single node hadoop should be backed up by HDFS and YARN running on a Linux environment (White, 2015).
The basic requirements behind the installation of hadoop include Java. Check in command prompt to verify if Java is already installed using:
$ java – version
In case of recent version released, then the Java runtime environment will support hadoop. If not, we have to install the java into the system. To make the environment local set the java path to ~/.bashrc file by including the following lines.
export JAVA_HOME = /usr/local/jdk1.7.0
export PATH=$PATH:$JAVA_HOME/bin
Now, can apply the changes to the current system working environment by
$ source ~/.bashrc
The second step of basic requirement is to configure SSH. SSH setup is used for starting and stopping the distributed daemon shell operations. SSH requires to be setup to allow password-less login for hadoop machines connected in the cluster. This is achieved through public/private key pair to authenticate different users and the public key will be shared across the cluster. Hadoop needs this SSH access to manage between cluster of nodes. Initially check if ssh is mapped to localhost without a passphrase by:
$ ssh localhost
Without passphrase ssh to localhost allows to generate a key value pair using the following command.
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
To enable SSH access to machine with a new key is done using the command,
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
Finally the SSH server configuration can be checked in the file /etc/ssh/sshd_config
Hadoop Operation Modes
Hadoop cluster can be in one of the three modes:
- •
Standalone Mode: By default, works in this mode by configuring the execution as a single java process.
- •
Pseudo Distributed Mode: Distributed simulation like processing is developed on a single machine. Hadoop daemon with hdfs, mapreduce runs like separate java process.
- •
Fully Distributed/Cluster Mode: Fully distributed with a minimum of two machines running as a cluster. The machines works as a master slave architecture which will be explained below.