As the name indicates, this chapter explains the various additional tools provided by Hadoop. The additional tools provided by Hadoop distribution are Hadoop Streaming, Hadoop Archives, DistCp, Rumen, GridMix, and Scheduler Load Simulator. Hadoop Streaming is a utility that allows the user to have any executable or script for both mapper and reducer. Hadoop Archives is used for archiving old files and directories. DistCp is used for copying files within the cluster and also across different clusters. Rumen is the tool for extracting meaningful data from JobHistory files and analyzes it. It is used for statistical analysis. GridMix is benchmark for Hadoop. It takes a trace of job and creates a synthetic job with the same pattern as that of trace. The trace can be generated by Rumen tool. Scheduler Load Simulator is a tool for simulating different loads and scheduling methods like FIFO, Fair Scheduler, etc. This chapter explains all the tools and gives the syntax of various commands for each tool. After reading this chapter, the reader will be able to use all these tools effectively.
TopThe additional tools provided by the Hadoop distribution are:
- •
Hadoop Streaming
- •
Hadoop Archives
- •
DistCp
- •
Rumen
- •
Gridmix
- •
Scheduler Load Simulator
- •
Benchmarking
Let us see all these tools one by one in detail.
TopHadoop Streaming
Hadoop streaming is a utility that is used to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
Working of Hadoop Streaming
Both mapper and reducer can be executables. These executables read the input line by line from stdin and gives the output to stdout. When the mapper is initialized, each mapper task will launch the executable as a separate process. The mapper tasks covert its input into lines and feed them to stdin. The mapper collects lines from stdout of the process and coverts each line to key, value pair. This key, value pair is output of the mapper.
The reducer tasks convert its input key,value pairs into lines and feed them to stdin. The reducer collects line oriented output from stdout of the process and converts each line to key, value pair. This key, value pair is output of the reducer. This is the basis for the communication protocol between the Map/Reduce framework and the streaming mapper/reducer.
The streaming tasks exiting with non-zero status are considered to be failed tasks. The user can indicate this by setting stream.non.zero.exit.is.failure to be true or false. By default, it is true.
Streaming Commands
The streaming commands are of the following form:
hadoop command [genericOptions] [streamingOptions]
The generic options must be placed before streaming options.
Required Streaming Parameters
The required streaming options are described in the Table 1.
Table 1. Required Streaming parameters
Sl. No. | Required Parameters | Description |
1. | -input directoryname or filename | Input location for mapper |
2. | -output directoryname | Output location for mapper |