Hadoop Tools

Hadoop Tools

Copyright: © 2019 |Pages: 47
DOI: 10.4018/978-1-5225-3790-8.ch009

Abstract

As the name indicates, this chapter explains the various additional tools provided by Hadoop. The additional tools provided by Hadoop distribution are Hadoop Streaming, Hadoop Archives, DistCp, Rumen, GridMix, and Scheduler Load Simulator. Hadoop Streaming is a utility that allows the user to have any executable or script for both mapper and reducer. Hadoop Archives is used for archiving old files and directories. DistCp is used for copying files within the cluster and also across different clusters. Rumen is the tool for extracting meaningful data from JobHistory files and analyzes it. It is used for statistical analysis. GridMix is benchmark for Hadoop. It takes a trace of job and creates a synthetic job with the same pattern as that of trace. The trace can be generated by Rumen tool. Scheduler Load Simulator is a tool for simulating different loads and scheduling methods like FIFO, Fair Scheduler, etc. This chapter explains all the tools and gives the syntax of various commands for each tool. After reading this chapter, the reader will be able to use all these tools effectively.
Chapter Preview
Top

Tools

The additional tools provided by the Hadoop distribution are:

  • Hadoop Streaming

  • Hadoop Archives

  • DistCp

  • Rumen

  • Gridmix

  • Scheduler Load Simulator

  • Benchmarking

Let us see all these tools one by one in detail.

Top

Hadoop Streaming

Hadoop streaming is a utility that is used to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

Working of Hadoop Streaming

Both mapper and reducer can be executables. These executables read the input line by line from stdin and gives the output to stdout. When the mapper is initialized, each mapper task will launch the executable as a separate process. The mapper tasks covert its input into lines and feed them to stdin. The mapper collects lines from stdout of the process and coverts each line to key, value pair. This key, value pair is output of the mapper.

The reducer tasks convert its input key,value pairs into lines and feed them to stdin. The reducer collects line oriented output from stdout of the process and converts each line to key, value pair. This key, value pair is output of the reducer. This is the basis for the communication protocol between the Map/Reduce framework and the streaming mapper/reducer.

The streaming tasks exiting with non-zero status are considered to be failed tasks. The user can indicate this by setting stream.non.zero.exit.is.failure to be true or false. By default, it is true.

Streaming Commands

The streaming commands are of the following form:

hadoop command [genericOptions] [streamingOptions]

The generic options must be placed before streaming options.

Required Streaming Parameters

The required streaming options are described in the Table 1.

Table 1.
Required Streaming parameters
Sl. No.Required ParametersDescription
1.-input directoryname or filenameInput location for mapper
2.-output directorynameOutput location for mapper

Complete Chapter List

Search this Book:
Reset