Access Full-Text Recommend to Your Library

Buy Instant Access to This Chapter

Instant access upon order completion

Add to Cart

Share

Recommend to Librarian Recommend to Colleague Fair Use Policy

Free Content

Sample PDF

More Information

Rights & Permissions

Favorite Cite Chapter

MLA

T. Revathi, et al. "Hadoop Tools." Big Data Processing With Hadoop, IGI Global Scientific Publishing, 2019, pp.169-215. https://doi.org/10.4018/978-1-5225-3790-8.ch009

APA

T. Revathi, K. Muneeswaran, & M. Blessa Binolin Pepsi (2019). Hadoop Tools. IGI Global Scientific Publishing. https://doi.org/10.4018/978-1-5225-3790-8.ch009

Chicago

T. Revathi, K. Muneeswaran, and M. Blessa Binolin Pepsi. "Hadoop Tools." In Big Data Processing With Hadoop. Hershey, PA: IGI Global Scientific Publishing, 2019. https://doi.org/10.4018/978-1-5225-3790-8.ch009

Export Reference

For Librarians

Hadoop Tools

Source Title: Big Data Processing With Hadoop

DOI: 10.4018/978-1-5225-3790-8.ch009

Abstract

As the name indicates, this chapter explains the various additional tools provided by Hadoop. The additional tools provided by Hadoop distribution are Hadoop Streaming, Hadoop Archives, DistCp, Rumen, GridMix, and Scheduler Load Simulator. Hadoop Streaming is a utility that allows the user to have any executable or script for both mapper and reducer. Hadoop Archives is used for archiving old files and directories. DistCp is used for copying files within the cluster and also across different clusters. Rumen is the tool for extracting meaningful data from JobHistory files and analyzes it. It is used for statistical analysis. GridMix is benchmark for Hadoop. It takes a trace of job and creates a synthetic job with the same pattern as that of trace. The trace can be generated by Rumen tool. Scheduler Load Simulator is a tool for simulating different loads and scheduling methods like FIFO, Fair Scheduler, etc. This chapter explains all the tools and gives the syntax of various commands for each tool. After reading this chapter, the reader will be able to use all these tools effectively.

Chapter Preview

Top

Tools

The additional tools provided by the Hadoop distribution are:

•
Hadoop Streaming
•
Hadoop Archives
•
DistCp
•
Rumen
•
Gridmix
•
Scheduler Load Simulator
•
Benchmarking

Let us see all these tools one by one in detail.

Top

Hadoop Streaming

Hadoop streaming is a utility that is used to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

Working of Hadoop Streaming

Both mapper and reducer can be executables. These executables read the input line by line from stdin and gives the output to stdout. When the mapper is initialized, each mapper task will launch the executable as a separate process. The mapper tasks covert its input into lines and feed them to stdin. The mapper collects lines from stdout of the process and coverts each line to key, value pair. This key, value pair is output of the mapper.

The reducer tasks convert its input key,value pairs into lines and feed them to stdin. The reducer collects line oriented output from stdout of the process and converts each line to key, value pair. This key, value pair is output of the reducer. This is the basis for the communication protocol between the Map/Reduce framework and the streaming mapper/reducer.

The streaming tasks exiting with non-zero status are considered to be failed tasks. The user can indicate this by setting stream.non.zero.exit.is.failure to be true or false. By default, it is true.

Streaming Commands

The streaming commands are of the following form:

hadoop command [genericOptions] [streamingOptions]

The generic options must be placed before streaming options.

Required Streaming Parameters

The required streaming options are described in the Table 1.

Table 1.

Required Streaming parameters

Sl. No.	Required Parameters	Description
1.	-input directoryname or filename	Input location for mapper
2.	-output directoryname	Output location for mapper

Complete Chapter List

Search this Book:

Reset