Statistical Visualization of Big Data Through Hadoop Streaming in RStudio

Statistical Visualization of Big Data Through Hadoop Streaming in RStudio

Chitresh Verma, Rajiv Pandey
DOI: 10.4018/978-1-5225-3142-5.ch019
(Individual Chapters)
No Current Special Offers


Data Visualization enables visual representation of the data set for interpretation of data in a meaningful manner from human perspective. The Statistical visualization calls for various tools, algorithms and techniques that can support and render graphical modeling. This chapter shall explore on the detailed features R and RStudio. The combination of Hadoop and R for the Big Data Analytics and its data visualization shall be demonstrated through appropriate code snippets. The integration perspective of R and Hadoop is explained in detail with the help of a utility called Hadoop streaming jar. The various R packages and their integration with Hadoop operations in the R environment are explained through suitable examples. The process of data streaming is provided using different readers of Hadoop streaming package. A case based statistical project is considered in which the data set is visualized after dual execution using the Hadoop MapReduce and R script.
Chapter Preview

Data Visualization

Data visualization is not only done by standard charts and graphs but also by technologically more advanced ways such as info-graphics, real-time dials and gauges, heat maps (Spakov&Miniotas, 2015). The visualization results like charts and bars are also interactive and they can be changed with a click of button. The data visualization is a well-developed domain where accomplished designers and data scientists have worked to build combination of the excellent visualization for data interpretation. It can be said that data visualization is not only creative but also decoding the data to the viewer is meaningful. In other words, connecting the gap between the actual data and logical inference is possible only by data visualization. A data designer uses his imagination to build the representation of the data which can easily be comprehended by the audience. All the combinations of data and its illustrations have the above mentioned sole purpose.

What Is Data Visualization?

Data visualization is the process of extracting the meaningful information from vast amount of data and then showing them in pictorial representation form for better understanding of the end users (Chen et al., 2007). Data visualization is science of filtering and isolating the data and then visualizing in different representation techniques.

The product of data visualization to the viewer may look as information moving from point A to point B. The data visualization process does not only involve designing the reports and charts but presenting it in a way that spectator can interpret the with least amount of effort.

Key Terms in this Chapter

chmod: It is UNIX command which used for changing the permission of files and directories. There are various code used with chmod. “chmod 755” is used for giving permission to everyone for read, write and execute the files and directories and this command code is commonly used in web server environment.

Trend Analysis: It is a mathematical technique which is commonly used by trader in stock market to predict the future stock rate. It exploits the historic data from determining the stock performance.

Framework: Software Framework is set of general rules and functions which perform specific work. These frameworks act as a principle for engineers and users in the practice applications.

Hadoop Streaming: It is Hadoop package utility that helps in running the mapper and reducer functions from terminal using multiple programming languages.

Real-Time Dials and Gauges: They are unique data visualization techniques used for displaying data load and flow speed. They mostly use JavaScript API for the animation purpose.

JAR: JAR is acronym for Java Archive. It is a package that includes many files and these files are mostly Java class files, images and metadata. It is used as libraries and executable files within the Java environment.

Statistical Analysis: It is branch of data analytics where sample sets are compiled to draw the mathematical conclusion using special software and tools.

Chunk: Chunk or data chuck is standardized by Vangie Beal. Its details are available in RFC2960 SCTP regarding the Stream Control Transmission Protocol (SCTP) standards. It is used to define the measurement unit of SCTP packet.

Network Analysis: The data analysis with related graphs and other linked data structured to specific output. It is generally done in planning, designing, traveling and marketing field related projects.

Heat Maps: It is the illustration technique involving use of colors which mimic the hotness. The real geographical maps are also dual with colors coding in this procedure.

Visualization: Illustration of data for purpose of getting insight of the unseen information is termed as visualization. It is extensively used in sports, medical science, education and management data interpretation.

Complete Chapter List

Search this Book: