Census Data Analysis and Visualization Using R Tool: A Case Study

Census Data Analysis and Visualization Using R Tool: A Case Study

Veena Gadad (Rashtreeya Vidyalaya College of Engineering, India) and Sowmyarani C. N. (Rashtreeya Vidyalaya College of Engineering, India)
Copyright: © 2019 |Pages: 25
DOI: 10.4018/978-1-5225-7277-0.ch008

Abstract

As a result of increased usage of internet, a huge amount of data is collected from variety of sources like surveys, census, and sensors in internet of things. This resultant data is coined as big data and analysis of this leads to major decision making. Since the collected data is in raw form, it is difficult to understand inherent properties and it becomes just a liability if not analyzed, summarized, and visualized. Although text can be used to articulate the relation between facts and to explain the findings, presenting it in the form of tables and graphs conveys information effectively. Presentation of data using tools to create visual images in order to gain more insights into data is called as data visualization. Data analysis is processing and interpretation of data to discover useful information and to deduce certain inferences based on the values. This chapter concerns usage of R tool and understanding its effectiveness for data analysis and intelligent data visualization by experimenting on data set obtained from University of California Irvine Machine Learning Repository.
Chapter Preview
Top

Introduction

R is an open source programming language whose main purpose is to deliver an user friendly way to perform data analysis, statistics and data visualization. The survey performed by IEEE spectrum on “The top programming languages of 2017” (Cass, 2018) tells that the R language is on the sixth position and python on first position among top 48 programming languages used by data scientists for analysis. As of June 2018, R ranks 10th in TIOBE index, a measure of popularity of programming languages (TIOBE The software Quality Company, 2018)The reason that R is used popularly is:

  • 1.

    R is a open source programming language- There is no limit with respect to subscription costs or license management. The libraries of the language are freely accessible.

  • 2.

    R is best statistical analysis tool- Data can be accessed in variety of format and many operations can be performed on the data with several functionalities useful for modern statistician. “dplyr” and “ggplot2” are examples for data manipulation and plotting.

  • 3.

    R provides best data visualization tools to create graphs, bar charts, multi panel lattice charts, scatter plots and custom designed graphics.

  • 4.

    R has consistent online support- The language being open source has a loyal support from statisticians, scientists and engineers.

Big data has potential to revolutionize the operational and strategic impacts, however there is paucity of empirical research. (Wamba, S. F, 2015). Big data as it is difficult to describe data without performing data analysis and visualization. In this article the important features of R to manage big data are discussed, relevant examples are articulated by carrying out experiments on UCI repository adult data set. The following are the main objectives of this article:

  • 1.

    Performing descriptive analysis for quantitative describing or summarizing the properties of the collected data. This includes examining the mean, standard deviation, minimum, maximum and median for numeric data or frequency of observation for nominal data.

  • 2.

    Intelligent data visualization for descriptive analysis with graphs like histograms, scatter plots and QQ plots.

  • 3.

    Exploratory data analysis is used to understand the properties and find patterns in the data set with visual methods (R, pp. 10-50). R provides number of functions useful for exploratory data analysis like box plot, histograms, scatter plot, violin plot etc..

  • 4.

    To perform statistical tests to perform statistical inferences and to draw some conclusions about the data. R provides functions to determine p-value and alpha to test the null hypothesis.

  • 5.

    Generation of dynamic documents using R Markdown and R programming language.

Top

Organization Of The Paper

The initial part of the paper presents managerial perspective of big data, description of the dataset used to understand the R tool, existing proprietary and open source tools to perform data analysis and visualization. In the rest of paper, usage of important R - libraries are discussed which can be used to perform effective data analysis and data visualization using census data as part of case study. Generation of reports using R markdown is discussed in the last part of the paper.

Complete Chapter List

Search this Book:
Reset