Exploratory Data Analysis

Exploratory Data Analysis

Copyright: © 2018 |Pages: 25
DOI: 10.4018/978-1-5225-3270-5.ch006
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Exploratory data analysis (EDA) tries to summarize datasets main characteristics such as nearest neighborhood indexes, standard deviation, scatterplots or quadrat analysis. This EDA chapter is divided into several sections to cover myGeoffice© options not forgetting the graphical mode when facing outputs: file data input (after all, any analysis demands data); Descriptive study of the variable (mean, kurtosis, distribution plot, etc.); 2D-3D data posting (spatial location of the data samples); Cutoff layout map (a spatial colorful plot according to the data samples values that are higher and lower against any particular threshold); G and Kipley's K Index (to disclose clustered, uniform and random space sampling); Kernel Gaussian density (a non-parametric way to estimate the probability space density function of a variable); T-Student and F-tests (a parametric approach to check statistical differences between two sub-regions), including a brief section regarding the two-way ANOVA technique; Quadrat analysis (comparison of the statistically expected and actual counts of objects within spatial sampling areas to test randomness and clustering); XX profile scatterplot (silhouette view of the data along XX axis); and YY profile scatterplot (silhouette view of the data along YY axis).
Chapter Preview
Top

Introduction

Exploratory data analysis (EDA) tries to summarize datasets main characteristics such as nearest neighborhood indexes, standard deviation, scatterplots or quadrat analysis. This EDA covers myGeoffice© ten options, not forgetting the graphical mode when facing outputs:

  • File data input (after all, any analysis demands data)

  • Descriptive study of the variable (mean, kurtosis, distribution plot, etc.)

  • 2D-3D data posting (spatial location of the data samples)

  • Cutoff layout map (a spatial colorful plot according to the data samples values that are higher and lower against any particular threshold)

  • G & Kipley’s K Index (to disclose clustered, uniform and random space sampling)

  • Kernel Gaussian density (a non-parametric way to estimate the probability space density function of a variable)

  • T-Student and F-tests (a parametric approach to check statistical differences between two sub-regions), including a brief sub-section regarding the two-way ANOVA technique.

  • Quadrat analysis (comparison of the statistically expected and actual counts of objects within spatial sampling areas to test randomness and clustering)

  • XX profile scatterplot (silhouette view of the data along XX axis)

  • YY profile scatterplot (silhouette view of the data along YY axis)

Be reminded that this second menu of myGeoffice© deals with variables that are continuous in nature such as groundwater or air pollution levels. For discrete distributions (the number of diamonds in different sub-regions, for example) should be investigated and evaluated with discrete statistics such as the compound Poisson (check Clark & Harper (2000) bibliography for further information), Poisson, Bernoulli and negative binomial distributions.

Top

Primary File Input

Every analysis requires data and myGeoffice© is not an exception. The first step of this second menu is the capability to load an ASCII text file that follows the next pattern and separated by one TAB only: Sample_id, Coord_x, Coord_y, Sample_value (use DOT or COMMA for double real values but not for the first ID field). In addition, the first and fourth column cannot be negative. Moreover, the input datasets may vary between 50 and 200 observations (see Figure 1). For practical purposes towards the second part of this writing, the present dataset used for analysis comes from GSLIB® book of Clayton Deutsch and André Journel (1998), which can be downloaded directly from myGeoffice© or gslib.com.

Figure 1.

It is desirable that these input observations sites are posted in a square layout since maps are drawn in a perfect four-sided layout

Top

Descriptive Study

The first step with autocorrelation, Kriging or geostatistical simulation (the statistical approach) concerns the assessment of univariate and bivariate descriptive indexes such as central tendency measures (mean, mode, median), distribution graphs (QQ and PP plot, cumulative frequency table and histogram), spread measures (variance, standard deviation, range and inter-quartile range), skewness, coefficient of variation (useful for lognormal distributions identification), kurtosis and scattergram. For user’s reference, the range of an ideal Gaussian dataset should vary between 4 and 6 standard deviations from the mean with a skewness of zero and a standard kurtosis of three.

Complete Chapter List

Search this Book:
Reset