Measuring the Effects of Data Mining on Inference

Measuring the Effects of Data Mining on Inference

Tom Burr (Statistical Sciences, Los Alamos National Laboratory, USA) and S. Tobin (Nuclear Nonproliferation and Security, Los Alamos National Laboratory, USA)
DOI: 10.4018/978-1-4666-5888-2.ch176
OnDemand PDF Download:
List Price: $37.50


Data mining is a term used to describe various types of exploratory data analysis whose purposes are to select data models, estimate model parameters, and generate hypotheses that can be tested on future data. It is known that model predictions are overly optimistic when generated from the same data that are used to select a model and estimate its parameters. Therefore, most statistical procedures assume that the data model is selected prior to data collection. Alternatively, to adjust for data mining, we describe steps that should be taken to account for “choosing the best” among many candidate models.
Chapter Preview


As a simple example that we use throughout to illustrate and refer to as “Example 1,” suppose that a data analyst plans to evaluate three linear models for a set of n observations of {y, x1,x2}. Models M1, M2, and M3 fit the response y as a linear function of predictor x1, of predictor x2, or of both predictor x1 and x2, respectively. The expected value of y in models M1–M3 are E(y) = β1x1, E(y) = β2x2, and E(y) = β1x1 + β2x2, respectively, and each y is assumed to be generated as y = E(y) + e, where e is a random error term with mean 0 and variance. Some type of model selection criterion, such as the adjusted residual sum of squares, can be used to select a model from M1, M2, and M3. The adjusted residual sum of squares is defined as

is the unadjusted
is the error sum of squares,
is the sum of squared y values around the mean of y, and p is the number of model predictors (p = 1 for M1 and M2 and p = 2 for M3).

Key Terms in this Chapter

Regression: Fitting a model y = f ( x 1 , x 2 , …, x p ) + error relating the p predictors x 1 , x 2 , …, x p to the response y . If the function f is linear in the parameters, such as then the regression is a linear regression (even though it is not linear in the predictors x 1 and x 2 ).

Bootstrap: A resample-the-data strategy that has many applications related to uncertainty quantification.

Principal Component Analysis: A procedure that uses an orthogonal transform to convert a set of observations of correlated variables into a set of values of linearly uncorrelated variables called principal components.

Data Mining (also known as Exploratory Data Analysis): Exploring the data to discover possible patterns and relationships that could lead to useful information.

Complete Chapter List

Search this Book: