# Measuring the Effects of Data Mining on Inference

Tom Burr (Statistical Sciences, Los Alamos National Laboratory, USA) and S. Tobin (Nuclear Nonproliferation and Security, Los Alamos National Laboratory, USA)

Copyright: © 2015
|Pages: 9

DOI: 10.4018/978-1-4666-5888-2.ch176

## Abstract

Data mining is a term used to describe various types of exploratory data analysis whose purposes are to select data models, estimate model parameters, and generate hypotheses that can be tested on future data. It is known that model predictions are overly optimistic when generated from the same data that are used to select a model and estimate its parameters. Therefore, most statistical procedures assume that the data model is selected prior to data collection. Alternatively, to adjust for data mining, we describe steps that should be taken to account for “choosing the best” among many candidate models.

Top## Background

As a simple example that we use throughout to illustrate and refer to as “Example 1,” suppose that a data analyst plans to evaluate three linear models for a set of *n* observations of {*y*, *x*_{1,}*x*_{2}}. Models M1, M2, and M3 fit the response *y* as a linear function of predictor *x*_{1}, of predictor *x*_{2}, or of both predictor *x*_{1} and *x*_{2}, respectively. The expected value of *y* in models M1–M3 are E(y) = β_{1}*x*_{1}, E(y) = β_{2}*x*_{2}, and E(y) = β_{1}*x*_{1} + β_{2}*x*_{2}, respectively_{,} and each *y* is assumed to be generated as *y* = E(*y*) + *e,* where *e* is a random error term with mean 0 and variance. Some type of model selection criterion, such as the adjusted residual sum of squares, can be used to select a model from M1, M2, and M3. The adjusted residual sum of squares is defined as

where

is the unadjusted

is the error sum of squares,

is the sum of squared

*y* values around the mean

of

*y*, and

*p* is the number of model predictors (

*p* = 1 for M1 and M2 and

*p* = 2 for M3).

## Key Terms in this Chapter

Regression: Fitting a model y = f ( x 1 , x 2 , …, x p ) + error relating the p predictors x 1 , x 2 , …, x p to the response y . If the function f is linear in the parameters, such as then the regression is a linear regression (even though it is not linear in the predictors x 1 and x 2 ).

Bootstrap: A resample-the-data strategy that has many applications related to uncertainty quantification.

Principal Component Analysis: A procedure that uses an orthogonal transform to convert a set of observations of correlated variables into a set of values of linearly uncorrelated variables called principal components.

Data Mining (also known as Exploratory Data Analysis): Exploring the data to discover possible patterns and relationships that could lead to useful information.