Abstract
Data mining is a term used to describe various types of exploratory data analysis whose purposes are to select data models, estimate model parameters, and generate hypotheses that can be tested on future data. It is known that model predictions are overly optimistic when generated from the same data that are used to select a model and estimate its parameters. Therefore, most statistical procedures assume that the data model is selected prior to data collection. Alternatively, to adjust for data mining, we describe steps that should be taken to account for “choosing the best” among many candidate models.
TopBackground
As a simple example that we use throughout to illustrate and refer to as “Example 1,” suppose that a data analyst plans to evaluate three linear models for a set of n observations of {y, x1,x2}. Models M1, M2, and M3 fit the response y as a linear function of predictor x1, of predictor x2, or of both predictor x1 and x2, respectively. The expected value of y in models M1–M3 are E(y) = β1x1, E(y) = β2x2, and E(y) = β1x1 + β2x2, respectively, and each y is assumed to be generated as y = E(y) + e, where e is a random error term with mean 0 and variance. Some type of model selection criterion, such as the adjusted residual sum of squares, can be used to select a model from M1, M2, and M3. The adjusted residual sum of squares is defined as
where
is the unadjusted
is the error sum of squares,
is the sum of squared
y values around the mean
of
y, and
p is the number of model predictors (
p = 1 for M1 and M2 and
p = 2 for M3).
Key Terms in this Chapter
Regression: Fitting a model y = f ( x 1 , x 2 , …, x p ) + error relating the p predictors x 1 , x 2 , …, x p to the response y . If the function f is linear in the parameters, such as then the regression is a linear regression (even though it is not linear in the predictors x 1 and x 2 ).
Bootstrap: A resample-the-data strategy that has many applications related to uncertainty quantification.
Principal Component Analysis: A procedure that uses an orthogonal transform to convert a set of observations of correlated variables into a set of values of linearly uncorrelated variables called principal components.
Data Mining (also known as Exploratory Data Analysis): Exploring the data to discover possible patterns and relationships that could lead to useful information.