Receive a 20% Discount on All Purchases Directly Through IGI Global's Online Bookstore

Antonio Bella (Universidad Politécnica de Valencia, Spain), Cèsar Ferri (Universidad Politécnica de Valencia, Spain), José Hernández-Orallo (Universidad Politécnica de Valencia, Spain) and María José Ramírez-Quintana (Universidad Politécnica de Valencia, Spain)

Source Title: Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques

Copyright: © 2010
|Pages: 19
DOI: 10.4018/978-1-60566-766-9.ch006

Chapter Preview

TopOne of the main goals of machine learning methods is to build a model or hypothesis from a set of data (also called evidence). After this learning process, the quality of the hypothesis must be evaluated as precisely as possible. For instance, if prediction errors have negative consequences in a certain application domain of a model (for example, detection of carcinogenic cells), it is important to know the exact accuracy of the model. Therefore, the model evaluation stage is crucial for the real application of machine learning techniques. Generally, the quality of predictive models is evaluated by using a training set and a test set (which are usually obtained by partitioning the evidence into two disjoint sets) or by using some kind of cross-validation or bootstrap if more reliable estimations are desired. These evaluation methods work for any kind of estimation measure. It is important to note that different measures can be used depending on the model. For classification models, the most common measures are accuracy (the inverse of error), f-measure, or macro-average. In probabilistic classification, besides the percentage of correctly classified instances, other measures such as logloss, mean squared error (MSE) (or Brier’s score) or area under the ROC curve (AUC) are used. For regression models, the most common measures are MSE, the mean absolute error (MAE), or the correlation coefficient.

With the same result for a quality metric (e.g. MAE), two different models might have a different error distribution. For instance, a regression model *R*_{1} that always predicts the true value plus 1 has a MAE of 1. However, it is different to a model *R*_{2} that predicts the true value for *n* - 1 examples and has an error of *n* for one example. Model *R*_{1} seems to be more reliable or stable, i.e., its error is more predictable. Similarly, two different models might have a different error assessment with the same result for a quality metric (e.g. accuracy). For instance, a classification model *C*_{1} which is correct 90% of the cases with a confidence of 0.91 for every prediction is preferable to model *C*_{2} which is correct 90% of the cases with a confidence of 0.99 for every prediction. The error self-assessment, i.e., the purported confidence, is more accurate in *C*_{1} than in *C*_{2}.

In both cases (classification and regression), an overall picture of the empirical results is helpful in order to improve the reliability or confidence of the models. In the case of regression, the model *R*_{1}, which always predicts the true value plus 1, is clearly uncalibrated, since predictions are usually 1 unit above the real value. By subtracting 1 unit from all the predictions, *R*_{1} could be calibrated and interestingly, *R*_{2} can be calibrated in the same way. In the case of classification, a global calibration requires the confidence estimation to be around 0.9 since the models are right 90% of the time.

Thus, calibration can be understood in many ways, but it is usually built around two related issues: how error is distributed and how self-assessment (confidence or probability estimation) is performed. Even though both ideas can be applied to both regression and classification, this chapter focuses on error distribution for regression and self-assessment for classification.

Estimating probabilities or confidence values is crucial in many real applications. For example, if probabilities are accurated, decisions with a good assessment of risks and costs can be made using utility models or other techniques from decision making. Additionally, the integration of these techniques with other models (e.g. multiclassifiers) or with previous knowledge becomes more robust. In classification, probabilities can be understood as degrees of confidence, especially in binary classification, thus accompanying every prediction with a reliability score (DeGroot & Fienberg, 1982). In regression, predictions might be accompanied by confidence intervals or by probability density functions.

Calibration Measure: any kind of quality function that is able to assess the degree of calibration of a predictive model.

Distribution Calibration in Classification (or simply “class calibration”): the degree of approximation of the true or empirical class distribution with the estimated class distribution.

Calibration Technique: any technique that aims to improve probability estimation or to improve error distribution of a given model.

Reliability Diagrams: In these diagrams, the prediction space is discretised into 10 intervals (from 0 to 0.1, from 0.1 to 0.2, etc.). The examples whose probability is between 0 and 0.1 go into the first interval, the examples between 0.1 and 0.2 go into the second, etc. For each interval, the mean predicted value (in other words, the mean predicted probability) is plotted (x axis) against the fraction of positive real cases (y axis). If the model is calibrated, the points will be close to the diagonal.

Confusion Matrix: a visual way of showing the recount of cases of the predicted classes and their actual values. Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class.

Distribution Calibration in Regression: any technique that reduces the bias on the relation between the expected value of the estimated value and the mean of the real value.

Probabilistic Calibration for Classification: any technique that improves the degree of approximation of the predicted probabilities to the actual probabilities.

Probabilistic Calibration for Regression: for “density forecasting” models, in general, any calibration technique that makes these density functions be specific for each prediction, narrow when the prediction is confident, and broader when it is less so.

Search this Book:

Reset

Copyright © 1988-2018, IGI Global - All Rights Reserved