Classification with Incomplete Data

Classification with Incomplete Data

Pedro J. García-Laencina, Juan Morales-Sánchez, Rafael Verdú-Monedero, Jorge Larrey-Ruiz, José-Luis Sancho-Gómez, Aníbal R. Figueiras-Vidal
DOI: 10.4018/978-1-60566-766-9.ch007
(Individual Chapters)
No Current Special Offers


Many real-word classification scenarios suffer a common drawback: missing, or incomplete, data. The ability of missing data handling has become a fundamental requirement for pattern classification because the absence of certain values for relevant data attributes can seriously affect the accuracy of classification results. This chapter focuses on incomplete pattern classification. The research works on this topic currently grows wider and it is well known how useful and efficient are most of the solutions based on machine learning. This chapter analyzes the most popular and proper missing data techniques based on machine learning for solving pattern classification tasks, trying to highlight their advantages and disadvantages.
Chapter Preview


Pattern classification is the discipline of building machines to classify data (patterns or input vectors) based on either a priori knowledge or on statistical information extracted from the patterns (Bishop, 1995; Duda et al., 2000; Jain et al., 2000; Ripley, 1996). This research field was developed starting from the 1960’s, and it has progressed to a great extend in parallel with the growth of research on knowledge-based systems and artificial neural networks. Pattern classification has been successfully applied in several scientific areas, such as computer science, engineering, statistics, biology, and medicine, among others. These applications include biometrics (personal identification based on several physical attributes as fingerprints and iris), medical diagnosis (CAD, computer aided diagnosis), financial index prediction, and industrial automation (fault detection in industrial process). Many of these real-word applications suffer a common drawback, missing or unknown data (incomplete feature vector). For example, in an industrial experiment some results can be missing because of mechanical/electronic failures during the data acquisition process (Lakshminarayan et al., 2004; Nguyen et al., 2003). In medical diagnosis some tests are not possible to be done because both the hospital lacks the necessary medical equipment or some medical tests may not be appropriate for certain patients (Jerez et al., 2006; Liu et al., 2005; Markey & Patel, 2004; Proschan et al., 2001). In the same context, another example could be an examination by a doctor, who performs several different kinds of tests; some test results may be available instantly, and some may take several days to complete. Anyway, it might be necessary to reach a preliminary diagnosis instantly, using only test results that are available. Missing data is a subject which has been treated extensively in the literature of statistical analysis (Allison, 2001; Little & Rubin, 2002; Schaffer, 1997), and also, but with less effort, in the pattern recognition literature. The unavailability of the data hinders the decision making processes due to the dependencies of decisions on information. Most scientific, business and economic decisions are somehow related to the information available at the time of making such decisions. As an example, most business evaluations and decisions are highly dependent on the availability of sales and other information, whereas advances in research are based on discovery of knowledge from various experiments and measured parameters. The ability of handling missing data has become a fundamental requirement for pattern classification because inappropriate treatment of missing data may cause large errors or false results on classification. In addition, it is being a more common problem in real-world data. Another clear example of the importance of handling missing data is that 45% of data sets in the UCI repository have missing values, what is one of most used collection of data sets for benchmarking machine learning procedures.

In general, pattern classification with missing data concerns two different problems, handling missing values and pattern classification. Most of the approaches in the literature can be grouped in four different types of approaches depending on how both problems are solved. Figure 1 resumes the different approaches in pattern classification with missing data.

Figure 1.

Methods for pattern classification with incomplete data. This scheme shows the different procedures that are analyzed in this chapter.


Intuitively the easiest way to deal with missing values is simply deleting the incomplete data. In a multivariate environment missing values may occur on one or more attributes and missing components are often a significant portion of the whole data set, and so, the deletion of these incomplete items may cause a substantial loss of information.

Key Terms in this Chapter

Missing Data Pattern: it describes which values are observed in the input data matrix and which values are missing.

Missing Data: data are said to be missing when there is no information for one or more pattern on one or more features in a research study.

Imputation: is a generic term for filling in unknown features with plausible values provided by a missing data estimator. Missing values are estimated from the available data.

Pattern Classification: is a scientific discipline whose aim is the classification of the objects into a set of categories or classes.

Multiple Imputation: a procedure which replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute. The multiply imputed data sets are then analyzed by using standard procedures for complete data and combining the results from these analyses.

Marginalization: is used to compensate for the missing values where the unreliable features of a pattern are integrated out of the distribution of a class, constrained by upper and lower bounds on the true values of these components implicit in their observed values. The resulting distributions with the smaller number of components are then used to compute the likelihood for that pattern.

Not Missing At Random (NMAR): data are NMAR when the missing data pattern is non-random and depends on the missing variables. In this situation, the missing variables in the NMAR case cannot be predicted only from the available variables in the database, and the missingness mechanism is informative.

Listwise/casewise analysis: it deletes the cases that have missing data for any one variable used in a particular study.

Pattern: is an entity that can be represented by a set of properties and variables, which are known as features or attributes. As examples, a pattern can be a fingerprint image, a human face or a speech signal.

Missing Data Indicator Matrix: it defines the missing data pattern.

Missing Completely at Random (MCAR): data are MCAR when the event that a particular item is missing is independent of observable and unknown features.

Expectation-Maximization Algorithm: is an efficient iterative procedure to compute the maximum likelihood estimates of parameters in probabilistic models, in the presence of missing or hidden data.

Predictive Accuracy: an imputation procedure should maximise the preservation of true values. That is, it should result in imputed values that are as ‘close’ as possible to the true values.

Missing At Random (MAR): data are MAR when the missing data pattern is independent of all unobserved features, although it may be traceable or predictable from other variables in the database.

Error Rate in Classification: the proportion of patterns that have been incorrectly classified by a decision model.

Distributional Accuracy: an imputation procedure should preserve the distribution of the true data values. That is, marginal and higher order distributions of the imputed data values should be essentially the same as the corresponding distributions of the true values.

Missing Data Mechanism: is the relationship between missingness and the known attributes in the input data matrix, i.e., the probability that a set of values are missing given the values taken by the observed and missing features.

Available-Case Analysis: it only uses the cases with available features for a particular study.

Complete Chapter List

Search this Book: