Computational Intelligence for Missing Data Imputation, Estimation, and Management: Knowledge Optimization Techniques

Computational Intelligence for Missing Data Imputation, Estimation, and Management: Knowledge Optimization Techniques

Tshilidzi Marwala (University of Witwatersrand, South Africa)
Release Date: April, 2009|Copyright: © 2009 |Pages: 326
ISBN13: 9781605663364|ISBN10: 1605663360|EISBN13: 9781605663371|DOI: 10.4018/978-1-60566-336-4

Description

The issue of missing data imputation has been extensively explored in information engineering, though needing a new focus and approach in research.

Computational Intelligence for Missing Data Imputation, Estimation, and Management: Knowledge Optimization Techniques focuses on methods to estimate missing values given to observed data. Providing a defining body of research valuable to those involved in the field of study, this book presents current and new computational intelligence techniques that allow computers to learn the underlying structure of data.

Topics Covered

The many academic areas covered in this publication include, but are not limited to:

  • Artificial Neural Networks
  • Hybrid approach to missing data
  • Introduction of missing data
  • Maximum likelihood approach
  • Missing data estimation approaches
  • Missing data estimation method
  • Missing data estimation methodology
  • Missing data imputation
  • Missing data mechanism
  • Missing data patterns
  • Optimization Techniques

Reviews and Testimonials

This book is carefully written to give a good balance between theory and application of various missing data estimation techniques.

– Tshilidzi Marwala, University of Witwatersrand, South Africa

An opening discussion of traditional missing data issues is followed by chapters covering a range of computational intelligence.

– Book News Inc. (June 2009)

Paradoxically, in these days of information glut, there is a concurrent problem of data loss--missing and incomplete data. Statisticians have generated a wealth of knowledge on the methods of handling missing data. While solving differential equations, it is common to encounter such problems as missing initial conditions, missing boundary conditions, and unspecified location of the boundary contour. Such inverse problems, said to be improperly posed, are often solved by regularization--a method of systematically guessing the missing values. [...] A decent attempt at assembling the available tools under one cover.

– Rao Vemur, Computing Reviews

Table of Contents and List of Contributors

Search this Book:
Reset

Preface

In real life, a set of data invariably contain missing data. The problem then is to reconstitute the most probable values through processes such as interpolation and extrapolation before using that set.

Methods for resolving the problem of missing data have been extensively explored in statistical texts (Abdella, 2005; Little & Rubin, 1987). The initial work on compensating for missing data was focused on improving survey data. In this book, missing data interpolation is called imputation to distinguish it from the statistical approach. Imputation is viewed as an alternative approach to deal with missing data. There are two ways to deal with missing data: these are either to estimate the missing data or to delete any vector (data set) with missing value(s). This book focuses on methods that estimate the missing values.

Of particular importance to the area of missing data interpolation is to analyze the nature of the missing data, and this is termed the missing data mechanism. Little and Rubin (1987) categorized three missing data mechanisms, namely: Missing At Random (MAR), Missing Completely At Random (MCAR) and a non-ignorable case also known as Missing Not At Random (MNAR).

In the first case, MAR occurs when the probability that variable X is missing depends on other variables, but not on X itself. An example of this is the case where two variables: the vibration level of a machine and its temperature, X are measured. If a very high vibration level causes the temperature sensor to fall off and thus high and subsequently low values of X become missing because of the other variable vibration level, this is termed MAR.

MCAR occurs when the probability that variable X is missing is unrelated to the value of X itself or to any other variable in the data set. This refers to data sets where the absence of data does not depend on the variable of interest or of any other variable in the data set (Rubin, 1978).

MNAR occurs when the probability of variable X missing is related to the value of X itself even if the other variables are controlled in the analysis (Allison, 2000). An example of this is when in a survey of weights of candidates, a person omits to mention his or her weight because its value is very high. In analyzing survey data, these mechanisms are very powerful and useful. Knowing these mechanisms assists one in choosing which missing data imputation method is best to use.

However, in many engineering problems, where on-line decision support tools are becoming widely used, these mechanisms are proving to be insignificant (Marwala & Hunt, 1999). For example if an aircraft is flying over the Atlantic Ocean and one of its critical sensors fails, there is simply no time to investigate why that particular sensor has failed and, thereby, indentify its missing value mechanism. What ought to be done in this situation is to quickly estimate the sensor’s value, so that an on-line auto-pilot system can continue to operate.

In using decision support tools, if data become missing, it is extremely important, particularly for critical applications, that the missing data estimation technique is accurate. The methods introduced in this book are computational intelligence methods and have proven to be very successful in modeling complex problems such as speech recognition (Nelwamondo, Mahola, & Marwala, 2006). In this book, many methods are considered. These include

  • the multi-layer perceptron model (Marwala, 2000),
  • radial basis functions (Bishop, 1995),
  • Gaussian mixture models (Chen, Chen, & Hou, 2004),
  • rough sets (Wu, Mi, & Zhang, 2003),
  • support vector machines (Drezet & Harrison, 2001),
  • decision trees (Ssali, & Marwala, 2008),
  • fuzzy ARTMAP (Carpenter et al., 1992) and extension neural networks (Mohamed, Tettey, & Marwala, 2006).

    Descriptions and implementations for using these missing data estimation process follow (Bishop, 1995; Marwala, 2007). These methods are implemented in both the Bayesian and maximum-likelihood framework (Marwala, 2001).

    It is still very difficult to know beforehand which of these computational intelligence methods are ideal for missing data imputation. For this reason, hybrid methods are also introduced and implemented in this book for missing data imputation. In particular, the ensemble methods that use more than one learning algorithm are considered (Perrone & Cooper, 1993). Some of these methods are computationally intensive, and as a result, the book introduces methods that are computationally efficient, such as the principal component analysis (Adams et al., 2002) and dynamic programming method (Bellman, 1957; Bertsekas, 2000).

    In this book, many optimization methods are used. For example, to train multi-layer perceptrons, a scaled conjugate gradient optimization method (Møller, 1993) is used. Other optimization methods used are:

  • the expectation maximization algorithm (Dempster, Laird, & Rubin, 1977),
  • genetic algorithms (Goldberg, 1989),
  • particle swarm optimization (Poli, Langdon, & Holland, 2005),
  • hill climbing (Tanaka, Toumiya, & Suzuki, 1997) and
  • simulated annealing (Tavakkoli-Moghaddam, Safaei, & Gholipour, 2006).

    It is difficult to know in advance which optimization method to use for missing data estimation process and, therefore, this book also explores various hybrid optimization techniques. Some of the hybrid optimization techniques that are considered in this book include the hybrid of genetic algorithms and particle swarm optimization.

    Traditional missing data imputation methods have been largely based on static models. Even computational intelligence methods are traditionally constructed in a static manner. These methods are static in the sense that they are the same over time. For situations where the concepts are drifting and, therefore, the data are non-stationary, these methods fail (Kubat, & Widmer, 1996). Many engineering problems have to model systems that are continuously changing because of aging. Therefore, for many engineering problems, data imputation methods are required that are immune or at best can handle these changes in the character of the systems. This problem, therefore, requires missing data models that evolve with the systems on which they are based. Evolutionary methods have been successful in designing learning machines that evolve with systems. These evolutionary methods include genetic algorithms (Goldberg, 2002), fuzzy maps (Carpenter et al., 1992), particle swarm optimization (Kennedy & Eberhart, 1995) and are described in detail in this book.

    Throughout this book, examples from the literature and case studies are used to illustrate the effectiveness of the presented missing data estimation methods. Some of the case studies used include the artificial taster, HIV and a mechanical system.

    SUMMARY OF THE BOOK

    In Chapter 1, traditional missing data issues, such as missing data patterns and mechanisms, are described. Attention is paid to the best models to deal with particular missing data mechanisms. A review of traditional missing data imputation methods is conducted, and the methods reviewed include case deletion and prediction rules (Acork, 2005). The case deletion methods reviewed are list-wise and pair-wise deletion. The prediction rule imputation techniques reviewed are mean substitution, hot-deck, regression and decision trees. Two missing data examples are studied, namely, the Sudoku puzzle and a mechanical system.

    Missing data estimation processes requires mathematical models that capture interrelationships amongst the variables. In Chapter 2, a method is presented that is aimed at approximating missing data and, thereby, capturing variables’ interrelationships by combining genetic algorithms and autoassociative neural networks. The neural network architectures implemented are the multi-layer perceptron and the radial basis function neural networks (Russell & Norvig, 1995). The proposed procedures are tested and then compared for missing data imputation.

    The ability to identify a model which captures the interrelationships between the variables is very important. Different models bring unique perspectives to the missing data problem and one way to maximize the performance of the missing data procedure is to hybridize different methods. In Chapter 3, the hybrid of autoassociative neural networks models are developed and used in conjunction with genetic algorithms (Goldberg, 2002) to estimate missing data. One hybrid technique combines three neural networks to form a hybrid autoassociative network, while the other merges principal component analysis and neural networks. These procedures are compared to the Bayesian auto-associative neural network (Bishop, 1995) and the genetic algorithm approach.

    In Chapter 4, two techniques, i.e., Gaussian mixture models trained using the Expectation Maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) and the combined auto-associative neural networks and particle swarm optimization methods are implemented for missing data estimation and then compared. Of a particular interest is the nature of the data in the analysis that suits each of these methods.

    Chapter 5 investigates an imputation technique based on rough sets computation (Wu, Mi, & Zhang, 2003). The characteristic relations are introduced to describe incompletely specified decision tables and then used for missing data estimation. Empirical results obtained using real data are given and insights into the problem of missing data are derived.

    In Chapter 6, autoassociative neural networks, principal components analysis and support vector regression (Marivate, Nelwamondo, & Marwala, 2008) are all combined with genetic algorithms, and then used to impute missing variables. The impact of using the principal component analysis on the overall performance of the autoassociative network and support vector regression is then assessed.

    In Chapter 7, a committee of networks is introduced for missing data estimation. This committee of networks consists of a multi-layer perceptron, support vector machines and radial basis functions. It is constructed through a weighted combination of the three networks. The networks committee is implemented collectively with a hybrid of the genetic algorithm and the particle-swarm optimization method for missing data estimation, and is then tested and assessed. Furthermore, evolutionary methods are used to evolve a committee of networks. The results of this committee are compared to the results from a traditional committee and stand-alone networks.

    The use of inferential sensors is common in on-line fault detection systems in various control applications. A problem arises when sensors fail while the system is designed to make a decision based on the data from those sensors. Various techniques to handle missing data are discussed in Chapter 8. Firstly, a novel algorithm that classifies and regresses in the presence of missing data is proposed. The algorithm is tested for both classification and regression problems. Secondly, an estimation algorithm that uses an ensemble of regressors within the context of the boosting mechanism is proposed. Hybrid genetic algorithms and fast simulated annealing are used to predict missing values and the results are compared.

    In Chapter 9, a classifier method is presented that is based on a missing data estimation framework, and which uses auto-associative multi-layer perceptron neural networks and genetic algorithms. The method is tested and compared to conventional feed-forward neural network using classification accuracies and the area under the receiver operating characteristics curve.

    In Chapter 10, various optimization methods are compared with the aim of optimizing the missing data estimation equation, which is made out of the auto-associative neural networks with missing values as design variables. These optimization techniques are the genetic algorithm, particle swarm optimization, hill climbing and simulated annealing. They are tested and the results obtained are compared.

    In implementing solutions to the missing data estimation problem, using optimization techniques, the definition of variable bounds is of critical importance. Chapter 11 introduces a novel paradigm to impute missing data that combines decision trees with an auto-associative neural network and principal component analysis. This is designed to answer the crucial question on whether the optimization bounds actually matter in the estimation of missing data. In the model, a decision tree is used to predict search bounds for a hybrid simulated annealing and genetic algorithm that minimizes an error function derived from the respective models. The results obtained are compared.

    Chapter 12 presents a control mechanism to assess the effect of a demographic variable, education level, on the HIV risk of individuals. This is intended to assist for understanding the extent to which the spread of HIV can be controlled by using the variable education level. This control mechanism is based on missing data frameworks where the missing data are the set points for control. An inverse neural network model and a missing data approximation model, based on an auto-associative neural network and the genetic algorithm, are used for the control mechanism and the results obtained are then compared.

    In Chapter 13, a computational intelligence approach to predicting missing data in the presence of concept drift is presented, using an en¬semble of multi-layered feed-forward neural networks. An algorithm that detects concept drift is presented. Six in¬stances prior to the occurrence of missing data are used to approximate the missing values. The algorithm is applied to simulated time-series data set resembling the non-stationary data from a sensor. Secondly, an algorithm that uses dynamic programming and neural networks to solve the problem of missing data imputation is presented, tested and the results are assessed. Thirdly, the impact of missing data estimation on fault classification in mechanical systems is studied. The missing data estimation method is based on auto-associative neural networks where the network is trained to recall the input data through some non-linear neural network mapping using genetic algorithm. The classification methods used are extension neural networks and Gaussian mixture models.

    TARGET AUDIENCE OF THIS BOOK

    This book is intended for researchers and practitioners who use data analysis to build decision support systems. In particular the target audience includes engineers, scientists and statisticians. The areas of engineering where decision support tools are becoming widely used (the target audience of this book) are aerospace, mechanical, civil, biomedical and electrical engineering. Furthermore, researchers in statistics and social science will also find the techniques introduced in this book to be highly applicable to their work. This book is carefully written to give a good balance between theory and application of various missing data estimation techniques. The applications selected reflect the target audience of this book and include examples from various branches of engineering.

    Author(s)/Editor(s) Biography

    Tshilidzi Marwala holds a Chair of Systems Engineering at the School of Electrical and Information Engineering at the University of the Witwatersrand. He is the youngest recipient of the Order of Mapungubwe (whose other recipients are Nobel Prize Winners Sydney Brenner and J.M. Coetzee) and was awarded the President Award by the National Research Foundation. He holds a Bachelor of Science in Mechanical Engineering (Magna Cum Laude) from Case Western Reserve University, a Master of Engineering from the University of Pretoria, PhD in Engineering from University of Cambridge (St John's College) and attended a Program for Leadership Development at Harvard Business School. He was a post-doctoral research associate at the Imperial College of Science, Technology and Medicine and in year 2006 to 2007 was a visiting fellow at Harvard University. His research interests include theory and application of computational intelligence to engineering, computer science, finance, social science and medicine. He has published over 150 papers in journals, proceedings and book chapters and has supervised 30 master and PhD theses. His book Computational Intelligence for Modelling Complex Systems is published by Research India Publications. He is the Associate Editor of the International Journal of Systems Science. His work has appeared in publications such as the New Scientist and Time Magazine. He was a Chair of the Local Loop Unbundling Committee, is a Deputy Chair of the Limpopo Business Support Agency and has been on boards of EOH (Pty) Ltd, City Power (Pty) Ltd, State Information Technology Agency (Pty) Ltd, Statistics South Africa and the National Advisory Council on Innovation. He is a trustee of the Bradlow Foundation as well as the Carl and Emily Fuchs Foundation. He is a Senior Member of the IEEE and a member of the ACM.

    Indices