Missing data creates various problems in analyzing and processing data in databases. In this chapter, a method aimed at approximating missing data in a database that uses a combination of genetic algorithms and neural networks is introduced. The presented method uses genetic algorithms to minimize an error function derived from an auto-associative neural network. The Multi-Layer Perceptron (MLP) and Radial Basis Function (RBF) networks are employed to form an auto-associative network. An investigation is undertaken into using the method to predict missing data accurately as the number of missing cases within a single record increases. It is observed that there is no significant reduction in the accuracy of the results as the number of missing cases in a single record increases. It is also found that results obtained from using the MLP are better than from the RBF for the data used.
TopIntroduction
Inferences made from available data for many applications depend on the completeness and the quality of the data being used in the analysis. Therefore, inferences made from complete data are most likely to be more accurate than those made from incomplete data are. However, there are time critical applications that necessitate estimation or approximation of the values of some missing variables, which have to be supplied with the values of other corresponding variables. Such situations may appear in a system that uses a number of instruments, where one or more of the sensors used in the system fail. In such a situation, the values from the missing sensor have to be estimated within a short time and with great precision by taking into account the values of other sensors in the system. In such situations, an approximation for the missing values involves estimating missing values while taking into account the inter-relationships that exists between the values of other variables.
Missing data in a database may arise for various reasons. They can arise from data entry errors, respondents’ non-response to some items in the data collection process, failure of instruments and other reasons. In Table 1 presents a database consisting of five variables, namely x1, x2, x3, x4, and x5, where the values of some variables are missing. If we assume that the observations from some variables in various records are not available, it becomes critical that a technique be formulated for estimating the values for the missing entries.
Table 1. Database with missing values
x1 | x2 | x3 | x4 | x5 |
25 | 3.5 | ? | 5000 | -3.5 |
? | 6.9 | 5.6 | ? | 0.5 |
45 | 3.6 | 9.5 | 1500 | 46.5 |
27 | 9.7 | ? | 3000 | ? |