The essence of data mining is to investigate for pertinent information that may exist in data (often large data sets). The immeasurably large amount of data present in the world, due to the increasing capacity of storage media, manifests the issue of the presence of missing values (Olinsky et al., 2003; Brown and Kros, 2003). The presented encyclopaedia article considers the general issue of the presence of missing values when data mining, and demonstrates the effect of when managing their presence is or is not undertaken, through the utilisation of a data mining technique. The issue of missing values was first exposited over forty years ago in Afifi and Elashoff (1966). Since then it is continually the focus of study and explanation (El-Masri and Fox-Wasylyshyn, 2005), covering issues such as the nature of their presence and management (Allison, 2000). With this in mind, the naïve consistent aspect of the missing value debate is the limited general strategies available for their management, the main two being either the simple deletion of cases with missing data or a form of imputation of the missing values in someway (see Elliott and Hawthorne, 2005). Examples of the specific investigation of missing data (and data quality), include in; data warehousing (Ma et al., 2000), and customer relationship management (Berry and Linoff, 2000). An alternative strategy considered is the retention of the missing values, and their subsequent ‘ignorance’ contribution in any data mining undertaken on the associated original incomplete data set. A consequence of this retention is that full interpretability can be placed on the results found from the original incomplete data set. This strategy can be followed when using the nascent CaRBS technique for object classification (Beynon, 2005a, 2005b). CaRBS analyses are presented here to illustrate that data mining can manage the presence of missing values in a much more effective manner than the more inhibitory traditional strategies. An example data set is considered, with a noticeable level of missing values present in the original data set. A critical increase in the number of missing values present in the data set further illustrates the benefit from ‘intelligent’ data mining (in this case using CaRBS).
Underlying the necessity to concern oneself with the issue of missing values is the reality that most data analysis techniques were not designed for their presence (Schafer and Graham, 2002). It follows, an external level of management of the missing values is necessary. There is however underlying caution on the ad-hoc manner in which the management of missing values may be undertaken, this lack of thought is well expressed by Huang and Zhu (2002, p. 1613):
Inappropriate treatment of missing data may cause large errors or false results.
A recent article by Brown and Kros (2003) looked at the impact of missing values on data mining algorithms, including; k-nearest neighbour, decision trees, association rules and neural networks. For these considered techniques, the presence of missing values is considered to have an impact, with a level external management necessary to accommodate them. Indeed, perhaps the attitude is that it is the norm to have to manage the missing values, with little thought to the consequences of doing this. Conversely, there is also the possibilities that missing values differ in important ways from those that are present.
While common, the specific notion of the management of missing values is not so clear, since firstly it is often necessary to understand the reasons for their presence (De Leeuw, 2001), and subsequently how these reasons may dictate how they should be future described. For example, in the case of large survey data, whether the missing data is (ibid.); Missing by design, Inapplicable item, Cognitive task too difficult, Refuse to respond, Don’t know and Inadequate score. Whether the data is survey based or from another source, a typical solution is to make simplifying assumptions about the mechanism that causes the missing data (Ramoni and Sebastiani, 2001). These mechanisms (causes) are consistently classified into three categories, based around the distributions of their presence, namely;