Classification is a form of data analysis that can be used to extract models to predict categorical class labels (Han & Kamber, 2001). Data classification has proven to be very useful in a wide variety of applications. For example, a classification model can be built to categorize bank loan applications as either safe or risky. In order to build a classification model, training data containing multiple independent variables and a dependant variable (class label) is needed. If a data record has a known value for its class label, this data record is termed “labeled”. If the value for its class is unknown, it is “unlabeled”. There are situations with a large amount of unlabeled data and a small amount of labeled data. Using only labeled data to build classification models can potentially ignore useful information contained in the unlabeled data. Furthermore, unlabeled data can often be much cheaper and more plentiful than labeled data, and so if useful information can be extracted from it that reduces the need for labeled examples, this can be a significant benefit (Balcan & Blum 2005). The default practice is to use only the labeled data to build a classification model and then assign class labels to the unlabeled data. However, when the amount of labeled data is not enough, the classification model built only using the labeled data can be biased and far from accurate. The class labels assigned to the unlabeled data can then be inaccurate. How to leverage the information contained in the unlabeled data to help improve the accuracy of the classification model is an important research question. There are two streams of research that addresses the challenging issue of how to appropriately use unlabeled data for building classification models. The details are discussed below.
Research on handling unlabeled data can be approximately grouped into two streams. These two streams are motivated by two different scenarios.
The first scenario covers applications where the modeler can acquire, but at a cost, the labels corresponding to the unlabeled data. For example, consider the problem of predicting if some video clip has suspicious activity (such as the presence of a “most wanted” fugitive). Vast amounts of video streams exist through surveillance cameras, and at the same time labeling experts exist (in law enforcement and the intelligence agencies). Hence labeling any video stream is possible, but is an expensive task in that it requires human time and interpretation (Yan et al 2003). A similar example is in the “speech-to-text” task of generating automatic transcriptions of speech fragments (Hakkani-Tur et al 2004, Raina et al 2007). It is possible to have people listen to the speech fragments and generate text transcriptions which can be used to label the speech fragments, but it is an expensive task. The fields of active learning (e.g. MacKay (1992), Saar-Tsechansky & Provost (2001)) and optimal experimental design (Atkinson 1996) addresses how modelers can selectively acquire the labels for the problems in this scenario. Active learning acquires labeled data incrementally, using the model learned so far to select particularly helpful additional training examples for labeling. When successful, active learning methods reduce the number of instances that must be labeled to achieve a particular level of accuracy (Saar-Tsechansky & Provost (2001)). Optimal experimental design studies the problem of deciding which subjects to experiment on (e.g. in medical trials) given limited resources (Atkinson 1996).