This chapter introduces two practical techniques for improving Naïve Bayes text classifiers that are widely used for text classification. The Naïve Bayes has been evaluated to be a practical text classification algorithm due to its simple classification model, reasonable classification accuracy, and easy update of classification model. Thus, many researchers have a strong incentive to improve the Naïve Bayes by combining it with other meta-learning approaches such as EM (Expectation Maximization) and Boosting. The EM approach is to combine the Naïve Bayes with the EM algorithm and the Boosting approach is to use the Naïve Bayes as a base classifier in the AdaBoost algorithm. For both approaches, a special uncertainty measure fit for Naïve Bayes learning is used. In the Naïve Bayes learning framework, these approaches are expected to be practical solutions to the problem of lack of training documents in text classification systems.
The Naïve Bayes learning based classifier is simple yet surprisingly accurate, and thus has been used in many machine-learning related classification projects (A-Engelson & Dagan, 1999; Katakis, Tsoumakas & Vlahavas, 2006; Nigam, McCallum, Thrun & Mitchell, 1998). In particular, it is remarkably successful for text classification problems, despite the fact that text data generally has a large feature space (Katakis, Tsoumakas & Vlahavas, 2006; Nigam, McCallum, Thrun & Mitchell, 1998). Compared to other learning methods, the Naïve Bayes has a number of good features that serve text classification systems.
The learning of the Naïve Bayes classifier does not demand any other statistics than feature statistics, and no further complex generalization processes are required unlike the other machine learning methods such as support vector machine.
It is very easy to incrementally update the classification model of a given categories due to its simplicity. When a new training documents arrive, the feature statistics are updated and feature evaluation can be immediately calculated without the need of re-processing past data. This characteristic is essential in the case where the document collection is highly evolutionary.
Since a classification model can be developed with a single pass over the documents, a Naïve Bayes learning process is faster than that of other methods.
The Naïve Bayes learning can easily accommodate the degree of importance of features occurring in documents; for example, for learning, one may double or triple the frequency of terms occurring in titles in news articles and title tags in HTML documents.
Because of the above advantages, there has been a strong incentive to improve the Naïve Bayes text classifier by combining it with other learning techniques (Kim & Kim, 2004; Kim, Rim, Yook & Lim, 2002; Nigam, McCallum, Thrun & Mitchell, 1998). To improve the classification performance, we first should resolve the problem of lack of training examples in the real-world environment. In general, most machine learning methods including Naïve Bayes assumes the existence of good quality documents for training. However, this assumption is not effective in real world operational environments. Thus we must assume that rather than trying to prepare complete training examples at one time, new training documents are continuously provided for learning as a data stream. Thus in such an environment, continuous update of the current classification model is required whenever a new set of training examples are prepared. Thus, how to obtain training examples has become an important issue in practically developing a text classifier. To resolve such a problem, active learning approach was proposed in which the learner can actively choose the training documents from a pool of unlabeled documents (A-Engelson & Dagan, 1999). However, active learning approach needs additional cost for labeling unknown documents by human experts. In this chapter, I will introduce two approaches to improving Naïve Bayes with incomplete training set: one is the EM (Expectation Maximization) approach and the other is the Boosting one.
Key Terms in this Chapter
Classif ication Uncertainty: The degree of uncertainty in the classification of the example with respect to the current model derived from given training examples.
AdaBoost: a kind of boosting algorithms that builds subsequent classifiers by being tweaked in favor of those instances misclassified by previous classifiers.
Selective Sampling: A kind of active learning methods that selectively chooses a set of candidate training data from unlabeled data.
Boosting: A machine learning meta-learning algorithm for performing supervised learning that creates a single strong learner with a set of weak learner.
Naïve Bayes: Simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions.
Text Classification: The task of automatically assigning a set of text documents to a set of predefined classes.
EM Algorithm: An iterative method for estimating maximum likelihood in problems with incomplete (or unlabeled) data.