In traditional text categorization, a classifier is built using labeled training documents from a set of predefined classes. This chapter studies a different problem: partially supervised text categorization. Given a set P of positive documents of a particular class and a set U of unlabeled documents (which contains both hidden positive and hidden negative documents), we build a classifier using P and U to classify the data in U as well as future test data. The key feature of this problem is that there is no labeled negative document, which makes traditional text classification techniques inapplicable. In this chapter, we introduce the main techniques S-EM, PEBL, Roc-SVM and A-EM, to solve the partially supervised problem. In many application domains, partially supervised text categorization is preferred since it saves on the labor-intensive effort of manual labeling of negative documents.
Text categorization is an important problem and has been studied extensively in machine learning, information retrieval and natural language processing. To build a text classifier, the user first collects a set of training documents, which are labeled with predefined or known classes (labeling is often done manually). A classification algorithm is then applied to the training data to build a classifier which is subsequently employed to assign the predefined classes to documents in a test set (for evaluation) or future instances (in practice). This approach to building classifiers is called supervised learning/classification because the training documents all have pre-labeled classes.
Over the past few years, a special form of text categorization, partially supervised text categorization, is proposed. This problem can be regarded as a two-class (positive and negative) classification problem, where there are only labeled positive training data, but no labeled negative training data. Due to the lack of negative training data, the classifier building is thus partially supervised. Since traditional classification techniques require both labeled positive and negative documents to build a classifier, they are thus not suitable for this problem. Although it is possible to manually label some negative documents, it is labor-intensive and very time consuming. Partially supervised text categorization studies to build a classifier using only a set of positive documents and a set of unlabeled documents. Note that partially supervised text categorization is also called PU learning where P and U represent “positive set” and “unlabeled set” respectively. Additionally, although PU learning is commonly studied in the context of text and Web page domain, the general concepts and the algorithms are also applicable to the classification tasks in other domains. Next, we discuss the problem definition and some possible applications of PU learning.
With the growing volume of text documents on the Web, Internet news feeds, and digital libraries, one often wants to find those documents that are related to one’s interest. For instance, one may want to build a repository of machine learning (ML) papers. First, one can start with an initial set of ML papers (e.g., an ICML Proceedings). One can then find those ML papers from related online journals or conference series, e.g., AI journal, AAAI, IJCAI, SIGIR, and KDD etc. Note that collecting unlabeled documents is normally easy and inexpensive in many text or Web page domains, especially those involving online sources.
The ability to build classifiers without negative training data is particularly useful if one needs to find positive documents from many text collections or sources. Given a new collection, the algorithm can be run to find those positive documents. Following the above example, given a collection of AAAI papers (unlabeled set), one can run the algorithm to identify those ML papers. Given a set of SIGIR papers, one can run the algorithm again to find those ML papers. If traditional classification techniques are used to deal with the application above, a user has to label the negative training set for every online sources, such as AAAI, SIGIR and KDD proceedings, etc (it is thus time consuming). In other words, one cannot use the classifier built using ML papers as positive set (P) and manually labeled non-ML papers in AAAI as negative set (N) to classify the SIGIR papers (as a test set) because they are from different domains — the classifier will suffer since the distribution of negative test documents in SIGIR (Information Retrieval papers) is largely different from that in the negative training documents in N (Artificial Intelligence papers). A user would obviously prefer techniques that can provide accurate classification without manual labeling any negative document.
In the application above, the class of documents that one is interested in is called the positive class documents, or simply positive documents (e.g., the ML papers in an ICML Proceedings). The set of positive documents are represented as P. The unlabelled set U (e.g., AAAI Proceedings) contains two parts of documents. One is also the documents of class P, which are the hidden positive documents in U. The rest of the documents in U are called the negative class documents or simply negative documents (e.g., the non-ML papers in an AAAI Proceedings) since they do not belong to positive class. Given a positive set P, PU learning aims to identify a particular class P of documents from U. The diagram of PU learning is shown in the Figure 1.
Key Terms in this Chapter
1DNF: This technique first builds a positive feature set PF containing words that occur in the positive set P more frequently than in the unlabeled set U. It then extracts the strong negative documents that do not contain any positive feature in PF. This technique can only extract out small number of negative documents.
Unlabeled Set: A set U of the mixed set which contains both hidden positive documents from class P and also hidden negative documents. Unlabeled set U, when combined with positive set P, is useful for building a classifier to identify novel positive documents.
Roc-SVM: In its first step, Roc-SVM performs negative data extraction from the unlabeled set U using the Rocchio method. In its second step, it selects a good classifier from a set of classifiers built by SVM, which is different from PEBL. Roc-SVM is robust and performs well consistently under a variety of conditions.
Two-Step Strategy of PU Learning: The first step of PU learning is to identify a set of reliable negative documents (set RN) from the unlabeled set U; and second step of PU leaning is then to build a classifier using positive set P, reliable negative set RN and remaining unlabeled set U’ (U’=U-RN).
PU Learning: Given positive set P and unlabeled set U, a classifier is built to identify hidden positive documents in U or in a separate test set T. The objective of PU learning is to accurately classify the documents in U or T into positive class (documents from P) and negative class (documents not from P).
S-EM: S-EM works by sending some “spy” documents from the positive set P to the unlabeled set U. Since spy documents (positive documents) should behave identically to the hidden positive documents in U and hence allows classifiers to reliably infer the behavior of the unknown positive documents, extracting out possible positive documents from U.
Positive Set: A set P of positive documents which a user is interested in. The documents in the P basically can be used describe the user’ interests.