Sampling the Web as Training Data for Text Classification

Sampling the Web as Training Data for Text Classification

Wei-Yen Day (National Taiwan University, Taiwan), Chun-Yi Chi (National Taiwan University, Taiwan), Ruey-Cheng Chen (National Taiwan University, Taiwan) and Pu-Jen Cheng (National Taiwan University, Taiwan)
Copyright: © 2010 |Pages: 19
DOI: 10.4018/jdls.2010100102
OnDemand PDF Download:


Data acquisition is a major concern in text classification. The excessive human efforts required by conventional methods to build up quality training collection might not always be available to research workers. In this paper, the authors look into possibilities to automatically collect training data by sampling the Web with a set of given class names. The basic idea is to populate appropriate keywords and submit them as queries to search engines for acquiring training data. The first of two methods presented in this paper is based on sampling the common concepts among classes and the other is based on sampling the discriminative concepts for each class. A series of experiments were carried out independently on two different datasets and results show that the proposed methods significantly improve classifier performance even without using manually labeled training data. The authors’ strategy for retrieving Web samples substantially helps in the conventional document classification in terms of accuracy and efficiency.
Article Preview

Sampling The Web As Training Data For Text Classification

Document classification has been extensively studied in the fields of data mining and machine learning. Conventionally, document classification is a supervised learning task (Yang & Liu, 1999; Yang, 1999) in which adequately labeled documents should be given so that various classification models, i.e., classifiers, can be learned accordingly. However, such requirement for supervised text classification has its limitations in practice. First, the cost to manually label sufficient amount of training documents can be high. Secondly, the quality of labor works is suspicious, especially when one is unfamiliar with the topics of given classes. Thirdly, in certain applications, such as email spam filtering, prototypes for documents considered as spams might change over time, and the need to access a dynamic training corpora specifically-tailored for this kind of application emerges. Automatic methods for data acquisition, therefore, can be very important in real-world classification work and require further exploration.

Previous works on automatic acquisition of training sets can be divided in two types. One of which focused on augmenting a small number of labeled training documents with a large pool of unlabeled documents. The key idea from these works is to train an initial classifier to label the unlabeled documents and uses the newly-labeled data to retrain the classifier iteratively. Although classifying unlabeled data is efficient, human effort is still involved in the beginning of the training process.

The other type of work focused on collecting training data from the Web. As more data is being put on the Web every day, there is a great potential to exploit the Web and devise algorithms that automatically fetch effective training data for diverse topics. A major challenge for Web-based methods is the way to locate quality training data by sending effective queries, e.g., class names, to search engines. This type of works can be found in (Huang, Chuang, & Chien, 2004; Huang, Lin, & Chien, 2005; Hung & Chien, 2004, 2007), which present an approach that assumes the search results initially returned from a class name are relevant to the class. Then the search results are treated as auto-labeled and additional associated terms with the class names are extracted from the labeled data. By sending the class names together with the associated terms, appropriate training documents can be retrieved automatically. Although generating queries is more convenient than manually collecting training data, the quality of the initial search results may not always be good especially when the given classes have multiple concepts. For example, the concepts of class “Apple” include company and fruit. Such a problem can be observed widely in various applications.

The goal of this paper is, given a set of concept classes, to automatically acquire training corpus based merely on the names of the given classes. Similar to our previous attempts, we employ a technique to produce keywords by expanding the concepts encompassed in the class names, query the search engines, and use the returned snippets as training instances in the subsequent classification tasks. Two issues may arise with this technique. First, the given class names are usually very short and ambiguous, making search results less relevant to the classes. Secondly, the expanded keywords generated from different classes may be very close to each other so that the corresponding search-result snippets have little discrimination power to distinguish one class from the others.

We present two concept expansion methods to deal with these problems, respectively. The first method, expansion by common concepts, aims at alleviating the problem of ambiguous class names. The method utilizes the relations among the classes to discover their common concepts. For example, “company” could be one of the common concepts of classes “Apple” and “Microsoft”. Combined with the common concepts, relevant training documents to the given classes can be retrieved. The second method, expansion by discriminative concepts, aims at finding discriminative concepts among the given classes. For example, “iPod” could be one of the unique concepts of class “Apple”. Combined with the discriminative concepts, effective training documents that distinguish one class from another can be retrieved.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing