Partially Supervised Text Categorization

Xiao-Li Li

doi:10.4018/978-1-59904-990-8.ch005

Save 10% on All IGI Global Research Books
& OnDemand Individual Chapter & Article DownloadsAvailable exclusively on IGI Global’s Online Bookstore. Offer valid through October 31, 2024

Special Offers
- Save 10% on the IGI Global Online bookstore
  Now through October 31, 2024, save 10% on all IGI Global research books & OnDemand individual chapter & article downloads. IGI Global contributors may stack this discount with their exclusive 50% contributor discount, which is automatically applied when logged into a contributor portal account. Non-contributors may also combine the discount with one other discount, including coupon codes. Not valid on open access processing charges, e-collections, or videos. Discount is not applicable for distributors.
  Explore Books & Chapters
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education & Social Sciences
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education & Social Sciences
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Partially Supervised Text Categorization

Xiao-Li Li

Source Title: Handbook of Research on Text and Web Mining Technologies

DOI: 10.4018/978-1-59904-990-8.ch005

OnDemand:

(Individual Chapters)

Available

$33.75

List Price: $37.50

Current Special Offers

10% Discount:-$3.75

TOTAL SAVINGS: $3.75

Abstract

In traditional text categorization, a classifier is built using labeled training documents from a set of predefined classes. This chapter studies a different problem: partially supervised text categorization. Given a set P of positive documents of a particular class and a set U of unlabeled documents (which contains both hidden positive and hidden negative documents), we build a classifier using P and U to classify the data in U as well as future test data. The key feature of this problem is that there is no labeled negative document, which makes traditional text classification techniques inapplicable. In this chapter, we introduce the main techniques S-EM, PEBL, Roc-SVM and A-EM, to solve the partially supervised problem. In many application domains, partially supervised text categorization is preferred since it saves on the labor-intensive effort of manual labeling of negative documents.

Chapter Preview

Top

Introduction

Text categorization is an important problem and has been studied extensively in machine learning, information retrieval and natural language processing. To build a text classifier, the user first collects a set of training documents, which are labeled with predefined or known classes (labeling is often done manually). A classification algorithm is then applied to the training data to build a classifier which is subsequently employed to assign the predefined classes to documents in a test set (for evaluation) or future instances (in practice). This approach to building classifiers is called supervised learning/classification because the training documents all have pre-labeled classes.

Over the past few years, a special form of text categorization, partially supervised text categorization, is proposed. This problem can be regarded as a two-class (positive and negative) classification problem, where there are only labeled positive training data, but no labeled negative training data. Due to the lack of negative training data, the classifier building is thus partially supervised. Since traditional classification techniques require both labeled positive and negative documents to build a classifier, they are thus not suitable for this problem. Although it is possible to manually label some negative documents, it is labor-intensive and very time consuming. Partially supervised text categorization studies to build a classifier using only a set of positive documents and a set of unlabeled documents. Note that partially supervised text categorization is also called PU learning where P and U represent “positive set” and “unlabeled set” respectively. Additionally, although PU learning is commonly studied in the context of text and Web page domain, the general concepts and the algorithms are also applicable to the classification tasks in other domains. Next, we discuss the problem definition and some possible applications of PU learning.

With the growing volume of text documents on the Web, Internet news feeds, and digital libraries, one often wants to find those documents that are related to one’s interest. For instance, one may want to build a repository of machine learning (ML) papers. First, one can start with an initial set of ML papers (e.g., an ICML Proceedings). One can then find those ML papers from related online journals or conference series, e.g., AI journal, AAAI, IJCAI, SIGIR, and KDD etc. Note that collecting unlabeled documents is normally easy and inexpensive in many text or Web page domains, especially those involving online sources.

The ability to build classifiers without negative training data is particularly useful if one needs to find positive documents from many text collections or sources. Given a new collection, the algorithm can be run to find those positive documents. Following the above example, given a collection of AAAI papers (unlabeled set), one can run the algorithm to identify those ML papers. Given a set of SIGIR papers, one can run the algorithm again to find those ML papers. If traditional classification techniques are used to deal with the application above, a user has to label the negative training set for every online sources, such as AAAI, SIGIR and KDD proceedings, etc (it is thus time consuming). In other words, one cannot use the classifier built using ML papers as positive set (P) and manually labeled non-ML papers in AAAI as negative set (N) to classify the SIGIR papers (as a test set) because they are from different domains — the classifier will suffer since the distribution of negative test documents in SIGIR (Information Retrieval papers) is largely different from that in the negative training documents in N (Artificial Intelligence papers). A user would obviously prefer techniques that can provide accurate classification without manual labeling any negative document.

In the application above, the class of documents that one is interested in is called the positive class documents, or simply positive documents (e.g., the ML papers in an ICML Proceedings). The set of positive documents are represented as P. The unlabelled set U (e.g., AAAI Proceedings) contains two parts of documents. One is also the documents of class P, which are the hidden positive documents in U. The rest of the documents in U are called the negative class documents or simply negative documents (e.g., the non-ML papers in an AAAI Proceedings) since they do not belong to positive class. Given a positive set P, PU learning aims to identify a particular class P of documents from U. The diagram of PU learning is shown in the Figure 1.

Key Terms in this Chapter

1DNF: This technique first builds a positive feature set PF containing words that occur in the positive set P more frequently than in the unlabeled set U. It then extracts the strong negative documents that do not contain any positive feature in PF. This technique can only extract out small number of negative documents.

Unlabeled Set: A set U of the mixed set which contains both hidden positive documents from class P and also hidden negative documents. Unlabeled set U, when combined with positive set P, is useful for building a classifier to identify novel positive documents.

Roc-SVM: In its first step, Roc-SVM performs negative data extraction from the unlabeled set U using the Rocchio method. In its second step, it selects a good classifier from a set of classifiers built by SVM, which is different from PEBL. Roc-SVM is robust and performs well consistently under a variety of conditions.

Two-Step Strategy of PU Learning: The first step of PU learning is to identify a set of reliable negative documents (set RN) from the unlabeled set U; and second step of PU leaning is then to build a classifier using positive set P, reliable negative set RN and remaining unlabeled set U’ (U’=U-RN).

PU Learning: Given positive set P and unlabeled set U, a classifier is built to identify hidden positive documents in U or in a separate test set T. The objective of PU learning is to accurately classify the documents in U or T into positive class (documents from P) and negative class (documents not from P).

S-EM: S-EM works by sending some “spy” documents from the positive set P to the unlabeled set U. Since spy documents (positive documents) should behave identically to the hidden positive documents in U and hence allows classifiers to reliably infer the behavior of the unknown positive documents, extracting out possible positive documents from U.

Positive Set: A set P of positive documents which a user is interested in. The documents in the P basically can be used describe the user’ interests.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Partially Supervised Text Categorization

Abstract

Introduction

Key Terms in this Chapter

Complete Chapter List