Improving Techniques for Naïve Bayes Text Classifiers

Han-joon Kim

doi:10.4018/978-1-59904-990-8.ch007

Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Improving Techniques for Naïve Bayes Text Classifiers

Han-joon Kim

Source Title: Handbook of Research on Text and Web Mining Technologies

DOI: 10.4018/978-1-59904-990-8.ch007

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

This chapter introduces two practical techniques for improving Naïve Bayes text classifiers that are widely used for text classification. The Naïve Bayes has been evaluated to be a practical text classification algorithm due to its simple classification model, reasonable classification accuracy, and easy update of classification model. Thus, many researchers have a strong incentive to improve the Naïve Bayes by combining it with other meta-learning approaches such as EM (Expectation Maximization) and Boosting. The EM approach is to combine the Naïve Bayes with the EM algorithm and the Boosting approach is to use the Naïve Bayes as a base classifier in the AdaBoost algorithm. For both approaches, a special uncertainty measure fit for Naïve Bayes learning is used. In the Naïve Bayes learning framework, these approaches are expected to be practical solutions to the problem of lack of training documents in text classification systems.

Chapter Preview

Top

Introduction

The Naïve Bayes learning based classifier is simple yet surprisingly accurate, and thus has been used in many machine-learning related classification projects (A-Engelson & Dagan, 1999; Katakis, Tsoumakas & Vlahavas, 2006; Nigam, McCallum, Thrun & Mitchell, 1998). In particular, it is remarkably successful for text classification problems, despite the fact that text data generally has a large feature space (Katakis, Tsoumakas & Vlahavas, 2006; Nigam, McCallum, Thrun & Mitchell, 1998). Compared to other learning methods, the Naïve Bayes has a number of good features that serve text classification systems.

•
The learning of the Naïve Bayes classifier does not demand any other statistics than feature statistics, and no further complex generalization processes are required unlike the other machine learning methods such as support vector machine.
•
It is very easy to incrementally update the classification model of a given categories due to its simplicity. When a new training documents arrive, the feature statistics are updated and feature evaluation can be immediately calculated without the need of re-processing past data. This characteristic is essential in the case where the document collection is highly evolutionary.
•
Since a classification model can be developed with a single pass over the documents, a Naïve Bayes learning process is faster than that of other methods.
•
The Naïve Bayes learning can easily accommodate the degree of importance of features occurring in documents; for example, for learning, one may double or triple the frequency of terms occurring in titles in news articles and title tags in HTML documents.

Because of the above advantages, there has been a strong incentive to improve the Naïve Bayes text classifier by combining it with other learning techniques (Kim & Kim, 2004; Kim, Rim, Yook & Lim, 2002; Nigam, McCallum, Thrun & Mitchell, 1998). To improve the classification performance, we first should resolve the problem of lack of training examples in the real-world environment. In general, most machine learning methods including Naïve Bayes assumes the existence of good quality documents for training. However, this assumption is not effective in real world operational environments. Thus we must assume that rather than trying to prepare complete training examples at one time, new training documents are continuously provided for learning as a data stream. Thus in such an environment, continuous update of the current classification model is required whenever a new set of training examples are prepared. Thus, how to obtain training examples has become an important issue in practically developing a text classifier. To resolve such a problem, active learning approach was proposed in which the learner can actively choose the training documents from a pool of unlabeled documents (A-Engelson & Dagan, 1999). However, active learning approach needs additional cost for labeling unknown documents by human experts. In this chapter, I will introduce two approaches to improving Naïve Bayes with incomplete training set: one is the EM (Expectation Maximization) approach and the other is the Boosting one.

Key Terms in this Chapter

Classif ication Uncertainty: The degree of uncertainty in the classification of the example with respect to the current model derived from given training examples.

AdaBoost: a kind of boosting algorithms that builds subsequent classifiers by being tweaked in favor of those instances misclassified by previous classifiers.

Selective Sampling: A kind of active learning methods that selectively chooses a set of candidate training data from unlabeled data.

Boosting: A machine learning meta-learning algorithm for performing supervised learning that creates a single strong learner with a set of weak learner.

Naïve Bayes: Simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions.

Text Classification: The task of automatically assigning a set of text documents to a set of predefined classes.

EM Algorithm: An iterative method for estimating maximum likelihood in problems with incomplete (or unlabeled) data.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Improving Techniques for Naïve Bayes Text Classifiers

Abstract

Introduction

Key Terms in this Chapter

Complete Chapter List