On Document Representation and Term Weights in Text Classification

Ying Liu

doi:10.4018/978-1-59904-990-8.ch001

Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

On Document Representation and Term Weights in Text Classification

Ying Liu

Source Title: Handbook of Research on Text and Web Mining Technologies

DOI: 10.4018/978-1-59904-990-8.ch001

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

In the automated text classification, a bag-of-words representation followed by the tfidf weighting is the most popular approach to convert the textual documents into various numeric vectors for the induction of classifiers. In this chapter, we explore the potential of enriching the document representation with the semantic information systematically discovered at the document sentence level. The salient semantic information is searched using a frequent word sequence method. Different from the classic tfidf weighting scheme, a probability based term weighting scheme which directly reflect the term’s strength in representing a specific category has been proposed. The experimental study based on the semantic enriched document representation and the newly proposed probability based term weighting scheme has shown a significant improvement over the classic approach, i.e., bag-of-words plus tfidf, in terms of Fscore. This study encourages us to further investigate the possibility of applying the semantic enriched document representation over a wide range of text based mining tasks.

Chapter Preview

Top

Introduction

Text classification (TC) is such a task to categorize documents into predefined thematic categories. In particular, it aims to find the mapping ξ, from a set of documents D: {d₁, …, d_i} to a set of thematic categories C: {C₁, …, C_j}, i.e. ξ: D → C. In its current study, which is dominated by supervised learning, the construction of a text classifier is often conducted in two main phases (Debole & Sebastiani, 2003; Sebastiani, 2002):

1.
Document indexing: The creation of numeric representations of documents
- •
  Term selection: To select a subset of terms from all terms occurring in the collection to represent the documents in a sensible way, either to facilitate computing or to achieve best effectiveness in classification.
- •
  Term weighting: To assign a numeric value to each term in order to weight its contribution which helps a document stand out from others.
2.
Classifier induction: The building of a classifier by learning from the numeric representations of documents

In literature, the most popular approach in handling document indexing is probably the bag-of-words (BoW) approach, i.e. each unique word in the document is considered as the smallest unit to convey information (Manning & Schütze, 1999; Sebastiani, 2002; van_Rijsbergen, 1979), followed by the tfidf weighting scheme, i.e. term frequency (tf) times inverse document frequency (idf) (Salton & Buckley, 1988; Salton & McGill, 1983). Numerous machine learning algorithms have been applied to the task of text classification. These include, but are not limited to, Naïve Bayes (Lewis, 1992a; Rennie, Shih, Teevan, & Karger, 2003a), decision tree and decision rule (Apté, Damerau, & Weiss, 1994, 1998), artificial neural network (Ng, Goh, & Low, 1997; Ruiz & Srinivasan, 2002), k-nearest neighbor (kNN) (Baoli, Qin, & Shiwen, 2004; Han, Karypis, & Kumar, 2001; Yang & Liu, 1999), Bayes Network related (Friedman, Geiger, & Goldszmidt, 1997; Heckerman, 1997), and recently support vector machines (SVM) (Burges, 1998; Joachims, 1998; Vapnik, 1999). Researchers and professionals from relevant areas, e.g. machine learning, information retrieval and natural language processing, constantly introduce new algorithms, test data sets, benchmark results, etc. A comprehensive review about text classification was given by (Sebastiani, 2002).

While text classification has emerged as a fast growing area, particularly due to the involvement of machine learning based algorithms, some key questions remain. Although, it is generally understood that word sequences, e.g. “finite state machine”, “machine learning” and “supply chain management”, convey more semantic information in representing textual documents than single terms, they have seldom been explored in both text classification and clustering. In fact, it is lack of a systematic approach to generate such high quality word sequences automatically. The rich semantic information resident in word sequences has been ignored. Although, the tfidf weighting scheme performs pretty well, particularly in the task of Web based information search and retrieval, these weights do not directly indicate the closeness of thematic relation between a term and a specific category.

Key Terms in this Chapter

Document Representation: Document representation is concerned about how textual documents should be represented in various tasks, e.g. text processing, retrieval and knowledge discovery and mining. Its prevailing approach is the vector space model, i.e. a document di is represented as a vector of term weights , where is the collection of terms that occur at least once in the document collection D.

Text Classification: Text classification intends to categorize documents into a series of predefined thematic categories. In particular, it aims to find the mapping ?, from a set of documents D: {d1, …, di} to a set of thematic categories C: {C1, …, Cj}, i.e. ?: D ? C.

Term Weighting Scheme: Term weighting is a process to compute and assign a numeric value to each term in order to weight its contribution in distinguishing a particular document from others. The most popular approach is tfidf weighting scheme, i.e. term frequency (tf) times inverse document frequency (idf).

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

On Document Representation and Term Weights in Text Classification

Abstract

Introduction

Key Terms in this Chapter

Complete Chapter List