Thesaurus-Based Automatic Indexing

Luis M. de Campos

doi:10.4018/978-1-59904-990-8.ch020

Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Thesaurus-Based Automatic Indexing

Luis M. de Campos

Source Title: Handbook of Research on Text and Web Mining Technologies

DOI: 10.4018/978-1-59904-990-8.ch020

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

In this chapter, we present a thesaurus application in the field of text mining and more specifically automatic indexing on the set of descriptors defined by a thesaurus. We begin by presenting various definitions and a mathematical thesaurus model, and also describe various examples of real world thesauri which are used in official institutions. We then explore the problem of thesaurus-based automatic indexing by describing its difficulties and distinguishing features and reviewing previous work in this area. Finally, we propose various lines of future research.

Chapter Preview

Top

Introduction

Automated text categorization (Sebastiani, 2002) is a successful subfield of information management and has to date been the subject of more than six hundred research publications during more than forty-five years of work (Sebastiani & Gabrilovich, 2007). Although it is intrinsically a research field (many scientists and engineers attempt to build good text classifiers using different methods) as part of text mining, this discipline can also be helpful for obtaining knowledge such as document summarization, concept or entity extraction, sentiment analysis, or document clustering from documents. Even in the field of information retrieval (van Rijsbergen, 1979) it is interesting for documents to be previously categorized into a list of classes so that they may be retrieved more accurately using associated keywords as meta-information.

Text documents and natural language both share the common problems of lexical ambiguity or polysemy (i.e. words or expressions having more than one meaning and which is a special case of homonymy) and synonymy (different terms or expressions for the same concept). Additionally, the most common representation used for a text document is the bag of words approach (a model similar to “first-order word approximation” used by Shannon in 1948) and this reduces a document to a list of unrelated terms usually resulting in loss of contextual meaning and structure of certain expressions present in the text.

One tool which has been intrinsically designed to avoid ambiguities of any class is a thesaurus. This is a set of terms with orthogonal meanings and a set of the hierarchical relationships between them. A thesaurus can be very useful in different areas of text mining (Berry, 2003) by removing ambiguity and identifying a document’s context. In this chapter, we present some thesaurus basics, a formal characterization, and several examples of real world thesauri. We will then focus on the problem of automatic indexing on a domain of categories defined on a thesaurus.

Key Terms in this Chapter

Document Indexing: Is the act of describing a document by index terms to indicate, with that metadata, what the document is about or to summarize its content. The index terms are often selected from some form of controlled vocabulary, e.g. a thesaurus.

Text Categorization: Is the task of assigning an electronic document to one or more categories, based on its contents. If only one category is assigned, we refer to that as single-label categorization. If several categories can be assigned to the document, we are dealing with multi-label categorization.

Vector Space Model: Is an algebraic model for representing documents (not only text) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

Descriptor: A word or a list of words that represents a concept without ambiguity. It can be used to retrieve documents in an information system, for instance a catalog or a search engine.

Thesaurus: A thesaurus is a list of every important descriptors in a given domain of knowledge; and, for each descriptor, the set of descriptor related with it.

Unsupervised Classification: Is a machine learning technique where a model is fit to observations. In this case there is no a priori known output (as in supervised classification). In unsupervised classification, a data set of input objects is partitioned into different groups or clusters, so that the objects in each group share some common trait, e.g. proximity according to some defined distance measure.

Supervised Classification: Is a machine learning technique for creating a function from training data. The training data consist of pairs of input objects (typically vectors), and desired outputs (categories). The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output). To achieve this, the learner has to generalize from the presented data to unseen situations in a “reasonable” way.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Thesaurus-Based Automatic Indexing

Abstract

Introduction

Key Terms in this Chapter

Complete Chapter List