Concept-Based Text Mining

Stanley Loh; Leandro Krug Wives; Daniel Lichtnow; José Palazzo M. de Oliveira

doi:10.4018/978-1-59904-990-8.ch021

Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Concept-Based Text Mining

Stanley Loh, Leandro Krug Wives, Daniel Lichtnow, José Palazzo M. de Oliveira

Source Title: Handbook of Research on Text and Web Mining Technologies

DOI: 10.4018/978-1-59904-990-8.ch021

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

The goal of this chapter is to present an approach to mine texts through the analysis of higher level characteristics (called “concepts’), minimizing the vocabulary problem and the effort necessary to extract useful information. Instead of applying text mining techniques on terms or keywords labeling or extracted from texts, the discovery process works over concepts extracted from texts. Concepts represent real world attributes (events, objects, feelings, actions, etc.) and, as seen in discourse analysis, they help to understand ideas and ideologies present in texts. A previous classification task is necessary to identify concepts inside the texts. After that, mining techniques are applied over the concepts discovered. The chapter will discuss different concept-based text mining techniques and present results from different applications.

Chapter Preview

Top

Introduction

Text mining is a useful manner to examine the content of a text or a collection of texts. Many text mining approaches are based on words present in the texts or associated to them. However, such approaches are prone to suffer with the vocabulary problem. As discussed in (Chen, 1994), (Chen et al., 1996) and (Furnas, 1987), texts are written in natural language and this may cause semantic mistakes due to synonyms (different words for the same meaning), polysemy (the same word with many meanings), lemmas (words with the same radical, like the verb “to marry” and the noun “marriage”) and quasi-synonyms (words related to the same subject, object or event, like “bomb” and “terrorist attack”).

There is an approach, called concept-based, that tries to minimize such confusions. Instead of mining words, this approach, called concept-based, examines concepts present in the texts. Concepts represent real world phenomena (events, objects, subjects, feelings, actions, etc) and they help to understand ideas and ideologies present in texts.

One assumption is that a concept-based approach would minimize the vocabulary problem because concepts can be expressed with different words (synonyms), as in a semantic expansion approach, and concepts can hold:

a.
Word variations: plural, gender, verbal conjugations;
b.
Semantic associations: as specialization and generalizations;
c.
Contextual information (or quasi-synonyms): for example “bomb” and “explosion”;
d.
Semantic information: as for example “to be” versus “not to be”.

In Information Retrieval, concepts are used with success to index and retrieve documents. Lin and Chen (1996) comment “the concept-based retrieval capability has been considered by many researchers and practitioners to be an effective complement to the prevailing keyword search or user browsing”.

The goal of this chapter is to present an approach to mine texts through the analysis of high level characteristics (called “concepts’), minimizing the vocabulary problem and the effort necessary to extract useful information. Instead of applying text mining techniques on terms or keywords labeling or extracted from texts, the discovery process works over concepts extracted from texts. A pre-processing step of classification is necessary to identify concepts inside the texts. After that, mining techniques are applied over the concepts discovered.

The chapter begins discussing some related works, then presents techniques to identify concepts in the texts and mining techniques applied over concepts. The chapter ends with a conclusion and a discussion about future trends.

Top

Background

Feldman and partners (Feldman & Dagan, 1995) (Feldman & Hirsh, 1997) (Feldman & Dagan, 1998) face the problem of applying mining tools over keywords that are assigned to texts as attributes. These mining techniques use statistical analysis to discover association rules and interesting patterns over keyword distributions and associations. To perform the KDT process (Knowledge Discovery in Texts), keywords should be previously assigned to texts. The authors did not discuss the way in which keywords are assigned to texts, suggesting that this process may be done manually by humans or automatically by software tools. Similarly, Lin et al. (1998) use terms automatically extracted from texts to categorize documents and to find associations. The most frequent terms are assigned as keywords (attributes).

However, when analyzing terms, problems arise due to the vocabulary problem. This problem happens because the terms used by one person to describe one object, idea or situation may be different of the terms used by another person. Just to give an example, a murder may be described by one author with the term “murder” while another may use “homicide”. Thus, if we perform a mining or analysis that is based only in the terms assigned to or extracted from texts, the process may be misled by semantic gaps.

Key Terms in this Chapter

Association Rules: Rules usually in the format X ? Y, meaning that “ifXis present in an object, thenYis also present in this object“.

Concepts: Represent real world phenomena (events, objects, subjects, feelings, actions, etc) and they help to understand ideas and ideologies present in texts.

Distribution Analysis: Evaluation of the frequency of objects or attributes in a collection.

Clustering: Process that separates objects in groups (clusters) evaluating the similarity between them. The goal is to put similar objects inside the same cluster and dissimilar ones in different clusters. The number of initial clusters may not be known.

Vocabulary Problem: Problem generated by the use of natural languages and caused by semantic mistakes due to synonyms (different words for the same meaning), polysemy (the same word with many meanings), lemmas (words with the same radical, like the verb “to marry” and the noun “marriage”) and quasi-synonyms (words related to the same subject, object or event, like “bomb” and “terrorist attack”).

Semantic Expansion: A kind of technique that adds words to a set of words to better represent an object or meaning; this technique is utilized to restructure a query in information retrieval systems.

Concept-Based Text Mining: A new approach for text mining that applies statistical techniques over concepts present in texts instead of applying over words.

Temporal Analysis: Application of mining techniques on objects or events chronologically ordered, following a time sequence.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Concept-Based Text Mining

Abstract

Introduction

Background

Key Terms in this Chapter

Complete Chapter List