Concept-Based Text Mining

Concept-Based Text Mining

Stanley Loh (Lutheran University of Brazil, Brazil), Leandro Krug Wives (Federal University of Rio Grande do Sul, Brazil), Daniel Lichtnow (Catholic University of Pelotas, Brazil) and José Palazzo M. de Oliveira (Federal University of Rio Grande do Sul, Brazil)
Copyright: © 2009 |Pages: 13
DOI: 10.4018/978-1-59904-990-8.ch021
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

The goal of this chapter is to present an approach to mine texts through the analysis of higher level characteristics (called “concepts’), minimizing the vocabulary problem and the effort necessary to extract useful information. Instead of applying text mining techniques on terms or keywords labeling or extracted from texts, the discovery process works over concepts extracted from texts. Concepts represent real world attributes (events, objects, feelings, actions, etc.) and, as seen in discourse analysis, they help to understand ideas and ideologies present in texts. A previous classification task is necessary to identify concepts inside the texts. After that, mining techniques are applied over the concepts discovered. The chapter will discuss different concept-based text mining techniques and present results from different applications.
Chapter Preview
Top

Introduction

Text mining is a useful manner to examine the content of a text or a collection of texts. Many text mining approaches are based on words present in the texts or associated to them. However, such approaches are prone to suffer with the vocabulary problem. As discussed in (Chen, 1994), (Chen et al., 1996) and (Furnas, 1987), texts are written in natural language and this may cause semantic mistakes due to synonyms (different words for the same meaning), polysemy (the same word with many meanings), lemmas (words with the same radical, like the verb “to marry” and the noun “marriage”) and quasi-synonyms (words related to the same subject, object or event, like “bomb” and “terrorist attack”).

There is an approach, called concept-based, that tries to minimize such confusions. Instead of mining words, this approach, called concept-based, examines concepts present in the texts. Concepts represent real world phenomena (events, objects, subjects, feelings, actions, etc) and they help to understand ideas and ideologies present in texts.

One assumption is that a concept-based approach would minimize the vocabulary problem because concepts can be expressed with different words (synonyms), as in a semantic expansion approach, and concepts can hold:

  • a.

    Word variations: plural, gender, verbal conjugations;

  • b.

    Semantic associations: as specialization and generalizations;

  • c.

    Contextual information (or quasi-synonyms): for example “bomb” and “explosion”;

  • d.

    Semantic information: as for example “to be” versus “not to be”.

In Information Retrieval, concepts are used with success to index and retrieve documents. Lin and Chen (1996) comment “the concept-based retrieval capability has been considered by many researchers and practitioners to be an effective complement to the prevailing keyword search or user browsing”.

The goal of this chapter is to present an approach to mine texts through the analysis of high level characteristics (called “concepts’), minimizing the vocabulary problem and the effort necessary to extract useful information. Instead of applying text mining techniques on terms or keywords labeling or extracted from texts, the discovery process works over concepts extracted from texts. A pre-processing step of classification is necessary to identify concepts inside the texts. After that, mining techniques are applied over the concepts discovered.

The chapter begins discussing some related works, then presents techniques to identify concepts in the texts and mining techniques applied over concepts. The chapter ends with a conclusion and a discussion about future trends.

Top

Background

Feldman and partners (Feldman & Dagan, 1995) (Feldman & Hirsh, 1997) (Feldman & Dagan, 1998) face the problem of applying mining tools over keywords that are assigned to texts as attributes. These mining techniques use statistical analysis to discover association rules and interesting patterns over keyword distributions and associations. To perform the KDT process (Knowledge Discovery in Texts), keywords should be previously assigned to texts. The authors did not discuss the way in which keywords are assigned to texts, suggesting that this process may be done manually by humans or automatically by software tools. Similarly, Lin et al. (1998) use terms automatically extracted from texts to categorize documents and to find associations. The most frequent terms are assigned as keywords (attributes).

However, when analyzing terms, problems arise due to the vocabulary problem. This problem happens because the terms used by one person to describe one object, idea or situation may be different of the terms used by another person. Just to give an example, a murder may be described by one author with the term “murder” while another may use “homicide”. Thus, if we perform a mining or analysis that is based only in the terms assigned to or extracted from texts, the process may be misled by semantic gaps.

Key Terms in this Chapter

Association Rules: Rules usually in the format X ? Y, meaning that “ifXis present in an object, thenYis also present in this object“.

Concepts: Represent real world phenomena (events, objects, subjects, feelings, actions, etc) and they help to understand ideas and ideologies present in texts.

Distribution Analysis: Evaluation of the frequency of objects or attributes in a collection.

Clustering: Process that separates objects in groups (clusters) evaluating the similarity between them. The goal is to put similar objects inside the same cluster and dissimilar ones in different clusters. The number of initial clusters may not be known.

Vocabulary Problem: Problem generated by the use of natural languages and caused by semantic mistakes due to synonyms (different words for the same meaning), polysemy (the same word with many meanings), lemmas (words with the same radical, like the verb “to marry” and the noun “marriage”) and quasi-synonyms (words related to the same subject, object or event, like “bomb” and “terrorist attack”).

Semantic Expansion: A kind of technique that adds words to a set of words to better represent an object or meaning; this technique is utilized to restructure a query in information retrieval systems.

Concept-Based Text Mining: A new approach for text mining that applies statistical techniques over concepts present in texts instead of applying over words.

Temporal Analysis: Application of mining techniques on objects or events chronologically ordered, following a time sequence.

Complete Chapter List

Search this Book:
Reset