The goal of this chapter is to present an approach to mine texts through the analysis of higher level characteristics (called “concepts’), minimizing the vocabulary problem and the effort necessary to extract useful information. Instead of applying text mining techniques on terms or keywords labeling or extracted from texts, the discovery process works over concepts extracted from texts. Concepts represent real world attributes (events, objects, feelings, actions, etc.) and, as seen in discourse analysis, they help to understand ideas and ideologies present in texts. A previous classification task is necessary to identify concepts inside the texts. After that, mining techniques are applied over the concepts discovered. The chapter will discuss different concept-based text mining techniques and present results from different applications.
Text mining is a useful manner to examine the content of a text or a collection of texts. Many text mining approaches are based on words present in the texts or associated to them. However, such approaches are prone to suffer with the vocabulary problem. As discussed in (Chen, 1994), (Chen et al., 1996) and (Furnas, 1987), texts are written in natural language and this may cause semantic mistakes due to synonyms (different words for the same meaning), polysemy (the same word with many meanings), lemmas (words with the same radical, like the verb “to marry” and the noun “marriage”) and quasi-synonyms (words related to the same subject, object or event, like “bomb” and “terrorist attack”).
There is an approach, called concept-based, that tries to minimize such confusions. Instead of mining words, this approach, called concept-based, examines concepts present in the texts. Concepts represent real world phenomena (events, objects, subjects, feelings, actions, etc) and they help to understand ideas and ideologies present in texts.
One assumption is that a concept-based approach would minimize the vocabulary problem because concepts can be expressed with different words (synonyms), as in a semantic expansion approach, and concepts can hold:
Word variations: plural, gender, verbal conjugations;
Semantic associations: as specialization and generalizations;
Contextual information (or quasi-synonyms): for example “bomb” and “explosion”;
Semantic information: as for example “to be” versus “not to be”.
In Information Retrieval, concepts are used with success to index and retrieve documents. Lin and Chen (1996) comment “the concept-based retrieval capability has been considered by many researchers and practitioners to be an effective complement to the prevailing keyword search or user browsing”.
The goal of this chapter is to present an approach to mine texts through the analysis of high level characteristics (called “concepts’), minimizing the vocabulary problem and the effort necessary to extract useful information. Instead of applying text mining techniques on terms or keywords labeling or extracted from texts, the discovery process works over concepts extracted from texts. A pre-processing step of classification is necessary to identify concepts inside the texts. After that, mining techniques are applied over the concepts discovered.
The chapter begins discussing some related works, then presents techniques to identify concepts in the texts and mining techniques applied over concepts. The chapter ends with a conclusion and a discussion about future trends.Top
Feldman and partners (Feldman & Dagan, 1995) (Feldman & Hirsh, 1997) (Feldman & Dagan, 1998) face the problem of applying mining tools over keywords that are assigned to texts as attributes. These mining techniques use statistical analysis to discover association rules and interesting patterns over keyword distributions and associations. To perform the KDT process (Knowledge Discovery in Texts), keywords should be previously assigned to texts. The authors did not discuss the way in which keywords are assigned to texts, suggesting that this process may be done manually by humans or automatically by software tools. Similarly, Lin et al. (1998) use terms automatically extracted from texts to categorize documents and to find associations. The most frequent terms are assigned as keywords (attributes).
However, when analyzing terms, problems arise due to the vocabulary problem. This problem happens because the terms used by one person to describe one object, idea or situation may be different of the terms used by another person. Just to give an example, a murder may be described by one author with the term “murder” while another may use “homicide”. Thus, if we perform a mining or analysis that is based only in the terms assigned to or extracted from texts, the process may be misled by semantic gaps.