Topic Extraction for Ontology Learning

Topic Extraction for Ontology Learning

Marian-Andrei RIZOIU, Julien VELCIN
DOI: 10.4018/978-1-60960-625-1.ch003
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

This chapter addresses the issue of topic extraction from text corpora for ontology learning. The first part provides an overview of some of the most significant solutions present today in the literature. These solutions deal mainly with the inferior layers of the Ontology Learning Layer Cake. They are related to the challenges of the Terms and Synonyms layers. The second part shows how these pieces can be bound together into an integrated system for extracting meaningful topics. While the extracted topics are not proper concepts as yet, they constitute a convincing approach towards concept building and therefore ontology learning. This chapter concludes by discussing the research undertaken for filling the gap between topics and concepts as well as perspectives that emerge today in the area of topic extraction.
Chapter Preview
Top

Introduction

The last years have seen an increased interest in research on ontology learning, especially from natural language texts. Special attention has been given to texts found on the Web, as they have specific features that we will present later in this chapter. Ontologies can be seen as collections of concepts linked together through relations. Therefore ontology learning is closely connected to concept learning. Buitelaar, Cimiano, and Magnini (2005) divide the process of ontology learning in a chain of different phases, the output of each phase being the input of the following one. An analysis of the state-of-the-art in terms of ontology learning at each of the various phases can be found in Cimiano, Völker, and Studer (2006).

In order to place topic extraction in the context of Ontology Learning process, we propose to take the reader into a descending overview of the inferior layers of the Ontology Learning Layer Cake (Buitelaar et al. (2005)), highlighting the challenges at each step. Beginning from the observation that ontologies are dynamic, and that they keep evolving mainly by means of refining concepts or replacing old concepts with new ones, a special attention must be paid to the “concept” layer. Therefore, automated ontology learning is closely connected to concept learning. As shown in Cimiano et al. (2006), the main approach toward learning concepts and their taxonomy (the hierarchical relations between concepts) is Conceptual clustering (Michalsky and Stepp (1983)), an unsupervised machine learning technique closely connected to unsupervised hierarchical clustering. This approach generally outputs a concept tree, each level being more specific than the previous one. At each level, the collection of terms is partitioned around each concept, using clustering algorithms, thus obtaining partitions of different granularity levels: bigger under the root and smaller as we reach the leaves. Examples of algorithms developed for this purpose are the well-known COBWEB (Fisher (1987)) and the more recent WebDCC (Godoy and Amandi (2006)). While this approach is promising and has shown good results, the resulted hierarchy is still very noisy and dependent on both the quality of extracted terms and their frequency in the text collection. Therefore, researchers have tried to improve the quality by allowing the expert to validate and guide the process. Others touched the field of semi-supervised learning techniques by making the algorithm aware of external information,

Taking into consideration these preliminary observations about the dependency of the superior layers of the cake on the quality of terms, we descend another step into the ontology layer cake. At the terms and synonyms layers, new challenges arise, such as extracting pertinent, non-ambiguous terms and dealing with disambiguation. Term extraction literature proposes solutions, out of which we mention some recent ones such as Wong, Liu, and Bennamoun (2009) and Wong, Liu, and Bennamoun (2008). The purpose of the lower layers of the cake is to extract terms and regroup synonyms under the same concept and finally defining the concepts, both in intention and in extension.

There are other approaches that pass though topics on the way towards concepts. Just like the later (see concept definition in Buitelaar et al. (2005)), topic definition is controversial. While some researchers consider a topic being just a cluster of documents that share a thematic, others consider topics as an abstraction of the regrouped texts that needs a linguistic materialisation: a word, a phrase or a sentence that summarises the idea emerging from the texts. Figure 1 presents an example of the topics that can be extracted from text. More details about some experimentation made with this system will be presented later, in section “Combining the two phases into an integrated system for extracting topics”.

Figure 1.

Example of output of the topic extraction system

978-1-60960-625-1.ch003.f01

Complete Chapter List

Search this Book:
Reset