The problem of learning concept hierarchies and terminological ontologies can be divided into two sub-tasks: concept extraction and relation learning. The authors of this chapter describe a novel approach to learn relations automatically from unstructured text corpus based on probabilistic topic models. The authors provide definition (Information Theory Principle for Concept Relationship) and quantitative measure for establishing “broader” (or “narrower”) and “related” relations between concepts. They present a relation learning algorithm to automatically interconnect concepts into concept hierarchies and terminological ontologies with the probabilistic topic models learned. In this experiment, around 7,000 ontology statements expressed in terms of “broader” and “related” relations are generated using different combination of model parameters. The ontology statements are evaluated by domain experts and the results show that the highest precision of the learned ontologies is around 86.6% and structures of learned ontologies remain stable when values of the parameters are changed in the ontology learning algorithm.
TopIntroduction
Concept hierarchies used in most of the contemporary digital libraries are created and maintained manually. One of the most subtle problems with the manual approach is that the engineering and subsequent maintenance processes are time-consuming. The manually created hierarchies are also prone to suffer from obsolescence as research advances (e.g., invention of new techniques). Consequently, emergence of new terminologies in various research areas can not be easily reflected in such static concept hierarchies. It is also difficult to construct concept hierarchies with broader and deeper coverage, which can help users on query suggestion and expansion, browsing, and navigation. Unsupervised approaches for learning knowledge are perceived as promising approaches to alleviate these problems, minimising human involvement and effort.
The Semantic Web (Berners-Lee, Hendler, & Lassila, 2001) has been bringing more research attention towards knowledge acquisition using automated approaches. There have been a number of existing works that aim to learn different types of ontologies from unstructured text corpus using techniques from Natural Language Processing (Hearst, 1992, Cimiano & Staab, 2004, Cimiano, Pivk, Schmidt-Thieme, & Staab, 2005), Information Extraction (Cunningham, 2005, Cimiano & Völker, 2005, Kiryakov, Popov, Terziev, Manov, & Ognyanoff, 2004), and Machine Learning (clustering and classification) (Maedche, Pekar, & Staab, 2002, Biemann, 2005). An important and plausible assumption is that given sufficient amount of text in a domain, coverage of knowledge in that domain can be ensured (Cimiano, 2006). Although learned ontologies are less accurate than those created manually, the advantages of being inexpensive, time-saving, and resistant to obsolescence make the automated approaches attractive, especially in domains where semi-structured data is not available or cannot be directly transformed to structured form.
A concept hierarchy or topic hierarchy can be viewed as a simple form of a terminological ontology in which concepts are not only organised using more general/specific relations, but also other types of relations, such as “related” (introduction on the “related” and “broader” relations can be found in Section “SKOS Ontology Model” and “Information Theory Principle for Concept Relationship”) defined in the SKOS ontology model1. In this chapter we explore a novel approach for learning terminological ontologies with respect to the SKOS model using Latent Dirichlet Allocation (Blei, Ng, & Jordan, 2003, Steyvers & Griffiths, 2005). The main objective is to establish “broader” and “related” relations between concepts using an unsupervised approach. The learned ontologies can be used to extend and expand existing topic hierarchies deployed in digital libraries and search engines, for different purposes such as facilitating search, browsing, query suggestion, and document annotation. The rest of the chapter is organised as follows. Section “Related Work” provides a short introduction on ontology categorisation and ontology learning tasks. It also gives an overview of existing methods for ontology learning from unstructured text. Section “Introduction to Latent Dirichlet Allocation” presents some background knowledge on the Latent Dirichlet Allocation, in particular the generative process, model representation and parameter estimation of the model. Section “Learning Relations in Terminological Ontologies” elaborates our approach for learning “broader” and “related” relations to construct terminological ontologies. We focus on the definition of concept relationship principle and an iterative algorithm for organising concepts into ontologies with the SKOS relations. Section “Experiment” describes the experiment conducted on a dataset which consists of abstracts of publications in the Semantic Web research area. Using the ontology learning algorithm with different parameters, around 7,000 ontology statements are generated. Results of evaluation in terms of recall, precision and F1 measures are demonstrated in Section “Evaluation”. Section “Discussion and Future Work” concludes the chapter and describes the future work.