By providing interoperability and shared meaning across actors and domains, lightweight domain ontologies are a cornerstone technology of the Semantic Web. This chapter investigates evidence sources for ontology learning and describes a generic and extensible approach to ontology learning that combines such evidence sources to extract domain concepts, identify relations between the ontology’s concepts, and detect relation labels automatically. An implementation illustrates the presented ontology learning and relation labeling framework and serves as the basis for discussing possible pitfalls in ontology learning. Afterwards, three use cases demonstrate the usefulness of the presented framework and its application to real-world problems.
Top1 Introduction
Ontologies, which are commonly defined as explicit specifications of shared conceptualizations (Gruber, 1995), provide a reusable domain model which allows for many applications in the areas of knowledge engineering, natural language processing, e-commerce, intelligent information integration, bio-informatics etc. Not all ontologies share the same amount of formal explicitness (Corcho, 2006), nor do they include all the components that can be expressed in a formal language, such as concept taxonomies and various types of formal axioms. Ontology research therefore distinguishes between lightweight and heavyweight ontologies (Studer et al., 1998). The manual creation of such conceptualizations for non-trivial domains is an expensive and cumbersome task which requires highly specialized human effort (Cimiano, 2006). Furthermore the evolution of domains results in a constant need for refinement of domain ontologies to ensure their usefulness.
Automated approaches to learning ontologies from existing data are intended to improve the productivity of ontology engineers. Buitelaar et al. (2005) organize the tasks in ontology learning into a set of layers. Ontology learning from text requires lexical entries to link single words or phrases to concepts . Synonym extraction helps to connect similar terms to a concept. Taxonomies provide the ontology’s backbone while non-taxonomic relations supply arbitrary links between the concepts. Finally, axioms are defined or acquired to derive additional facts.
Data sources for ontology learning typically include unstructured, semi-structured and structured data (Cimiano, 2006). Ontology learning from structured data consumes information sources such as database schemas or existing ontologies. This process is also called lifting as it lifts or maps parts of existing schemas to new logical definitions. Since most of the available data out there appear in unstructured and semi-structured forms, a major research focus over the last two decades has been the extraction of domain models from natural language text using a variety of methods. Cimiano (2006) presented an extensive overview of ontology learning methods from unstructured data. Many of the methods involve corpus statistics such as association rules mining (Maedche et al., 2002), co-occurrence analysis for term clustering (Wong et al., 2007), latent semantic analysis for detecting synonyms and concepts (Landauer & Dumais, 1997), and kernel methods for classifying semantic relations (Giuliano et al., 2007). Many corpus-based approaches are based on Harris’ distributional hypothesis (Harris, 1968), which states that terms or words are similar to the extent that they occur in syntactically similar contexts. Besides corpus statistics, researchers also apply linguistic parsing and linguistic patterns in ontology learning, building on the seminal work of Hearst (Hearst, 1992), patterns support taxonomy extraction (Liu et al., 2005), detection of concepts and labeled relations in combination with the application of Web statistics (Sánchez-Alonso & García, 2006), or Web-scale extraction of unnamed relations (Etzioni et al., 2008).