XAR: An Integrated Framework for Semantic Extraction and Annotation

XAR: An Integrated Framework for Semantic Extraction and Annotation

Naveen Ashish (University of California-Irvine, USA) and Sharad Mehrotra (University of California-Irvine, USA)
DOI: 10.4018/978-1-60566-894-9.ch011
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

The authors present the XAR framework that allows for free text information extraction and semantic annotation. The language underpinning XAR, the authors argue, allows for the inclusion of probabilistic reasoning with the rule language, provides higher level predicates capturing text features and relationships, and defines and supports advanced features such as token consumption and stratified negotiation in the rule language and semantics. The XAR framework also allows the incorporation of semantic information as integrity constraints in the extraction and annotation process. The XAR framework aims to fill in a gap, the authors claim, in the Web based information extraction systems. XAR provides an extraction and annotation framework by permitting the integrated use of hand-crafted extraction rules, machine-learning based extractors, and semantic information about the particular domain of interest. The XAR system has been deployed in an emergency response scenario with civic agencies in North America and in a scenario with an IT department of a county level community clinic.
Chapter Preview
Top

Introduction

The vision of semantic interoperability on a large-scale, such as that envisioned by the concept of the Semantic Web (Berners-Lee, Hendler & Lassila, 2001), continues to sustain interest and excitement. The availability of automated tools for semantic annotation of data on the open Web is recognized as critical for Semantic Web enablement. In the process of semantic annotation we annotate significant entities and relationships in documents and pages on the Web, thus making them amenable for machine processing. The time and investment of marking and annotating Web content manually is prohibitive for all but a handful of Web content providers, which leads us to develop automated tools for this task. As an example, consider Web pages of academic researchers with their biographies in free text as shown in Figure 1.

Figure 1.

Semantic Annotation of Web Content

The annotation of significant concepts on such pages, such as a researcher’s current job-title, academic degrees, alma-maters and dates for various academic degrees etc (as shown in Figure 1) can then enable Semantic Web agent or integration applications over such data. Such annotation or mark-up tools are largely based on information extraction technology. While information extraction itself is a widely investigated area, one still lacks powerful, general purpose, and yet easy-to-use frameworks and systems for information extraction, particularly the extraction of information from free text which is a significant fraction of the content on the open Web. In this chapter we describe XAR, a framework and system for free text information extraction and semantic annotation. XAR provides a powerful extraction and annotation framework by permitting the integrated use of hand-crafted extraction rules, machine-learning based extractors, as well as semantic information about the particular domain of interest for extraction. In this chapter we will describe the XAR framework which permits the integrated use of 1) Hand-crafted extraction rules, 2) Existing machine-learning based extractors, and 3) Semantic information in the form of database integrity constraints to power semantic extraction and annotation.

We have designed XAR to be an open-source framework that can be used by end-user application developers with minimal training and prior expertise, as well as by the research community as a platform for information extraction research. Over the last year we have used XAR for semantic annotation of Web documents in a variety of interesting domains. These applications range from the semantic annotation of details of particular events in online news stories in an overall application for internet news monitoring, to the semantic annotation of free text clinical notes as part of a business intelligence application in the health-care domain. This chapter is organized as follows. In the next section we provide an overview of XAR from a user perspective i.e., as a framework for developing extraction applications. We then present the technical details of our approach including the XAR system architecture, algorithmic issues, and implementation details. We present experimental evaluations assessing the effectiveness of the system in a variety of different domains. We also describe use case studies of application development using XAR in two different organizations. Finally, we discuss related work and provide a conclusion.

Top

The Xar System

We first describe XAR from a user perspective i.e., as a framework for developing extraction applications and performing annotation tasks. The extraction step in annotation is treated as one of slot-filling. For instance in the researcher bios task, each Web page provides values for slots or attributes such as the job-title, academic degrees, dates etc. The two primary paradigms (Feldman et al., 2002) for automated information extraction systems are (i) Using hand-crafted extraction rules, and (ii) Using a machine-learning based extractor that can be trained for information extraction in a particular domain. Essentially, extraction applications in XAR are developed by using either hand-crafted extraction rules (Feldman et al., 2002) or machine-learning based extractors (Kayed 2006), which are further complemented with semantic information in the form of integrity constraints. We describe and illustrate each of these aspects.

Complete Chapter List

Search this Book:
Reset