Semantic Document Networks to Support Concept Retrieval

Semantic Document Networks to Support Concept Retrieval

Simon Boese (University of Hamburg, Germany), Torsten Reiners (Curtin University, Australia & University of Hamburg, Germany) and Lincoln C. Wood (University of Otago, New Zealand)
Copyright: © 2014 |Pages: 12
DOI: 10.4018/978-1-4666-5202-6.ch192
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

There are many unstructured documents created in many disciplines which need to be (pre-) processed in one way or another for further integration and use in IT systems. The predominance of the Internet and large corporate databases implies that there are large volumes of documents that need to be analysed and searched to retrieve information; particularly within the fields of machine translation, text analysis, semantic mining, information extraction and retrieval. We explicate a framework based on concept-based indexing that supports the analysis, storage, and retrieval of documents. Natural-language reduction is used to calculate semantic cores for concept-based indexing of stored concepts found within documents. The processed documents are stored within a semantic network enabling effective analysis of core concepts within documents and rapid retrieval of specific ideas from multiple documents based on provided concepts
Chapter Preview
Top

Introduction

This chapter focuses on a framework to support advanced document storage and fast queries to retrieve documents based on concept-focused searches. These searches favour ‘semantic’ searches which evaluate and use the meanings of words and phrases, rather than ‘key-word’ searches. The framework rests on three stages: pre-processing (semantic analysis influences the storage quality within a semantic database), conceptualization (extraction of key concepts to establish document networks), and storage within a semantic database, facilitating advanced future retrieval. The objective is to decompose documents and extract all relevant information about structure and content to allow comprehensive storage in a semantic document network; including the interpretation according to domains, contexts, languages, or readers. For example, the word ‘trunk’ may refer to a storage area (in the context of motor vehicles), a clothes storage box (in the context of travelling), or an elephant’s appendage (in the context of a safari); see Figure 1. The arrows represent parameters associated with relations. There can be multiple meanings for the related words and it is only the clustering of words that provides the important context which provides readers with meaning; e.g., Safari is also the name of an Internet browser.

Figure 1.

Evaluation of the meaning of 'trunk' based on the context. This supports semantic-based retrieval of documents rather than merely keyword-based retrieval [Source: Boese, Reiners, and Wood (2012, p. 5)].

A brief introduction to conceptualization and the semantic document network provides an overview of how information can be stored in an interlinked network. Using a short sample, we demonstrate the calculation of the semantic core using concept-based indexing and how the concepts are embedded within the existing semantic document network.

Top

Background

Organizations are facing increasingly significant document management challenges as they seek to leverage vast volumes of internally-focused documents (e.g., emails or internal reports) or provide document-based services to others. The challenge is to design document management systems that support the storage and retrieval of unstructured electronic documents; in contrast, there are well-established document management methods for structured documents, such as those used by libraries. Limited meta-information (particularly key terms) has historically been used to support simple indexing and classification procedures. However, the rise of user-generated content within Web 2.0, and the on-going accumulation of document digitalization have led to the challenge to maintain, let alone increase, the retrieval quality. Improved search engine capabilities enable users to consider synonyms, stem forms, and even translations (He & Wang, 2009). However, these elements share the commonality of requiring a search request that is based on words within the document, while ignoring the meaning and context that these words occur in – they ignore the semantic meaning behind the text. Semantic analysis can support the search through the determination of the key concepts and scenarios that may be associated with a term; e.g., the word ‘trunk’ may be used with a different meaning in documents about car repair, travel accessories, or in safari reports. As the Web progresses and evolves, we anticipate that computers will continue to process information on increasingly higher levels, and will soon enable search and retrieval of documents based on the meaning of words, rather than just the occurrence of words. The underlying systems that support this process would also enable other applications for handling documents, enabling software agents to extract individualised information from databases, grade unstructured exams with minimal instructor setup, summarise correspondences or articles, and translate documents effectively. In all of these cases, the ability to understand natural, unstructured language is crucial to ensure the robustness and reliability of the results.

Key Terms in this Chapter

Semantic Document Network: A network that contains the semantic representation of content of the document but not the document textual content. It is the intersection between the content of the documents and connects the nodes, representing the overlap of semantic content of documents.

Concept Retrieval: The ability to query a document and extract particular segments of text that match concepts or ideas provided by a user.

Semantic Network: nodes, encapsulating data and information, are connected by edges which include information about how these nodes are related to one another.

Concept: one or multiple words associated with a category that was generated by the abstraction of common characteristics from a range of particular ideas, while removing the uncommon characteristics. The remaining common characteristic is that which is similar to all of the different individuals and represents the meanings, or sense, of the ideas.

Text Analysis: the process of deriving meaningful information from the data and ideas expressed within the document. It includes meta-information, structural information, and content information.

Semantic Core: The document-specific component of the semantic network that contains the ideas, concepts, that best represents the meaning of the document, rather than the best-matching words.

Concept-Based Indexing (CBI): is a method for indexing that differs from text-based indexing (which uses keywords or headings); CBI instead uses descriptions, ideas, and concepts to index documents.

Semantic Analysis: is the elicitation of knowledge from documents, accounting for the context and understanding. The units that are extracted are arranged and grouped within meaningful categories.

Complete Chapter List

Search this Book:
Reset