Foundation of Keyword Search in XML

Foundation of Keyword Search in XML

Weidong Yang (Fudan University, China) and Hao Zhu (Fudan University, China)
DOI: 10.4018/978-1-4666-1975-3.ch001


It has become desirable to provide a way of keyword search for users to query structured information in an XML database (data-centric retrieval) by combining database and information retrieval techniques. Therefore, the key challenges of keyword search in the XML database are how to define appropriate result models meeting user’s search intents, how to search the results by using efficient algorithms, and how to ranking the results. In this chapter, on one hand, the authors present the foundational knowledge of XML keyword search such as XML data models, XML query languages, inverted index, and Dewey encoding. On the other hand, some existing typical researches of keyword search in XML are presented, including the results models such as Smallest Lowest Common Ancestor (SLCA), Exclusive Lowest Common Ancestor (ELCA), Meaningful Lowest Common Ancestor (MLCA), the related search algorithms, and the ranking approaches.
Chapter Preview

1.1 Introduction

As a standard for the representation and exchange of semi-structured data on the Internet, XML has attracted much research in XML retrieval, which enables information discovery in XML data.

With regard to the retrieval mode, traditional structured query languages, such as XPath and XQuery, are used to search XML data, which can convey complex semantic meanings and therefore retrieve precisely the desired results. Nevertheless, the syntax of such a language is often rather complicated which makes it not appropriate for a naive user. One still needs sufficient knowledge of the structure, role of the requested objects in order to formulate such a meaningful query. In contrast, keyword search is a proven user-friendly way of querying XML data, since the user does not need to know either a query language or the structure of the underlying data. The main disadvantage lies in the lack of expressivity and inherent ambiguity, which also poses challenges in interpreting the semantics when performing keyword search on XML data.

Considering the organization of underlying XML data, data-centric retrieval and text-centric (or document-centric) retrieval are to be examined. While both text and structure are important, text-centric retrieval give higher priority to text. The premise of this approach is that XML document retrieval is characterized by long text fields and inexact matching. In contrast, data-centric XML retrieval can execute exact match conditions upon mainly numerical and non-text attribute-value encoded data. Therefore, this put the emphasis on the structural aspects of XML data and queries. In a word, text-centric approaches are appropriate for data that are essentially text documents, marked up as XML to capture document structure, while data-centric approaches are commonly used for data collections with complex structures that mainly contain non-text data. (Further comparisons can be referred in later sections).

In the chapter, some basic XML concepts will be described in Section 1.2. Then we will mainly discuss data-centric keyword search from several aspects, namely semantic model, search algorithms and relevance ranking, which be further presented during section 1.3-1.5. Other related issues can be found in Section 1.6.

Complete Chapter List

Search this Book: