Review on Keyword Search and Ranking Techniques for Semi-Structured Data

Review on Keyword Search and Ranking Techniques for Semi-Structured Data

Dayananda P. (JSSATEB, India) and Sowmyarani C. N. (Rashtreeya Vidyalaya College of Engineering, India)
DOI: 10.4018/978-1-5225-7347-0.ch013

Abstract

The size of semi-structured data is increasing continuously. Handling semi-structured data efficiently is a challenging task. Keyword search is an important task, and required information can be retrieved without having knowledge of data storage hierarchy. There are several challenges in handling XML data. This chapter discusses various challenges in terms of lowest common ancestor (LCA) semantics, processing of queries efficiently, retrieving top-k results for user needed data. The existing approach is defined under many classes based on how the problem and solution are tackled. Analysis of keyword search and ranking techniques for retrieving desired information are discussed in detail.
Chapter Preview
Top

When XML documents do not contain IDREF, they can be modeled as trees. Approaches to handle such documents are called tree-based approaches because they are based on tree model. Inspired by the hierarchical structure of the tree model, most of existing tree-based approaches are based on the LCA (Lowest Common Ancestor) semantics, which returns the lowest common ancestors of matching nodes to keyword queries.

There are many subsequent semantics to filter less meaningful answers. Existing works either improve the effectiveness by proposing a new semantics or improve the efficiency by proposing a new method for certain semantics. The widely accepted LCA-based semantics include LCA itself, SLCA, VLCA, MLCA, ELCA, and etc, among which, SLCA and ELCA are the most popular semantics. We classify the existing research works into these semantics and result of our classification is shown in Figure 1. Some research work involves study of more than one semantic such as XRANK (Dayananda, 2016; Guo, 2003), Set-intersection (Bao et al., 2012), and Top-K (Dayananda, 2016).

2.1 LCA Semantics

The LCA semantics for XML keyword search was first proposed in XRANK (Guo et al., 2003). By the LCA semantics, for a set of matching nodes, each of which contains at least one query keyword and each query keyword matches at least one node in this set, the lowest common ancestor (LCA) of this set is a returned node. An answer is a subtree rooted as a returned node (i.e., an LCA) or a path from the returned node to matching nodes. XRANK is extended from Google's Pagerank algorithm for ranking. It takes into account the proximity of the keywords and the references between attributes. XRANK implements a naive approach, and three optimized approaches afterwards to improve the search.

Complete Chapter List

Search this Book:
Reset