A Cost-Based Range Estimation for Mapping Top-k Selection Queries over Relational Databases

A Cost-Based Range Estimation for Mapping Top-k Selection Queries over Relational Databases

Anteneh Ayanso (Brock University, Canada), Paulo B. Goes (University of Arizona, USA) and Kumar Mehta (George Mason University, USA)
Copyright: © 2009 |Pages: 25
DOI: 10.4018/jdm.2009062501

Abstract

Finding efficient methods for supporting top-k relational queries has received significant attention in academic research. One of the approaches in the recent literature is query-mapping, in which top-k queries are mapped (translated) into equivalent range queries that relational database systems (RDBMSs) normally support. This approach combines the advantage of simplicity as well as practicality by avoiding the need for modifications to the query engine, or specialized data structures or indexing techniques to handle top-k queries separately. However, existing methods following this approach fall short of adequately modeling the problem environment and providing consistent results. In this article, the authors propose a cost-based range estimation model for the query-mapping approach. They provide a methodology for trading-off relevant query execution cost components and mapping a top-k query into a cost-optimal range query for efficient execution. Their experiments on real world and synthetic data sets show that the proposed strategy not only avoids the need to calibrate workloads on specific database contents, but also performs at least as well as prior methods.
Article Preview

Introduction

Relational databases are increasingly being used to support a wide range of interactive applications that require efficient methods for exploratory search and retrieval (for example, search for airline tickets, hotel rooms, real estates, used cars). In such applications, users specify target values for certain attributes without necessarily requiring exact matches to these values in return. However, relational queries normally establish rigid qualification to deal only with data that exactly match selection conditions (Motro, 1988). Due to the exactness in nature of relational databases and the query languages, users face the challenge of routinely specifying value ranges of attributes in search of approximate results, particularly in the event of empty answers or too many answers (Huh & Lee, 2001; Shin, Huh, Park & Lee, 2008). Top-k querying aims to address this problem, and is defined in the recent literature as a method of specifying relational queries by target values of attributes in order to obtain a desired number of best matches based on some ranking (distance) functions (Bruno, Chaudhuri & Gravano, 2002; Chaudhuri & Gravano, 1999). The objective is to retrieve a ranked list of the k records that are the closest to the query specifications.

Given the enormous installed base of relational database management systems (RDBMSs) in many emerging applications, finding efficient top-k retrieval methods that are easily implementable within these environments is of great importance. Our work in this article is particularly motivated by the mechanism of mapping or translating a top-k query into an equivalent range query that is normally supported by any RDBMS (Bruno et al., 2002; Chaudhuri & Gravano, 1999; Donjerkovic & Ramakrishnan, 1999). The rationale in the query-mapping approach is to exploit existing features of RDBMSs (for example, query language, summary statistics, and available index utilities) and provide a specific method to translate a top-k query into a conventional range query for efficient execution. This approach not only eliminates the need for additional specialized data structures and separate execution mechanisms for top-k queries, but also maintains the transparency in the operations of the RDBMS (Ayanso, Goes & Mehta, 2007; Bruno et al., 2002; Chaudhuri & Gravano, 1999). Nevertheless, existing top-k query-mapping methods are either simple heuristics that lead to significant performance variation by database settings (Chaudhuri & Gravano, 1999) or workload-adaptive that requires learning the data using queries with pre-specified data characteristics (Bruno et al., 2002; Chen & Ling, 2002).

In this article, we present a cost-based, query level model for mapping a top-k query into a cost-optimal range query. To this end, we explicitly formulate the relevant cost components and provide a methodology that allows evaluation of cost tradeoffs for range estimation. The top-k retrieval problem and the query-mapping solution involve many related issues and conflicting efficiency objectives. The use of tradeoff analysis not only captures the complexity of the problem better, but also helps to identify and incorporate the major factors that influence database operations and the associated costs in top-k querying. Our query level model is an improvement over the workload-based approach, which involves calibration of parameters using pre-specified workloads. In the workload-based approach, queries that deviate in characteristics or distribution from pre-specified workloads suffer from performance inefficiency. We adopt an approach that accounts for cost and performance implications of queries individually and show the relative consistency in efficiency under different experimental settings.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 30: 4 Issues (2019): Forthcoming, Available for Pre-Order
Volume 29: 4 Issues (2018): 1 Released, 3 Forthcoming
Volume 28: 4 Issues (2017)
Volume 27: 4 Issues (2016)
Volume 26: 4 Issues (2015)
Volume 25: 4 Issues (2014)
Volume 24: 4 Issues (2013)
Volume 23: 4 Issues (2012)
Volume 22: 4 Issues (2011)
Volume 21: 4 Issues (2010)
Volume 20: 4 Issues (2009)
Volume 19: 4 Issues (2008)
Volume 18: 4 Issues (2007)
Volume 17: 4 Issues (2006)
Volume 16: 4 Issues (2005)
Volume 15: 4 Issues (2004)
Volume 14: 4 Issues (2003)
Volume 13: 4 Issues (2002)
Volume 12: 4 Issues (2001)
Volume 11: 4 Issues (2000)
Volume 10: 4 Issues (1999)
Volume 9: 4 Issues (1998)
Volume 8: 4 Issues (1997)
Volume 7: 4 Issues (1996)
Volume 6: 4 Issues (1995)
Volume 5: 4 Issues (1994)
Volume 4: 4 Issues (1993)
Volume 3: 4 Issues (1992)
Volume 2: 4 Issues (1991)
Volume 1: 2 Issues (1990)
View Complete Journal Contents Listing