Evaluating Top-k Skyline Queries Efficiently

Evaluating Top-k Skyline Queries Efficiently

Marlene Goncalves, María Esther Vidal
DOI: 10.4018/978-1-60960-475-2.ch004
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Criteria that induce a Skyline naturally represent user’s preference conditions useful to discard irrelevant data in large datasets. However, in the presence of high-dimensional Skyline spaces, the size of the Skyline can still be very large. To identify the best k points among the Skyline, the Top-k Skyline approach has been proposed. This chapter describes existing solutions and proposes to use the TKSI algorithm for the Top-k Skyline problem. TKSI reduces the search space by computing only a subset of the Skyline that is required to produce the top-k objects. In addition, the Skyline Frequency Metric is implemented to discriminate among the Skyline objects those that best meet the multidimensional criteria. This chapter’s authors have empirically studied the quality of TKSI, and their experimental results show the TKSI may be able to speed up the computation of the Top-k Skyline in at least 50% percent with regard to the state-of-the-art solutions.
Chapter Preview
Top

Introduction

Emerging technologies such as Semantic Web, Grid, Semantic Search, Linked Data and Cloud and Peer-to-Peer computing have become available very large datasets. For example, by the time this paper has been written at least 21.59 billion pages are indexed by the Web (De Kunder, 2010) and the Cloud of Linked Data has at least 13,112,409,691 triples (W3C, 2010). The enormous growth in the size of data has a direct impact on the performance of tasks that are required to process on very large datasets and whose complexity depends on the size of the database. Particularly, the task of evaluating queries based on user preferences may be considerably affected by this situation.

Skyline (Börzsönyi et al., 2001) approaches have been successfully used to naturally express user preference conditions useful to characterize relevant data in large datasets. Even though, Skyline may be a good choice for huge data sets its cardinality may become very large as the number of criteria or dimensions increases. The estimated cardinality of the Skyline is O(lnd-1n) when the dimensions are independent where n is the size of the input data and d the number of dimensions (Bentley et al., 1978).

Consider Table 1 that shows estimates of the skyline cardinality when the number of dimensions ranges from 2 to 10 in a database comprised of 1,000,000 tuples. We may observe in Table 1 that Skyline cardinality rapidly increases making unfeasible for users to process the whole skyline set. In consequence, users may have to discard useless data manually and consider just a small subset or a subset of the Skyline that best meet the multidimensional criteria. To identify these points, the Top-k Skyline has been proposed (Goncalves and Vidal, 2009; Chan et al., 2006b; Lin et al., 2007). Top-k Skyline uses a score function to induce a total order of the Skyline points, and recognizes the top-k objects based on these criteria.

Table 1.
Estimated Skyline Cardinality
#DimensionsCardinality
2191
32,637
436,431
5503,309
66,953,471
796,065,749
81,327,197,371
918,335,909,288
10253,319,948,365

Complete Chapter List

Search this Book:
Reset