Article Preview
TopIntroduction
Emerging technologies such as Semantic Web, Grid, Semantic Search, Linked Data and Cloud and Peer-to-Peer computing have become available very large datasets. For example, by the time this paper has been written at least 3.8 billion pages are indexed by the Web (Kunder, 2013) and the Linking Open Data cloud which by September 2011 it has at least 31 billion RDF triples, interlinked by around 504 million RDF links1.
The enormous growth in the size of data has a direct impact on the performance of tasks that are required to process on very large datasets and whose complexity depends on the size of the database. Particularly, the task of evaluating queries based on user preferences may be considerably affected by this situation.
Skyline (Börzsönyi et al., 2001) approaches have been successfully used to naturally express user preference conditions useful to characterize relevant data in large datasets. Even though, Skyline may be a good choice for huge data sets its cardinality may become very large as the number of criteria or dimensions increases. The estimated cardinality of the Skyline is O(lnd-1n) when the dimensions are independent where n is the size of the input data and d the number of dimensions (Bentleyet al., 1978).
Consider Table 1 that shows estimates of the skyline cardinality when the number of dimensions ranges from 2 to 10 in a database comprised of 1,000,000 tuples. We may observe in Table 1 that Skyline cardinality rapidly increases making unfeasible for users to process the whole skyline set. In consequence, users may have to discard useless data manually and consider just a small subset or a subset of the Skyline that best meet the multidimensional criteria. To identify these points, the Top-k Skyline has been proposed (Goncalves & Vidal, 2009; Chan et al., 2006b; Lin et al., 2007). Top-k Skyline uses a score function to induce a total order of the Skyline points, and recognizes the top-k points based on these criteria.
Table 1. Estimated skyline cardinality
Several algorithms have been defined to compute the Top-k Skyline (Goncalves & Vidal, 2009; Chan et al., 2006b; Lin et al., 2007; Vlachou & Vazirgiannis, 2007), but they may be very costly. First, they require the computation of the whole Skyline; second, they execute probes of the multi-dimensional function over the whole Skyline points. Thus, if k is much smaller than the cardinality of the Skyline, these solutions may be very inefficient because a large number of non-necessary probes may be evaluated, i.e., at least Skyline size minus k performed probes will be non-necessaries.
Top-k Skyline has become necessary in many real-world situations (Vlachou & Vazirgiannis, 2007), and a wide range of ranking metrics to measure the interestingness of each Skyline tuple has been proposed.