Extended Adaptive Join Operator with Bind-Bloom Join for Federated SPARQL Queries

Extended Adaptive Join Operator with Bind-Bloom Join for Federated SPARQL Queries

Damla Oguz, Shaoyi Yin, Belgin Ergenç, Abdelkader Hameurlain, Oguz Dikenelli
Copyright: © 2017 |Pages: 26
DOI: 10.4018/IJDWM.2017070103
(Individual Articles)
No Current Special Offers


The goal of query optimization in query federation over linked data is to minimize the response time and the completion time. Communication time has the highest impact on them both. Static query optimization can end up with inefficient execution plans due to unpredictable data arrival rates and missing statistics. This study is an extension of adaptive join operator which always begins with symmetric hash join to minimize the response time, and can change the join method to bind join to minimize the completion time. The authors extend adaptive join operator with bind-bloom join to further reduce the communication time and, consequently, to minimize the completion time. They compare the new operator with symmetric hash join, bind join, bind-bloom join, and adaptive join operator with respect to the response time and the completion time. Performance evaluation shows that the extended operator provides optimal response time and further reduces the completion time. Moreover, it has the adaptation ability to different data arrival rates.
Article Preview

1. Introduction

As the increase in the number of data sources on linked data, a distributed data space on the web is generated. This huge global data space can be automatically queried by using two approaches called link traversal (Hartig, Bizer, & Freytag, 2009) and query federation (Görlitz & Staab, 2011a). The first approach is based on discovering potentially relevant data by following the links between them. In other words, it finds the related data sources during the query execution. The second approach, query federation, divides the query into subqueries and distributes them to the SPARQL endpoints of the relevant data sources. The intermediate results from the data sources are aggregated and the final results are generated. Although both approaches have the advantage of providing up-to-date results, link traversal cannot guarantee finding all results because the relevant data sources change according to the starting point. For this reason, we focus on the query federation approach.

The objective of engines in query federation is to minimize both the response time and the completion time. Response time is the time to generate the first result tuple, whereas completion time is the time to provide all result tuples. Response time and completion time include communication time, I/O time and CPU time. Since the communication time dominates other costs, the main objective of the federated query engines can be stated as to minimize the communication cost. Static query optimization (Selinger, Astrahan, Chamberlin, Lorie, & Price, 1979) is not adequate for federated queries, because they are executed over the SPARQL endpoints of the selected distributed data sources on the web, and the data arrival rates are unexpected. Moreover, most of the statistics about the data sources are missing or unreliable. These constraints show that adaptive query optimization (Deshpande, Ives, & Raman, 2007) is a necessity for query federation over linked data.

Adaptive query optimization has been studied in detail in relational databases (Babu & Bizarro, 2005; Deshpande et al., 2007; Morvan & Hameurlain, 2009; Gounaris, Tsamoura, & Manolopoulos, 2013). However, it is a new research area for linked data. There are only two engines which consider adaptive query optimization for federated queries over SPARQL endpoints: ANAPSID (Acosta, Vidal, Lampo, Castillo, & Ruckhaus, 2011) and ADERIS (Lynden, Kojima, Matono, & Tanimura, 2010, 2011). The first one proposes a non-blocking join method based on symmetric hash join (Wilschut & Apers, 1991) and Xjoin (Urhan & Franklin, 2000), while the second one uses a cost model for dynamically changing the join order. Other than these, AVALANCHE (Basca & Bernstein, 2010, 2014) collects statistical information about relevant data sources and then generates its execution plan to provide the first k tuples. In addition, there are several studies which concentrate on join ordering for SPARQL queries by using different techniques such as evolutionary algorithms (Oren, Guéret, & Schlobach, 2008; Hogenboom, Milea, Frasincar, & Kaymak, 2009) and ant colony (Hogenboom, Frasincar, & Kaymak, 2013; Kalayci, Kalayci, & Birant, 2015). To the best of our knowledge, adaptive join operator (Oguz, Yin, Hameurlain, Ergenc, & Dikenelli, 2016) is the first study which aims to reduce both the response time and the completion time for query federation over SPARQL endpoints.

As mentioned above, the communication cost is the dominant cost in distributed environments. Bloom filter (Bloom, 1970), which is a space efficient data structure, is widely used in relational databases (Mackert & Lohman, 1986; Mullin, 1990; Michael, Nejdl, Papapetrou, & Siberski, 2007; Ives & Taylor, 2008). It is utilized in different linked data tasks such as identity reasoning (Williams, 2008) and data source selection (Hose & Schenkel, 2012). Bloom filter is also employed to reduce the communication cost in two studies of linked data (Basca & Bernstein, 2014; Groppe, Heinrich, & Werner, 2015).

Complete Article List

Search this Journal:
Volume 20: 1 Issue (2024)
Volume 19: 6 Issues (2023)
Volume 18: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing