The Graph Signature: A Scalable Query Optimization Index for RDF Graph Databases Using Bisimulation and Trace Equivalence Summarization

The Graph Signature: A Scalable Query Optimization Index for RDF Graph Databases Using Bisimulation and Trace Equivalence Summarization

Mustafa Jarrar (Sina Institute, Birzeit University, Birzeit, Palestine) and Anton Deik (Sina Institute, Birzeit University, Birzeit, Palestine)
Copyright: © 2015 |Pages: 30
DOI: 10.4018/IJSWIS.2015040102
OnDemand PDF Download:
No Current Special Offers


Querying large data graphs has brought the attention of the research community. Many solutions were proposed, such as Oracle Semantic Technologies, Virtuoso, RDF3X, and C-Store, among others. Although such approaches have shown good performance in queries with medium complexity, they perform poorly when the complexity of the queries increases. In this paper, the authors propose the Graph Signature Index, a novel and scalable approach to index and query large data graphs. The idea is that they summarize a graph and instead of executing the query on the original graph, they execute it on the summaries. The authors' experiments with Yago (16M triples) have shown that e.g., a query with 4 levels costs 62 sec using Oracle but it only costs about 0.6 sec with their index. Their index can be implemented on top of any Graph database, but they chose to implement it as an extension to Oracle on top of the SEM_MATCH table function. The paper also introduces disk-based versions of the Trace Equivalence and Bisimilarity algorithms to summarize data graphs, and discusses their complexity and usability for RDF graphs.
Article Preview

1. Introduction And Motivation

Big Data, Data Web, Linked and Open Data are examples of an emerging era of data industry and data science. We are witnessing a rapid growth of the amount of available structured and linked data. As of June 2015, the Linking Open Data Statistics project (LODStats) records 3308 published datasets consisting of around 89.9 billion RDF triples.1 Examples of published datasets are: DBPedia, Yago, DBLP, CiteSeer, ACM, Freebase, Geonames, MusicBrainz, as well as open governmental datasets such as those of the UK (, the US (, and Ireland (, and many others. The biggest software companies are encouraging the trend of publishing and linking structured data on the web. For example, Google, Yahoo, and Microsoft have joined efforts in developing a shared ontology.2 Facebook is also providing access to parts of its data via its Graph API.3

To exploit structured data on the web to its full potential, people need efficient querying methods. SPARQL was introduced by W3C as a standardized query language that enables querying decentralized collections of RDF data. However, SPARQL is oriented for technical people. So, in order to allow people with limited IT skills to query structured data, many solutions were proposed, among which are those which proposed an interactive approach that allows the user to formulate queries without prior knowledge of the underlying data or its structure. Examples of such approaches are: Lore (Goldman & Widom, 1997) which was developed for querying schema-free XML, and MashQL, which is a query formulation language for RDF introduced in previous work (Jarrar & Dikaiakos, 2008; 2009; 2010; 2012). In July 2013, Facebook started rolling out a “Graph Search” functionality,4 allowing users to formulate structured queries over the Facebook data graph. Not only do such approaches motivate the importance of querying data graphs, but they also emphasize the significance of having fast responses for queries executed over large data graphs in an interactive environment.

The most widely adopted data model specification for representing structured data on the web is RDF (Resource Description Framework). RDF syntax is based on XML and reflexes simple graph-based data model. RDF represents data as triples <Subject, Predicate, Object>.5 For instance, the fact that the book called Wamadat is authored by Naima can be represented by the following three triples which form a directed labeled graph (see Figure 1):

Figure 1.

A data graph and it Graph Signature (SO and SI)

< BK3, Name, Wamadat >
< BK3, Author, AU3 >
< AU3, Name, Naima >

Data representation using RDF is more elementary than relational databases and XML models, which enables easy data integration and interoperability of systems. However, querying RDF data graphs (especially large graphs) is a major challenge that faces all querying approaches, and therefore has brought the attention of the research community (e.g., Abadi et al., 2007; Chong et al., 2005; Schätzle et al., 2013; Tran et al., 2013; Yuan et al., 2013). Querying such data, which is typically stored in one relational table denoted by <S,P,O> is of high complexity because traversing a graph stored in relational model involves many self-joins of that table. More specifically, a query with IJSWIS.2015040102.m01edges on such a table requires IJSWIS.2015040102.m02 self joins of that table (Abadi et al., 2007).

Complete Article List

Search this Journal:
Open Access Articles
Volume 18: 4 Issues (2022): Forthcoming, Available for Pre-Order
Volume 17: 4 Issues (2021): 2 Released, 2 Forthcoming
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing