TBSGM: A Fast Subgraph Matching Method on Large Scale Graphs

TBSGM: A Fast Subgraph Matching Method on Large Scale Graphs

Fusheng Jin (Beijing Institute of Technology, Beijing, China), Yifeng Yang (Beijing Institute of Technology, Beijing, China), Shuliang Wang (School of Software, Beijing Institute of Technology, Beijing, China), Ye Xue (Northwestern University, Evanston, USA) and Zhen Yan (Technical University of Munich, Munich, Germany)
Copyright: © 2018 |Pages: 23
DOI: 10.4018/IJDWM.2018100104
OnDemand PDF Download:
No Current Special Offers


Subgraph matching, which belongs to NP-hard, faces significant challenges on a large scale graph with billions of nodes, and existing methods are usually confronted with greater challenges from both stability and efficiency. In this article, a subgraph matching method in a distributed system, tree model-based subgraph matching method (TBSGM) is proposed. The authors provide a transformed efficient query tree as a replacement for a query graph. In order to get the tree, they present a cost evaluation model which may help to generate the efficient query tree according to network communication-cost and calculation-cost evaluation. Also, a key set based indexing strategy for intermediate results is given to simplify the matching results during network communication. Extensive experiments with real-world datasets show that TBSGM significantly outperforms other methods in the aspects of scalability and efficiency.
Article Preview

1. Introduction

Graph data have been increasing exponentially in recent years. In 2015, the number of active users on Facebook in a month has reached 1.3 billion (http://www.statisticbrain.com/facebook-statistics/). It is unfeasible to handle these large-scale graphs that have more than billion nodes on a single machine. That is, in the context of large scale graph, processing graph in distributed systems become an inevitable trend. Among all graph algorithms, subgraph matching has been widely used in many applications, such as finding isomorphism in organic polymer and protein matching (Zhao & Han, 2010), knowledge bases (Kasneci, Suchanek, Ifrim, Ramanath, & Weikum, 2008; Wu, Li, Wang, & Zhu, 2012) and program analysis (Zhang, Yang, & Jin, 2010; Zhao & Han, 2010). Many methods like (Cordella, Foggia, Sansone, & Vento, 2004; Han, Lee, & Lee, 2013; He & Singh, 2010; Shang, Zhang, Lin, & Yu, 2008; Sun, Wang, Wang, Shao, & Li, 2012; Ullmann, 1976; Zhang, Li, & Yang, 2009; Zhao & Han, 2010) have been given to deal with the subgraph matching efficiently.

However, dealing with subgraph matching on large data graph is still an unresolved issue. When dealing with large scale graphs, all existing methods are either unstable or inefficiency. Matching order and redundant results are the most common reasons that result in instability and inefficiency (Han et al., 2013; Lee et al., 2012).

1.1. Subgraph Matching

In this paper, when handling subgraph matching problem of large scale graph with more than billion nodes in distributed systems, the subgraph matching problem is defined as follows: For a data graph G and a query graph Q, retrieve all subgraphs of G that are isomorphic to Q (Sun et al., 2012). Subgraph matching approaches are mainly categorized into two groups under different implementing environment, centralized approaches and distributed approaches:

  • 1.

    Centralized Approaches: (Cordella, Foggia, Sansone, & Vento, 2004; Han, Lee, & Lee, 2013; He & Singh, 2010; Shang, Zhang, Lin, & Yu, 2008; Ullmann, 1976; Zhang, Li, & Yang, 2009; Zhao & Han, 2010). Given a matching order of query nodes, it compares each query node with the data node and check the edges that connect to this query node, until every query node has been matched. Centralized approaches are usually efficient when the scale of graph data is small. However, when it comes to large scale graph data, these approaches are very low efficiency, because the data graph cannot be stored in computer memory completely, and it takes a lot of extra resources, such as visiting external store or communicating on network.

  • 2.

    Distributed Approaches: Sun Z (Sun et al., 2012) is a typical example. It transforms query graph to many STwigs and matching STwigs with data nodes. Then it performs the operation of Join to get the final result. However, the problem is that different matching orders and excessive intermediate results will significantly influence the efficiency.

Complete Article List

Search this Journal:
Open Access Articles
Volume 18: 4 Issues (2022): Forthcoming, Available for Pre-Order
Volume 17: 4 Issues (2021): 2 Released, 2 Forthcoming
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing