Article Preview
Top1. Introduction
Graph data have been increasing exponentially in recent years. In 2015, the number of active users on Facebook in a month has reached 1.3 billion (http://www.statisticbrain.com/facebook-statistics/). It is unfeasible to handle these large-scale graphs that have more than billion nodes on a single machine. That is, in the context of large scale graph, processing graph in distributed systems become an inevitable trend. Among all graph algorithms, subgraph matching has been widely used in many applications, such as finding isomorphism in organic polymer and protein matching (Zhao & Han, 2010), knowledge bases (Kasneci, Suchanek, Ifrim, Ramanath, & Weikum, 2008; Wu, Li, Wang, & Zhu, 2012) and program analysis (Zhang, Yang, & Jin, 2010; Zhao & Han, 2010). Many methods like (Cordella, Foggia, Sansone, & Vento, 2004; Han, Lee, & Lee, 2013; He & Singh, 2010; Shang, Zhang, Lin, & Yu, 2008; Sun, Wang, Wang, Shao, & Li, 2012; Ullmann, 1976; Zhang, Li, & Yang, 2009; Zhao & Han, 2010) have been given to deal with the subgraph matching efficiently.
However, dealing with subgraph matching on large data graph is still an unresolved issue. When dealing with large scale graphs, all existing methods are either unstable or inefficiency. Matching order and redundant results are the most common reasons that result in instability and inefficiency (Han et al., 2013; Lee et al., 2012).
1.1. Subgraph Matching
In this paper, when handling subgraph matching problem of large scale graph with more than billion nodes in distributed systems, the subgraph matching problem is defined as follows: For a data graph G and a query graph Q, retrieve all subgraphs of G that are isomorphic to Q (Sun et al., 2012). Subgraph matching approaches are mainly categorized into two groups under different implementing environment, centralized approaches and distributed approaches:
- 1.
Centralized Approaches: (Cordella, Foggia, Sansone, & Vento, 2004; Han, Lee, & Lee, 2013; He & Singh, 2010; Shang, Zhang, Lin, & Yu, 2008; Ullmann, 1976; Zhang, Li, & Yang, 2009; Zhao & Han, 2010). Given a matching order of query nodes, it compares each query node with the data node and check the edges that connect to this query node, until every query node has been matched. Centralized approaches are usually efficient when the scale of graph data is small. However, when it comes to large scale graph data, these approaches are very low efficiency, because the data graph cannot be stored in computer memory completely, and it takes a lot of extra resources, such as visiting external store or communicating on network.
- 2.
Distributed Approaches: Sun Z (Sun et al., 2012) is a typical example. It transforms query graph to many STwigs and matching STwigs with data nodes. Then it performs the operation of Join to get the final result. However, the problem is that different matching orders and excessive intermediate results will significantly influence the efficiency.