## Abstract

Application domains such as bioinformatics and web technology represent complex objects as graphs where nodes represent basic objects (i.e. atoms, web pages etc.) and edges model relations among them. In biochemical databases proteins are naturally represented as labeled graphs: the nodes are atoms and the edges are chemical links. In computer vision, graphs can be used to represent images at different levels of abstraction. In a low-level representation (Pailloncy, Deruyver, & Jolion, 1999), the nodes of a graph correspond to pixels and the edges to spatial relations between pixels. At higher levels of description (Hong & Huang, 2001), nodes are image regions and edges are spatial relations among them. In a Web graph (Deutsch, Fernandez, Florescu, Levy, & Suciu, 1999) nodes are web pages and edges are links between them. In all these domains, substructure queries that search for all exact or approximate occurrences of a given query graph in the graphs of the database can be useful. Research efforts in graph searching have taken three directions: the first is to study matching algorithms for particular graph structures (planar graphs, bounded valence graphs and association graphs); the second is to use elaborate tricks to reduce the number of generated matching maps (Cordella, Foggia, Sansone, & Vento, 2004); and the third is, since graph searching problem is NP-complete, to provide polynomial approximate algorithms. In the context of querying in a database of graphs many of the existing methods are designed for specific applications. For example, several querying methods for semi-structured databases have been proposed. In addition, commercial products and academic projects (Kelley 2002) for subgraph searching in biochemical databases are available. These two examples have a different underlying data model (a web-page database is a large graph whereas a biochemical database is a collection of graphs). In these two applications regular path expressions and indexing methods are used during query time to respectively locate substructures in the database and to avoid unnecessary traversals of the database. In general graph databases, there are some searching methods where the data graph and the query graph must have the same size. Other methods allow the query to be considerably smaller than the database graphs. A common idea in the above algorithms is to index the subgraph matches in the database and organize them in suitable data structures.

Top## Background

A graph database can be viewed as either a single (large) labeled graph (e.g. web) or a collection of labeled graphs (e.g., chemical molecules). By keygraph searching it is intended graph or subgraph matching in a graph database. Although (sub)graph-to-graph matching algorithms can be used for keygraph searching, efficiency considerations suggest the use of indexing techniques to reduce the search space and the time complexity especially in large databases. Keygraph searching in databases consists of three basic steps:

*1. *Reduce the search space by filtering. For a database of graphs a filter finds the most relevant graphs; for a single-graph database it identifies the most relevant subgraphs. In this work, the attention is restricted to filtering techniques based on the structure of the labeled graphs (paths, subgraphs). Since searching for subgraph structures is quite difficult, most algorithms choose to locate paths of node labels.

*2. *Formulate query into simple structures. The query graph can be given directly as a set of nodes and edges or as the intersection of a set of paths. Furthermore the query can contain wildcards (representing nodes or paths) to allow more general searches. This step normally reduces the query graph to a collection of small paths.

*3. *Match. Matching is implemented by either traditional (sub)graph-to-graph matching techniques, or combining the set of paths resulting from processing the path expressions in the query through the database.